Vighnesh Iyer

vighnesh.iyer@berkeley.edu
Github, LinkedIn


Discussion with Eric Quinnell from Tesla

Disclaimer: these are my words, not Eric's. I have used some creative license.

Eric Quinnell from Tesla came to speak to the SLICE lab on 9/28/2022. Download his slides here.

Title: Industrial Ranting about Current Computer Micro-Architectures

Abstract:

Industrial Silicon design provides incredible opportunities, including designing gaming consoles that elevate you as a veritable demi-god to your nephews, while later handing you a severance package as your machine is out-designed. This talk covers an analysis of current CPU micro-architectures, showing what works, what does not, and why OP Caches and SMT designs are The Worst™.

Be your ISA x86, ARM, or RISC-V – any bad micro-architecture can equally ruin them all! Come learn how to see the big picture in CPU micro-architecture so that you too can one day work on completely unrelated AI Silicon for self-driving cars and robots.

Bio:

Dr. Eric Quinnell received his BSEE/MSEE and in 2006 his Ph.D. from the University of Texas at Austin in the topic of Floating-Point Fused Multiply-Adders. He began his industrial career at AMD designing floating-point and cryptography units on the Bobcat x86 and Jaguar x86 CPUs, the latter being the CPU core in the XboxOne and PS4.

Eric spent nearly a decade at Samsung on the (now retired) Exynos/Mongoose ARMv8 CPUs on the floating-point, L2/L3 caches, and branch predictor units before joining ARM as the Cortex-A CPU micro-architecture performance research lead. Currently, Eric works at Tesla building the hardware and silicon for the Dojo AI Supercomputer.

Dr. Quinnell has many patents, papers, and all the necessary qualifiers required to come talk to grad-student engineers about all the things. He lives in Austin, TX with his wife Leslie, and three children.

Notes

Criticism of Variable Length ISAs

What you're saying is that the relatively small code density improvements from the RISC-V compressed extension are dwarfed by the overhead of variable length instruction fetch and therefore limited decoder width? So we're ok with sacrificing cache area consumed by full fixed-width instructions to increase fetch/decode width? Are there no uArch tricks to still have wide decode with variable length ISAs?

Performance Modeling

Why was this phenonema (overall bad tradeoff of code density vs decoder width) not caught in performance modeling? Are the benchmarks just wrong?

Benchmarks

How do you extact meaningful benchmarks from emulation or real silicon?

Full Self Driving

When will FSD work?

Tesla Engineering Culture

Proper engineering culture is what makes or breaks a team / company and its innovation

On the Importance of Deletion / Deprecation

Destruction is just as, if not more, important than creation.

Precise Exceptions in DOJO

So DOJO doesn't have precise exceptions. Why?

Academic and Industry uArch Collaboration

Academic and industry uarch collaboration fell through due to a UW-Intel patent dispute that lead to industry paying settlements - now everyone keeps arch features within each company and away from academia.

Ethernet Superiority

Ethernet's physical layer has been progressing faster than PCIe (see upcoming 800Gbps links). We prefer to use Ethernet for chip-to-chip and board-to-board communication. DOJO fabric is all Ethernet - to namely avoid PCIe used as much as possible for host-accelerator communication. PCIe requires complexity for long traces due to latency requirements, while Ethernet has no such restriction. They use a combination of copper and optical physical channels.

On Publish or Perish

Eric also mentioned his disdain for the publication mania in academia. He is supportive of efforts like ArXiV that have changed the model of info dissemmination from conferences and journals to just a website on which nearly anyone can publish things. He mentions that nearly all cutting edge ML research is just published in the open, completely independent from any academic venue.

This conference / journal / peer review culture supports some very bad things. Constant use of the same stale and meaningless benchmarks trying to one up each other. Inability to work on long term vision problems. Politics on review committees preventing good papers from being published. Peer review doesn't address any real concerns about a paper - it is just a formality at best and a citation farm at worst. Elevation of frequently published people who may not have any real innovation but rather just constant iteration of a basic block (my note: see KAIST's papers at any circuits conference).

More Notes (from Charles Hong)

Variable length instructions

Why don’t people just accept Apple's uArch as the superior CPU uArch?

Tesla Internals

Benchmarking and Modeling

My Perspective

We Need to Settle Arguments

Well clearly someone is right: it is either Krste (RISC-V compressed instructions aren't a big overhead, we can still do wide decode and retain high code density) or Eric (you can't do wide decode easily and it will be harder to hit optimal uArch to run HPC/server workloads). Who is it? How can we do the experiment in the open?

In mid-October, Eric gave me some clarification on an experiment we could set up (see below):

On Krste or me being right, for my thought experiment add insane numbers of ALUs to your 8-10 wide fixed decoder and the 3x LDs or so to feed it. Cluster the ALUs in big honkin groups and +1 fwd cycle for all I care between clusters, just try to keep alu/agens fast fwd if possible. This may overweight one cluster with early binding bc of dependencies, but y’all can clever your way of of that pretty easily at branch boundaries.

Don’t universal schedule, do dedicated per ALU pls -keep it simple stupid. My assertion is the dumber, wider uarch wins. Width of decode is not about sustained IPC, it’s about burstiness of JIT code so have the junk ALUs to eat the bursts. This WILL require you to fetch >1 basic block, so on brp you’ll need target-of-target redundant storage (just write down BIAS or whatever, always/mostly taken is the common case, will fetch you 2 basic blocks most of the time. See ISCA 2020 Samsung paper for ZAT/ZOT implementation)

Also rename of 8-10 wide — just speculate the common case (2 in, 1 out) and if some op breaks that model (e.g. 3 input, 2 dest, etc), just skid 1 cycle and lose the extra width.

Working at Tesla

From my impression, Tesla is a great place to work. The hierarchy doesn't just exist for its own benefit / governance, but actually enables downstream engineers to do what needs to be done. No annoying or incompetent managers. Musk hand-selects the best people to lead each team. Only recently have they been opening up their silicon team to new hires vs poaching. Tesla doesn't seem to be interested in 'maintenance' of software - people are expected to adapt to new APIs, remove deprecated ones, and rewrite often.

A few days later, I saw this conversation from here:

Parag Agrawal texts Elon Musk

April 9, 2022

This just confirms to me that Tesla is the place to be.

Un-Asked Questions