Beyond CUDA Summit

It is a dark, dark world. Mr. Leather-Jacket isn't relenting. With his 10x (and growing) profit margin on the hottest commodity known to man, people are screaming for relief. But no one has stepped up to the plate.

Billions of precious VC capital is lit on fire every month as each AGI optimist tries to outcompete the others. But no matter your model architecture, training data or post-training, token sampling, or slop generation strategy, you are always at his mercy.

He insists: "The more you buy, the more you save". And now he has gone even further: "The more you buy, the more you earn." Indeed, we have no choice but to accept his words and buy his overpriced racks.

Still, some arrogant fools think they can build their own hardware and undercut his fat margin. But after Microsoft delayed their custom AI chip once again, he didn't hesitate to remind the world who's boss:

"What's the point of building an ASIC if it's not going to be better than the one you can buy?"

– Mr. Leather-Jacket

Foiled again! Is there truly no escape?

But what if, someone went beyond? Beyond what, you might ask? Why, of course, to defeat Team Green, we must go beyond CUDA. Ah yes, the impenetrable moat of Team Green: the CUDA language, programming model, compiler, a vast set of optimized libraries, and a rock solid runtime (among many other fortifications).

No one has gone beyond, but many fools continue to dare to try. Today, I had a first-row seat to see a showcase of Team Red's latest attempt. Hosted by their own neocloud, TensorWave, they put together a show with many users and contributors to Team Red's CUDA killer.

So, how did it go? Is Mr. Leather-Jacket shaking in his leathers? Read on to find out.

What Did I Expect?

Before the event started, I looked through the list of speakers and talks. My expectation was for a highly technical event that would give developers the confidence and knowhow to write cross-platform kernels and make sure ML models and inference servers had first-class support for AMD hardware.

In particular, I was expecting a discussion on these topics:

An overview of deficiencies and legacy cruft of the CUDA programming model
New abstractions over GPU programming primitives provided by HIP
A detailed overview and examples of the ROCm API and programming model and how it differs from CUDA
From the Triton speaker: how Triton enables easy portability to AMD GPUs by virtue of their tile abstraction while maintaining peak GPU utilization
From Jim Keller and Raja Koduri: rethinking the ideal architecture and microarchitecture for the SIMT programming model from first principles, eliminating the graphics-oriented specializations that exist in consumer NVIDIA GPUs
From AMD: ubiquitous support for ROCm across the product lineup, elimination of kernel driver crashes and misbehaviors, demonstration of end-to-end training of large scale models using upstream ML libraries
Live demos, coding walkthroughs, demonstration of push-button llama.cpp / ollama / vLLM inference engines running with zero issues on easily available AMD hardware (no Docker, no VMs, no pre-prepared machine, show a setup from scratch)

If my expectations had been met, this would have been a watershed moment in the development of ROCm and AMD GPUs as a first-class hardware target, worthy of billions of external investment. So what actually happened?

Let's dive into each session.

Past and Future of Compute

This panel was originally slated to have Jim Keller and Raja Koduri. Hearing from them was highly anticipated by the crowd. But Jim claimed he had some personal affairs to attend to and Raja was sick (or vice versa) and neither of them showed up!

Instead, we had Greg from TensorWave (built GPUOcelot, has worked on DL deployment at Baidu), Davor from Tenstorrent, Micah from Mihira AI (worked at NVIDIA/AIT in the past), and Nicholas (the author of "The CUDA Handbook") (see the recording here).

Why is CUDA NVIDIA's Moat?

Greg started by stating that the SW stack is the most important thing to get right; CUTLASS enables 90+% HW utilization for many kernels, and they achieved that through NVIDIA's investment of engineer hours in intense cache blocking + other optimizations.

Micah followed by saying it's not just the SW, but keeping the SW in sync with the HW. You can't stick CUDA on another HW platform and get the same performance; AMD had its own programming model, but it wasn't mature and in sync/ubiquitous across its HW platforms.

Nicholas pointed out that NVIDIA took a risk to build CUDA in the first place, sacrificing die area for general purpose programmable logic. CUDA is also highly portable: taking speculative risks in HW (such as power tradeoffs in Pascal and Volta) is enabled by portability.

Davor said that NVIDIA puts lots of effort into benchmarks: MLPerf, HPC applications, showing that CUDA works for many domains. HW vendors should not try to reinvent the stack - reuse host and device APIs, programming models, device integration (as PCIe cards), just as Tenstorrent does.

The CUDA Language

Micah said that CUDA C++ can't live for much longer; we need new languages as we can't rely on undergrads knowing C++. Davor noted that by the end of his grad school life, all parallel programming became CUDA, and you increasingly need the knowledge and ability to inject code across the full stack (raw PTX, CUDA, CUTLASS, cuDNN, Pytorch). Customers for new HW platforms can't be expected to rewrite lots of primitives on their own.

Greg mentioned that CUDA is a huge ecosystem including Tensor cores with massive local parallelism, NVLink massive scale out, and structured sparsity / MoE support. NVIDIA has optimized for these applications and features, but there could be an opportunity to optimize for a different application target. All these pillars of NVIDIA's architecture are at their limits (supposedly).

Nicholas said that the SOTA is advancing very quickly, and NVIDIA is the best platform for them to evaluate their new operators / training techniques. Davor pointed out that SIMT was the original CUDA programming model - there were no Tensor Cores, no Tensor Memory Accelerators. NVIDIA is throwing in more stuff and it isn't even CUDA as it was when Nicholas wrote the book.

Accelerator Programming Models

Someone from the audience asked a question. There are many companies building AI hardware, but CUDA is tuned to NVIDIA's HW. If you try to build another accelerator, it won't be ideal to compile CUDA to that target. What are your thoughts on building consensus for the software stack?

Micah said that the AI world likes numpy-based Python, and if we can compile that to your accelerator, there isn't a CUDA problem anymore. Old CUDA code has CUDA assumptions and features in it, so we should throw it away and rewrite it (like Tenstorrent has done). But there is no solution for porting existing CUDA to your accelerator, and he doesn't know what the bridge is - he can't imagine someone can develop a robust and performant version of a CUDA bridge.

Davor concurred that if you have Pytorch as input, then we can use vertically stovepiped compilers (from the user's perspective). However, if you're talking about porting cuDNN / cuBLAS, that is much harder and will require manual effort.

Greg hoped that PTX would be performance portable, but it actually is not. Today, PTX is tuned for every device architecture (it would be hard to go from PTX to your accelerator). Davor concurred that we can't rely on PTX as the lowest level, and we need to educate developers / go back to first principles for a parallel programming model. He even suggested that we can use Cursor to translate PTX to another low-level IR and this wouldn't be an issue in the future.

Commentary

OK after reading all this, you may be wondering, doesn't all this discussion imply that CUDA can't be easily beaten? My feeling was that this panel just reinforced CUDA's dominance and why it was here to stay. Some of the suggestions by the panelists, such as using LLMs for writing kernels for new hardware targets, seem too far-fetched.

Crossing the CUDA Moat

Onto the next set of presentations.

Mako

The first talk was by Waleed Atallah of Mako (see the recording). His company builds AI-powered (read: LLM-in-a-loop) GPU kernel generation and optimization tools.

His main claim is that with new DNN operators, more GPU kernels need to be written and tuned. It took 5 years from the original attention paper to get FlashAttention, so there is clearly lots of inefficiency and hard manual work required to produce optimized / fused GPU kernels. They've built some kind of "auto-tuned" AI stack where they sweep over the space of torch.compile options and kernel implementations, with a global cache of all optimizations they've tried and compiler artifacts.

He mentions that they're working on LLM powered kernel generation and pointed to KernelBench as a good starting point. There is work to be done on teaching an LLM how to optimally use the hardware, add the right documentation in the context, and use profiling data as feedback for optimization.

This guy and Mako aren't incompetent per se, but the talk had nothing to do with ROCm or HIP. The techniques described were also very vague.

Lemurian Labs

Up next was Jay Dawani from Lemurian Labs. Now this guy was a little weird: he said lots of words with very little meaning (see the recording). He blathered on about how they built some compiler and "dynamic runtime" that can "solve heterogeneity" (Pytorch to CPU, GPU, NPU, … and so forth). And of course, his magic compiler is able to do something 30-40% faster than CUDA (the something was never specified).

My initial impression from this conversation is that this guy is a smooth talking scam artist. In the past, he pushed for the design of a custom chip to execute math using a logarithmic number system over INT8 or the like. It seems this idea, and the custom silicon, has been abandoned or at least relegated to second priority.

Spectral Compute

Up next was Michael from Spectral Compute (see the recording). Now, this guy and his team are highly competent. They built a compiler called SCALE which can compile unmodified CUDA programs for AMD GPU targets. The work that must have gone into this is insane.

Their compiler is fully compatible with nvcc semantics and quirks, and they actually duplicate the nvcc oddities precisely. They claim to have 90+% compatibility for math APIs and 70% for runtime APIs, so far. Another caveat is that CUDA code is often written with the assumption of a warp size of 32, and their compiler will detect if your code is written this way and warn you to make your code generic to accommodate AMD's 64-thread warp size. They even support inline PTX in CUDA kernels by translating it to another IR and then re-targeting it for the AMD APIs.

What's even more impressive is they can outperform HIP (by 2x on average on MI300X) on the Rodinia benchmark by compiling the CUDA code there out-of-the-box. This company has done lots of GPU work, including building a GPU-accelerated regex engine! This was by far the most compelling talk given in this summit, and I hope they have the strength to continue their work.

Pre-training Beyond CUDA

OK so far, you may have noticed that the speakers have only barely mentioned AMD GPUs in their talks. This trend shall continue.

Zyphra

The next talk is from Quentin Anthony, a Model Training Lead at Zyphra (see the recording). All I could gather from this talk is that making sure layer dimensions are powers of 2 is quite important for performance. They wrote a blog post "The Zyphra Training Cookbook", which I think is pretty good.

UCLA VAST Lab

The next speaker was Neha Prakriya from UCLA (see the recording), from Jason Cong's lab. They looked into techniques to minimize the corpus of pretraining text you need using embedding and sampling: sounds sensible. And then they did some fine-tuning and distillation using their sampling approach with a small 7B Llama and showed it worked well. They used this technique to develop a HLS coding assistant (it just injects #pragmas in the right places).

OK fine, but what does this have to do with AMD? Turns out, there are lots of people doing this exact same research, but AMD gave her lab a MI300X, so she did this experiment with one AMD GPU. All the others use NVIDIA, and I'm sure she wanted to too.

RWKV

Next was Eugene Cheah from featherless.ai. I'll admit my knowledge of his RWKV is very weak, so I can't judge, but he claims this model architecture achieves linear attention scaling and no quadratic KV cache growth. Again, hardly any mention of AMD GPUs, but near the end he rambled on about building a "personal AGI" that can tune its memory in the night and extract your experiences during the day.

Are You Bored Yet?

So was everyone here. All the speakers were so dull that the majority of the audience got up and started gabbing at the back near the bar. The background noise was so loud, the organizers had to keep shutting them up, since you could hardly hear the speakers.

Post-Training Beyond CUDA

Undeterred, we move onto the next session.

Higgsfield AI

We heard from a former GenAI engineer at Snapchat, Alex, who now runs Higgsfield AI. This is a video generation tool that pulls from open source video generation models (e.g. Wan) and stitches together an entire pipeline: your product images or ideas, storyboarding, dialogue generation, camera angle and action scripting, and final video generation.

The world's first multi-agent AI video creation platform that turns story ideas into ready-to-watch, long-form content.

Alex showed us a plot of 'pixels generated over time' and noted that models can generate more and more pixels today, to the point where soon they can generate TV shows and movies. He says that AI generated videos outperform human-generated videos on social media, and in the future all pixels will be model generated. "AI generated videos are outperforming 'conventional' video on social media" (he actually said this). What an inspiring future!

He showed us an ad they generated for Monster Energy, where a Viking climbs a mountain, reaches the top, and drinks a can of Monster. No cameras, actors, directors, or editors required! So cool! Of course, it looked absolutely ridiculous.

The next step is of course AI-generated music videos. He then showed us an example of a rap music video - made in under a day by one person with "studio level quality and camera control and visual elements that typically require teams of 10 people + weeks of work", "Hollywood level quality", "you can make the final scene of Game of Thrones yourself" (he actually said this). TensorWave didn't share the recording of this talk on Youtube (I wonder why), and I was too busy laughing to take a video, but to get an idea of what the rap video was like, you can ask the Mavericks.

OK but what does this have to do with AMD? Turns out, he claims, MI300X are 25% faster than "NVIDIA" for some video generation model. OK fine. Real convincing.

OpenBabylon

The insanity continues with Anton and Yurii from OpenBabylon.

They've built some infrastructure for adapting LLMs for underrepresented languages. That on its own is pretty useful.

But this talk was about the aftermath of the Ukraine War, specifically LLM-based test generation for students in Ukraine for teachers. It's quite simple really: there is a war, so let's fine tune an LLM. But Ukraine may not have many servers to host a large model, so they chose a small one.

They noticed that models have worse instruction following, and emit more harmful content when using non-English languages, and models encode "Silicon Valley centric biases". They fine-tuned Llama 3.1 8B with "domain-specific knowledge" (i.e. Ukrainian history) and Ukrainian optimized tokenization. And guess what, the teachers accepted 55% of the test questions that came from their model.

But what about AMD? Well, they used the TensorWave cloud to fine-tune this Llama model. Nice.

Inference Beyond CUDA

At this point, the audience is completely restless and is making lots of noise. The panels have gone on for too long and the event is way behind schedule. But the most important talk is coming.

Introducing, the panel of Will Beauchamp of Chai Research, Kyle Bell of Tensorwave, and Paul Merolla of MK1 (see the recording).

Kyle didn't say much, and Paul focused on his optimized inference service MK1. He claimed that MI300X + MK1 > H100 and that that MI300X outperforms H100 by 2x in request throughput. MK1 handles the inference service for Chai, which used to use NVIDIA, but now has supposedly moved to AMD. MK1 can handle CAG (cache augmented generation) with MI300X where the higher memory capacity makes this easier. Paul showed his inference setup has 2.25x faster all-reduce on 2x/4x MI300X vs RCCL (quickreduce).

And then we have to talk about Will, the founder of Chai Research.

You see, Chai is a very serious company, with very skilled ex-quant researchers, all from highly ranked universities. They are paid a hefty sum of $500-750k in cash + equity. Their product, as Will puts it, is a "consumer platform for AI". Now, what does that mean?

His company builds an app: Chai AI

Our app revolutionizes the way we interact with AI chatbots. With our advanced technology, our chatbots think and reply like real characters, complete with authentic voices. You can choose from a massive selection of characters, each with their own unique voice and personality. Whether you're chatting with your favorite celebrity, a beloved character from a movie or TV show, or a historical figure, you'll feel like you're talking to a real person.

Prompt extraction from Chai AI. Yes, it's just an LLM wrapper.

So Will, why did you start Chai AI? You see, "all of AI has been shaped by a small group of people (old men, the West)", but maybe teenage girls need access too. He notes, "teenage girls can create better makeup advice vs our AI" and we can incorporate that into the chat generation. "Teenagers are spending 1.5 hrs a day interacting with AI. This is a monumental shift".

As you can see from the Chai Research 2025 Roadmap they're doing quite a good job with teenager 'retention'. These very honorable ex-quants have been spending their days optimizing LLM tokens based on "real-world engagement". And what kind of engagement exactly are we talking about here?

A typical post on /r/ChaiApp (left) and popular chatbots on c.ai (right)

Business is booming of course. They are very good at using a blend of low-cost LLMs and aggressive sampling with their engagement estimators to shoot 10s of trillions of tokens every month straight into the brains of rotting youth.

Very dark opinion: AI is going to enable a lot of morally questionable people to develop products and services that are very profitable but bad for the customers https://t.co/uZIgOprlsU
— PoIiMath (@politicalmath) March 22, 2025

"Morally questionable people"? Who could that be? All right, that's enough of Chai.

Pre-training Beyond CUDA

There was only one talk from Koosh Azimian of 310 AI. They do some kind of GenAI for biology. It's not clear if they develop and train their own models or if they just provide an interface to open-source models. It seems interesting, but I can't comment much on their models or service.

The AMD angle is they did some training on 6 nodes of 8x MI250 for 70 days. They made a claim that their Pytorch models were ported to run on ROCm in one engineer day without any code modifications.

Open Source beyond CUDA

This was a panel / presentation with Philippe Tillet (OpenAI / Triton), Liz Li (AMD), Greg (TensorWave), and Andrey Cheptsov (dstack). Philippe says that OpenAI uses Triton for all of their kernels and they kind of work with AMD's hardware team. AMD themselves has a Triton team which tries to keep the ROCm backend up to date, optimized, and upstream with good performance on MI300X. Philippe then says that there are still cases with MI300 is slower than H100, but he's hoping that within a year or two, they will reach parity.

Liz from AMD says that they've been moving their Pytorch CI/CD pipeline in-house, continuing to work on OSS models, making sure huggingface models run out of the box on AMD hardware, and trying to get DeepSeek to run reliably on an AMD cluster.

Andrey from dstack gave a talk on his container orchestration platform. Some Kubernetes-based thing that takes care of storage, data, models, training launches, and so forth. Overall, nothing that interesting.

Finally, we heard from Gregory of ScalarLM (see recording here). This appears to be an LLM training and inference software stack that runs out-of-the-box on MI300X.

Hot Takes Panel

The panelists were Dylan (Semianalysis), Anush (AMD, VP of AI), Darrick (TensorWave), Mark (Meta/ GPU mode), and Eugene (RWKV guy). Now, this was interesting.

SemiAnalysis' Article

Dylan began by talking about his article benchmarking MI300X against H100, which was quite unfavorable towards AMD. He said that when AMD saw this, they got scared. Lisa called him in a panic.

His team had spent 5 months working with MI300X GPUs on TensorWave to compare against H100. Getting started with AMD was very rough, scaled dot product attention wouldn't even compute proper numbers, and there were memory leaks, among many other problems. AMD's Anush worked hard to fix all the issues, but there were still many lingering issues, and performance was still bad. By the end of the 5 month period, performance was still bad, but things kind of "worked". Out of the box Pytorch performance was still bad, while NVIDIA gives you near peak performance right away. NVIDIA has 50k GPUs for internal purposes that they use to keep up library software standards, while AMD has much less pay, custom driver builds for customers, and no unified driver.

Lisa talked to SemiAnalysis the next day for 1.5hrs. She was very receptive to feedback, and the ROCm user group said AMD engineers were happier that they finally got a Pytorch CI/CD pipeline from management.

AMD's Anush took the brunt of this brutalization. He's very active on Twitter, indeed he's the face of AMD on Twitter. On stage, Anush was asked what AMD can do to improve.

Anush said that he acknowledges all the problems, that AMD was working on things that weren't visible to the benchmarker, they had been too focused on hyperscalers, and were just slowly coming down to the long tail of customers. He knows the custom driver builds were bad, and that AMD is catching up now. He urged the audience to contact him directly if they encounter any issues. He says they are trying a first principles approach to everything (supposedly). After all, the Pytorch CI is up and running, GPU Mode has some AMD GPUs now, and we're trying to get quantizers to work cleanly (with Mark). We're trying hard, stop bullying so much OK?

This part of the panel was just so brutal. You get the impression that AMD has no chance of being competitive versus NVIDIA.

TensorWave

The moderator asked why TensorWave went all in on AMD. Customers ask them for NVIDIA, but they said no. Why?

Darrick said that it was about "alignment of ethos". TensorWave started in 2023 during compute shortages, all that existed was NVIDIA's locked closed ecosystem with minimal open source. We wanted to solve that problem and provide a viable alternative to the market, so naturally they must use AMD. AMD was the only HW provider that could check all the boxes: the resources to threaten NVIDIA's dominance, and public support of open source.

Yeah I'm sure TensorWave is AMD-only because of "ethos". Surely it has nothing to do with AMD funding their company.

Open Source

The moderator asks for hot takes on open source. What is the right way to do this?

Mark of GPU Mode says that NVIDIA leverages open source (e.g. Pytorch). You need the right leverage. Pytorch makes it easy to safely depend on it for the long term, which encourages contributions. GPU Mode is just a reading group, and it quickly evolved to host parties, hackathons, sponsorships. Open source attracts high agency people.

Dylan claimed that NVIDIA only released the Blackwell ISA just a month ago, just to prevent AMD from copying it. Anush remarked that AMD was a fast follower. But Mark doesn't buy Dylan's argument: after all, people could just fork Pytorch, but they don't.

Rapid Fire Questions

What if Taiwan is invaded?
- Dylan: everyone is affected.
What about non-GPU custom hardware?
- Anush: AMD has a big footprint from embedded up to cluster, but we don't have a uniform software stack yet, we are trying. We're trying to make it possible to run a model on your laptop and port it to Instinct in the cloud without software hacking.
- Dylan: all hardware needs a dev ecosystem. Google can dogfood their own ecosystem including OpenXLA. There is no code in Pytorch to run any other accelerator. There are teams of people doing assembly programming per kernel. It's not going to work for any hardware company that can't make massive investments.
- Darrick: alternative accelerators are not worth offering, and are not good for customers. Only GPUs are flexible enough for new model architectures. Software is also the most mature on GPUs, nothing else.
What will happen in next 2 years for TensorWave, which customers?
- Darrick: past year was mostly about inference, software support, but we still need to solve training. This year is about training and scale
Is there anything beyond backprop with respect to scaling?
- Anush: at nod.ai we tried a few things. Keep an eye out. We just need to be ready for it.
- Eugene: distributed training and optimizers are still a pipe dream, but could happen eventually.
NVIDIA created a mixed CPU and GPU product. What about AMD?
- Anush: AMD had silos of hardware excellence (Xilinx for FPGAs, datacnter CPUs). We're moving up the stack, and AMD has done well at backwards compatibility before, so we will persist that.
Rank the top 5 AI hardware vendors today and in 2030 if you think non-GPU architectures will become dominant
- Dylan: 1 is NVIDIA, 2 is TPU, 3 is AMD, 4 is Trainium, and 5 is nothing else. Unfortunately Trainium is even below AMD.

Final Talk by Anush - What's Next

And with that, we're onto our final talk of the day by AMD's Anush (see the recording).

He began with the chant of "developers, developers, developers!". Anush pledges to help you get started on ROCm. Lots of people complain about it, but it will get better! We added a Pytorch CI just recently; see, we're trying to improve performance and driver stability.

Dylan and Jeff picking raffle tickets for the MI200 giveaway

Epilogue

After this event concluded, I saw this Ask HN: Why hasn’t AMD made a viable CUDA alternative?. I recommend you read that thread.

This ticket, finally closed after being open for 2 years, is a pretty good micocosm of this problem:

Users complaining that the docs don't even specify which cards work.

But it goes deeper - a valid complaint is that "this only supports one or two consumer cards!" A common rebuttal is that it works fine on lots of AMD cards if you set some environment flag to force the GPU architecture selection. The fact that this is so close to working on a wide variety of hardware, and yet doesn't, is exactly the vibe you get with the whole ecosystem.

About a week after Beyond CUDA, SemiAnalysis released an article "The GPU Cloud ClusterMAX™ Rating System".

Sitting as undisputed #1 is NVIDIA's golden child, CoreWeave, while TensorWave (and any other AMD neocloud) is just a Bronze-tier GPU cloud. Certainly, we can expect the AMD neoclouds to move up the list, but it will take time.

There have been some more favorable articles from SemiAnalysis recently: "AMD 2.0 – New Sense of Urgency" and AMD vs NVIDIA Inference Benchmark: Who Wins?. But just as AMD is starting to ramp MI350X, NVIDIA is ramping B200, and there is no comparison - B200 wipes the floor.

Comedy Show

But wait, Beyond CUDA isn't done yet. You may have thought the comedy show was the actual event, but no, there was an actual comedy show by the 3 Tech Roast guys! You should checkout their Youtube channel.

There were too many good moments. Their very first roast was aimed at the entire audience who "were rejected from or were too poor to afford to attend GTC" 😆. Well I have to admit, I am too poor to attend GTC.

The comedians moved from 'founder' to 'founder', and it was hard to believe that these 'founders' weren't audience plants. From Tixfix.ai which has nothing to do with AI and who's entire dev team is located in Nepal, to the founder of Fillers AI who's a plastic surgeon who uses "AI" to model the impact of lip and cheek fillers. These are real people! Reality is beyond any Silicon Valley parody.

In the final segment, the comedians pulled people from the crowd and had them make fools of themselves. In one act, they forced everyone to line up from least to greatest TC. Dylan, who was made part of this routine, very slyly walked over to the greatest TC corner 😆.

At the very end, Dylan and this girl needed to act in a skit to prove they weren't an AI. The skit asked for the girl to console a man (played by a comic) who didn't have enough money to buy NVIDIA GPUs.

Comic: "NVIDIA GPUs aren't in stock 😭 and I can't afford them either 😭, is there any alternative? What should I do!? I need some GPUs". This poor girl had no clue what GPUs even were, she just went "I'm so sorry, hope you feel better, it's ok". And someone from the audience yelled out "Intel" and she became quite enthusiastic to the sad comic: "OHHH! You can buy Intel GPUs!" 😆

What a fitting end to the night.

Conclusion

I would like to thank TensorWave for being good sports, not giving up, making the event professional with pro videographers and photographers, and for inviting a bunch of us to this event. But honestly, I wouldn't be surprised if this event was shadow funded by NVIDIA. It just reinforced that AMD is way behind NVIDIA, and that if there is anyone can go beyond CUDA, it is NVIDIA (and indeed, just the next day at GTC, NVIDIA's Tile IR was announced).