What The Chip Happened · Investor Intelligence

MLPerf Training v6.0
What Investors Need To See

The new round of MLPerf Training results just dropped. Below is the data that moves the thesis for NVIDIA (NVDA), AMD (AMD), CoreWeave (CRWV), Nebius (NBIS) and the rest of the AI buildout. 227 submitted results, 24 organizations, 7 workloads. Every chart is computed from the published v6.0 results table.

Source: MLCommons MLPerf Training v6.0 (Closed division) 227 results · 24 orgs · 7 benchmarks Lower time = better

86.8%

of all results ran on NVIDIA silicon (197 of 227)

7 / 7

benchmarks NVIDIA submitted to. AMD: 3 of 7

5.4×

GB300 vs H200 generational speedup (Llama2-70B LoRA, 8 GPU)

8,192

GPU max-scale run. CoreWeave + Azure both hit it

The Headline: NVIDIA Still Owns The Bench

MLPerf is the closest thing to a referee in AI hardware. The first read is simple: who even showed up, and on what. NVIDIA submitted across every workload. AMD competed on the three easiest ones and sat out the four that define frontier training.

1. Submission share by accelerator vendor

Count of submitted results, NVIDIA-silicon vs AMD-silicon

For investors: NVIDIA appears on ~87% of every result a partner chose to publish. Partners only submit configs that look good, so this is the optimistic case for the field, and AMD is still a sliver of it.

2. Results per benchmark, by vendor

Where each vendor's silicon actually competed

AMD silicon shows up only on Llama2-70B LoRA, Llama3.1-8B and Flux.1. The harder, larger workloads are NVIDIA-only.

3. Chip-family submission counts

Which accelerators carried the round

Blackwell Ultra (GB300 + B300) alone is 136 of 227 results. Hopper is now a tail. This is a Blackwell round.

4. Benchmark coverage matrix: who can run what

A check means that vendor's silicon posted at least one result on that workload

Workload	What it stresses	NVIDIA	AMD
Llama2-70B LoRA	Fine-tuning, mid-size	✓	✓
Llama3.1-8B	Pre-train, small LLM	✓	✓
Flux.1	Image / diffusion	✓	✓
GPT-OSS-20B	New MoE pre-train	✓	absent
DeepSeek-V3 671B	Frontier MoE, huge scale	✓	absent
Llama3.1-405B	Large dense LLM	✓	absent
DLRM-DCNv2	Recommenders	✓	absent

This is the single most important slide for the NVDA vs AMD debate. AMD has no published result on DeepSeek-V3, Llama3.1-405B, GPT-OSS or DLRM. Those are the workloads hyperscalers actually buy fleets for. Competing on 8-GPU fine-tuning is real progress, but the frontier-scale column is still empty.

Blackwell Ultra: The Generational Leap

This is the round where GB300 (Blackwell Ultra) shows what it does to last generation. For anyone modeling the H100-to-Blackwell upgrade cycle, these are the multiples that justify the capex.

5. GB300 vs GB200 vs H200: Llama2-70B LoRA

8 GPUs, time-to-train in minutes (lower is better)

GB300 trains 5.4× faster than H200 at the same 8-GPU count. That is the number that pulls forward the refresh.

6. GB300 vs GB200 vs H200: Llama3.1-8B

8 GPUs, time-to-train in minutes (lower is better)

4.05× over H200 on pre-training. The uplift holds across workload types, not just the easy one.

7. GB300 scaling: Llama2-70B LoRA

Time-to-train as GPU count grows, 4 → 512

From 8 GPUs (5.6 min) to 512 GPUs the same job finishes in 24 seconds. The interconnect (NVLink / NVL72) is doing the work here.

8. GB300 scaling: DeepSeek-V3 671B

Frontier MoE, 256 → 8,192 GPUs

Near-clean scaling on a 671-billion-parameter model out to 8,192 GPUs. This is the rack-scale moat, not the single-chip spec sheet.

NVIDIA vs AMD: Head To Head

On the three workloads where both vendors showed up, here is the honest scoreboard. AMD's MI355X is competitive. It is not ahead.

9. Full chip leaderboard: Llama2-70B LoRA

Best 8-GPU result per chip, minutes (lower is better)

MI355X (8.27 min) lands 4th, behind GB300, B300 and B200, ahead of GB200 and H200. A real result for AMD.

10. Full chip leaderboard: Llama3.1-8B

Best 8-GPU result per chip, minutes (lower is better)

MI355X (86.8 min) again sits mid-pack, beating Hopper but behind the full Blackwell stack.

11. Closest apples-to-apples: B300 vs MI355X

Both 8-GPU air/liquid nodes, same generation timing

Against the comparable 8-GPU Blackwell part, NVIDIA's B300 is ~21-26% faster than MI355X. The gap narrowed versus prior rounds. It did not close.

12. Flux.1: AMD's only non-LLM entry

Image model, best result per chip at matched scale (min)

Flux.1 is the only non-LLM workload AMD entered, but the gap is wide. At matched 64 GPUs, MI325X (92.4 min) is roughly 2x slower than B300 (46.7 min), and AMD ran older CDNA3 silicon here, not its newest MI355X. Participation, not parity.

AMD's Real Progress: And The Gap

The bear case on AMD is "they never show up." This round answers part of that. CDNA4 (MI350 series) is a genuine leap over CDNA3. The remaining question is breadth and scale.

13. MI355X vs MI300X: generational uplift

Llama2-70B LoRA, 8 GPUs, minutes (lower is better)

MI355X is 3.46× faster than MI300X on the same job. CDNA4 plus FP4 is a real step. AMD's own silicon is the story here, not the NVIDIA comparison.

14. Who is building AMD systems

AMD-silicon results by submitting organization

OEM breadth is improving: Dell, HPE, Supermicro, Cisco, MiTAC and Oracle all posted MI results. The supply chain is forming even if scale is not there yet.

15. The breadth gap, visualized

Benchmarks attempted: NVIDIA vs AMD

For an AMD bull, the homework is simple. Watch for the first MI355X result on DeepSeek-V3 or a 405B-class model at 1,000+ GPUs. Until that prints, MLPerf positions AMD as a strong node-level option, not a fleet-level NVIDIA replacement.

The Neocloud Scoreboard: CRWV, NBIS, Oracle

MLPerf is now a marketing channel for the GPU clouds. Submitting a clean, large-scale result is a credibility signal to enterprise buyers. Here is who showed up and how big they went.

16. Cloud submitters: results and max scale

Submitted results (bars) and largest single run in GPUs (labels)

Nebius posted the most cloud results (12). CoreWeave and Azure went the largest at 8,192 GPUs. Oracle covered the most workloads.

17. CoreWeave's flex: DeepSeek-V3 scaling

CRWV on GB300, 2,048 → 8,192 GPUs, minutes

CoreWeave's 8,192-GPU DeepSeek run (2.02 min) beat NVIDIA's own 8,192-GPU GB200 result (3.34 min). A neocloud posting a best-in-round number at frontier scale is the CRWV bull case in one chart.

18. Scale leaderboard: every run at 2,048+ GPUs

The biggest jobs in the round. This is where fleet operators prove themselves.

Operator	Chip	Workload	GPUs	Minutes

Only three names run jobs this big: NVIDIA, CoreWeave and Azure. CoreWeave is the only pure-play neocloud on the list, and it owns the top DeepSeek result.

19. Nebius (NBIS) workload spread

Best NBIS result per workload, minutes (mixed scale)

Nebius spread across Llama3.1-8B, GPT-OSS and Flux.1 on B300/GB300. Broad, mid-scale, very current silicon. A "we have the new chips and they work" statement.

The FP4 Training Era Has Started

The quiet structural story in v6.0: four-bit floating point is now being used in actual training, not just inference. This is a software-and-format moat that favors whoever's numerics the ecosystem standardizes on.

20. Lowest precision used in linear layers

Count of results by numeric format

64 of 227 results trained at 4-bit (nvfp4 or mxfp4). A year ago training at FP4 was a research claim. It is now a submitted MLPerf result.

21. The format war: NVFP4 vs MXFP4

Which vendor uses which 4-bit format

NVIDIA pushes its own NVFP4 (39 results); AMD uses the open MXFP4 (25). Two formats, two software stacks. Watch which one model builders adopt by default.

Ecosystem Breadth & Best-In-Class

The last lens is the supply chain. A platform wins partly by how many partners ship it. Here is OEM participation and the fastest result on record per workload.

22. Submissions by organization

Every submitting org, result count

Dell, HPE and Supermicro lead OEM submissions. 24 organizations submitted, almost all selling NVIDIA-based systems. The channel is the moat as much as the chip is.

23. Best-in-round result per workload

Fastest published time-to-train and the system that set it

Workload	Fastest (min)	Chip	GPUs	Max scale tested
Llama2-70B LoRA	0.40	GB300	512	512
DLRM-DCNv2	0.67	GB300	64	64
DeepSeek-V3 671B	2.02	GB300 (CoreWeave)	8,192	8,192
Llama3.1-8B	4.46	GB200	1,024	1,024
GPT-OSS-20B	7.43	GB300	512	512
Llama3.1-405B	7.07	GB200 (Azure)	8,192	8,192
Flux.1	17.12	GB300	512	512

Every single best-in-round time was set on NVIDIA Blackwell. The fastest result not set by NVIDIA's own silicon doesn't exist in this table. That is the position investors are pricing.

24. The one-glance summary

Where each name stands after v6.0

Name	What v6.0 says	Read
NVIDIA (NVDA)	On all 7 workloads, every best-in-round time, FP4 training, 8,192-GPU scaling	Lead intact
AMD (AMD)	MI355X 3.46× over MI300X, mid-pack on 3 of 7, absent from frontier scale	Closing, not closed
CoreWeave (CRWV)	Top DeepSeek-V3 result, 8,192-GPU run, beat NVIDIA's own GB200 number	Credibility win
Nebius (NBIS)	12 results, newest Blackwell, broad mid-scale coverage	Showing up
OEMs (Dell/HPE/SMCI)	Heaviest submitters, almost all NVIDIA-based	Channel locked to NVDA

Want the full breakdown and the trade behind it?

What The Chip Happened · Fiscal.ai partner · 15% OFF