The new round of MLPerf Training results just dropped. Below is the data that moves the thesis for NVIDIA (NVDA), AMD (AMD), CoreWeave (CRWV), Nebius (NBIS) and the rest of the AI buildout. 227 submitted results, 24 organizations, 7 workloads. Every chart is computed from the published v6.0 results table.
Source: MLCommons MLPerf Training v6.0 (Closed division)227 results · 24 orgs · 7 benchmarksLower time = better
86.8%
of all results ran on NVIDIA silicon (197 of 227)
7 / 7
benchmarks NVIDIA submitted to. AMD: 3 of 7
5.4×
GB300 vs H200 generational speedup (Llama2-70B LoRA, 8 GPU)
8,192
GPU max-scale run. CoreWeave + Azure both hit it
01
The Headline: NVIDIA Still Owns The Bench
MLPerf is the closest thing to a referee in AI hardware. The first read is simple: who even showed up, and on what. NVIDIA submitted across every workload. AMD competed on the three easiest ones and sat out the four that define frontier training.
1. Submission share by accelerator vendor
Count of submitted results, NVIDIA-silicon vs AMD-silicon
For investors: NVIDIA appears on ~87% of every result a partner chose to publish. Partners only submit configs that look good, so this is the optimistic case for the field, and AMD is still a sliver of it.
2. Results per benchmark, by vendor
Where each vendor's silicon actually competed
AMD silicon shows up only on Llama2-70B LoRA, Llama3.1-8B and Flux.1. The harder, larger workloads are NVIDIA-only.
3. Chip-family submission counts
Which accelerators carried the round
Blackwell Ultra (GB300 + B300) alone is 136 of 227 results. Hopper is now a tail. This is a Blackwell round.
4. Benchmark coverage matrix: who can run what
A check means that vendor's silicon posted at least one result on that workload
Workload
What it stresses
NVIDIA
AMD
Llama2-70B LoRA
Fine-tuning, mid-size
✓
✓
Llama3.1-8B
Pre-train, small LLM
✓
✓
Flux.1
Image / diffusion
✓
✓
GPT-OSS-20B
New MoE pre-train
✓
absent
DeepSeek-V3 671B
Frontier MoE, huge scale
✓
absent
Llama3.1-405B
Large dense LLM
✓
absent
DLRM-DCNv2
Recommenders
✓
absent
This is the single most important slide for the NVDA vs AMD debate. AMD has no published result on DeepSeek-V3, Llama3.1-405B, GPT-OSS or DLRM. Those are the workloads hyperscalers actually buy fleets for. Competing on 8-GPU fine-tuning is real progress, but the frontier-scale column is still empty.
02
Blackwell Ultra: The Generational Leap
This is the round where GB300 (Blackwell Ultra) shows what it does to last generation. For anyone modeling the H100-to-Blackwell upgrade cycle, these are the multiples that justify the capex.
5. GB300 vs GB200 vs H200: Llama2-70B LoRA
8 GPUs, time-to-train in minutes (lower is better)
GB300 trains 5.4× faster than H200 at the same 8-GPU count. That is the number that pulls forward the refresh.
6. GB300 vs GB200 vs H200: Llama3.1-8B
8 GPUs, time-to-train in minutes (lower is better)
4.05× over H200 on pre-training. The uplift holds across workload types, not just the easy one.
7. GB300 scaling: Llama2-70B LoRA
Time-to-train as GPU count grows, 4 → 512
From 8 GPUs (5.6 min) to 512 GPUs the same job finishes in 24 seconds. The interconnect (NVLink / NVL72) is doing the work here.
8. GB300 scaling: DeepSeek-V3 671B
Frontier MoE, 256 → 8,192 GPUs
Near-clean scaling on a 671-billion-parameter model out to 8,192 GPUs. This is the rack-scale moat, not the single-chip spec sheet.
03
NVIDIA vs AMD: Head To Head
On the three workloads where both vendors showed up, here is the honest scoreboard. AMD's MI355X is competitive. It is not ahead.
9. Full chip leaderboard: Llama2-70B LoRA
Best 8-GPU result per chip, minutes (lower is better)
MI355X (8.27 min) lands 4th, behind GB300, B300 and B200, ahead of GB200 and H200. A real result for AMD.
10. Full chip leaderboard: Llama3.1-8B
Best 8-GPU result per chip, minutes (lower is better)
MI355X (86.8 min) again sits mid-pack, beating Hopper but behind the full Blackwell stack.
11. Closest apples-to-apples: B300 vs MI355X
Both 8-GPU air/liquid nodes, same generation timing
Against the comparable 8-GPU Blackwell part, NVIDIA's B300 is ~21-26% faster than MI355X. The gap narrowed versus prior rounds. It did not close.
12. Flux.1: AMD's only non-LLM entry
Image model, best result per chip at matched scale (min)
Flux.1 is the only non-LLM workload AMD entered, but the gap is wide. At matched 64 GPUs, MI325X (92.4 min) is roughly 2x slower than B300 (46.7 min), and AMD ran older CDNA3 silicon here, not its newest MI355X. Participation, not parity.
04
AMD's Real Progress: And The Gap
The bear case on AMD is "they never show up." This round answers part of that. CDNA4 (MI350 series) is a genuine leap over CDNA3. The remaining question is breadth and scale.
13. MI355X vs MI300X: generational uplift
Llama2-70B LoRA, 8 GPUs, minutes (lower is better)
MI355X is 3.46× faster than MI300X on the same job. CDNA4 plus FP4 is a real step. AMD's own silicon is the story here, not the NVIDIA comparison.
14. Who is building AMD systems
AMD-silicon results by submitting organization
OEM breadth is improving: Dell, HPE, Supermicro, Cisco, MiTAC and Oracle all posted MI results. The supply chain is forming even if scale is not there yet.
15. The breadth gap, visualized
Benchmarks attempted: NVIDIA vs AMD
For an AMD bull, the homework is simple. Watch for the first MI355X result on DeepSeek-V3 or a 405B-class model at 1,000+ GPUs. Until that prints, MLPerf positions AMD as a strong node-level option, not a fleet-level NVIDIA replacement.
05
The Neocloud Scoreboard: CRWV, NBIS, Oracle
MLPerf is now a marketing channel for the GPU clouds. Submitting a clean, large-scale result is a credibility signal to enterprise buyers. Here is who showed up and how big they went.
16. Cloud submitters: results and max scale
Submitted results (bars) and largest single run in GPUs (labels)
Nebius posted the most cloud results (12). CoreWeave and Azure went the largest at 8,192 GPUs. Oracle covered the most workloads.
17. CoreWeave's flex: DeepSeek-V3 scaling
CRWV on GB300, 2,048 → 8,192 GPUs, minutes
CoreWeave's 8,192-GPU DeepSeek run (2.02 min) beat NVIDIA's own 8,192-GPU GB200 result (3.34 min). A neocloud posting a best-in-round number at frontier scale is the CRWV bull case in one chart.
18. Scale leaderboard: every run at 2,048+ GPUs
The biggest jobs in the round. This is where fleet operators prove themselves.
Operator
Chip
Workload
GPUs
Minutes
Only three names run jobs this big: NVIDIA, CoreWeave and Azure. CoreWeave is the only pure-play neocloud on the list, and it owns the top DeepSeek result.
19. Nebius (NBIS) workload spread
Best NBIS result per workload, minutes (mixed scale)
Nebius spread across Llama3.1-8B, GPT-OSS and Flux.1 on B300/GB300. Broad, mid-scale, very current silicon. A "we have the new chips and they work" statement.
06
The FP4 Training Era Has Started
The quiet structural story in v6.0: four-bit floating point is now being used in actual training, not just inference. This is a software-and-format moat that favors whoever's numerics the ecosystem standardizes on.
20. Lowest precision used in linear layers
Count of results by numeric format
64 of 227 results trained at 4-bit (nvfp4 or mxfp4). A year ago training at FP4 was a research claim. It is now a submitted MLPerf result.
21. The format war: NVFP4 vs MXFP4
Which vendor uses which 4-bit format
NVIDIA pushes its own NVFP4 (39 results); AMD uses the open MXFP4 (25). Two formats, two software stacks. Watch which one model builders adopt by default.
07
Ecosystem Breadth & Best-In-Class
The last lens is the supply chain. A platform wins partly by how many partners ship it. Here is OEM participation and the fastest result on record per workload.
22. Submissions by organization
Every submitting org, result count
Dell, HPE and Supermicro lead OEM submissions. 24 organizations submitted, almost all selling NVIDIA-based systems. The channel is the moat as much as the chip is.
23. Best-in-round result per workload
Fastest published time-to-train and the system that set it
Workload
Fastest (min)
Chip
GPUs
Max scale tested
Llama2-70B LoRA
0.40
GB300
512
512
DLRM-DCNv2
0.67
GB300
64
64
DeepSeek-V3 671B
2.02
GB300 (CoreWeave)
8,192
8,192
Llama3.1-8B
4.46
GB200
1,024
1,024
GPT-OSS-20B
7.43
GB300
512
512
Llama3.1-405B
7.07
GB200 (Azure)
8,192
8,192
Flux.1
17.12
GB300
512
512
Every single best-in-round time was set on NVIDIA Blackwell. The fastest result not set by NVIDIA's own silicon doesn't exist in this table. That is the position investors are pricing.
24. The one-glance summary
Where each name stands after v6.0
Name
What v6.0 says
Read
NVIDIA (NVDA)
On all 7 workloads, every best-in-round time, FP4 training, 8,192-GPU scaling
Lead intact
AMD (AMD)
MI355X 3.46× over MI300X, mid-pack on 3 of 7, absent from frontier scale
Closing, not closed
CoreWeave (CRWV)
Top DeepSeek-V3 result, 8,192-GPU run, beat NVIDIA's own GB200 number