If you want a full roundup of all the very technical details, you can read NVIDIA’s in-depth architecture overview. We’ll be breaking down the most important stuff.
The New Die Is Absolutely Massive
From the gate, they’re going all out with this new chip. Last generation’s Tesla V100 die was 815mm on TSMC’s already mature 14nm process node, with 21.1 billion transistors. Already quite big, but the A100 puts it to shame with 826mm on TSMC’s 7nm, a much denser process, and a whopping 54.2 billion transistors. Impressive for this new node.
This new GPU features 19.5 teraflops of FP32 performance, 6,912 CUDA cores, 40GB of memory, and 1.6TB/s of memory bandwidth. In a fairly specific workload (sparse INT8), the A100 actually cracks 1 PetaFLOPS of raw compute power. Of course, that’s on INT8, but still, the card is very powerful.
Then, much like the V100, they’ve taken eight of these GPUs and created a mini supercomputer that they’re selling for $200,000. You’ll likely be seeing them coming to cloud providers like AWS and Google Cloud Platform soon.
Play Video
However, unlike the V100, this isn’t one massive GPU—it’s actually 8 separate GPUs that can be virtualized and rented on their own for different tasks, along with 7x higher memory throughput to boot.
As for putting all those transistors to use, the new chip runs much faster than the V100. For AI training and inference, A100 offers a 6x speedup for FP32, 3x for FP16, and 7x speedup in inference when using all of those GPUs together.
Note that the V100 marked in the second graph is the 8 GPU V100 server, not a single V100.
NVIDIA is also promising up to 2x speedup in many HPC workloads:
As for the raw TFLOPs numbers, A100 FP64 double precision performance is 20 TFLOPs, vs. 8 for V100 FP64. All in all, these speedups are a real generational improvement over Turing, and are great news for the AI and machine learning space.
TensorFloat-32: A New Number Format Optimized For Tensor Cores
With Ampere, NVIDIA is using a new number format designed to replace FP32 in some workloads. Essentially, FP32 uses 8 bits for the range of the number (how big or small it can be) and 23 bits for the precision.
NVIDIA’s claim is that these 23 precision bits aren’t entirely necessary for many AI workloads, and you can get similar results and much better performance out of just 10 of them. This new format is called Tensor Float 32, and the Tensor Cores in the A100 are optimized to handle it. This is, on top of die shrinks and core count increases, how they’re getting the massive 6x speedup in AI training.
They claim that “Users don’t have to make any code changes, because TF32 only runs inside the A100 GPU. TF32 operates on FP32 inputs and produces results in FP32. Non-tensor operations continue to use FP32”. This means that it should be a drop in replacement for workloads that don’t need the added precision.
Comparing FP performance on the V100 to TF performance on the A100, you’ll see where these massive speedups are coming from. TF32 is up to ten times faster. Of course, a lot of this is also due to the other improvements in Ampere being twice as fast in general, and isn’t a direct comparison.
They’ve also introduced a new concept called fine-grained structured sparsity, which contributes to the compute performance of deep neural networks. Basically, certain weights are less important than others, and the matrix math can be compressed to improve throughput. While throwing out data doesn’t seem like a great idea, they claim it does not impact the accuracy of the trained network for inferencing, and simply speeds up the.
For Sparse INT8 calculations, the peak performance of a single A100 is 1250 TFLOPS, a staggeringly high number. Of course, you’ll be hard pressed to find a real workload cranking only INT8, but speedups are speedups.