I guess, but its still based on the usage of tensor in addition to the raw FP performance. If you're after raw FP performance, the mi100 seems to have higher numbers on paper. The amount of FP workloads that can be increased by specific use of CUDA and Nvidia's various ML libraries does seem to be quite large, but that's still something up to the individual company/researcher to determine if its useful so it's still useful to compare the "base" FP64/32 results.First of all: the A100 values are a little bit mixed up.
With Tensor Cores:
bfloat16 or FP16 = 312 TFlops (with sparsity up to 624 TFlops)
TF32 = 156 TFlops (with sparsity up to 312 TFlops)
("FP32-like/precision equivalent" matrix ops for training)
INT8 = 624 TOPS (with sparsity up to 1248 TOPS)
Additionally the regular base FP64 performance is 9.7 TFlops, but additionally Ampere can calculate FP64 MMA-ops via Tensor Cores in full precision and they have extended their CUDA-X libs for easy handling, therefore the resulting FP64 for a lot of (or even most?) workloads should be much higher than 9.7 TFlops FP64.
In the end it seems that MI100 is no match for Ampere, especially not with regards to AI workloads.