First of all: the A100 values are a little bit mixed up.
With Tensor Cores:
bfloat16 or FP16 = 312 TFlops (with sparsity up to 624 TFlops)
TF32 = 156 TFlops (with sparsity up to 312 TFlops)
("FP32-like/precision equivalent" matrix ops for training)
INT8 = 624 TOPS (with sparsity up to 1248 TOPS)
Additionally the regular base FP64 performance is 9.7 TFlops, but additionally Ampere can calculate FP64 MMA-ops via Tensor Cores in full precision and they have extended their CUDA-X libs for easy handling, therefore the resulting FP64 for a lot of (or even most?) workloads should be much higher than 9.7 TFlops FP64.
In the end it seems that MI100 is no match for Ampere, especially not with regards to AI workloads.