To put it more clearly, the 'Megatron-DeepSpeed' DL framework was actually ported to Fugaku, and the dense matrix multiplication library was accelerated for 'Transformer', so as to maximize the distributed training perf.
Let's make that more clear...
Megatron (an NVidia-developed framework that excels at multi-GPU AI Acceleration), coupled with DeepSpeed, a Microsoft-written library were used.
The framework ensures many GPU's can be used at once to train the model.
The library itself helps by accelerating the performance of the model during training and operation by introducing many cool things like gradient-loss control and parallelism while still being fairly memory efficient.
When you add on a self-attention tooling like Transformers, you have a model that can pick out or highlight the more important sections of the input.
This type of AI acceleration (all of it put together) is very good at Natural Language Processing.
I think what I've said is true but I'm still learning.