44GB over 900,000 cores is only 50KB per core
If you compute the amount of silicon area or transistors per core, it's pretty clear that these are probably similar to what Nvidia calls a "core". In other words, something more like a SIMD lane.
of cache memory which is tiny compared with consumer CPUs.
I doubt it works as cache, but is probably directly addressable. Cache lookups require associative memory, which wastes die space and power for something that's really not necessary, when your access patterns are predictable. The normal way this works is you have a double-buffering type scheme and you've got a DMA engine which drains/fills inactive buffers while you're computing on the contents of the active buffer.
The IBM Cell processor worked this way, which was popularly used in Sony's PS3. Each of its 8 SPEs had 256 kB of SRAM that it used like this. The SPEs had no direct access to system memory, but rather relied upon DMAs to copy everything in/out of their scratchpad memory.
As a specific purpose chip with specifically written code, it likely doesn't need a big memory for random access. (Unless I am misunderstanding what they mean by on-chip memory)
I think a big use case for external DRAM access is streaming in weights, in the event you don't have enough SRAM to keep them all on-die. That's probably what eats most of the massive memory bandwidth on Nvidia's Hopper, for instance.