Okay, looking at that Nvidia doc some more, I think I know what Habana did. They probably found the latency which was the sweet spot for their chip, and then looked at the closest V100 result below that.
However, what that means is if you increase the batch size to where V100 does well, you're probably past the point where the Goya's internal memory is exhausted.
AFAIK, I don't think 6 ms is some kind of magic latency number, for the cloud (where these chips would be used). So, it's probably better to compare Goya's best throughput against the V100's best throughput.
But we're overlooking something. Goya is clearly using integer arithmetic to hit that number, while the V100 is using fp16. What they don't say is the hit you take on accuracy from using their 8-bit approximation. And if you're going to use 8-bit, then you can use Nvidia's new TU102 and probably get nearly the same performance (should be about 2x of the V100's fp16 throughput, according to Nvidia's numbers).