Move Over GPUs: Startup's Chip Claims to Do Deep Learning Inference Better

Lucian Armasu · Sep 19, 2018

Habana Labs, a new AI chip startup, promises to deliver much higher inference performance than even a machine learning-optimized GPU.

Move Over GPUs: Startup's Chip Claims to Do Deep Learning Inference Better : Read more

bit_user · Sep 19, 2018

Something seems wrong with their benchmark, if a V100 only rates 2x as fast as Intel Xeon. I'm skeptical even 56 Xeon cores would be that fast.

Anyhow, V100 is old news. Turing is yet 2x to 4x faster, still.

alextheblue · Sep 19, 2018

bit_user :

I can't help but think this wouldn't be a match for Nvidia's tensor cores if they (Nvidia) built a chip that was basically a big 100W Tensor block. But maybe I'm wrong. As of today though this Habana design does have the best performance in that power envelope. Of course, that's all just on paper.

WINTERLORD · Sep 19, 2018

so my next motherboard will have one of these for realtime raytracing kinda makes me wonder now what amd will offer in the future

bit_user · Sep 20, 2018

alextheblue :

You're declaring a winner based on just one benchmark and so little other information?

Anyway, according to this, Nvidia claims 6,275 images/sec with a single V100:

https://images.nvidia.com/content/pdf/inference-technical-overview.pdf

But that really doesn't tell us how well their architecture performs on different types of networks. The fact that it seems to rely on internal memory means it probably hits a brick wall, as you increase network size and complexity.

bit_user · Sep 20, 2018

Okay, looking at that Nvidia doc some more, I think I know what Habana did. They probably found the latency which was the sweet spot for their chip, and then looked at the closest V100 result below that.

However, what that means is if you increase the batch size to where V100 does well, you're probably past the point where the Goya's internal memory is exhausted.

AFAIK, I don't think 6 ms is some kind of magic latency number, for the cloud (where these chips would be used). So, it's probably better to compare Goya's best throughput against the V100's best throughput.

But we're overlooking something. Goya is clearly using integer arithmetic to hit that number, while the V100 is using fp16. What they don't say is the hit you take on accuracy from using their 8-bit approximation. And if you're going to use 8-bit, then you can use Nvidia's new TU102 and probably get nearly the same performance (should be about 2x of the V100's fp16 throughput, according to Nvidia's numbers).

bit_user · Sep 20, 2018

alextheblue :

First, Nvidia's GPU rely on the CUDA cores to drive the Tensor cores. I'm not sure how many CUDA cores you can remove, before impacting Tensor core performance, but the answer might be that they're already optimally balanced. That means all you could really remove is the fp64 support and the graphics blocks. That said, I keep wondering if you couldn't use the texture units for on-the-fly weight decompression.

The CUDA cores are also valuable for implementing various layer types that can't simply be modeled with tensor products.

softwarehouse · Sep 20, 2018

Too slow Seagate. Still stuck at 14TB?

Here's 16TB in action.

https://www.win-raid.com/t3548f45-The-XP-Yeager-Project-TB-Breaking-the-TB-Capacity-Barrier.html

Search

Move Over GPUs: Startup's Chip Claims to Do Deep Learning Inference Better

Lucian Armasu

Contributing Writer

bit_user

Polypheme

alextheblue

Distinguished

WINTERLORD

Distinguished

bit_user

Polypheme

bit_user

Polypheme

bit_user

Polypheme

softwarehouse

TRENDING THREADS

Latest posts

Moderators online

Share this page