DARPA Wants To Build a 1,000x More Efficient Graph Analytics Processor With Intel, Qualcomm's Help

Lucian Armasu · Jun 5, 2017

DARPA announced that Intel and Qualcomm will help the agency build a next-generation graph analytics platform that should deliver a 1,000x improvement in performance-per-watt by 2021.

DARPA Wants To Build a 1,000x More Efficient Graph Analytics Processor With Intel, Qualcomm's Help : Read more

bit_user · Jun 5, 2017

I know this is graph processing - not graphics - but I'd be surprised if they could build hardware 1000x the perf/W of today's GPUs w/ HBM2.

https://devblogs.nvidia.com/parallelforall/gpus-graph-predictive-analytics/

SockPuppet · Jun 5, 2017

This is custom-purpose hardware. It will be many times faster than a GPU at the job it was designed for.

gc9 · Jun 5, 2017

GPUs have higher memory bandwidth available than CPUs, and tolerate longer memory latency, but they don't address the problem that in graph processing much of the bandwidth is wasted on data that is compared and discarded. One idea is smarter memory chips that can compare and discard internally, reducing the time and energy shuttling results back to the processors.

A motivating example seems to be that the full bandwidth can be used to fetch useful values when traversing dense arrays, but not sparse-matrices.

Consider a dense 2D-matrix that contains elements of a row in adjacent memory locations (a row-major representation). Processors can automatically predict the address of the next few elements and prefetch them, and memory can send a series of adjacent values. When striding (such as following down a column), the processor can still predict the next address and prefech it, although only one or two values in the cache line are needed, wasting bandwidth.

In a sparse matrix, the elements may be respresented as rows that alternate column number and value (or as a row of column numbers and a row of values). The processor cannot easily predict the address of the desired values. It has to scan down the row (or do a binary search) to find if an element exists, and where in the row it exists. During this scan or binary search, most of the results might be discarded. Also, in the binary search case, additional time is lost waiting for each round trip to memory and back.

Graph networks might be represented similarly to a sparse matrix. Each node has a list of its neighbors labeled with a relationship id and neighbor node id. So again the processor frequently scans down a list of neighbors until it finds the relationship id. A hash table could be used, but they may not fill cache memory densely, and they also might contain lists when there are multiple neighbors for the same relationship.

To make memory smarter, it should return only matching values, not values that will be immediately discarded and never used. So one approach is to put a filter on the memory output buffer to offload some of the associative matching and discarding. Another approach would be to add simple processing to do a scan or binary search. Another approach is to build memories that are addressed associatively, so a whole row could be queried and only values with a matching id or column number are queued in the output buffer.

To make processors better for graphs, they might have shorter cache lines, or at least the ability not to fill the whole cache line if it is not wanted (multiple presence and dirty bits). Also they need to communicate with the smarter memory.

On the other hand, the applications appear to be for analyzing large networks, so these are specialized hardware devices for governments and organizations with financially deep pockets. Unless someone can come up with a consumer application as intensive as video games and cryptomining have been for GPUs...

bit_user · Jun 5, 2017

SockPuppet :

Yeah, I get that. I just said not 1000x. I take the 1000x figure as maybe applying to server CPUs.

bit_user · Jun 5, 2017

gc9 :

The reason I specifically mentioned it is that I see HBM2 as a partial solution to this problem.

I was imagining the ultimate solution might look like a smart memory. However, there might be another way to slice it. Instead of using a few channels of fast DRAM that relies on large bursts to achieve high throughput, one could use many more channels of slower memory that has lower overhead and a correspondingly lower penalty for small reads.

Intel's 3D Xpoint is supposedly bit-addressable. Perhaps their selection of Intel, as a partner, wasn't quite as arbitrary as it might seem.

gc9 :

Google's development of their own TPUs has shown even a single cloud operator can drive enough demand for specialized analytics hardware.

Search

DARPA Wants To Build a 1,000x More Efficient Graph Analytics Processor With Intel, Qualcomm's Help

Lucian Armasu

Contributing Writer

bit_user

Titan

SockPuppet

Distinguished

gc9

Distinguished

bit_user

Titan

bit_user

Titan

TRENDING THREADS

Latest posts

Moderators online

Share this page