GPUs have higher memory bandwidth available than CPUs, and tolerate longer memory latency, but they don't address the problem that in graph processing much of the bandwidth is wasted on data that is compared and discarded. One idea is smarter memory chips that can compare and discard internally, reducing the time and energy shuttling results back to the processors.
A motivating example seems to be that the full bandwidth can be used to fetch useful values when traversing dense arrays, but not sparse-matrices.
Consider a dense 2D-matrix that contains elements of a row in adjacent memory locations (a row-major representation). Processors can automatically predict the address of the next few elements and prefetch them, and memory can send a series of adjacent values. When striding (such as following down a column), the processor can still predict the next address and prefech it, although only one or two values in the cache line are needed, wasting bandwidth.
In a sparse matrix, the elements may be respresented as rows that alternate column number and value (or as a row of column numbers and a row of values). The processor cannot easily predict the address of the desired values. It has to scan down the row (or do a binary search) to find if an element exists, and where in the row it exists. During this scan or binary search, most of the results might be discarded. Also, in the binary search case, additional time is lost waiting for each round trip to memory and back.
Graph networks might be represented similarly to a sparse matrix. Each node has a list of its neighbors labeled with a relationship id and neighbor node id. So again the processor frequently scans down a list of neighbors until it finds the relationship id. A hash table could be used, but they may not fill cache memory densely, and they also might contain lists when there are multiple neighbors for the same relationship.
To make memory smarter, it should return only matching values, not values that will be immediately discarded and never used. So one approach is to put a filter on the memory output buffer to offload some of the associative matching and discarding. Another approach would be to add simple processing to do a scan or binary search. Another approach is to build memories that are addressed associatively, so a whole row could be queried and only values with a matching id or column number are queued in the output buffer.
To make processors better for graphs, they might have shorter cache lines, or at least the ability not to fill the whole cache line if it is not wanted (multiple presence and dirty bits). Also they need to communicate with the smarter memory.
On the other hand, the applications appear to be for analyzing large networks, so these are specialized hardware devices for governments and organizations with financially deep pockets. Unless someone can come up with a consumer application as intensive as video games and cryptomining have been for GPUs...