I just noticed they used a batch size of 1, for that data! GPUs are much faster and more efficient with large batch sizes, as the paper subsequently acknowledges. When they increase the batch size to 64, the GPU gets a 47x speedup (it was already 2.6x as fast as Intel's Loihi chip, to begin with), while being about 50x as efficient as batch size 1! So, that cuts Intel's efficiency benefit to a mere ~2x, unless you have some sort of workload that absolutely cannot be batched (e.g. realtime/embedded control). Once you add the efficiency gains of Pascal or even Maxwell, their massive efficiency advantage should completely fall away. And we haven't even gotten to Turing's Tensor cores.
BTW, the GPU is so slow that it's even outrun by their Xeon E5-2630 CPU! They don't say which version, but I'm guessing v3 or v4, which would have AVX2. Worse, that GPU is only about 2x the speed of the Jetson X1 SoC, which is essentially what powers the Nintendo Switch. So, we're talking about a really crap GPU.
Maybe I'll give the paper a full read, and post up anything more I find. So far, what sounded like a very promising development now seems like a PR whitewashing exercise.