PC Doldrums: Quarterly Shipments Hit Lowest Levels Since 2007

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

A cache, regardless of the underlying technology, still needs a map of what addresses are at what location in said cache and the worst-case first word data access latency to DRAM is over 60ns plus lookup time vs a constant 10-15ns total for SRAM. If you meant replacing RAM with HBM, HBM is still much slower and much higher latency than L3. The only way they might be able to get away with ditching L3 would be to make L2 bigger but that isn't going to happen until larger L2 can be made without negatively impacting latency between the CPU cores and L2.


You can achieve the same "benefits" by simply having enough RAM in your system to keep frequently used data there and update software to keep data unpacked in RAM instead of discarding everything and reloading from storage as most games currently do when reloading a level. That's just inefficient software design and can be addressed without introducing any new technology by making better use of available memory.

Before introducing new technology to address a "problem", actually check that there is a problem in the first place.
 
Just thought I'd throw in a tidbit about HBM. It's graphics memory. It's fundamentally different than CPU RAM. In GPUs, you don't care about latency, just bandwidth. The GPU can request data from its RAM well before it's needed, as the GPU's workload is insanely repetitive by nature.

The CPU, on the other hand, benefits less from absolute bandwidth than it does from low latency, at least on the scale we're talking about. It has no idea what it will need next until the last moment (compared to the GPU, at least). HBM would make for absolutely terrible CPU RAM. In order to have enough look-ahead to fully utilize HBM, you'd need a pipeline several times longer than Prescott, and that's not at all what you want in a CPU. It would be effectively an order of magnitude slower than DDR4 in that application despite its ridiculous bandwidth. You can imagine the implications of using it in place of the CPU's cache.
 

No, there's no lookup time. It's not a cache.


No, because there'd be the loading overhead you mentioned before. Furthermore, if it were feasible to keep it unpacked in RAM, then it'd also be feasible to keep in unpacked in storage.

I thought you were posing some case where highly-compressible data were stored that wasn't feasible to store in its uncompressed state.


That's what we're talking about eliminating.


Better yet if you can store it uncompressed and use it in-place from 3D XPoint. Then, you never pay the price of loading it.


Before nit-picking, actually check that we're talking about the same thing.
 

Not really. GPUs primarily use SMT, on a massive scale, to hide latencies. For instance, each "Streaming Multiprocessor" in Pascal (i.e. what we would call a "core", in the CPU world) switches between up to 64 "warps" (what we'd call a "thread", in the CPU world) to hide memory access latency. Intel GPUs support up to 7 threads per core.

Graphics has so much concurrency that GPUs can afford to take the "lazy" approach. It's far better than to burn a lot of die space and power on fancy prefetchers, and then waste precious bandwidth when they're inevitably wrong.

GPUs use concurrency to paper over so many things. They're very much about "keeping it simple", especially when doing so can save die area or power. The more power & area-efficient you can make each GPU core, the more cores you can afford to have and/or the faster you can afford to clock them.


First, the premise of my point is specifically to decrease access latency by putting HBM2 next to the CPU die. Intel's Knights Landing has a mode where it can use its 16 GB of HMC as a huge L3 cache, for exactly this reason.

Second, I think CPUs are much more dependent on hardware prefetchers than GPUs. In many cases, their memory accesses are quite regular. Also, due to lack of concurrency, CPUs are less able to hide memory access latency. So, the architects put a lot of effort into making their prefetchers fairly sophisticated.


So, why did Intel use HMC in Knights Landing?


Exactly why would it be worse than DDR4?

In this analysis of HMC v1, the author estimates a request size of just 64 bytes (a typical L1/L2 cacheline size) you can achieve 71.9% efficiency at 56% Reads.

https://www.ece.umd.edu/~blj/papers/thesis-PhD-paulr--HMC.pdf
 

The depth of the execution pipeline and memory latency are two completely unrelated topics. What the CPU needs to cope with latency is instruction re-order with sufficient eligible instructions to fill the execution pipeline and modern Intel CPUs already have a much deeper re-order queue than Prescott. Prescott's reorder buffer could accommodate up to 126 instructions while modern Intel CPUs were at 192 last time I read about it.

Of course, the worse the memory latency, the harder the scheduler has to work to find eligible instructions and fill execution ports when the re-order queue gets held up by cache/memory-bound dependencies.
 
in my business line (school systems and hospitals) I can tell you there will be even less computer "PC" sales in the near future, corps and school are tired of buying computers every other year, they are quickly moving to centralized server based virtual workstations, running desktop Lite (basically a terminal screen mouse and keyboard) for users to use. in 5 years, sales of pc will have dropped a lot more (business sales) than you think

It will not have no upgraded desktops anymore, not buy computers for users to mess with, and centralizing everything and OS upgrades become a synch and full control over user become unescapable.
 

FWIW, this analysis of HMC (v1) quotes a t_RAS of 27.5 ns, based on
Gwangsun Kim, John Kim, Jung Ho Ahn, and Jaeha Kim.
Memory-centric System Interconnect Design with Hybrid Memory Cubes.
In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques,
PACT ’13, pages 145–156, Piscataway, NJ, USA, 2013. IEEE Press
https://www.ece.umd.edu/~blj/papers/thesis-PhD-paulr--HMC.pdf

In his performance baseline, he uses
a latency of t_RCD + t_CL + t_Burst = 19 ns for a DDR3-1600 device, which represents the minimum time to open a row and stream data out of that row.
Yeah, I know that's his "minimum", but in non-realtime performance estimation it makes more sense to look at "typical" than your worst-case example.
 

You pay the price of a 200+GB X-Point DIMM which is likely going to be much more expensive than a similar capacity NVMe SSD, increased storage requirement due to having to store data in unpacked format if you want to use it in-situ without additional processing overhead and 20-100X slower access latency than DDR4.

As for examples of "compressible data", think of procedurally generated textures, geometry and vector art. These take hundreds if not thousands of times more space once rasterized for use as a texture or object by GPUs. Normal textures also go through multiple levels of scaling and filtering for LOD mipmaps, you'd need to store all that duplicate data too if you wanted to eliminate reloading.
 


Okay, so memory has a latency that depends on what you're asking it for. It will bring up a specific row to start, and then it will access a specific column. Repeated accesses to the same row on different columns is fast. Changing row is slow.

With parallel workloads, this pattern is utilized to allow each compute unit to have access to the data it needs at the same time as the rest of the compute units. In a GPU, the rows can be quite wide, as there are plenty of compute units that will access adjacent portions of RAM simultaneously. HBM and HCM are both approaches that use massively oversized rows. They can do this because they support many compute units.

Now, that increased width means that it takes HBM and HCM longer to switch rows. That's fine in a GPU or massively parallel compute card, as they operate in a predictable manner that is highly repetitive. The same compute units do the same thing repetitively until the task is complete. With Xeon Phi, there are 244 threads, and each thread does the same thing. In the specific case of that Xeon Phi, there are only 61 compute units. This means that each compute unit will repeat roughly the same task four times. This makes it pretty easy to figure out what they'll need next, as the input data tends to be arranged side-by-side.

Now, CPUs on the other hand (usually) only access a small piece of any specific row at a time. This means that they spend more time switching between rows, and that in turn means that the tradeoffs that benefit GPUs and compute cards don't work with the CPU. This is one reason why they need far more sophisticated prediction circuits/algorithms.

CPUs also differ from compute cards and GPUs in another crucial aspect: a CPU often has each core working on something completely different than the other cores. That means that each core will probably need totally different rows at the same time. This is where cache comes in. Adding HBM doesn't solve the issue that cache addresses in a CPU. In a GPU, it does. This is due to HBM having wide enough rows that it can feed every compute unit at the same time without switching rows at all. Older GDDR modules didn't have wide enough rows to pull that off. It's also worth mentioning that GPUs still have L1 cache to handle working data, even in the HBM models. L1 is a fundamental part of the compute unit, and cannot be moved off of the CPU/GPU/compute core. If I'm not mistaken, modern HBM GPUs also have L2 cache, as L2 also performs vital roles in collating data. The only cache that HBM is capable of replacing entirely, even on a GPU, is L3.

Lastly, cache isn't DRAM. It's SRAM. HBM is still DRAM, and therefore has all of DRAM's issues. SRAM outpaces DRAM by a wide margin. To put it in perspective, Skylake has a cache bandwidth of nearly 1 TB/s. Even looking exclusively at bandwidth, HBM can only match the cache in a modern CPU. It certainly won't offer any improvement.

I do stand corrected on the pipeline comment. Regardless, HBM would still impose a massive restriction on IPC in a modern CPU. It's engineered for a very different task than as a CPU cache.
 

At the beginning, GPUs had no L1, then they got L1. They had no L2, now they have L2. As GPUs get used for more general compute-centric stuff, I wouldn't be surprised to see them pick up more general computing stuff... like L3. They can get away without it because current workloads are still mostly linear, uniform and predictable but that may change.
 

I'm talking about a hypothetical point in time where 3D XPoint is comparable in $/GB with NAND flash.


In extreme cases, there's no way around computing as-needed. The benefit that 3D XPoint DIMMs would offer is to persistently cache the result, should it make sense to do so. Obviously, procedural textures will always be computed on-the-fly. That's what GPUs are built for.


The size of MIP Maps converges as 4/3rds the size of the original.

But, I wonder if the stored size of 3D models is now dominated geometry, since so many textures are now procedural.
 

See above, where the paper I cited says you can get pretty good efficiency with just 64-byte request sizes.

Anyway, Pascal uses 32-lane SIMD, translating to a 128-byte request size. But, in 3D rendering, most requests will be scatter-gather. Not linear.


In dense linear algebra operations, sure. Not so much for 3D rendering.


Nope. These are true threads - not like GPU "threads". Every one of the 244 threads can be simultaneously executing a different program, in a separate process. That's what makes Xeon Phi special. It has nearly the parallelism of a GPU, but it can still execute legacy code.


Just to be clear, Xeon Phi implements hyperthreading in a way that's functionally equivalent to your desktop CPU. They're completely independent. You really shouldn't try to explain things you don't understand.


Well, the fact that cache lines tend to be about 64 bytes is a pretty good indication of the typical access locality. If access were sparser, then we should see smaller cache lines.


This is rubbish. Again, to pick on Nvidia's Pascal, they have 2x 32-lane SIMD engines per SM. Those can switch between 64-warps, and there's no requirement or assumption that they're accessing consecutive memory locations. In fact, each lane of the SIMD (which they call a "thread") can do an indirect load, which is often referred to as scatter-gather loads/stores.


It's dwarfed by the registers, however. GPUs use cache in very different ways than CPUs. As you say, L2 is mainly used to stage & collate loads and stores. By comparison with desktop CPUs, the amount of L2 that even the most monstrous GPUs have is tiny. I doubt we'll see a GPU with L3.


Tell it to the architect of Knights Landing.
 

No. Cache is very power-inefficient and GPUs have thousands of hardware threads with which to hide latency. These guys claim 60% of the power dissipated by modern CPUs is from the cache hierarchy:

http://www.csm.ornl.gov/SOS20/documents/Sohmers.pptx

GPUs are fundamentally about power-efficiency.

Plus, the more threads you have, the more likely it is that what you might want will have gotten evicted from L3 by the time you want it. GPUs move such massive amounts of data through that shared caches really don't make much sense. This is probably why Knights Landing has no L3 (other than MCDRAM).

Again, what lets them get away with it is by employing SMT on such a massive scale.
 

Depends on what you use it for. For conventional GPU workloads, shaders are all working independently on their own small piece of the overall puzzle with very little to no interaction with each other. Once you introduce the need to buffer data between cores executing tightly coupled threads to smooth out variances in execution speeds between related threads, you will want L3 to avoid completed intermediate results hogging the L2 on one core or thrashing the destination core's L2 before it is done doing whatever it was doing on its previous data.

Will GPUs and Knight's Landing be used to run that sort of more desktop-like code? Maybe, maybe not. If they do, they'll need L3 or some other functionally similar structure to keep data on-die as it transits between cores to avoid the huge penalties associated with evicting to RAM and having to reload from it moments later.
 
GPGPU has been around for ~15 years. OpenCL has been with us for about a decade, with CUDA (and other proprietary efforts) predating it. So, it's safe to say the GPU compute community has considered how to adapt shader-oriented architectures for efficient, general parallel execution.


Caches are a hack. Their prevalence in general-purpose CPUs is a by-product of having to support & optimize legacy software, and then having to drag along software developers into the realm of parallel computing. They prioritize generality and ease of use over the efficiency of explicitly addressable local memory.

GPU computing didn't have the first problem, and the second was mitigated by the willingness and adventurousness of any software developers treading this path. With the price of shared caches so dear, they opted to expose the memory hierarchy and to force software to be explicit about thread collaboration.

In OpenCL and CUDA, threads are organized into work groups or blocks, respectively. Memory is allocated as private, group-local (i.e. __local or __shared__), or global. Furthermore, OpenCL and CUDA both provide explicit mechanisms for asynchronous data transfers.


It has a mode where its MCDRAM (HMC) can be used as a 16 GB L3 cache.


In the other mode, the MCDRAM is instead explicitly addressable. This is probably for those primarily treating it as a GPU, and using APIs like OpenCL to manage the MCDRAM pool.
 

Many everyday algorithms in desktop software are inherently sequential in nature and simply don't lend themselves to efficient parallelizing. If you make a thread-safe hash tree, sparse array and most other STL constructs that programmers take for granted, performance quickly ends up bottlenecked by synchronization objects or burning cycles by busy-waiting and you are better off running your algorithms non-threaded with a global readers-writer lock to protect the whole object during updates.

Another entire class of common intrinsically sequential algorithms that don't parallelize well at all and we all use countless times on a daily basis is parsers which cannot do their job without knowing the context that got established prior to any given point within the input stream.

Parallel computing isn't a viable solution for everything.
 

I'm not really sure who you're arguing with now, but I'd never claim that CPUs optimized for sequential and lightly-threaded workloads won't be with us for the foreseeable future. Caches, super-scalar, OoO, and all.

IMO, the more interesting question is what happens with FPGAs and machine-learning optimized processors. For a while, it looked like the big horsepower in the computing world would simply be divided into two classes of chips: CPUs and GPUs. Not to say that FPGAs and ASICs haven't been used for various specialized applications, but that's what we seemed to be converging towards. Now, I'm starting to think that the dominance of AI and machine learning might yield a new class of processors that will at least gain prevalence in the cloud, if not also mobile SoCs (and you can't necessarily rule out the possibility of them appearing in desktops).
 

You called L3 and strong single-threaded cores a kludge to support legacy software. I merely pointed out that there are algorithms where parallel computing is simply not practically feasible. We'll be needing at least some strong single-threaded cores well beyond the foreseeable future. For most other things, there's (i)GPGPU which is far more power-efficient.

At some point in the future, we'll probably see x86 CPUs going bigLITTLE with 4-8 full-blown dual-threaded cores to handle heavy timing-critical loads and 4-16 simplified quad-threaded cores for all the background stuff running on the CPU.
 

I said no such thing.

What I said was that caches are a hack. It's true. That doesn't mean they're bad. It does mean they're not a globally optimal solution. But that doesn't mean CPUs can just do away with them so easily. However, being a somewhat clean slate, GPUs did a better job of reducing their dependence on caches - especially shared ones.

My original point about L3 was just that perhaps HBM or HMC could obviate the need for them. We don't need to rehash that whole issue - I'm just restating what I actually did say.

How you got from that to the idea that I thought GPUs were the best solution to every problem or that I think strong OoO cores are bad and will disappear is quite simply mystifying.
 


So, are you saying that storage hierarchies used in datacenter applications are also a hack?

Or are you suggesting that we shouldn't be using DRAM at all, and should just use SRAM for everything (or vice-versa)?

Or are you saying that it's not a hardware engineer's job to consider how easily a programmer can utilize his/her product?

Or are you suggesting that a "properly" designed CPU has no need for cache at all, and the existence of any type of cache is an indication of more fundamental problems with the CPU?

Or are you suggesting that there's a more elegant solution that provides better general-purpose performance than caches provide?

Regardless, I am having trouble seeing any valid points among the options above. They're all shaky at best, or simply incorrect/based on faulty assumptions at worst.

Lastly, what exactly is the problem with having a cache? It dramatically boosts performance in many applications without any particular drawbacks, and in modern CPUs, they provide cross-core communication channels of dramatically higher bandwidth than anything off the CPU. Basically, they don't hurt anything, but they do benefit quite a number of things, so where's the problem?
 

You do know that L1 and even part of the register file is shared between 32+ shader cores in all GPUs in recent history, right? That makes GPUs ESPECIALLY dependent on shared caches. The L2 is sitting on top of the memory controllers, which makes it shared as well and more comparable to L3 in desktop CPUs.

As for GPUs being a "clean slate", I think the main reason GPUs could get away with little to no cache before is simply that rendering graphics consisted mainly of fetching texture data from memory and that data is of such time-limited relevance (only needed long enough to spit out a few pixels) that it isn't worth caching most of the time. Shaders were also so limited in length and complexity that they didn't need external stores either. That is no longer the case with GPGPU and today's heavily compute-driven graphics. HBM2 is no substitute for sufficiently large and fast caches, which is why AMD doubled the L2 cache size to 4MB with Vega despite using HBM2.
 
I'm replying only for the sake of explaining my previous statement. However, I think debating such matters would be getting too far off topic.

Not all caching is bad, but the further you get from the application logic, the more crude and inefficient it becomes. The closer the caching is done to the application, the more intelligently the cache can be managed.


In the ideal world, you don't have hardware burning a lot of power trying to guess those things the software will do that the compiler already knows. Compilers are smart enough to do a better job of telling the hardware what to prefetch, when, and also when to flush and evict things (in many cases). That's not to say you necessarily replace all existing cache with explicitly addressable SRAM or HBM2 buffers and make the hardware completely dumb, but in some cases it's actually better than using an associative cache.

See the above link for further argument of this position. Cache burns a lot of power and die space ain't free, so more you can simplify or eliminate it, the better you can scale (i.e. massively parallel architectures).

Again, I point to Knights Landing's explicitly addressable mode for its MCDRAM as one example of this. Whether it's used as a software-managed cache or the entire working state of the algorithm can fit within it is left up to the software to decide (i.e. in that mode).

It's no accident that IBM's Cell was so groundbreakingly fast or that the Sunway TaihuLight's SW26010 is so efficient - both employ explicitly-addressable local memory. CUDA and OpenCL provide constructs to ease the burden of programmers, on such architectures, offloading much of the housekeeping to the runtime environment.

And it's no accident that GPUs, which offer the most power-efficient computation of the generally programmable platforms available today, have no L3 and less total L2 than even mainstream desktop CPUs.


Well, that'll have to be too bad. I didn't love the idea, at first, but then it began to grow on me. We're not going to resolve it here. So, you can think about it some more or reject it out of hand. Not my problem.
 

Sure, they use caches, but not in the way that you were saying. They don't use shared caches for communication between the threads - they use them primarily for efficient batching of memory accesses.


Sure, it started out that way. But they quickly grew to a scale where any feasible amount of L3 would be pointless. They use explicitly-managed, on-die shared memory for low latency. When they have to go off-die, the massive amount of SMT and comparatively low clock speeds can often hide the latency of doing so. It works out to be a more efficient and scalable approach.


Okay, so where's their L3?


If you're telling me that 256-core (AKA compute unit) Vega's 4 MB of L2 is used the same way as the 1.375 MB L3 per core of Intel's <= 28-core Scalable Xeons, I'm not believing it. And that would be an impasse it seems we cannot bridge.

Seriously, you're talking about 16 kB per compute unit or 1 kB per SIMD lane. Barely a drop in the bucket, compared with the Xeons. Put another way, Vega's memory bus can pass 121 kB per second per byte of last level cache, whereas a 28-core Xeon can only pass 3.3 kB per second per byte of its last level cache. Perhaps that might drive home the point that these caches exist for very different purposes.
 
Status
Not open for further replies.