Report: Intel Haswell-E Halo Platform Will Have Eight Cores

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

HT works exactly the same as having more physical cores as far as software is concerned and does not require any HT-specific programming to use.

What does require architecture-specific optimizations is fancy stuff like using HT or the 2nd int-core for AMD to use the 2nd thread/core as an intelligent cache pre-fetcher or other similar function intended to support the other thread/core that does the heavy-lifting.

Unless you go out of your way to implement architecture-specific tweaks, every HT thread and every int-core looks exactly the same to software.
 

Should be a $600 part too, just like the...3930K, is it?


Ah. Interesting. Thanks! :)


Um. See, i really can't prove anything a year before, i could post a few links and dump info gathered in the last few months into this post, but i'm just too lazy at the moment.

But in short,
Broadwell has more or less been confirmed to be BGA/PGA only, without any LGA part.
There will be a Haswell "Refresh" next year for the desktop with the 9-series chipsets. These will be LGA. Compatible with 8-series? Most probably.
Haswell-E will also come out next year and git into LGA 2011-3.
Skylake comes next for the desktop in 2015.

That's all folks.

p.s. I do remember seeing one indication of a Broadwell LGA part, but everything since has pointed to a Haswell Refresh. Unless this Haswell refresh turns out to be Broadwell, which i don't think it is, the above info seems to be the most likely case as of now.

p.p.s. AMD wouldn't love Intel even if they decided not to release anything at all for a year.
 
Well, more cores need more memory bandwidth. That's why the Sandybridge-E platform has quad-channel memory. Maybe when DDR4 hits, dual-channel might be enough to keep 6 cores fed.

Then again, the iGPU can probably consume all of that memory bandwidth, all by itself. So, you probably need both on-die graphics memory (which is making an appearance in mobile Haswell and I think the XBox One CPU) and DDR4.

Also, most desktop users don't need 4 cores, let alone 8. Even most games don't benefit much from going beyond quad-core. I think Intel got it right with their market segmentation: 6+ cores needs a high-end platform, and only makes much sense for high-end users and workstations/servers.
 
I think I'd be surprised if you didn't.
😉

We're talking about very different things, here. Any GPU-compute task is currently driven by the CPU. The computation model is that the CPU prepares some work and ships it off to the GPU. In the meantime, the CPU can go do something else or queue up more work, but if we're talking about something like a game, the CPU often needs to get the results before it can finish that frame and move on to the next. So, pipelining opportunities are limited.

Now, the CPU is waiting for the data to transfer to the card, then for the GPU to process it, and finally for the GPU to return some result. If either of those steps includes a significant amount of data transfer, then the time the CPU must wait for that data directly affects the amount of time to process a frame, which directly affects the framerate.

And we're not talking about "a few cycles more or less", but rather something like feeding 100 MB of scene geometry to the physics engine, for instance. And let's just worry about one direction. When you're talking about 4, 8, or 16 GB/sec, that amounts to 25 ms, 12.5 ms, or 6.25 ms. No big deal, eh? Well, if you're doing it for every frame, then the maximum framerate you could reach would be 40, 80, or 160 fps. Right now, you're probably thinking that 80 fps sounds pretty good, so 8x it is! Well, let's not forget that we're assuming exclusive use of the bus, but there's a lot of graphics data being shipped over it, as well. It also assumes that the GPU's computation time is zero, which it's not. And the CPU has to do other chores that are either dependent on the results of the GPU computation or a dependency of it. So, maybe you can see why bus speed is so important - because transaction time can be as much or more important than the total amount of stuff you can cram over it, if you were simply sending nonstop.

Of course, I have no idea how realistic 100 MB is, and clever game devs will try to keep as much data on the graphics card as possible and overlap as many things as they can. But even if you cut my number down to 10 MB and assume that bus channel is available 50% of the time, you still come out with spending 30% of the time potentially waiting on just the bus @ 60 fps, in the PCIe 3.0 x4 case.

Don't think we won't start seeing this, since the APUs used in the upcoming consoles have no bus separating the CPU and GPU. They both have fast, wide datapaths to unified memory, as well as possibly a shared L3 cache. Game devs will certainly be doing a lot more heterogeneous computing, to use AMD's parlance.

Fortunately, this point can quite easily be refuted, on the basis of PCIe scaling studies Tom's has actually done. In an article they wrote about PCIe 2.0 scaling, nearly three years ago, they stated:

we did see a fairly large difference between x8 and x16 slots

Source: http://www.tomshardware.com/reviews/pcie-geforce-gtx-480-x16-x8-x4,2696-17.html
Which flies in the face of your above assertion.

That's completely out of context. We're talking about this platform, not something coming out in 2015 or 2016! I was saying that you have to look at the demands of the software that will be out during the first couple years after this platform is released, in order to judge whether PCIe x4 or x8 would be a bottle neck on it, because that's the minimum period of time when people who buy these will actually be using them.
 

You shouldn't be feeding geometry data to the GPU; you upload it to the GPU's RAM once and then reuse that. The only traffic afterward is scene updates and the CPU does not need to update the whole scene at once. Even without caching scene geometry to GPU RAM, it can still upload geometry one object at a time as it gets done processing each of them and process the next object while geometry for the previous one is being transferred.


If you keep everything you can possibly leave in GPU RAM in GPU RAM, the bus would be free for control and data ~100% of the time and if programmers do their job right, they would be interleaving GPU commands/data transfers with other processing to avoid stalling on IO backlog.


No matter how fast the interconnect between CPU and GPU is, you still wouldn't want to process a large geometry/scene blob before starting to send it to the GPU even if you use shared memory with zerocopy API that technically has 0ms latency since the GPU still needs early access to scene data to start rendering the next frame and the GPU likely needs more cycles to render stuff than the time it takes the software to prepare the necessary data.


"fairly large" being 1-5% in most games they used, which is negligible in my book.

Here is a newer review of PCIe performance scaling using Ivy Bridge, a HD7970 and a GTX680...
http://www.techpowerup.com/reviews/Intel/Ivy_Bridge_PCI-Express_Scaling/23.html

In most cases, there is only a 1-2% difference between x4 and x8.
 
This is quickly descending into technical nitpicking that's of questionable relevance to the original point. I'm just saying...

Of course you wouln't spoon-feed primitives to the GPU, one at a time. That's obviously not what I meant. But you ned to send/re-send geometry that changed, hence my revised figure of 10 MB. However, not being a game dev on a AAA title, I can't say for sure how much scene geometry must be updated by the CPU for every frame, or even comment on whether a "retained mode" is what modern games actually use.

In pretty much any game worth playing, stuff is happening in the game world around the player. Therefore, it would not be possible just to dump everything in GPU memory (assuming it's big enough) and simply sit back and tweak camera angles. And if the GPU is being used to accelerate physics or AI, as in my example, then some data must flow back to the CPU, hence the blocking.

The article I cited is nearly thee years old. If they saw a difference way back then, surely the disparity is larger, now. As for the article you cited, the problem with their conclusions is that they averaged over all resolutions - including ones that are heavily fill-rate limited and too slow for any serious gamer to actually use. If you restricted it to resolutions that were actually usable, I think you'd probably see the gap increased slightly, since 2010.

Now, by the time Haswell-E systems hit the streets, even that article will probably be 2.5 years old and the situation will have shifted further. And that's just on the release date of Haswell-E. In the year+ that follows, while it's still the highest-end platform available, the situation will continue to evolve. So, all of this is very speculative and only time will tell the answer.
 

The article I cited does not average over all resolutions. Each test case has separate graphs for every resolution used and for the most common modern resolution (1920x1080), there is less than 5% difference between x4 and x16 in most games tested.

As for "the disparity getting larger", I'm not really seeing that. Drivers, APIs, game frameworks/engines and programmers are getting better at working around latency and non-deterministic GPU/driver/API behavior. To make SLI/CFX work better, they have no choice but to improve on that since everything is becoming that much more unpredictable in the process with more hardware and software involved. Getting the most out of future GPUs will likely need it just as much due to the large amount of compute resources that need scheduling and the inherent challenge of keeping 2000+ GPU threads busy without overwhelming the host CPU. You cannot do that if your software is written to keep the GPU on a tight leash so the whole rendering/physics pipeline needs to become more loosely coupled to make things more efficient.

Sounds familiar? This is fundamentally the same sort of effort programmers need to do to leverage multi-core/multi-thread CPUs. The more large-scale parallelism you are attempting to leverage, the more vitally important it becomes to reduce or eliminate inter-dependencies between threads and stages while structuring algorithms to tolerate more latency. Code with multiple or complex inter-dependencies and low tolerance to latency will usually scale miserably; sometimes to the point of performing worse than its non-threaded equivalent.

Even if you eliminate communications-bound latency, you still have to deal with computations-bound latency so you still need to write code with (usually unpredictable) latency in mind. This is not going away even with GPiGPU or whatever its brand-specific name and complementary features might be.

As for PCIe 4.0, with Broadwell seemingly getting delayed, I would not be too surprised if it launched with the necessary hardware on-chip, just waiting for 4.0 certification if it does not have it from the start much like SB-E did.
 
Status
Not open for further replies.