AMD CPU speculation... and expert conjecture

Page 184 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

8350rocks

Distinguished


The new consoles won't have to contend for resources now, so those additional threads can be run concurrently. Developers get a big win there!

Except the GPU already does far more work then the CPU, so there isn't much left for the CPU to do aside from feed the GPU and run the easy to run threads (audio, UI, OS, etc). So I doubt you are going to see MORE work put on the GPU, since that would quickly bottleneck. And 60FPS is the IN thing now, which is going to farther limit what devs can do.

Well, with current setups...yes. The GPU on PS4 had the compute functions increased by something like 8 times...which will allow more offloading. We will have to see how it turns out...but I think the results will be good.

Except some tasks CAN'T be broken up. And you have to worry about all the extra bottleneckes, race conditions, priority inversions, and the like that can occur under VERY specific circunstances that make debugging a PITA. Then people whine and complain that the game engine stinks, how the game was released as a Beta, etc.

Yes, that's true, some tasks have to be run serially...whether or not coding languages will allow ways to make that more efficient in the future is only speculation. We will have to see how it turns out. Also, debugging is always a PITA...LOL. It's merely to what degree it is a PITA that would be in contention. :)
 


Yes they will be fighting for resources; you still have to have multiple threads accessing the same data, which still acts as the primary bottleneck. Sure, you increased throughput, but the bottleneck remains in place.

Well, with current setups...yes. The GPU on PS4 had the compute functions increased by something like 8 times...which will allow more offloading. We will have to see how it turns out...but I think the results will be good.

It doesn't work like that; each GPU shader can be used for either compute or rendering. Its not either-or like it was back in the days of vertex and pixel shaders.

If the GPU is busy with rendering, guess what? It won't be able to do compute at the same time. All the compute pipelines are doing is giving the system a hook to allow offloading, but doesn't solve the fundamental problem of the GPU already being tasked to capacity.

Yes, that's true, some tasks have to be run serially...whether or not coding languages will allow ways to make that more efficient in the future is only speculation. We will have to see how it turns out. Also, debugging is always a PITA...LOL. It's merely to what degree it is a PITA that would be in contention. :)[/quotemsg]

Its not a programming language problem, its a fundamental problem of computing. You would need to fundamentally re-design computer architecture to get around the fact that the entire design was built around serial processing.

As for debugging, YOU try debugging all the race conditions, priority inversions, and memory leaks that start to show up in a multithreaded environment on dozens of different configs. I have, and it is NOT a fun task. Single-thread context is easy, and there really shouldn't be any such problems that make it past debugging. You have no idea the problems I've seen over the years, and most of them tend to be rare (never show up in testing) and specific to certain platform performance requirements to show (again: Never show up in testing).
 

8350rocks

Distinguished
While I agree that x86 was designed for serial operations from onset. The innovations of HSA and hUMA are going in a new direction. I think Intel will ultimately have to capitulate to the ever increasing number of proponents of HSA and hUMA. Once the innovations AMD has implemented are becoming more and more consolidated into daily functions, and the performance achieved on lesser hardware rivals or exceeds the performance of high performance legacy x86 designs, there will be no way for Intel to argue. Brute force doesn't solve everything.

I am very curious to see what effects the consoles and the new kaveri APUs have on software development. As it will ultimately be software that is the limiting factor in performance once the PC is capable of using all hardware to maximum efficiency potential.

The direction you speak of is coming, in my mind, I cannot see anyway that HSA does not contribute to a far more parallel capability of hardware. At which point is does become a coding language issue, though software will evolve. Adobe is already working on implementations for HSA in it's programs. I think there will be other companies doing the same to push HSA along even further.

If we see photoshop run 20-40% faster on a 4 core Kaveri, then it does on a 3770k/4770k...then the hardware race outcome is determined, and Intel will have to accommodate for the fact that AMD's advancements will equate to more efficiency.
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


Still it's just a 7870ish class GPU. They can't use too much for compute or it will not have enough resources left to render the game.
 

8350rocks

Distinguished
From the interview with Mark Cerny:

http://www.gamasutra.com/view/feature/191007/

"The 'supercharged' part, a lot of that comes from the use of the single unified pool of high-speed memory," said Cerny. The PS4 packs 8GB of GDDR5 RAM that's easily and fully addressable by both the CPU and GPU.

If you look at a PC, said Cerny, "if it had 8 gigabytes of memory on it, the CPU or GPU could only share about 1 percent of that memory on any given frame. That's simply a limit imposed by the speed of the PCIe. So, yes, there is substantial benefit to having a unified architecture on PS4, and it’s a very straightforward benefit that you get even on your first day of coding with the system. The growth in the system in later years will come more from having the enhanced PC GPU. And I guess that conversation gets into everything we did to enhance it."

The CPU and GPU are on a "very large single custom chip" created by AMD for Sony. "The eight Jaguar cores, the GPU and a large number of other units are all on the same die," said Cerny. The memory is not on the chip, however. Via a 256-bit bus, it communicates with the shared pool of ram at 176 GB per second.

"One thing we could have done is drop it down to 128-bit bus, which would drop the bandwidth to 88 gigabytes per second, and then have eDRAM on chip to bring the performance back up again," said Cerny. While that solution initially looked appealing to the team due to its ease of manufacturability, it was abandoned thanks to the complexity it would add for developers. "We did not want to create some kind of puzzle that the development community would have to solve in order to create their games. And so we stayed true to the philosophy of unified memory."

In fact, said Cerny, when he toured development studios asking what they wanted from the PlayStation 4, the "largest piece of feedback that we got is they wanted unified memory."

There he is discussing unified memory and the growth expected on the GPU...in this next bit he talks about the modifications done to the GPU, specifically to allow rendering and compute all at once.

Cerny is convinced that in the coming years, developers will want to use the GPU for more than pushing graphics -- and believes he has determined a flexible and powerful solution to giving that to them. "The vision is using the GPU for graphics and compute simultaneously," he said. "Our belief is that by the middle of the PlayStation 4 console lifetime, asynchronous compute is a very large and important part of games technology."






click here

Cerny envisions "a dozen programs running simultaneously on that GPU" -- using it to "perform physics computations, to perform collision calculations, to do ray tracing for audio."

But that vision created a major challenge: "Once we have this vision of asynchronous compute in the middle of the console lifecycle, the question then becomes, 'How do we create hardware to support it?'"

One barrier to this in a traditional PC hardware environment, he said, is communication between the CPU, GPU, and RAM. The PS4 architecture is designed to address that problem.

"A typical PC GPU has two buses," said Cerny. "There’s a bus the GPU uses to access VRAM, and there is a second bus that goes over the PCI Express that the GPU uses to access system memory. But whichever bus is used, the internal caches of the GPU become a significant barrier to CPU/GPU communication -- any time the GPU wants to read information the CPU wrote, or the GPU wants to write information so that the CPU can see it, time-consuming flushes of the GPU internal caches are required."

The three "major modifications" Sony did to the architecture to support this vision are as follows, in Cerny's words:

•"First, we added another bus to the GPU that allows it to read directly from system memory or write directly to system memory, bypassing its own L1 and L2 caches. As a result, if the data that's being passed back and forth between CPU and GPU is small, you don't have issues with synchronization between them anymore. And by small, I just mean small in next-gen terms. We can pass almost 20 gigabytes a second down that bus. That's not very small in today’s terms -- it’s larger than the PCIe on most PCs!

•"Next, to support the case where you want to use the GPU L2 cache simultaneously for both graphics processing and asynchronous compute, we have added a bit in the tags of the cache lines, we call it the 'volatile' bit. You can then selectively mark all accesses by compute as 'volatile,' and when it's time for compute to read from system memory, it can invalidate, selectively, the lines it uses in the L2. When it comes time to write back the results, it can write back selectively the lines that it uses. This innovation allows compute to use the GPU L2 cache and perform the required operations without significantly impacting the graphics operations going on at the same time -- in other words, it radically reduces the overhead of running compute and graphics together on the GPU."

•Thirdly, said Cerny, "The original AMD GCN architecture allowed for one source of graphics commands, and two sources of compute commands. For PS4, we’ve worked with AMD to increase the limit to 64 sources of compute commands -- the idea is if you have some asynchronous compute you want to perform, you put commands in one of these 64 queues, and then there are multiple levels of arbitration in the hardware to determine what runs, how it runs, and when it runs, alongside the graphics that's in the system."


"The reason so many sources of compute work are needed is that it isn’t just game systems that will be using compute -- middleware will have a need for compute as well. And the middleware requests for work on the GPU will need to be properly blended with game requests, and then finally properly prioritized relative to the graphics on a moment-by-moment basis."

This concept grew out of the software Sony created, called SPURS, to help programmers juggle tasks on the CELL's SPUs -- but on the PS4, it's being accomplished in hardware.

The team, to put it mildly, had to think ahead. "The time frame when we were designing these features was 2009, 2010. And the timeframe in which people will use these features fully is 2015? 2017?" said Cerny.

"Our overall approach was to put in a very large number of controls about how to mix compute and graphics, and let the development community figure out which ones they want to use when they get around to the point where they're doing a lot of asynchronous compute."

Cerny expects developers to run middleware -- such as physics, for example -- on the GPU. Using the system he describes above, you can run at peak efficiency, he said.

"If you look at the portion of the GPU available to compute throughout the frame, it varies dramatically from instant to instant. For example, something like opaque shadow map rendering doesn't even use a pixel shader, it’s entirely done by vertex shaders and the rasterization hardware -- so graphics aren't using most of the 1.8 teraflops of ALU available in the CUs. Times like that during the game frame are an opportunity to say, 'Okay, all that compute you wanted to do, turn it up to 11 now.'"

Sounds great -- but how do you handle doing that? "There are some very simple controls where on the graphics side, from the graphics command buffer, you can crank up or down the compute," Cerny said. "The question becomes, looking at each phase of rendering and the load it places on the various GPU units, what amount and style of compute can be run efficiently during that phase?"

I think that should explain much of the "how"...also, a lot of system resources that would have been tied up by things like audio now have dedicated units for them specifically. This allows other parts of the hardware to not be tied up running other threads.

Another thing the PlayStation 4 team did to increase the flexibility of the console is to put many of its basic functions on dedicated units on the board -- that way, you don't have to allocate resources to handling these things.

"The reason we use dedicated units is it means the overhead as far as games are concerned is very low," said Cerny. "It also establishes a baseline that we can use in our user experience."








"For example, by having the hardware dedicated unit for audio, that means we can support audio chat without the games needing to dedicate any significant resources to them. The same thing for compression and decompression of video." The audio unit also handles decompression of "a very large number" of MP3 streams for in-game audio, Cerny added.

This last bit discusses potential bottlenecks and how they've planned to overcome them:

One thing Cerny was not at all shy about discussing are the system's bottlenecks -- because, in his view, he and his engineers have done a great job of devising ways to work around them.

"With graphics, the first bottleneck you’re likely to run into is memory bandwidth. Given that 10 or more textures per object will be standard in this generation, it’s very easy to run into that bottleneck," he said. "Quite a few phases of rendering become memory bound, and beyond shifting to lower bit-per-texel textures, there’s not a whole lot you can do. Our strategy has been simply to make sure that we were using GDDR5 for the system memory and therefore have a lot of bandwidth."

That's one down. "If you're not bottlenecked by memory, it's very possible -- if you have dense meshes in your objects -- to be bottlenecked on vertices. And you can try to ask your artists to use larger triangles, but as a practical matter, it's difficult to achieve that. It's quite common to be displaying graphics where much of what you see on the screen is triangles that are just a single pixel in size. In which case, yes, vertex bottlenecks can be large."

"There are a broad variety of techniques we've come up with to reduce the vertex bottlenecks, in some cases they are enhancements to the hardware," said Cerny. "The most interesting of those is that you can use compute as a frontend for your graphics."

This technique, he said, is "a mix of hardware, firmware inside of the GPU, and compiler technology. What happens is you take your vertex shader, and you compile it twice, once as a compute shader, once as a vertex shader. The compute shader does a triangle sieve -- it just does the position computations from the original vertex shader and sees if the triangle is backfaced, or the like. And it's generating, on the fly, a reduced set of triangles for the vertex shader to use. This compute shader and the vertex shader are very, very tightly linked inside of the hardware."

It's also not a hard solution to implement, Cerny suggested. "From a graphics programmer perspective, using this technique means setting some compiler flags and using a different mode of the graphics API. So this is the kind of thing where you can try it in an afternoon and see if it happens to bump up your performance."

These processes are "so tightly linked," said Cerny, that all that's required is "just a ring buffer for indices... it's the Goldilocks size. It's small enough to fit the cache, it's large enough that it won't stall out based on discrepancies between the speed of processing of the compute shaders and the vertex shaders."

He has also promised Gamasutra that the company is working on a version of its performance analysis tool, Razor, optimized for the PlayStation 4, as well as example code to be distributed to developers. Cerny would also like to distribute real-world code: "If somebody has written something interesting and is willing to post the source for it, to make it available to the other PlayStation developers, then that has the highest value."
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


Intel isn't ignoring HSA. Their drivers have been upgraded for Haswell and perform quite well now for compute. It's one of the more shocking things about Haswell. Their OpenCL performance of the HD4600 is faster than the A10-6800K.

http://www.tomshardware.com/reviews/core-i7-4770k-haswell-review,3521-3.html


 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


The compute and rendering still have an inverse relation though. The resources are fixed. If you dial up the compute performance you're dialing down the rendering performance.

If all 64 "sources" are used then you get 0 rendering left. In a game like Crysis 3 you're looking at a min/avg of 33/51fps at 1080p. You could use maybe 30% resources for compute and still average over 30fps.

If they push the graphics more with say Crysis 4, then how much will be left for compute if it's having to use even more resources for the render work?
 




Hence the problem I am arguing: If the target is a steady 60FPS on a 7xxx class GPU, then you are going to be VERY limited in how much compute you can use on the GPU, unless you significantly tone down the graphics to compensate.

And as far as HSA goes, its not that big a change from a development perspective. Yes, the single addressing scheme is nice, but thats already abstracted out; the developer doesn't care. So HSA, by itself, isn't going to cause any real change in software development, because its all handled at a much lower level.
 

8350rocks

Distinguished
For console developers...it doesn't make nearly as big a difference as it does for console games ported to PC.

Additionally...many functions that are run on the GPU don't even use SP's. Some of it can be done by rasterization alone. The discussion Mark Cerny had basically stated that they can offload to the GPU dynamically. When you need rendering, they can load the CPU cores, and when you're not rendering as heavily, you can use compute features on the GPU to speed up the other tasks you're running at that time.

It sounds as though they intend to utilize the GPU nearly 100% as much as possible and use the CPU cores as a sort of "overflow" for the GPU system.
 

noob2222

Distinguished
Nov 19, 2007
2,722
0
20,860


one thing your missing in that part of the equation is the disparity in raw cpu power. yes, intel's ocl looks better, but look at it another way.

vegas-pro-12-opencl.png


A10 5800k speeds up by 130% ... intel speeds up by 40% ... also essentially your comparing a dual module stripped down AMD to a full quad+HT Intel. not really a fair assumption just looking at the raw numbers.
 

jdwii

Splendid
Ok here we go again CPU tasks are extremely serial and performance would go DOWN if we brought everything over to the GPU, on top of this even the 7970 HD edition is used at a higher rate than a CPU in gaming meaning you would lower performance even more since it would take GPU resources away, this technology only works best for heavily multithreaded tasks. 1 GPU core/shader is MUCH weaker than even a 600mhz arm core in a cheap 20$ flip phone. For gaming OpenCL(or whatever technology that mocks this process) won't take off. Also when a person compares a gaming pc with a server pc they should be banned from existence unless they talk about cloud gaming.
And to finish this off if Amd didn’t care about the high-end CPU market or improving their single threaded performance they would not be putting so much focus onto Steamroller’s performance per clock improvements or making a 5Ghz CPU.
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


Bu that OpenCL is mostly if not exclusively on the CPU side, not the real HSA which is mainly OpenCL on the GPU side. This last should provide (depending on the GPU and the code) upto an order of magnitude better performance. ummm.. meaning it probably wont fit on current charts bar lengths(too long to represent) lol...

Example:
(old but illustrative, as i posted before... 2011 they even call it "fusion" )
Optimizing VLC Using OpenCL and Other Fusion Capabilities

"" VLC is a free and open source cross-platform multimedia player and framework that plays back most types of multimedia files. In our optimization work, we first integrated an AMD-exclusive algorithm, Steady Video™, into VLC, to stabilize shaky video played with VLC. Then we used OpenCL to optimize VLC’s scaling filter, which is used to enlarge or shrink the video on the fly during playback. This OpenCL optimization has achieved speedups of up to 10x on Llano and 18x on Trinity compared to a competing CPU. A de-noise filter in VLC is used to reduce the noise in video. Due to the algorithm’s data dependency nature between pixels, it is hard to parallelize. Through our optimization efforts, we have implemented an OpenCL version which is already faster on Trinity APU than on a competing CPU. Together these features/optimizations will enable a much better video playback experience on HSA platforms. ""

( that on HSW, isn't the OpneCL of HSA, if it is , is a damn weak result, winy tinny result lol)

Nice try Intel .... lol.. try to delude the ignorant...

HSA is mostly a runtime, and a compiler, doubt intel will follow that. Besides it also has some *hardware* specification which also doubt very much intel will follow (without being a member that is).

Intel will only follow HSA after AMD is dead, and the first thing they will do is try to control it for their advantage. And if ARM will be any threat, since ARM is very big on HSA, intel will never follow it unless ARM submits... performance is in the software and intel nows it and plays with it.

( and to think ppl almost rip they eyes out discussing hardware charts, when in fact they show *SOFTWARE* variations of those benchmarks lol)

 


In theory, yes, assuming the OS is written in such a way where thread runetime and scheduling is deterministic. In Sony's case, that may well be the case. For the XB1, a modified Win 8 kernel? I'm not so sure.

See the problem? You have to code at a VERY low level, so you know ahead of time "Oh, I'm done rendering, so I can use more of the GPU". You open yourself to a LOT of potential post-release problems if this doesn't hold. And the new consoles have a lot larger OS, and it seems likely that, by being a PC architecture, you won't see devs as willing to continue to use manual thread management, now that you have an OS which can handle it for you.



Correct, a single GPU execution unit is slow. You gain performance because you have a couple HUNDRED to throw at a problem. For parallel tasks, the individual performance loss per "core" is insignificant if you scale the problem large enough. But for non-scaling tasks? OpenCL/Directcompute/CUDA/DirectX/etc will leech performance.
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


The OpenCL GPU benchmarks are only using the embedded GPU. The HD4600 is on i3, i5 and i7 CPUs, so it is a fair comparison. It has CPU and combined scores mixed in but I was ignoring those due to the massive price difference.
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


Unless the reviewer is intending to be misleading, there are separate scores for CPU, GPU and Combined CPU+GPU. Otherwise why make the distinction?

Intel went from ZERO OpenCL support in Sandy to decent in Ivy, and higher in Haswell. AT has numbers for the 5200 which brings the scores even higher but the price is steep and mobile only not desktop.
 

GOM3RPLY3R

Honorable
Mar 16, 2013
658
0
11,010


That's true, however, building on that meaningful processing, what console game truly uses more than 10 threads? My point exactly.
 

noob2222

Distinguished
Nov 19, 2007
2,722
0
20,860

most of them aren't gpu only, most of it is cpu+gpu or cpu. The one that is IGP only is luxmark, even then their numbers look pretty fishy.
luxmark.png

Compared to :
Luxmark_gpu.jpg

luxmark-igp.gif

opencl-luxmark-igp.png


CPU+gpu is the only test here but has hasbeen 477-k:
Lux.png


So why is it toms test is the only one out there that has the 5800k slower than the 3770k on IGP only when the rest are nearly twice as fast, and even competetive to combined against the 4770k. toms showed it only 1/3 the power instead of 2445 to 2354.

In fact on GPU only tests, toms is the only one that showed even a remotely close race on igp, with Intel actually winning with the 3770k ... thats a big wtf?
 

8350rocks

Distinguished
Yes, XB1's presale numbers are abysmal and to add an interesting note to the AMD v. Intel debate...it looks like Intel is "slipping". Instead of Broadwell next year, they are doing a haswell "refresh".

http://www.fudzilla.com/home/item/31701-intel-plans-haswell-refresh-in-q2-2014

Imagine Steamroller launching and having to compete with hasfail for a full 18 months, by which time, if broadwell does finally poke it's head out of obscurity, the Excavator will likely be hitting the shelves.

Interesting...eh?
 

montosaurous

Honorable
Aug 21, 2012
1,055
0
11,360
I'm not expecting anything interesting from Intel until Skylake or Haswell-E anyways, which is 2 years away. Intel hasn't broken tick-tock before, so if they do it now I would be surprised. 14nm seems like a big step however, so it is possible.
 

8350rocks

Distinguished
Honestly, I am not at all surprised if their Tri-gate has issues shrinking past 22nm lithography using bulk wafers. Once you get down to that small a transistor node you essentially need to be using PD-SOI or FD-SOI for thermal envelopes to stay in an acceptable realm.

AMD has stated they were going to bulk for 28nm, but recently GloFo announced that 28nm HKMG PD-SOI would be available soon. I suspect AMD knew this and was holding out for the 28nm SHP process to get the kinks worked out before they went forward with steamroller. Especially considering GloFo knows that TSMC (AMD's other chief source for chips/wafers) has a working 28nm bulk process and even has 20nm bulk worked out (for ULV so far). GloFo would have to have a better process to ensure that AMD did the bulk of their business there considering the yield fiasco with 32nm chips.
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810




SA has been saying that for a while. Intel is trying to head off ARM by severely focusing on mobile. So Broadwell may end up for only BGA mobile parts in 2014. The power hasn't gone down that much for Desktop parts and that market is stagnating anyway.

It does give AMD some chance to catch up before Intel releases their 8 core desktop chips with DDR4.
 

montosaurous

Honorable
Aug 21, 2012
1,055
0
11,360


Yeah I assume Broadwell will be BGA anyways. Skylake might be the last mainstream LGA from Intel. I've only built with AMD thus far, but I would surely only build with them if Intel went full BGA except for Extreme series and server chips.
 

8350rocks

Distinguished
Ahh...yes...the $1000-1200 Hasfail-E.

LOL...AMD should surely be worried about that...being as the 8 core steamroller's will likely cost 20-25% of what those things will run! Unless they're 4x faster than steamroller...(doubt it)...then they wouldn't even enter a perf/$ discussion in my mind. That will be for the e-peen Intel crowd who want to talk about the fact that they wasted $750-1000 too much on a CPU.
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


Because if you follow
http://www.tomshardware.com/gallery/monte-carlo-opencl,0101-386890-0-2-3-1-png-.html
http://www.tomshardware.com/gallery/black-scholes-opencl,0101-386858-0-2-3-1-png-.html
http://www.tomshardware.com/gallery/binomial-opencl,0101-386851-0-2-3-1-png-.html

the results on GPU vs CPU+GPU are almost identical, and according to my example and many others i followed, its not lack of GPU side potential, its lack of "SOFTWARE" proper, that is, the culprit of similar results is not the GPU hardware, its the software that not targets this one as it should.

The difference could be up to an order of magnitude (10x)... or clearly more with HSA certified software... But its not that bad this first results for some tests (good deal above CPU alone for some). Also the reviewer isn't misleading, unless he was doing the software benchmarks himself, in coherence he must make "good faith" on those bench soft, and that is the greatest mistake of all reviewers lol... at least they could had asked for md5 signs and point the web site of origin.

Interpretation "in context" is up to anyone...

And yes, HSA software, according to what i read will come "certified", that is, it must pass a test of some kind to be able to have the logo.

 
Status
Not open for further replies.