AMD CPU speculation... and expert conjecture

juanrga · Apr 23, 2014

jdwii :

I will be watching the fight between (Power8 + Maxwell CUDA) and (Broadwell Xeon + KL Phi).

jimmysmitty · Apr 23, 2014

etayorius :

By one million and I think it was due to the sales to flush out remaining units as the PS4 moved in. Before that though the 360 was outselling it.

For me, I like the XB1 media features but honestly they are watered down PCs both of em. Not worth our time.

de5_Roy :

I am surprised AMD is saying no to the ultra low end tablet space. It sells a lot more than the high end does. Still their own choice I guess

I saw the new DA info. It could be Mantle enabled, or not. If it is Origin exclusive I wont care either way.

truegenius :

I don't think the XB1 fails as a gaming console, it still does that well. Not 1080P well but neither does the PS4. Only a mid end PC does 1080P gaming well.

H.265 was just signed as a standard in April of 2013, the XB1 was pretty far into development so I doubt they would just add it in last minute especially when it is a new encoding standard meaning there could be issues. The good news is that it can be added for support via software updates. Will that be as good as a pure hardware method? No but it will still add support. Still the XB1 supports h.264 which I have found to be pretty impressive and H.265 seems to improve compression quality but not by an insane amount over H.264.

And yes the 360 was a year ahead. The PS3 however had massive hype behind it. It also lost Sony a ton of money thanks to them shoving in what at the time was $1000 Blu-Ray player into a $600 dollar console.

BTW, consoles have never out classed PCs and always will be catching up. By the time the PS3 came out with its 7800 GTX variant we had the 8800 GTX/Ultra. We knew the specs of the current consoles long ago and as we found out, I said it all along "PCs have been gaming on "next gen" for two years", considering that they are using CPUs based on AMDs module design (albeit lower end) and GPUs based on AMDs GCN design.

PCs will always have better gaming capabilities, they just wont go used. By the time a GTX 680 or HD7970 are fully utilized it will probably be say 3 more years at least.

jdwii :

I don't really consider console gamers hardcore gamers. If you look at it in terms of a hardcore player being one who plays a game for hours upon hours, like say a MMO where people spend months if not years worth of hours playing, then consolers don't even come close. Most games, especially ones designed with consoles in mind, are normally 20 hours each. The few exceptions would be RPGs or open world based games. Even those still don't come close. I have more hours in GW2 and TF2 combined than most of the rest of my family has in every game they have ever played combined.

In reality, PC gamers are the real hardcore crowd with casuals encroaching on that territory. Hell gaming itself used to be a more nerdy thing to do, now even the jocks who would pick on the PC enthusiasts play games. Of course they also love the madden games but hey, it is what it is.

jdwii :

That is a interesting CPU. I am curious as to what type of RAM they have pushing that much bandwidth when DDR3 is pretty tapped out and DDR4 should be able to.

As well, I do love the standard "it can do this 1000x faster, we said so".

I guess we will have to wait and see how many companies are willing to shell out the cash needed to rewrite their software for PowerPC (same CPUs that were in the PS3) as it is not a direct port over.

juanrga · Apr 23, 2014

jimmysmitty :

That news was given before and discussed here. As mentioned before, AMD is avoiding direct competition with ARM tablets because the Mullins/Beema APUs are not competitive enough. I also linked to a recent report showing the huge lost (billions $ per year) that Intel is suffering because it is unable to compete in mobile space.

AMD has learn the lesson from Intel disaster and is avoiding that ultra-competitive market.

8350rocks · Apr 23, 2014

juanrga :

jimmysmitty :

That news was given before and discussed here. As mentioned before, AMD is avoiding direct competition with ARM tablets because the Mullins/Beema APUs are not competitive enough. I also linked to a recent report showing the huge lost (billions $ per year) that Intel is suffering because it is unable to compete in mobile space.

AMD has learn the lesson from Intel disaster and is avoiding that ultra-competitive market.

Ultra competitive??? It is Intel and AMD only...how is that ultra competitive?

juanrga · Apr 23, 2014

8350rocks :

No, it isn't Intel and AMD. If you read the links and explanations given before, you will know that Intel is losing billions of $ per year due to the strong competence from other players, that some analysts are soliciting Intel to give up its pretensions, and that AMD has just said "no" to this ultra-competitive tablet market.

I recall that I explained in this forum why neither Beema nor Mullins would be competitive in that market. I also explained why they were rejected for servers. Some people in this forum stated very strong disagreement (including personal attacks) with my claims. Some of them are gone now from this thread, but once again AMD is doing what I predicted it would do.

etayorius · Apr 23, 2014

jimmysmitty :

etayorius :

By one million and I think it was due to the sales to flush out remaining units as the PS4 moved in. Before that though the 360 was outselling it.

For me, I like the XB1 media features but honestly they are watered down PCs both of em. Not worth our time.

de5_Roy :

I am surprised AMD is saying no to the ultra low end tablet space. It sells a lot more than the high end does. Still their own choice I guess

I saw the new DA info. It could be Mantle enabled, or not. If it is Origin exclusive I wont care either way.

truegenius :

I don't think the XB1 fails as a gaming console, it still does that well. Not 1080P well but neither does the PS4. Only a mid end PC does 1080P gaming well.

H.265 was just signed as a standard in April of 2013, the XB1 was pretty far into development so I doubt they would just add it in last minute especially when it is a new encoding standard meaning there could be issues. The good news is that it can be added for support via software updates. Will that be as good as a pure hardware method? No but it will still add support. Still the XB1 supports h.264 which I have found to be pretty impressive and H.265 seems to improve compression quality but not by an insane amount over H.264.

And yes the 360 was a year ahead. The PS3 however had massive hype behind it. It also lost Sony a ton of money thanks to them shoving in what at the time was $1000 Blu-Ray player into a $600 dollar console.

BTW, consoles have never out classed PCs and always will be catching up. By the time the PS3 came out with its 7800 GTX variant we had the 8800 GTX/Ultra. We knew the specs of the current consoles long ago and as we found out, I said it all along "PCs have been gaming on "next gen" for two years", considering that they are using CPUs based on AMDs module design (albeit lower end) and GPUs based on AMDs GCN design.

PCs will always have better gaming capabilities, they just wont go used. By the time a GTX 680 or HD7970 are fully utilized it will probably be say 3 more years at least.

jdwii :

I don't really consider console gamers hardcore gamers. If you look at it in terms of a hardcore player being one who plays a game for hours upon hours, like say a MMO where people spend months if not years worth of hours playing, then consolers don't even come close. Most games, especially ones designed with consoles in mind, are normally 20 hours each. The few exceptions would be RPGs or open world based games. Even those still don't come close. I have more hours in GW2 and TF2 combined than most of the rest of my family has in every game they have ever played combined.

In reality, PC gamers are the real hardcore crowd with casuals encroaching on that territory. Hell gaming itself used to be a more nerdy thing to do, now even the jocks who would pick on the PC enthusiasts play games. Of course they also love the madden games but hey, it is what it is.

jdwii :

That is a interesting CPU. I am curious as to what type of RAM they have pushing that much bandwidth when DDR3 is pretty tapped out and DDR4 should be able to.

As well, I do love the standard "it can do this 1000x faster, we said so".

I guess we will have to wait and see how many companies are willing to shell out the cash needed to rewrite their software for PowerPC (same CPUs that were in the PS3) as it is not a direct port over.

Actually it was the Xbox360 the one who was using the PowerPC Xenon, PS3 used CELL.

gamerk316 · Apr 24, 2014

Actually it was the Xbox360 the one who was using the PowerPC Xenon, PS3 used CELL.

Both were using the POWER arch; The Xenon CPU of the 360 was derived from the Cell PPE. The Wii (and Gamecube) used the older PPC 7xx arch, and are PPC 750 derivatives. [I note the PPC 750 launched in 1997. That's how old the arch is.]

8350rocks · Apr 24, 2014

gamerk316 :

POWER is a phenomenal uarch if you have the tools to master it though...

juanrga · Apr 24, 2014

Some time ago I predicted that discrete GPUs will be killed by about 2018--2020, when APUs will be ~10x faster than any dGPU. I mentioned that AMD long-term plan is

CPU + dGPU --> APU + dGPU --> ultra-high-performance APU

Several people here disagreed strongly and some did claim that discrete GPUs will be forever with us. Math says otherwise. This is what Nvidia Research Team claims about the future of their own GPUs:

In this time frame, GPUs will no longer be an external accelerator to a CPU; instead, CPUs and GPUs will be integrated on the same die with a unified memory architecture.

I like to see that one of the main GPU companies agrees with me!

I know that Intel think the same. In fact, Intel want to accelerate the killing of discrete cards. I am plenty sure that AMD think the same (because the laws of physics are the same for everyone), but current AMD remains silent about future plans beyond official roadmaps.

truegenius · Apr 24, 2014

juanrga :

afaik, cpu and gpu integration on same die is to minimize the time taken to communicate between cpu and gpu and bandwidth thus removing communication bottleneck
which means more performance in gpu assisted tasks
thus for processing tasks which needs tons of data exchange between cpu and gpu will get huge benefit from this tech
for example hardware video conversion

but

gaming does not require such communications speeds (only productive works needs higher cpu-gpu communication), gaming needs higher gpu-vram speed which is not possible using system ram, also not to forget power requirement of gpus
this is why we say that pcie 2.0 x16 is enough for gaming, which means cpu-gpu communication is not a bottleneck in gaming

thus, by this i mean to say that imo we won't see death of dedicated gaming GPUs as far as gaming is alive, but workstation cards may suffer (not sure about higher end workstation cards) and we may see cpu/apu designed for workstations (if we can get enough ram bandwidth)

gamerk316 · Apr 24, 2014

8350rocks :

Just like the 68k was before it. Both were better then X86 was...

8350rocks · Apr 24, 2014

You know, I think a lot of people think Intel has the most advanced fabs in the world because they forget about IBM. Look at POWER8, a 650mm^2 die on 22nm FD-SOI with sustainable yields.

Then a single POWER8 processor can run 96 threads...96! Eat your heart out HTT. POWER8 might seriously run Intel back out of the high end server market again...

jdwii · Apr 24, 2014

truegenius :

juanrga :

afaik, cpu and gpu integration on same die is to minimize the time taken to communicate between cpu and gpu and bandwidth thus removing communication bottleneck
which means more performance in gpu assisted tasks
thus for processing tasks which needs tons of data exchange between cpu and gpu will get huge benefit from this tech
for example hardware video conversion

but

gaming does not require such communications speeds (only productive works needs higher cpu-gpu communication), gaming needs higher gpu-vram speed which is not possible using system ram, also not to forget power requirement of gpus
this is why we say that pcie 2.0 x16 is enough for gaming, which means cpu-gpu communication is not a bottleneck in gaming

thus, by this i mean to say that imo we won't see death of dedicated gaming GPUs as far as gaming is alive, but workstation cards may suffer (not sure about higher end workstation cards) and we may see cpu/apu designed for workstations (if we can get enough ram bandwidth)

Yeah that makes way more since i almost find it impossible to get more power from 1 die vs 2. Seems a little weird to state such a thing maybe for servers but for gaming(bandwidth over latency)? Maybe stacked chips so we can have the power on an 290X plus an I7 with 16GB of ram and GDDR7 ram or what ever. Just seems impossible when we can't even hit 14nm easily. Not to mention software really needs to get better.

jimmysmitty · Apr 24, 2014

8350rocks :

Intel does have the most advanced FABs. They were the ones who pushed HK/MG, they were the first with FinFETs and they do have 14nm up, just a few yield issues. IBM is not nearly the hardware powerhouse they used to be.

As for the 4 way SMT, this is not the first instance of it by any means. Larrabee, which is now in the Xeon Phi stuffs, had 4 way multithreading capabilities.

I am interested to see what kind of power consumption and temperatures that IBM gets on that CPU considering that Intel stayed away from SOI due to their research saying it will run into issues beyond 32nm and AMD even left it when GloFlo went to a different approach.

8350rocks · Apr 24, 2014

jimmysmitty :

Research shows SOI has lower power consumption with capacity for higher clocks. Look at the SOI consortium findings...there were many in there. The cost of SOI wafers vs bulk is where Intel went Bulk with FinFET.

However, you act as though FinFET and SOI are mutually exclusive...they are not. Intel has conceded they will have to go to FinFET on UTBB FD-SOI past ~10nm to continue to shrink past that as bulk wafer will not produce reliable enough yields. Though if all in the market abandon FD-SOI, then who is going to make the advances to shrink that far on FD-SOI? STMicro still does lots of research on FD-SOI nodes, and the consortium will still have research, but fabs will have to commit to increasingly costly tooling up processes to convert back to SOI in 3 nodes time.

Additionally, IBM could call their 22nm a 14nm as Intel, because Intel's 14nm is actually much closer to 22nm because of the way they name their nodes. So in all actuality, IBM and Intel are on the same node size, IBM just calls their node by the largest transistor and Intel calls theirs by their smallest transistor. (A common practice of Intel to give the impression of a greater process advantage than actually exists) In truth, Intel's 22nm is actually what the rest of the industry would have called 26nm. When AMD was on 32nm. But does 26nm (a half node shrink) sound FAR superior or does 22nm sound better?

That is all just a marketing ploy...

You must also take into consideration that getting a well sustainable yield on a 650mm^2 die at that node is FAR more impressive on FD-SOI than having massive yield issues with FinFET on bulk at the same node with dies approximately 25-30% of the size.

jimmysmitty · Apr 24, 2014

8350rocks :

I never said FinFET and SOI were mutually exclusive, just that Intel does happen to have the most advanced process tech in the industry. I understand that SOI and Bulk are the various types of wafer production. My point is that Intel spends more money than even consortiums do on process tech and that they would use what is viable. That is why they went Hafnium and HK/MG before anyone else.

As for process size naming, I would have to research it to be able to say for sure. Either way 14nm being the smallest transistor is still impressive.

As for what to go for, I have read the main issue is not so much Bulk vs SOI but rather what materials that are being used as Silicon is starting to hit limits. The next step, for say 7nm and beyond, is either SiGe or just Ge as it has better properties than silicon.

Of course there is also graphite and carbon nano tubes but those are also probably much more costly than SiGe/Ge.

I am all for the competition but there is still the question of why if FD-SOI is "better" did TSMC (largest for sale FAB) not go with it? If it was more cost effective and better they would. They produce in such high volume that the savings would benefit them and their customers.

As I said, Intel spends a ton of money on this. They will go wherever they can save money and reduce power.

I can't wait to see the reviews on this chip just to see what the numbers of power/heat are. It is a big chip, larger than even nVidias Titan, so I am sure it will use quite a bit of power.

palladin9479 · Apr 24, 2014

Some time ago I predicted that discrete GPUs will be killed by about 2018--2020, when APUs will be ~10x faster than any dGPU. I mentioned that AMD long-term plan is

CPU + dGPU --> APU + dGPU --> ultra-high-performance APU

Hahahahahaha

Physics says not just no, but hell no. Space and heat are the primary issues and no single chip solution will be faster then a dual chip solution for this same reason. Heat dissipation is limited by surface area and dCPU + dGPU has 2~4x more surface area then an APU would have. This is without getting into dedicated high speed memory for the dGPU. And before your utter "but but stacked ram!!!", whatever you put on the APU you can put 8~16x as separate chips on the dGPU and 16~32x of as separate DIMMs on the dCPU. Local memory would only ever act as cache.

palladin9479 · Apr 24, 2014

Then a single POWER8 processor can run 96 threads...96! Eat your heart out HTT. POWER8 might seriously run Intel back out of the high end server market again...

At 650mm^2 Power8 is non-competitive against Intel Xenon's in the market segment you'd actually use Xenon's in. It's primary competitor is the Oracle SPARC T5

https://en.wikipedia.org/wiki/SPARC_T5

The T5 has 8 cores with each core capable of 16 threads for a total thread count of 128 per chip and you typically deploy these chips in 2, 4 or 8 socket configurations with 256GB to 1TB of system memory. The size is 478mm^2 at 28nm. The biggest difference between them is the Cache size, IBM utilizes a gigantic cache in the hopes of maintaining high single thread speeds in their DB2 software. Oracle went with a wide implementation where the chip is smaller and cheaper but is designed to operate fast with dozens of simultaneous threads in the Oracle RDBMS and OWLS software.

These uArch's are for big iron high throughput systems where your processing tens of thousands of transactions per minute. They aren't cheap but they get the job done and most importantly, don't die.

juanrga · Apr 25, 2014

truegenius :

1) The "about 10x faster" claim only considered CPU-->GPU communication. Thus applies to gaming. For compute one has to consider also CPU-->GPU and other nuisances, increasing the performance gap.

2) Only because ancient programs used less than 640Kb doesn't mean that 640Kb are enough for everybody. Current games are developed with current hardware limits in mind. Future games doesn't need to be limited by constraints of today hardware.

3) The slow PCie3 is already bottlenecking current games. Developers alleviate this bottleneck by using tricks such as low resolution textures, texture quilting (building large textures by repetition of small textures), texture compression... E.g. unreal engine uses texture compressed at 1:4 or 1:8 ratio (depends of DXTn format used) before sending them by the PCIe slot. The compressed textures are then decompressed by the GPU.

4) Another important point is that gaming GPUs are not designed in a vacuum, but use the same basic architecture used for compute GPUs. R&D costs are then distributed between all the cards that use the architecture. If you develop an architecture for use exclusive on a gaming card, then all the cost have to be transfered to those cards, which implies ultra-expensive cards that no gamer would purchase.

This is essentially the same reason why AMD has not released Steamroller FX CPU for gamers. Previous FX CPUs shared the R&D cost with the Opterons used in servers and HPC. But Steamroller is not competitive enough for server/HPC CPUs; thus, once Steamroller architecture was dropped from Opteron CPUs, it became evident that AMD couldn't release a FX brand because a mere question of costs.

I recall that I explained this when I claimed the past year, that no FX Steamroller was coming to the deskopt. Several posters ignored my point and claimed "wait for the roadmap". The 2014/2015 desktops roadmap gave me the reason. I did the math. AMD also did.

5) Only because current PCs use slow DDR3 for "system ram" doesn't imply that "system ram" has to be slower than "vram". The PS4 uses GDDR5 as "system ram". The next year Intel releases a 'CPU' with 8--16 GB of MCDRAM with a sustained bandwidth of 500GB/s. The Nvidia Titan VRAM peaks at 288GB/s.

The Nvidia Research Team that did make the above claim is designing an ultra-high performance APU with a "system ram" of 1.6TB/s of bandwidth. I.e. more than 5x the bandwidth of the GDDR5 on the Nvidia Titan designed by the same Team.

6) Another point is that games evolve towards GPU computing. Ancient games used the CPU for everything, and the GPU only for basic display. But current GPUs evolved towards offloading the CPU for 3D graphics computation

The next step is towards offloading the physics and AI computations also to the GPU. In fact, the APU used in the PS4 has been specifically designed to compute the physics and the AI on the GPU. Watch Sony talks. Existent games and demos are already computing the physics on the PS4 GPU.

==============================================

Once again:

The past is CPU + dGPU

The future is high-performance APU

The transition for AMD is clearly APU + dGPU

Why do you believe that AMD is enabling APU-dGPU Crossfire? Why do you believe that MANTLE has asymmetric multi-GPU support? Why do you believe that AMD is giving talks about using the dGPU for rendering whereas offloading the post-processing to the APU?

http://gearnuke.com/in-depth-look-at-amd-mantle/

jdwii :

Except that maths says otherwise. What happens is that Nvidia research Team did the math before doing the claim. I also did.

juanrga · Apr 25, 2014

jimmysmitty :

Indeed. IBM is selling its fabs because cannot compete against Intel and others.

Power8 beats the Xeon in raw performance, but loses in efficiency (performance per watt). One of reasons why IBM is joining to Nvidia to compete against Intel.

8350rocks :

There is agreement on that FINFET is more adequate than FD-SOI for high-performance applications. This is why Intel chose FINFETs time ago, TSMC chose FINFETs for 16nm, Samsung chose FINFETs for 14nm, Glofo chose FINFETs for 14nm, USC chose FINFETs for 14nm, and even IBM chose FINFETs for 14nm. Only STMicro chose FD-SOI for 14nm and how many customers have? Zero?

Please don't link to SOI-consortium marketing again. They have been claiming for years how FD-SOI is going to dominate the world, but this never happen.

It has been shown that FINFETs are scalable beyond 10nm. There are doubts about if SOI is scalable.

https://www.semiwiki.com/forum/content/3128-soi-future-flop.html

Based on what companies are doing today, the installed and planned capacity for the companies, the likelihood of changes at 10nm and the difficulties of scaling FDSOI to 7nm, I expect the leading edge logic market to look like this:

Node FDSOI FinFET on bulk FinFET on SOI
14nm 2.1% 96.2% 1.7%
10nm 5.0% 93.5% 1.5%
7nm 1.6% 96.9% 1.5%

Of course, FinFET and SOI are not mutually exclusive, but the cost of doing both doesn't compensate the small performance improvement over FINFET on bulk. This is why in above table you see only 1.5--1.7% for FINFET on SOI.

juanrga · Apr 25, 2014

palladin9479 :

You can delete the quote from Nvidia Research Team, but I can reintroduce it:

In this time frame, GPUs will no longer be an external accelerator to a CPU; instead, CPUs and GPUs will be integrated on the same die with a unified memory architecture.

You can write "hahahahaha", but your post is filled with elementary physics mistakes that neither Nvidia nor Intel/AMD engineers are doing. :lol:

You are assuming current CPU:GPU ratios. Your first mistake. About 50% of the die of Kaveri APUs is devoted to the GPU and, of course, a separated GPU will have double available area for the same die size, but the APU that the Nvidia Research Team is designing devotes about 90% of the die size to GPU. This means a separate GPU of the same total size will only have 11% more die space. And this, a priori, 11% better performance of the discrete GPU (assuming perfect scaling) is completely outperformed by the interconnector, resulting on the APU being much faster than the GPU.

The same about heat. Their APU is rated at 300W, and a separated GPU would be rated at about 333W. Now we have to add the power consumption from moving data though the interconector at enough speeds to feed a separate GPU and the result is (as Nvidia engineers also found) that a separated GPU cannot be rated so high to fit the ~20MW constraint for the whole HPC system, resulting in a lost of raw performance. This was your second mistake.

You are extrapolating the thermal/power 'laws' you know for current architectures, when exascale level plays by different rules.

E.g. current technological gap between wire energy (256 bit) and DFMA energy is of a factor of about 6. For exascale this gap increases to a factor of 23 and goes against a CPU + dGPU combo by a factor of about ten after considering the linear scaling of energy with technology featuring size.

You are making the same kind of mistake than those engineers who believe that can extrapolate Newtonian laws to bodies moving at one half of the speed of light.

Of course AMD engineers did come at same conclusion about discrete GPUs than Nvidia or Intel engineers. This is why AMD chief engineer mentioned that they selected an APU for exascale HPC. A CPU+dGPU is slower. I gave the link before.

In fact, everyone working in exascale agrees. I have been revising the Japan last report on their Exascale project and they come to the same conclusion: only an single heterogeneous die can provide the required performance. Their prototype design is a 16L+256T chip (i.e. 16 cpu cores @4Ghz plus 256 gpu-like cores @ 1GHz on same die). The total performance of their chip is of 9 TFLOPs (DP).

After this short explanation about why a separate GPU will be actually slow than that an APU, I hope that you don't reply asking or mentioning something weird about the CPU.

The same history about memory. Nvidia Research Team uses ultra-fast memory, but the CPU only needs about a 10% of the total bandwidth. This means a separate GPU will have only about 11% more bandwidth available, which is completely outperformed by the interconnector bottleneck, resulting on the APU being much faster than the separate GPU.

Stacking RAM is not incompatible with the use of additional DIMMs for expanding system memory. This is the approach used by Intel the next year: 8-16GB stacked RAM plus DDR4 DIMMs. It is the approach taken by AMD as well for its HPC APU.

The Nvidia Research team takes a different approach. Their design uses a MCM package with 256GB of system RAM added to the APU die. Probably the stacks are replaceable instead soldered, because there is no mention of any DIMM in the design.

palladin9479 :

To be fair, IBM POWER is doing it infinitely better than SPARC designs on high performance applications. See the red area below? That is IBM share. See the white area? This is SPARC. It is almost dead now

800px-Processor_families_in_TOP500_supercomputers.svg.png

gamerk316 · Apr 25, 2014

1) The "about 10x faster" claim only considered CPU-->GPU communication. Thus applies to gaming. For compute one has to consider also CPU-->GPU and other nuisances, increasing the performance gap.

Physics fail. You simply aren't going to get a powerful enough APU on a single die simply due to yield/power constraints. dCPU + dGPU will always be faster as a result.

2) Only because ancient programs used less than 640Kb doesn't mean that 640Kb are enough for everybody. Current games are developed with current hardware limits in mind. Future games doesn't need to be limited by constraints of today hardware.

No, they will be constrained by the hardware of the time, just like always.

3) The slow PCie3 is already bottlenecking current games. Developers alleviate this bottleneck by using tricks such as low resolution textures, texture quilting (building large textures by repetition of small textures), texture compression... E.g. unreal engine uses texture compressed at 1:4 or 1:8 ratio (depends of DXTn format used) before sending them by the PCIe slot. The compressed textures are then decompressed by the GPU.

Decompression of textures is simple enough to do. Just like decompression of audio is seemless on the CPU. We have benchmarks that show no performance advantage between PCI-E 3.0 x16 versus PCI-E 1.1 x16; maybe 2-3 FPS. PCI-E isn't a major performance bottleneck.

4) Another important point is that gaming GPUs are not designed in a vacuum, but use the same basic architecture used for compute GPUs. R&D costs are then distributed between all the cards that use the architecture. If you develop an architecture for use exclusive on a gaming card, then all the cost have to be transfered to those cards, which implies ultra-expensive cards that no gamer would purchase.

Partially true, but GPU's of all types work on the same basic principles, so you aren't going to see significant design variances between compute and gaming GPU's. At the end of the day, they're simply massively parallel floating point processors.

5) Only because current PCs use slow DDR3 for "system ram" doesn't imply that "system ram" has to be slower than "vram". The PS4 uses GDDR5 as "system ram". The next year Intel releases a 'CPU' with 8--16 GB of MCDRAM with a sustained bandwidth of 500GB/s. The Nvidia Titan VRAM peaks at 288GB/s.

The Nvidia Research Team that did make the above claim is designing an ultra-high performance APU with a "system ram" of 1.6TB/s of bandwidth. I.e. more than 5x the bandwidth of the GDDR5 on the Nvidia Titan designed by the same Team.

RAM speeds aren't a major bottleneck. Even on APU's, you see performance gains flatline around DDR3-2333. Secondly, DDR is still the main choice because other forms of RAM are a LOT more expensive, making them non-economical for general use. If performance mattered more then cost, we'd be using RAMBUS RAM, rather then DDR right now.

6) Another point is that games evolve towards GPU computing. Ancient games used the CPU for everything, and the GPU only for basic display. But current GPUs evolved towards offloading the CPU for 3D graphics computation

The next step is towards offloading the physics and AI computations also to the GPU. In fact, the APU used in the PS4 has been specifically designed to compute the physics and the AI on the GPU. Watch Sony talks. Existent games and demos are already computing the physics on the PS4 GPU.

Anything which scales and is vector based should really have the option of being offloaded to the GPU. First it was rendering, then Physics. You could even make a case for AI. Of course, overloading the GPU to teh CPU starts becoming a concern at that point. What we can do is limited by how fast GPU's gain performance.

We've been doing physics on the GPU since NVIDIA aquired Ageia. And I've been the most vocal person here saying that its about bloody time we start moving physics over, since multi-object interactions simply kill the CPU (too much to handle with so few resources).

That being said, with PhysX being the only major Physics API that's GPU accelerated, I doubt we'll see this be a major trend.

de5_Roy · Apr 25, 2014

amd teases new soc to be unveiled in 3 days (and counting)
http://www.fudzilla.com/home/item/34591-amd-teases-beema-soc-or-something-rather

palladin9479 · Apr 25, 2014

Juan we've been down this road, you have no idea what your talking about. Your rehashing old arguments with a half baked understandings that have already been disproven.

There is a physical limit to the amount of transistors you can fit on any die of a specific size. The larger the die the more uneconomical it becomes to produce it. The smaller the die the bigger a problem thermal dissipation becomes. This is a problem that AMD's APU's have been fighting for awhile now.

An i5-4670K has an expected thermal dissipation of 84w. A GTX 780 has an expected dissipation of 250W. Combined they require 334W of heat to be removed at expected load, in reality you can expect that 100% full load on both would produce more then 334W of heat. If that was a single chip, your beyond water cooling and into phase change or peltier territory. That's without getting into dual dGPU setups. Any future CPU design that would allow a powerful APU would allow an even more powerful dCPU + dGPU combination.

Another thing that pops up, an off board dGPU will always, 100% of the time, without fail, be faster at processing batch SIMD data then anything you put on a CPU. The R9 290x has 2816 stream processors (44CU) that are connected to a dedicated 4GB of 512-bit GDDR5 memory running at 320GBps bandwidth. The floating point (SIMD) performance is 5632 GFLOPS single precision and 704 double precision. The die is 6.2 billion transistors and takes up 438mm^2. The TDP is 250W

The most powerful APU currently is the A10-7850K with 512 stream processors (8CU) connected to shared 128-bit DDR3 bus that runs up to 34.1GBps of bandwidth. Floating point (SIMD) performance is 737.3 GFLOPS single precision and 46.1 GFLOPS double precision. The die is 245MM^2 and contains 2.41 billion transistors. The TDP is 95W.

Notice the massive difference in scale between those two. The 7850K has absolutely no hope of ever being in the same category as the R9 290 in raw performance. It has neither the raw size nor the heat dissipation capacity to compete.

It has a single advantage, low latency when processing small amounts of FP instructions. I say small because the dGPU will always be faster but incurs a transition latency. The instruction count need to be such that the time it takes for transition is larger then the time it takes for execution. That is the point at which it becomes inefficient to bundle them off to the dGPU for processing. Unfortunately we already have something that does exactly that job, the SIMD FPU. An external dedicated SIMD FPU coprocessor would always be faster then an integrated one, yet typical code has such a low density of SIMD FPU instructions that there was rarely a situation where you'd want a powerful co-processor. Instead a local 8087 co-processor proved faster with the small amount of 8087 code that was present. This logic is also true for SSE/AVX/FMA instructions, their density is such that it's better having a small local FPU vs a big remote FPU. That changes radically when you get into GPGPU and heavy data processing like rasterization and physics calculations. At that point the amount of SIMD FPU code skyrockets and now the large powerful remote option becomes the faster choice.

Short version, your wrong. Long version, any technology that would enable a powerful APU would also enable an even more powerful dGPU of at least 4~5x difference. Any technology that enables fast local memory, would also enable faster dedicated memory on the remote dGPU at a 16~32x difference. GDDR5 is just DDR3 customized for high bandwidth. GDDR6/7 would then be based on DDR4 and have the same level of customization for higher bandwidth.

palladin9479 · Apr 25, 2014

To be fair, IBM POWER is doing it infinitely better than SPARC designs on high performance applications. See the red area below? That is IBM share. See the white area? This is SPARC. It is almost dead now

The fact that you said this shows you have no clue how big iron works. It's like saying diesel trucks are dead because there aren't any in Formula 1 racing.

Most big iron systems aren't super computers. They are smaller implementations that run LoB and business software for business's. IBM and Oracle are the two choices for Financial analysis, forecasting, modeling along with sales tracking and medical databases. The typical configuration involves either IBM DB2 or Oracle RDMBS running on Power / SPARC with some form of J2EE managed web platform that runs all the business applications. The way J2EE works in managed mode is that you deploy a single administrative (can actually do multiple) instance, then as many worker installations as you want. The admin interface then deploys out the web applications into the virtual containers that run on the worker installations with the database connections using JDBC and other Java based data factories.

Take something like the T5-2. You would run three T5-2's with one T5-4. The T5-4 would be hosting the Oracle RDMBS with 512GB of system memory and 10GbE network connections, or even LACP configured across 4~8 1GbE network connection. Each of the T5-2's would also be running LACP connections. Inside each T5-2 you would deploy three or four J2EE worker installations, the T5-4 would get the Oracle RDMS and the administrative installation inside their own containers. From there you can deploy hundreds of web applications, with centralized management, redundancy and clustering into those nine to twelve worker zones. You may even through in an installation of DSEE if your wanting to use a separate LDAP for user access and control. There are also dozens of fusion middle ware applications that can be deployed that interconnect this system with other non-Oracle systems for communications. Also there will be specialized line of business applications present on the system that often need to communicate with back end web applications inside that portal. This is the reason why Oracle and IBM are designing chips with ridiculous thread counts, because these environments actually have that many worker threads present doing stuff. Single thread performance no longer matters only the total amount of I/O you can keep pumping through it.

That is what an enterprise class business portal looks like. It costs a million or so USD, most of which is the licensing for the Database and J2EE web management software. You'll want to toss in some sort of shared storage system if you don't already have an enterprise SAN, so that may or may not add to the cost. Millions of large business's across the world use a system very similiar to that to manage their business and financial data. Those quarterly sales reports that are produced, guess where they are produced at? Certainly not some financial guys desktop computer.

And since those systems aren't super computers, they will never be on a "Top 500" chart.

AMD CPU speculation... and expert conjecture

Distinguished

Champion

Distinguished

Distinguished

Distinguished

Honorable

Glorious

Distinguished

Distinguished

Distinguished

Glorious

Distinguished

Splendid

Champion

Distinguished

Champion

Splendid

Splendid

Distinguished

Distinguished

Distinguished

Glorious

Splendid

Splendid

Splendid

Share this page