AMD CPU speculation... and expert conjecture

gamerk316 · Apr 28, 2013

@hcl123 [Too much to quote

]

More or less right.

The reason Intel goes with the far weaker HTT is, because from a COST perspective, its REALLY cheap to implement (just an extra register stack) for decent gains in some tasks (register-register math). In short, its performance increase is greater then the die space it eats up.

CMT, by contrast, is almost a second core itself, just missing a few dedicated pieces of hardware. Much more powerful, but eats a LOT more of the die as a result (since its almost like adding a second core).

truegenius · Apr 28, 2013

what i don't like the most in amd module (bd arch) is the performance lose from both cores during full use of module
but intel htt does not seems to loose any performance on main thread

i.e,
amd = 80%+80% = 160% of dual core (loosing performance on both cores)
intel = 100%+20% = 120% of dual core (less efficient but still not loosing performance on main thread)
note: percentage difference is used to show efficuency and not to compare one arch performance to another

and since current games do heavy processing on few threads only so 'x' heavy with 'x' light threads makes more sense

and regarding die size, l3 reduction makes more sense to me

thats why i don't like bd arch

palladin9479 · Apr 28, 2013

amd = 80%+80% = 160% of dual core (loosing performance on both cores)
intel = 100%+20% = 120% of dual core (less efficient but still not loosing performance on main thread)

Except that is NOT how that works, it would be 60+60 for the Intel as you have two demanding threads both contending for CPU time. There is no preferred / virtual or whatever BS people want to use "core" in HTT. Both register stacks are treated the same from the OS perspective and are task switched accordingly. The difference is Intel worked with MS to ensure the NT kernel recognizes the "HTT" feature and is careful about scheduling multiple threads simultaneously on a single core. You can actually see this in high-demand integer workloads when comparing 8xxx to the i7. The SB/IB uArch is core cores with three ALU's each, HTT just lets them utilize those three ALU's more efficiently. BD/PD uArch is eight cores with two ALU's each and a shared scheduler / decoder and L2 cache.

8350rocks · Apr 28, 2013

hcl123 :

Well, I think you added a lot of benefit to the conversation about CMT. Could you point me to your technical source about it? Just curious, I would love to read it.

So, based on what you're telling me then, in CMT, they no longer use context switches mid-cycle as the old days did (TMT)...unless dictated by a blocked thread, or FPU execution, etc. Instead, they dedicate a full cycle per thread instruction. Given the fact that cycles occur 4 times faster now than they did with the faster single core CPUs, I could see that being a similar effect of perception to the end user.

You mentioned in SR, they intend to increase this to 2 cycles per thread instruction with an increase to 4 (shortened) pipelines (I knew about the additional pipeline and the shortening of latency, but could not find any more information about this)...though, how are they going to accomodate the additional pipeline (beyond the additional decoder in the front end for efficiency)? I've read vague descriptions of increased register files and reworked internal register processes, etc.

Additionally, I would really like to know, is this essentially going to be 4 threads per 4 cycles per module (2 per module with context switches after 2 cycles), or is the logic changing in the internal system?

I understand the potential, I have been saying that all along. We are on the same page there, though I would really love to see the technical source for the changes in the architecture if you can find it off hand. Obviously, my google-fu has failed me somewhere, as all I can find is AMD press releases with vague mentions and a brief mention of CLMT in Wiki (server clusters). I've been out of the loop for a little while, evidently, and would love to catch up on AMDs tech. Honestly the internal information I have reviewed from AMD doesn't really get down to this level, it's more a "technical spec sheet" and BD/PD logic flowcharts...nothing on SR.

truegenius · Apr 28, 2013

60+60 can happen as we also hear that 'htt is causing performance loss'
but seems like it doesnot happen too often

i will need to some test on a system with htt capability to test if its 100+20 or 60+60 for majority of apps
but currently i only have 1090t

(and other classmates have core2duos :lol: and some have i3 (as per my recomendation

) , means i have most powerfull processor in my group

)
if 4 th gen i3 can overclock well then i may sell my 1090t

and will test htt and performance loss.

8350rocks · Apr 28, 2013

truegenius :

Not sure if you've heard, but haswell is the last generation of i3 CPUs from intel...

Broadwell will be all quad+ core in the desktop, in laptop/mobile, they will still have a few dual core i5's with HTT, but nothing branded i3.

http://www.semiaccurate.com/forums/archive/index.php/t-6887.html

Also, skylake will supposedly be all BGA with the exception of the LGA2011 socket equivalent at that time.

mayankleoboy1 · Apr 29, 2013

palladin9479 :

IIRC, in HTT, the second core has only enough registers to save the thread context, and not do any processing. The real and the virtual core only have a singe ALU between them. So the second thread is executed if the first one is I/O blocked or instruction blocked. Am i correct ?

8350rocks · Apr 29, 2013

mayankleoboy1 :

Actually, I saw a review where HTT in Crysis 3 actually caused FPS to drop by about 8-10 frames, it was a European site that did the review...if I run across it again I will post the link.

de5_Roy · Apr 29, 2013

hcl123 :

i like how it was clearer and more understandable (even though a bit of that went over my head). so in the end, it's more about thread execution, less about how many threads are in the queue. the module does process only two threads per cycle..... but it can process instructions from more than 2 threads...?
the arch design approach looks like it needs quite a bit of die space (seemingly unlike hyperthreading). does more space/resources needed mean more transistors -> more power use? so if amd doesn't provide effective, dynamic power management circuitry, the cpu may end up a powerhog (well, one of the many factors), yes?
personally i don't think anyone criticized bd uarch (well, i never have). that's something amd fanboys bring up every time fx cpus are being criticized. for example,"u makin fun of bd(8150)? how dare you make fun of bd (arch) (notice the addition and direction switch). it so much better than anything intel has and yet you buy $150 i7 cpus overpriced to $350(true, but nothing to do with 8150 or bd uarch lol)." iirc it was fx8150 (hyped as core i7 killer by amd fanboys quoting amd promo slides) that usually takes majority of the criticism.

gamerk316 :

so, core i7 prices are pretty much bs. we're paying over $100 for a so little. all it takes for intel to fire its laser at the cpu and voila, it's a core i5, $100 cheaper because it doesn't have hyperthreading! who knows, may be intel doesn't even laser off htt, just uses some kind of software trickery to disable it. in any case, :fou:

truegenius :

we can only dream of an unlocked core i3. short of buying a core i5 655k, there won't be a new unlocked core i3. something like that will cannibalize everything higher than core i3, especially in non us markets. it's be a downgrade from your 1090t anyway (2c vs 6c while games are starting to 4 cores).

palladin9479 · Apr 29, 2013

mayankleoboy1 :

Since we're talking Intel HT there is no second core. There is one core that just happens to have two exposed x86 register stacks (EAX, EBX, ECX, EDX, ect..). That one core has three internal integer units (amongst other things) that process's micro-ops. Externally the decoder takes two x86 macro-instructions, decodes them into a set of micro-instructions then schedules them onto internal resources. Those three ALU's and the MMU (ld/st unit) must process instructions from both targets, so we're talking 1.5 ALU's per target (thread). Now it's not divided exactly that way due to the front end doing code analysis and using out of order execution it will determine the most efficient sequence to process those micro-instructions (not x86). One stream of binary code typically won't use's more then one ALU at a time, and rarely more then two. Intel capitalized on this fact and thus HT was born. As long as the OS does not schedule two threads onto the same core when other cores are idle. Essentially the OS must be "HT" aware which most modern OS's are. That's how occasionally a program will do worse with HT enabled. Their code multi-tasks well and actually use's most of the processor's resources. Resource contention will cause a loss of performance within the context of a single task.

In BD/PD's case each module is fully dedicating two integer units and one MMU per exposed register stack. The above code that could use three ALU's instead has to use two though it's guaranteed full access to those two. Though there is risk sharing the branch predictor / instruction decode and L2 cache system.

palladin9479 · Apr 29, 2013

so, core i7 prices are pretty much bs. we're paying over $100 for a so little. all it takes for intel to fire its laser at the cpu and voila, it's a core i5, $100 cheaper because it doesn't have hyperthreading! who knows, may be intel doesn't even laser off htt, just uses some kind of software trickery to disable it. in any case,

CPU microcode. They used to use things like read-only registers that were programmed at factory but with "upgradable CPUs" being a think I surmise it's just a bit of microcode that tells the CPU what model it is and what features are supported. Via's CPUs were known to have their flags register open to being changed through software, it's how we can trick any piece of software into thinking it's running on an Intel CPU.

gamerk316 · Apr 29, 2013

antiglobal :

/thread

de5_Roy · Apr 29, 2013

antiglobal :

gamerk316 :

that's right, the day excavator thread starts.

amd's cpu roadmap said they'd release a new cpu lineup every year with 10-15% better perf/watt, which put sr cpus in 2013. they were supposed to come out in october this year, keeping in line with bd and pd releases but now it looks like q1 2014.

8350rocks · Apr 29, 2013

palladin9479 :

Nice summary, I would add that when I, or others, refer to it as a "virtual core", it's because it only possesses a minimum of resources. It's really like an "internal trick" using resources of the same core internally but with an extra ALU and a few other pieces. It's a low die space way to gain a little extra "oomph" under the right circumstances, but, not under all circumstances.

As a side note: It's a bit misleading for some that look at something like a Ubuntu interface showing "siblings" being 8 on an i7-3770k with 4 cores with HTT enabled...though Linux is smart enough to load the "real" cores first (as most OS suites are these days). Generally they will add complimentary functions to a HTT "virtual core" so that the same resources are being used toward one larger thread goal.

gamerk316 · Apr 29, 2013

8350rocks :

Remember that from an OS perspective, all it knows are register stacks. The OS doesn't (and shouldn't) care about the internal layout of the CPU architecture, just how many register stacks can be loaded. Hence why most OS's show 8 cores for a 3770k.

That being said, if not loaded properly, HTT can kill performance (See: Early Pentium 4). So they typically look at the HTT bit when loading cores, and try and avoid using the HTT core if at all possible. Personally, AMD could have saved itself a bit a grief with BD if they had simply done the same thing, rather then have MSFT change their scheduler via a patch to basically do the same exact thing.

mayankleoboy1 · Apr 29, 2013

palladin9479 :

According to CharliD at S|A , modern intel i3 and i5 have the additional feature of the chip fused off during fabrication, so microcodes wont unlock a processor anymore.
Its very silly that intel fabricates a complete chip, then intentionally fuses off some features so that it cn sell the crippled chip at a lower price point.

mayankleoboy1 · Apr 29, 2013

BTW, S|A comparing OpenCL perf of a 3770K and a A10-5800K .

result : Its not what you think .....

http://semiaccurate.com/assets/uploads/2013/04/Intel-OpenCL-Performance1.png

truegenius · Apr 29, 2013

iirc then 7660D is vliw4 :??:

and hd6000 wasn't that good in opencl

de5_Roy · Apr 29, 2013

mayankleoboy1 :

the !@#$ is wrong with the world today... everything is going upside down.... i was aware of hd4k's luxmark advantage but these...

i think there are several reasons for this happening.
1. drivers.
2. drivers.
3. it looks like the benches were hyperthreading aware, in the last cpu+gpu bench, the core i7 gave some serious thrashing to poor li'l trinity.
4. l3 cache
5. this one is frequently denied(by amd fanboys) - cpu bottleneck. i saw opencl benches with a llano where the cpu seemed to be holding the igpu back, the same thing seems to be happening here.

if the bench is fully igpu-bound, the 7660d will have advantage. if there's a cpu component, the hd4000 will close the gap.

then again, this bench means very little for one simple reason - price. i'd like to see these benches re-run with core i3 3225... or even a core i5 3570k. when you realize that you get 8% faster overall performance with 3770k for $200 more, the apu looks better.

edit: radeon hd 7730 revealed? are we seeing the future 'dual graphics' candidates? would be nice. ^_^
http://www.techpowerup.com/183314/AMD-Readies-Radeon-HD-7730-Graphics-Card.html
imo entry level gcn 2.0 cards would be better for kaveri.

JAYDEEJOHN · Apr 29, 2013

Of course the cpu is holding the igpu back, which is just more low hanging fruit for AMD, since Intel cant squeeze out much more here, AMD can, all the while improving their igpu/gpus as well, and Ive a feeling the driver team is getting more attention nowadays as well

de5_Roy · Apr 29, 2013

JAYDEEJOHN :

i think the rumored 6 core kaveri has the potential to be the overall opencl performance leader. even if it doesn't turn out like that, amd will still have the price advantage.... as long as they don't price it against core i5.

truegenius · Apr 29, 2013

and where is the rumored ARM companion (or security feature) core -_-

8350rocks · Apr 29, 2013

de5_Roy :

Basically that's mostly #3, the i7-3770k is just more raw muscle...and there are some things to be said for that. But as you point out further down, 8% iGPU performance is not worth $200.00 I would hold out for Richland first (10-15% gain over A10-5800k)

if the bench is fully igpu-bound, the 7660d will have advantage. if there's a cpu component, the hd4000 will close the gap.

then again, this bench means very little for one simple reason - price. i'd like to see these benches re-run with core i3 3225... or even a core i5 3570k. when you realize that you get 8% faster overall performance with 3770k for $200 more, the apu looks better.

100% agree...the test should really be run between 2 products in at least a similar market segment.

edit: radeon hd 7730 revealed? are we seeing the future 'dual graphics' candidates? would be nice. ^_^
http://www.techpowerup.com/183314/AMD-Readies-Radeon-HD-7730-Graphics-Card.html
imo entry level gcn 2.0 cards would be better for kaveri.

Catalyst v13.4 drivers already support A10-5800k + HD 7750 CF. They will also be getting some great driver updates from Kudori in the very near future.

griptwister · Apr 29, 2013

In mobile gaming and real world situations is where AMD's iGPU really stands out against Intel. Again, Proving that benchmarks are sometimes just numbers, the end result is how the user feels about his product.

gamerk316 · Apr 29, 2013

mayankleoboy1 :

I suspect the CPU side of the house. Remember, OpenCL offloads SOME work, but not all of it. So the CPU is still going to have an effect, and guess what? Any i7 crushes A10 based CPU's. So depending on the amount of work that gets offloaded, you get varying results.

The second issue is while the A10 APU's will scale better (which is quite apparent in games), on a per-shader basis, Intels iGPU is faster. So if you feed the GPUs a small enough dataset, its quite possible that Intels iGPU would appear faster, or at least fast enough for the CPU side of the house to pull ahead in the benchmark.

AMD CPU speculation... and expert conjecture

Glorious

Distinguished

Splendid

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Splendid

Splendid

Splendid

Glorious

Splendid

Distinguished

Glorious

Distinguished

Distinguished

Distinguished

Splendid

Champion

Splendid

Distinguished

Distinguished

Distinguished

Glorious

Share this page