Collection of AMD K10 data

ches111 · Feb 10, 2007

AJ,

Scientia carries a significant presence in AMDZone. He was forthright in letting people know that.

Not like some others who have an obvious opinion but do not state their affiliation/s.

That is all I was recognizing.

Similar to JkFlipFLop who obviously does not like John Kerry

but also has been up front about being an Intel employee.

Edited to add: NMDante as he has been up front about being an Intel employee as well.

Ranman68k · Feb 10, 2007

You guys are giving me a warm, fuzzy feeling inside.

Kum Bah Ya

Everyone is being so nice! Hopefully, this trend will be contageous!

Very informative thread!

ajfink · Feb 10, 2007

AJ,

Scientia carries a significant presence in AMDZone. He was forthright in letting people know that.

Not like some others who have an obvious opinion but do not state their affiliation/s.

That is all I was recognizing.

Similar to JkFlipFLop who obviously does not like John Kerry but also has been up front about being an Intel employee.

I know who Scientia is, I was just making a little joke. It's all good, 8)

I also enjoy it when people get along.

As to K10 needing to be a beast compared to C2D...it won't be. It will be on par at least, no doubt, on the desktop side at least until Intel can crank the clock speed up (and then maybe AMD will be able to do the same?). On the server side, Intel has a much, much tougher task set for it. Current Opteron systems are still competitive in the 4S+ space, Barcelona will be the new 800-pound gorilla.

gOJDO · Feb 10, 2007

@Scientia
First of all, welcome to THG. I had very low expectations about you. I am glad that I was wrong. At least you have proven that you are a man with balls, unlike the sock puppet noobs who came only to troll and spam here. You might like AMD, but you know your stuff and I respect that. I hope that you will make THG better, balancing the C2 euphoria with your knowledge about AMD. Thank you for your input in this thread.

@Jack
I will ask Jake to make this thread sticky and I will update with any useful data. Also I would love to see any benchmarks. Someone who has benched ES(but can't publish the results), told me that it has wonderfull power management and that it outperforms Clovertown in FP by 40%, clock for clock.

@Baron
About the L1, I found two infos: it will be 128kB(64kB i/64kB d) and it will be 64kB(32kB i/32kB d). According to AMD slides it is 64kB total. Can you provide any data that will determinate which is true?

bigwater · Feb 10, 2007

VERY good job gOJDO. I was getting sick of the intel -amd battles. 8)

Scientia_From_AMDZ · Feb 10, 2007

Thank you. I know that the subject of whether I've posted here before under alias has been repeated many times but, no, this is the first time. I got fed up with anonymous posting on my blog and Sharikou's blog. Anonymous posting seems to just make it easier to flame and removes any accountability from the poster. Also, I didn't care for the spoof posting that Sharikou180 has been doing (posting with someone else's ID). Any person with integrity will always own what they say even if they say something really stupid or incorrect. You say, "Oops, I was wrong" and then move on.
------------

To be honest, I've never had a lot of faith in MS's ability to do robust load sharing and management on multi-cpu systems. It is possible that MS could get it right with Vista but I think it is more likely that versions of Unix and Linux will remain the standard on server applications.

MS has a great history of snatching knowledge from other areas as they did when they bought MSDOS from another company, and then with with Xerox PARC and MacOS for Windows, and then VMS for Windows NT. The original team leader for Windows was from Xerox PARC and after he quit Windows was completed by MS's Mac applications team leader. The lead for Windows NT designed VMS which is why the initials are one letter higher V->W, M->N, and S->T. I guess they copied that from HAL. Maybe they'll get it right this time. However, I guess any improvement would be good.

I don't believe the bit instructions LZCNT/POPCNT have anything to do with alignment. I believe POPCNT counts the number of 1 bits in a word and I think LZCNT gives the number of zeros. Itanium has POPCNT which makes it much faster on certain parity benchmarks. However, several of the other hardware improvements should indeed help with alignment including the Prefetch buffer which at 32 bytes is large enough to not break long instructions in half. I recall when Motorola 68000 had to have instructions and data aligned on word boundaries.

Yes, in terms of performance it looks like:

K8 -> K10

will be the same jump as

Yonah -> Core 2 Duo

or K7 -> K8

There is definitely some convergence going on with Intel and AMD architecture. For example it is interesting the way that Intel used a hybrid bus on Tulsa to increase performance; this is a half step to native quad. With the large number of similarities between K10 and the C2D architecture you have to wonder how similar they will be when Intel has both IMC and P2P and both are producing quad cores on 45nm. Intel and AMD also both appear to be pursuing GPU based computation.

Scientia_From_AMDZ · Feb 10, 2007

Thank you for your input in this thread.

You're welcome.

About the L1, I found two infos: it will be 128kB(64kB i/64kB d) and it will be 64kB(32kB i/32kB d). According to AMD slides it is 64kB total. Can you provide any data that will determinate which is true?

It would seem unlikely to me that AMD would suddenly reduce L1 size. It has been the same ever since K7. Also, to reduce cache size AMD would need to make it greater than 2-way associativity to maintain the same performance. And, on the die shots the L1 cache size seems the same. I suppose it is possible that they've changed it but I haven't thought of any reason that would make it worth the effort.

piesquared · Feb 10, 2007

Great posts guys/gals.

Anyway, I posted this over at xtremesystems, but i'll post it here too. Strictly speculation though, so I don't know if it belongs.
IMO K10's frequency numbers seem low. I've read on several occasions about how well IBM's 65nm process is scaling. Of course it's a different design, and call it a hunch but i'm thinking Barcelona is going to clock alot higher. I don't remember seeing much documentation with an AMD stamp on it (such as the ones gOJDO posted) about clocks. I can't imagine AMD designed silicon that would only scale a few hundred Mhz for their server chips.

[edit]
Unless of course those few hundred Mhz make a sizable performance increase.

dasickninja · Feb 10, 2007

I don't pretend to understand anything in here, but hopefully this thing overclocks like a dog.

ajfink · Feb 10, 2007

Core Duo -> C2D was actually a fairly small performance difference in a lot of benchmarks. 1% ~ mid teens percent, and that was heavy multimedia work.

Scientia_From_AMDZ · Feb 10, 2007

As long as Intel is on bulk silicon while AMD is on SOI, Intel chips will always have more overclocking potential. It is the nature of bulk silicon that you have to allow a bigger thermal margin while SOI can be close to the limit. Because of the larger margin Intel chips can usually be clocked higher when done carefully while AMD chips with lower margin cannot. Intel will lose this margin if they move to SOI.

Remember Flippies and HD punches? Intel's margin is the same; usually it can be used but it isn't guaranteed by the factory.

CaptRobertApril · Feb 10, 2007

How it will compete against current Core2 Quad depends on its the frequency. IMO, clock for clock both will perform similar.

Superb post, gOJDO. Given the fact that the current roadmaps show that K10 will not be issued in any frequency over 2.5-2.6GHz for the next 18 months, do you believe that there is anything in the specs that can substantiate the bandied 40-70% performance increase over C2Q?

gOJDO · Feb 10, 2007

How it will compete against current Core2 Quad depends on its the frequency. IMO, clock for clock both will perform similar.

Superb post, gOJDO. Given the fact that the current roadmaps show that K10 will not be issued in any frequency over 2.5-2.6GHz for the next 18 months, do you believe that there is anything in the specs that can substantiate the bandied 40-70% performance increase over C2Q?
We don't know for sure what would be the release frequencies of Barcelona, because we don't have any official AMD info. According to HKEPC and dailytech the highest clocked Barcelona(quadcore), this year, will be 2.5GHz. It will have more than 40% memory bandwidth per CPU, than Clovertown. I am very certain that it will have faster FPU, but I am not sure by how much. I have some info that clock for clock, its FPU is 40% faster, but I don't know how the 40% are measured. So, it is possible that it will perform 40% faster for certain kind of softwares, maybe even more if the software is bandwidth dependent. But, most of the server software is much more INT dependent. Compared to K8(still IMO), K8L will have between 5% and 10% faster ALU, but still slower than C2, clock for clock.

Scientia_From_AMDZ · Feb 10, 2007

40% could also be performance per watt. For example, it is possible that AMD is comparing an Opteron system to a dual or quad FSB Clovertown system with FBDIMM and ending up with 29% less power consumption. As I recall, AMD specifically said that the number could include performance per watt. Or it could be a combination.

BTW, I wouldn't expect higher clocks in 2007. Barcelona uses a different transistor design than Brisbane and it will take some time to bring the speed up. It is possible that they could be bumping the speed by the time they begin 45nm. Supposedly, Barcelona is designed for a 45nm shrink, not to mention that K10 will be modular by the time DC 2.0 is released. I would say that any clock increases will either be with Brisbane or not until 2008 with K10.

CaptRobertApril · Feb 10, 2007

Good point, gOJDO and Scientia. I guess that the only way to really find out is when they are actually physically benchmarked by an independent third party. Until then, it's all just vapourware speculation.

Don't get me wrong, I am seriously in the market for dual quads and would love nothing better to go AMD (I'm on a San Diego 3700 right now). But price/performance ratio is my religion and I'm just gonna have to make a rational decision when these puppies are out and the initial price burst is over.

gOJDO · Feb 10, 2007

This slide:

confuses me. I found a lot of articles stating that the L1 would be 128kB, same as on K8. So:
1. the 64kB L1 on the AMD slide is a typo or
2. the L1 is 4-way

from this thread:

The L3 cannot be hooked to L2 via the Xbar. If it were configured like this the normal traffic between L2 and L3 would consume too much bandwidth on the Xbar.

The L3 is not connected to L2 at all. It is connected to the L1. But how?
If it is via the Xbar, then it would have to share the same bus with the data from RAM. Doesn't sound very logical to me.
I think that the L1 has 128bit bus to L2 and 128bit bus to L3. This also makes the 128bit(64bit per direction) L2 to L1 bus sounds more logical, because else the 128bit will be a bottleneck for the 256bit L1. The L1 can do two 128bit loads per cycle, one load and one write at cycle, or only one write per cycle. So, here goes the logic: one load from L2, the other from L3. L2 is used for storing the executed code. It is exclusive, thus the stored data can't exist on L3(which could be inclusive, but the exclusivity of L2 prevents the data duplication) at same time. So, what do you think?

m25 · Feb 10, 2007

I think the answer is in the fact that this slide is much older and inaccurate than the other data you submitted.

gOJDO · Feb 10, 2007

the slide is from the same presentation from Ben Sander 10/10/2006 "AMD FPF 2006"

m25 · Feb 10, 2007

Yes, just that; a PRESENTATION, and that diagram with thin likes has a pretty childish look, so I think the problem is there; a typo.

Ranman68k · Feb 10, 2007

This slide:

confuses me. I found a lot of articles stating that the L1 would be 128kB, same as on K8. So:
1. the 64kB L1 on the AMD slide is a typo or
2. the L1 is 4-way

In the context of this slide, it is possible the slide is describing the L1 cache for data and not the L1 cache for instruction.

Scientia_From_AMDZ · Feb 11, 2007

That is basically the same diagram as Phil Hester's June Analyst Day presentation, page 14.

As far as I can tell, the diagram simply didn't have enough room to show both the L1 Instruction and L1 Data caches. I believe the diagrams were simplified to make it easier to show the 3 levels of cache. I would say that the diagram only shows one of the two L1 caches and therefore does not show total L1 size.

gOJDO · Feb 11, 2007

What do you think about the L1 to L2 and the L1 to L3 buses?
Do you think that L2 is connected to L3 at all?

Scientia_From_AMDZ · Feb 11, 2007

Yes, I have a lot of respect for Hans. He was the one who refuted the notion that K8L had 4 instruction issue. It is too bad that he stopped posting technical articles on his website. Those were first rate. I know of another gentleman who did very detailed analysis of memory and cache on AMD and Intel processors but then he stopped when K8 was introduced.

That thread you linked to was brutal. It had already deteriorated into flaming by the 2nd page and just got worse from there. Yes, it does have real information in it but the flames and personal attacks really knock things down. I once described that on AMDZ as like eating ice cream with sawdust in it. This thread is going smoothly with lots of information and discussion and no flames and that makes a huge difference in how easy it is to read.

Yes, everything I've seen would suggest 64KB's each for Data and Instructions. BTW, I've just seen at AMD their latest press release is Ultimate Datacenter Performance-Per-Watt. This again makes me think that the 40% number is mostly power draw and not processor speed.

Scientia_From_AMDZ · Feb 11, 2007

What do you think about the L1 to L2 and the L1 to L3 buses?
Do you think that L2 is connected to L3 at all?

I can see why L3 would only need to be connected to L1 when L3 gets a hit. There wouldn't be any reason to pass this to L2. However, L3 gets updated when L2 overflows. So, it would seem that L2 would need to be connected to L3 to transfer whatever gets pushed out.

However, from the diagram it rather looks like AMD has some kind of crossbar switch on the L1 cache controller. Perhaps this is the most flexible arrangement if not the most direct.

gOJDO · Feb 11, 2007

However, L3 gets updated when L2 overflows. So, it would seem that L2 would need to be connected to L3 to transfer whatever gets pushed out.

Why to push out to L3? The L2 is directly connected to the ODMC, bypassing the crossbar.

Collection of AMD K10 data

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Splendid

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Share this page