Smithfield: 2Q05, 2.8Ghz for $240!!!

IIB · Jan 29, 2005

A mostly overlooked asspect about intel not having an arbitrator and AMD having an Arbitrator + an ON-DIE memory controloer is the folowing :

Cahche Snooping, an operation where CPU0 looks in CPU1 cache to snoop for data it needs, is going to be less then half the latncy and twice the bandwidth with the AMD Platfrom. sence the path is going to be all on-die and at much higher speed throughput.

AMD dualcore is going to be much more effective then Intels.

A DP Pentium 4 2.8ghz W/O HT is NOT an attractive offer for 240$. when compered to a 240$ Athlon 64 3800+ (given by then it drops to todays 3500+ price level).
The Athlon 64 gets you between 20 to 35% more preformance at HL2. you can probably applay smaller preformance gains to most other single threaded applications.
I suspect that in most threaded applications the Pentium 4 will do only a litle better then catching up with A64. remeber that todays 2.8ghz prescott alrady enjoy ~10% boost in threaded applications due to its HT. even at good thereaded worksation-type (not server) workloads a dual-xeon setup wont get much better then 30% advantage over a single hyperthreaded Pentium 4. so thats what were looking at here... a slight victory for the Pentium 4 in a small portion of threaded Apps. and thats for a 300+(!) square mm CPU against an 84mm CPU.
not mentioning heat and power consumption.

BUT - AND THATS A BIG ONE much more importently for AMD - people who are not Technology enthusiests will LOVE the idea of getting a dual 2.8GHZ CPU for a that cheap. becouse the obvious reasoning is 2.8X2 is 4.6 and no one puts a up a 4.6 number or even close to it for 240$ (or any amount of money for that matter).

Thats the real problem... and thats what AMD strategy will be focused on - education. AMD, as always puts its faith in the costumer amount of IQ and awarness.

at worst - the best AMD dffensive option is to put up cheap dual-core winchesters at around 160 square mm they are still going to be reasnoble at 240$ and even at low speed bins (1.8 Ghz) they will eat smithfiled for breakfest... only problem is that if this gets too popular AMD might, again, get into a slight manufactoring problems. (Small becouse 95% of the CPUs they sell would still be about 20%-40% smaller then last year).

I hope the english is readable becouse there is no way im spell cheking it..

This post is best viewed with common sense enabledEdited by iiB on 01/29/05 11:43 PM.

Xeon · Jan 30, 2005

Cahche Snooping, an operation where CPU0 looks in CPU1 cache to snoop for data it needs, is going to be less then half the latncy and twice the bandwidth with the AMD Platfrom. sence the path is going to be all on-die and at much higher speed throughput.

The arbiter will keep track of that, as well the latencies and bandwidths between cores wont be a issue, they don’t need to move something like 3 gig's between them so your point of AMD having more bandwidth and lower latencies, which brings me to the clear and obvious question. How do you know the latencies, last I checked AMD and Intel have stated nothing in regards to latencies since they wont matter.

Regardless of latencies and bandwidth the information is just a few system ticks away.

AMD dualcore is going to be much more effective then Intels.

Got some numbers to back that one up, truth is known AMD's solution will most likely be more efficient but Intel may yet pull a rabbit out of the hat.

A DP Pentium 4 2.8ghz W/O HT is NOT an attractive offer for 240$. when compered to a 240$ Athlon 64 3800+ (given by then it drops to todays 3500+ price level).

How so you get 2 processors for the price of one, unless you actually believe that 2 core machines will muster no real world advantages, then by all means believe what you want.

The Athlon 64 gets you between 20 to 35% more preformance at HL2.

Point being, as well real world game play is much more different that rail benchmarks. I find it amusing people now dismiss 3dmark as a reliable benchmark what makes the rail benchmark any different.

I suspect that in most threaded applications the Pentium 4 will do only a litle better then catching up with A64.

I was unaware the P4 lost to the A64 in threaded benchmarks, as for catching up I don’t know I don’t even know what to say to it just confuses me.

and thats for a 300+(!)

It's 215 actually but you seem to be on a rant so 300 it is.

Thats the real problem... and thats what AMD strategy will be focused on - education. AMD, as always puts its faith in the costumer amount of IQ and awarness.

Sure they do, ah yes the PR rating deal just screams to me honesty.

Xeon

Post created with being a dickhead in mind.
For all emotional and slanderous statements contact THG for all law suits.

P4Man · Jan 30, 2005

>Vectorized code such as SSE and SSE2 run-in bundles 128bit,
>Intel added and widened this feature for the Prescott core,
>where the processor tries to alleviate stress off the
>memory subsystem by bursting the data whether it be to a
>GPU, IDE controller or memory. Ill get you a link since you
>obviously missed that tidbit of information.

Sure, go ahead and link, its far better than your reformulated Intel PDFs. Doesn't change the fact that why you said earlier make *zero* sense. Hint: double precission applies to floating point, not integer.

>Did I miss something? A P4 receives and sends 4x a clock
>tick so no 2x200 would not be the same as 4x200, to be sure
>I am amazed by what you said because it makes no sense what
>so ever.

Yeah you missed something. the stupidity of your original claim.

> As far as I can tell all existing HT optimized code will
>work “ideally” on smithy.

no, SMP optimized code will.

> N-Force 5 should be just as interesting I would think.

Except that intel never designed a memory controller for K7, so its anyone's guess how much headroom there still is. My guess: pretty much none. I doubt nVidia will do a better job than intel, and its a fact there is no historical evidence indicating anything else. There is, however, plenty of historical evidence of intel seriously outperforming any other third party memory controllers (Via, SIS, ATI,..) on the market. What makes you think this is going to be any different ?

>So why did I pay for XP if 99% of the time my second core
>will remain idle,

Cause you care about the 1% time when cpu performance matter (like when playing a game or whatever).

>It doesn’t work that way with HT enabled P4’s goes back and
>forth. I don’t see it changing with smithy.

yeah, back and forth between WHAT ? Going back and forth is only usefull if you have two CPU bound apps or threads to go back and forth between, which is for most people, almost never.

>How does power even support a argument that a dual core K8
>would be about the same speed as a 2x Opteron

Hu ? Fairly simple really. A cpu can be held back because of timing issues or power issues. For AMD, I don't see power being the limiting factor, at least not with a single core, so I can believe dual dual core CPU's would not be a lot slower (clockspeed). For intel, its pretty clear that when you apply exotic cooling, they can clock considerably higher, implying thermal limitations rather than timing or transistor switching speed. Hence, doubling the power isn't likely to have a positive effect on clockspeed. Is that too complicated for you ?

>Been dual channel for quite some time.

If so, scratch the nice performance boost part.

>What the hell are you talking 1st gen smithies are equipped
>with the arbiter chip, the arbiter chip controls bus
>transactions, current Northbridge’s are not equipped to
>deal with 2 chips on one bus.

Among AGTL+ designs specs is glueless 4+ SMP and dual independant busses. if current chipsets do not support SMP, its marketing, nothing else, the bus is perfectly capable of it, as are the chipsets if they are not crippled.

>But bandwidth isn’t the real concern for those processors >at this point, thermal output has to be managed.

Ah, so you do agree now ?

>Personally 3-7% is more realistic of a theory,

Show me one real world app that gains anywhere near 7% using SSE3 compiliation versus SSE2 and I'll be impressed. Show me 50 apps, and I might agree "3-7%" is an average one should expect.

>Last I checked the 945, 955 chipsets would support 1066

Sure they might; but how about the motherboards ? Signal integrity is a motherboard design issue, not a chipset issue. You just reinforced my point, if there is no 1066 fsb option today, its most likely because the motherboards would be too hard/expensive to produce.

= The views stated herein are my personal views, and not necessarily the views of my wife. =

endyen · Jan 30, 2005

I hate b eing petty. In a discussion like this, attention to detail is important.

becouse the obvious reasoning is 2.8X2 is 4.6

The last time I checked 2X2.8 was 5.6. Are you using higher math?

Xeon · Jan 31, 2005

Sure, go ahead and link, its far better than your reformulated Intel PDFs. Doesn't change the fact that why you said earlier make *zero* sense. Hint: double precission applies to floating point, not integer.

Yes your point is? The statement was clear I just used a valid example since SSE,2,3 already run in bundled, as well MMX bundles, but whatever you argue for the point of arguing.

Yeah you missed something. the stupidity of your original claim.

What sends receives 4x a clock tick? Oh well not my problem you can't understand what I am saying, like I care for that matter.

no, SMP optimized code will.

Is that not what I said specialized HT code works quite well on dual processor machines?

What makes you think this is going to be any different ?

Wow your pretty stupid VIA was smoking Intel based chipsets, that was about the time when VIA didn’t have a "license" for the P4 FSB. Intel dealt with it quite well by scaring motherboard manufactures and OEM's into not using it. Ask Crashman I know he remembers.

Cause you care about the 1% time when cpu performance matter (like when playing a game or whatever).

HMMMMMM K!

yeah, back and forth between WHAT ?

Registers...

Going back and forth is only usefull if you have two CPU bound apps or threads to go back and forth between, which is for most people, almost never.

So then what does Windows do? Obviously can't balance loads, manage threads, or do memory management.

Hu ? Fairly simple really. A cpu can be held back because of timing issues or power issues. For AMD, I don't see power being the limiting factor, at least not with a single core, so I can believe dual dual core CPU's would not be a lot slower (clockspeed). For intel, its pretty clear that when you apply exotic cooling, they can clock considerably higher, implying thermal limitations rather than timing or transistor switching speed. Hence, doubling the power isn't likely to have a positive effect on clockspeed. Is that too complicated for you ?

Do you even know what you are talking about? Clock speed limits sure they come from voltage limitations which pose transistor switching limitations. But in the end all it is, is signal integrity due to parts of the CPU switching faster then the rest of the CPU.

But as for a dual K8 vs. a dual Opteron I still don’t see where you argument is coming from, if the socket can deliver the correct amperage why would either situation differ? Over clocking would also be marginally different in each situation, since the HT bus is quite capable of handling higher speeds.

Among AGTL+ designs specs is glueless 4+ SMP and dual independant busses. if current chipsets do not support SMP, its marketing, nothing else, the bus is perfectly capable of it, as are the chipsets if they are not crippled.

HMMMMM K!

Ah, so you do agree now ?

Have I ever disagreed with the extreme nature of the thermal output of the Prescott cores?

Show me one real world app that gains anywhere near 7% using SSE3 compiliation versus SSE2 and I'll be impressed. Show me 50 apps, and I might agree "3-7%" is an average one should expect.

Latest builds of LAME and Gordian Knot last I checked, why the argument on this is beyond me *shakes head*.

Sure they might; but how about the motherboards ? Signal integrity is a motherboard design issue, not a chipset issue. You just reinforced my point, if there is no 1066 fsb option today, its most likely because the motherboards would be too hard/expensive to produce.

They make 1066 boards what the hell is your point?

Xeon

Post created with being a dickhead in mind.
For all emotional and slanderous statements contact THG for all law suits.

juin · Jan 31, 2005

My dear P4man

Xeon was implying that prescott can make better use of vector memory operation. So i dont get you point if there any.

The northbridge will allwayse think there only 1 cpu as a simple log of transaction will be able to route any I/O and keep CC.The MCH wont see any change.

i need to change useur name.

P4Man · Jan 31, 2005

>The statement was clear I just used a valid example since
>SSE,2,3 already run in bundled, as well MMX bundles, but
>whatever you argue for the point of arguing.

Lets rewind this; Vapor claimed "For one, a single Scotty core eats bandwidth like no other, imagine two." Which is correct, P4 performance does heavily depend on FSB bandwith, compare P4A with P4B and C if you don't believe this. So if you add a core while keeping the same bandwith, obviously per core bandwith halves in the worst case, meaning you should not expect stellar scaling going from single to dual core.

Your confused statements about "vectorized streaming code bundles" and "double precission integers" are not really a counter argument. But show us that link, and we might understand what you are trying to get at.

>What sends receives 4x a clock tick? Oh well not my problem
>you can't understand what I am saying,

Again, rewind. Replying to the same Vapor statement you said:

" bandwidth issues will be moot with considerations P4's run on a quad data rate memory subsystem, CPU1 takes rise fall of the 1/2 tick CPU2 gets the rise fall of the second 1/2 of system tick"

Which is bogus of course. It doesn't matter how you reach 800 MT/s, wether it be single pumped 800 MHz or octal pumped 100 MHz, its still 800 MT/s. Adding a second core will reduce maximum bandwith per core to 400 MT/s under worst circumstances.

>Is that not what I said specialized HT code works quite
>well on dual processor machines?

No, but it was a minor nitpick; you claimed HT optimized code would work well on Smithfield, and I made a slight correction stating SMP optimized code would, since Smithfield will behave like a SMP computer, not a HT computer, and there are some minor differences between both (like avoiding cache trashing). But yes, HT optimized code will most likely work pretty well too.

>Wow your pretty stupid VIA was smoking Intel based
>chipsets, that was about the time when VIA didn’t have a
>"license" for the P4 FSB

You're a young snot I presume; if you hadn't been, you'd have known intel memory controllers have pretty much always lead the pack when it comes to implementation. Yes, at one point VIA had a theoretically faster solution when they offered 133 MHz memory, and intel kept BX at 100, but even then BX was usually faster. And no, PX400 didn't smoke the 845, they where roughly equal for a few months until intel released the 865/875. Sorry, there is just not much historical evidence of intel being outperformed by anyone else when it comes to memory controllers, and there is tons of evidence of the contrary. I don't see wha nVidia could bring to table in this respect, but I guess time will prove me right or wrong.

>Do you even know what you are talking about? Clock speed
>limits sure they come from voltage limitations which pose
>transistor switching limitations. But in the end all it is,
>is signal integrity due to parts of the CPU switching
>faster then the rest of the CPU

Of course not. There gets a point where thermal density or power requirements are just to great to cope with in a commercial desktop product. prescot is pretty close to that limit already, and if we won't ever see 5 GHz prescotts, its most likely because of power/heat, not because of switching speed.

>But as for a dual K8 vs. a dual Opteron I still don’t see
>where you argument is coming from,

I assume you meant K8 vs P4 here. Simple: K8 isn't nearly as power limited as Prescot, its signal propagation limited or transistor speed limited. Now, if you double the die, neither of these get any worse (well, not per core, but statistics will play tricks for the dual core cpu ), power issues however, will roughly double. Apply a tiny bit of common sense, and you'll see its much more likely K8 dual core will run at speeds close to current K8 top speeds, than dual core prescotts; If you can't grasp that.. what can I say ?

> if the socket can deliver the correct amperage why would
> either situation differ?

*If* the socket (and PSU,..) can, and *if* the HSF can cope with the increased power without substantially increasing core temperature, then nothing much changes. But what if it can't ? Its not like motherboard designers or HSF manufacturers had an easy time designing power circuitry for a single prescot.

>Have I ever disagreed with the extreme nature of the
>thermal output of the Prescott cores?

I guess not, but then how hard can it be to see 2 prescott cores will represent an even bigger challenge, most likely imposing clockspeed limits ? I am even convinced the current prescott is mostly power limited.

>Latest builds of LAME and Gordian Knot last I checked, wh

Toss me a link please. remember, to gauge the impact of SSE3, we'd need the same app compiled with and without SSE3 support running on the same cpu. Also, video encoding is probably the single most important app that can benefit from SSE3, you can not expect it to be representative for an overall speedup.

>They make 1066 boards what the hell is your point?

That these boards are too expensive to be produced as mass market products ?

= The views stated herein are my personal views, and not necessarily the views of my wife. =

Xeon · Jan 31, 2005

Lets rewind this; Vapor claimed "For one, a single Scotty core eats bandwidth like no other, imagine two." Which is correct, P4 performance does heavily depend on FSB bandwith, compare P4A with P4B and C if you don't believe this. So if you add a core while keeping the same bandwith, obviously per core bandwith halves in the worst case, meaning you should not expect stellar scaling going from single to dual core.

Your confused statements about "vectorized streaming code bundles" and "double precission integers" are not really a counter argument. But show us that link, and we might understand what you are trying to get at.

I never stated there would not be a bandwidth issue I bloody well stated that Intel has added some features to try and make that short coming less of a issue.

With considerations if both cores are working on one thread which will be the most likely situation than bandwidth issues become moot. With Xeons both cores are busy at it and that 533 and now 800 fsb gets gobbled up quite quickly.

As for your inability to follow what I am saying about bundled data, find it yourself I have nothing to prove this is a forum, hardly a point of real consequence.

Which is bogus of course. It doesn't matter how you reach 800 MT/s, wether it be single pumped 800 MHz or octal pumped 100 MHz, its still 800 MT/s. Adding a second core will reduce maximum bandwith per core to 400 MT/s under worst circumstances.

Did I argue that point NO!!! While both cores are busy with one thread 1/2 bandwidth means jack, they could even run it 1/2 tick style as I stated who knows. I was stating most likely scenarios.

No, but it was a minor nitpick; you claimed HT optimized code would work well on Smithfield,

What "ideal" whatever man learn English.

You're a young snot I presume; if you hadn't been, you'd have known intel memory controllers have pretty much always lead the pack when it comes to implementation. Yes, at one point VIA had a theoretically faster solution when they offered 133 MHz memory, and intel kept BX at 100, but even then BX was usually faster. And no, PX400 didn't smoke the 845, they where roughly equal for a few months until intel released the 865/875. Sorry, there is just not much historical evidence of intel being outperformed by anyone else when it comes to memory controllers, and there is tons of evidence of the contrary. I don't see wha nVidia could bring to table in this respect, but I guess time will prove me right or wrong.

Right....

Of course not. There gets a point where thermal density or power requirements are just to great to cope with in a commercial desktop product. prescot is pretty close to that limit already, and if we won't ever see 5 GHz prescotts, its most likely because of power/heat, not because of switching speed.

I suppose, sick and tired of arguing with you.

I assume you meant K8 vs P4 here. Simple: K8 isn't nearly as power limited as Prescot, its signal propagation limited or transistor speed limited. Now, if you double the die, neither of these get any worse (well, not per core, but statistics will play tricks for the dual core cpu ), power issues however, will roughly double. Apply a tiny bit of common sense, and you'll see its much more likely K8 dual core will run at speeds close to current K8 top speeds, than dual core prescotts; If you can't grasp that.. what can I say ?

K sure ya you right or whatever you want me to say.

*If* the socket (and PSU,..) can, and *if* the HSF can cope with the increased power without substantially increasing core temperature, then nothing much changes. But what if it can't ? Its not like motherboard designers or HSF manufacturers had an easy time designing power circuitry for a single prescot.

Your right "if", if I cared to argue the points anymore, or if this mattered outside the forum.

Toss me a link please. remember, to gauge the impact of SSE3, we'd need the same app compiled with and without SSE3 support running on the same cpu. Also, video encoding is probably the single most important app that can benefit from SSE3, you can not expect it to be representative for an overall speedup.

Am I your b*tch? Go out and get it yourself.

That these boards are too expensive to be produced as mass market products ?

I wouldn’t know I don’t know everything.

Xeon

Post created with being a dickhead in mind.
For all emotional and slanderous statements contact THG for all law suits.

P4Man · Jan 31, 2005

>With considerations if both cores are working on one thread
>which will be the most likely situation

Ahem.. no. 2 cores can't work on a single thread
:rollseyes:

>I don’t know everything.

Agreed

= The views stated herein are my personal views, and not necessarily the views of my wife. =

Xeon · Feb 1, 2005

Ahem.. no. 2 cores can't work on a single thread
:rollseyes:

Why HT fools the OS in thinking there are 2 processors, both virtual processors work away on one thread. Vanderpool does that on a grand scale allowing multiple instances of an OS.

I really don't understand what your point is to begin with, it's all OS level thread management, 1 thread divided amongst 2 processors.

Agreed

Wish you had such humility.

Xeon

Post created with being a dickhead in mind.
For all emotional and slanderous statements contact THG for all law suits.

P4Man · Feb 1, 2005

>Wish you had such humility.

I guess my way of being humble is by not making firm statements on things I know nothing about, and not resort to name calling when I am corrected by someone.

>Why HT fools the OS in thinking there are 2 processors,

Yes

>both virtual processors work away on one thread.

How many times do I have to say: "NO!" before you bellieve me or look it up ? HT allows the CPU to work on 2 threads more or less simultaneously (one per virtual cpu), but there is no way 2 CPU (virtual, physical) can speed up the execution of a single thread. THAT IS THE WHOLE FRIGGING PROBLEM WITH HT (and SMP and dual core), most consumer code is single threaded.

= The views stated herein are my personal views, and not necessarily the views of my wife. =

endyen · Feb 1, 2005

Excuse me for interupting but

Why HT fools the OS in thinking there are 2 processors, both virtual processors work away on one thread. Vanderpool does that on a grand scale allowing multiple instances of an OS.

Do you have a clue what you are talking about?
Here's a thread 1 + 0 -so what? Are you saying that 1 virtual chip will get 0, while the other gets 1? Even the newest newb here knows two chips cant divide a single thread.

Xeon · Feb 1, 2005

I guess my way of being humble is by not making firm statements on things I know nothing about, and not resort to name calling when I am corrected by someone.

Ya you should shut up you don’t know what you are talking about.

How many times do I have to say: "NO!" before you bellieve me or look it up ? HT allows the CPU to work on 2 threads more or less simultaneously (one per virtual cpu), but there is no way 2 CPU (virtual, physical) can speed up the execution of a single thread. THAT IS THE WHOLE FRIGGING PROBLEM WITH HT (and SMP and dual core), most consumer code is single threaded.

Its OS Level thread manipulation, it's still coming off the OS's main thread(s). Otherwise my 8606 handles 424 threads and 37 processes would really bog that CPU down would you not think.

If the code is specifically spun for it, it will make the necessary calls and run according on alternate threads. I don't know what your argument is; this is how the technology works. It's quite OS dependant hence the performance hit on 2K, 2K thinks it's a real second processor and that’s not what it is, just better resources balancing.

As for the second processor not speeding up a single thread, if its OS level manipulation, why would the OS not be able to split the load of the software up? *Most* software and their various internal cycles are not system tick dependant.

Just like SMP code is not 2 independent threads of code, its one core thread that can if a second third or whatever CPU is detected it can utilize those additional resources.

Xeon

Post created with being a dickhead in mind.
For all emotional and slanderous statements contact THG for all law suits.

Era · Feb 1, 2005

Slicing a single thread in two or more pieces and executing those pieces
simultaneously on a HT capable cpu (single or dual core) or in SMP is a
no-go. A thread is the "smallest fragment" of written/compiled code and
is not even meant to be sliced in pieses to be prosessed in parallel.

Think of a thread 1000 machine instructions long. Now share it with 1000 cpu's and process them in parallel.

In reply to:
--------------------------------------------------------------------------------

You're a young snot I presume; if you hadn't been, you'd have known intel memory controllers have pretty much always lead the pack when it comes to implementation. Yes, at one point VIA had a theoretically faster solution when they offered 133 MHz memory, and intel kept BX at 100, but even then BX was usually faster. And no, PX400 didn't smoke the 845, they where roughly equal for a few months until intel released the 865/875. Sorry, there is just not much historical evidence of intel being outperformed by anyone else when it comes to memory controllers, and there is tons of evidence of the contrary. I don't see wha nVidia could bring to table in this respect, but I guess time will prove me right or wrong.

--------------------------------------------------------------------------------

Right....

Right!!!

slvr_phoenix · Feb 1, 2005

Even the newest newb here knows two chips cant divide a single thread.

Any existing chips, no. Of course not. That'd be silly. But what about in the fanciful dream world of 'the future'?

Imagine a multi-cored CPU with a processing-arbiter that in a manner similar to pre-fetching can adequately 'guess' how registers will be utilized. It could re-distribute pieces of a workload to other cores with relabelled registers. So long as the pre-processing guessed right and split the workload in a way that the registers didn't need to be shared across cores during execution it should be possible. Any data that needs to be shared or moved after execution could then be copied over by the processing-arbiter in preperation for the next cycle.

In fact if the processing-arbiter were double-clocked like the integer units of a P4 are then it could possibly even copy/move data between registers of the multiple cores during the half-clock to expedite shared integer operations in CPUs of that style.

Or if one went one step further and virtualized an entire range of registers shared by all cores instead of splitting them into individual sets of registers seperate on each core, then the copying/moving would be completely unnecessary as all cores would have access to the other registers. It would also eliminate nearly every penalty for a pre-processing 'guess' miss other than the possible lack of 100% utilization.

The real key to anything like this ever working would of course be that the processing-arbiter would have to be capable of breaking down the code and recreating the same execution using different code with 100% accuracy. The processing-arbiter would have to run very efficiently, possibly even wasting the first cycle of any new execution to pre-load the code and rewrite it so that its feed can remain one cycle ahead of the actual execution.

And to take it a step further, if the processing-arbiter were keeping track of threads and their usage of the cores, it is possible that the processing-arbiter could not only decide when best to split a single thread and when to run multiple threads, but also might even be smart enough to jam pieces of code from more than one thread into a single core for execution that cycle like a similar (if not improved) version of Intel's HT. And if capable of that, it might even be capable of optimizing badly written code to utilize more registers and f/iops per cycle than was originally programmed by recognizing any independent operations that could be executed this cycle instead of in the next.

Of course no such animal as this wonderous processing-arbiter exists and I highly doubt that Intel (or AMD, or anyone else) will manage to create such a concept any time soon. (Except for maybe Sony's Cell.) But if multiple cores could be united by such a device then there would be definite advantages of dual-core over dual-CPU. Though with the extreme utilization possible, it would also bring heat levels to their maximums quite easily. And if the technology weren't completely free of bugs then some people's software might suddenly become bug-ridden because of the processing-arbiter and not because of their actual code.

Ah, 'the future', how I long to see thee. I don't think that this would be possible above 4 cores per die however because after that the distances between the processing-arbiter and the dies would either become asymmetrical or there would be a lot of die space wasted between cores. It likely would work best with just two cores. Well ... unless you went 3D and stacked cores upon cores instead of laying them side by side. :\

<pre>I just want to say I wuv you.
And I mean it fwom the bottom of my hawt.</pre>

endyen · Feb 1, 2005

Imagine a multi-cored CPU with a processing-arbiter that in a manner similar to pre-fetching can adequately 'guess' how registers will be utilized.

Well that is a little strange. A thread is the equivalent of a prime number. Dividing it could only result in an incomplete answer.
On the other hand, if pre-fetch were capable of fortelling the future, the gambling establishment would get pisssed. Even Intel wouldn't want to fool with those guys. Then of course concepts of self determination would get kind of quashed.

zeezee · Feb 2, 2005

Try to make an OS run the following single-threaded code on two processors...

MOV AX,5
MOV BX,6
ADD AX,BX

Remember, both CPU's have their own AX & BX registers.

...won't work!

You either have to recompile the code to optimize both CPU's (which is stupid as you will probably need the third core just for this) or resort to other poster's Futuristic Imaginery CPU which sounds easier.

P4Man · Feb 2, 2005

> I don't know what your argument is; this is how the
> technology works.

sigh..

= The views stated herein are my personal views, and not necessarily the views of my wife. =

zeezee · Feb 2, 2005

A thread is the "smallest fragment" of written/compiled code and
is not even meant to be sliced in pieses to be prosessed in parallel.

Let’s refine the above statement like:

… to be processed in parallel by two processors.

Parallel processing is nothing new and has been implemented on x86 CPU’s since early Pentiums (as far as I remember). Even single core CPU’s have multiple ALU/FPU units to process multiple instructions at once, whenever they can be run in parallel, i.e. when the second instruction doesn’t depend on the first’s result.

This doesn’t work on multiple CPU’s/Cores for two reasons.

One, registers can’t be shared.

Two, a CPU can look ahead only by a few instructions to identify other instructions which can be run in parallel (explanations available upon request).

Xeon · Feb 2, 2005

Finally someone puts up a valid point, I was ready to walk away not thinking anyone would mention that.

Xeon

Post created with being a dickhead in mind.
For all emotional and slanderous statements contact THG for all law suits.

slvr_phoenix · Feb 2, 2005

Well that is a little strange. A thread is the equivalent of a prime number. Dividing it could only result in an incomplete answer.

You're right in that dividing it resluts in incomplete answers, but the job of the processing-arbiter would not only be to divide the thread, but to also recombine those incomplete answers into a complete answer.

For example, here's a generic example of assembly instructions for a single thread:
mov a, b
mov d, 0
mov c, 4
mov d, 1
mul a, 4
mov d, 2
add c, 9
mov d, 3
mul c, 2
add a, 2
mov d, 4
mul a, a
mul c, d
mov d, c
mov b, a
mov a, 1
add b, c

Now look at how that can be re-programmed by a processing-arbiter to execute across multiple CPUs:

CPU1:
mov a, b
mul a, 4
add a, 2
mul a, a

CPU2:
mov a, 4
add a, 9
mul a, 2

CPU3:
mov a, 0
mov a, 1
mov a, 2
mov a, 3
mov a, 4

processing-arbiter (handled at tail end of cycle):
mov CPU1-d, CPU3-a
mov CPU1-c, CPU2-a

CPU1:
mul c, d
mov d, c
mov b, a
mov a, 1
add b, c

And if the processor were designed to share a big cache of registers where the processing-arbiter hands out virtualized labels so that all cores shared the same registers then the processing-arbiter wouldn't even need to do its two moves at the end of the cycle, thus shortening the pipe.

The whole concept of the processing-arbiter is to break apart instructions and reorder them so that they can be run across multiple CPUs without changing the end results. It's in effect still a single thread, just distributed across multiple cores so that the execution units of all cores are utilized as fully as possible.

And in fact the processing-arbiter could possibly even be designed to do things like identify the wasted processing done in things like:
mov a, 0
mov a, 1
mov a, 2
mov a, 3
mov a, 4

And shrink it down into:
mov a, 4

If a multi-core processor could have a processing-arbiter that does this then it would help bridge that gap between single-threaded and multi-threaded apps so that multi-core systems wouldn't be as idle when running heavy single threads. Unfortunately I don't think that it could ever be developed effectively for multiple CPU systems as the communication between CPUs would be just too slow and then the processing-arbiter would also have to deal with a lot more syncrhonization issues than it would in a multi-cored CPU.

<pre>I just want to say I wuv you.
And I mean it fwom the bottom of my hawt.</pre>

zeezee · Feb 3, 2005

I don’t see any added value of a code-arbiter (I think that’s how you call it) chip in the way you demonstrate it.

It sounds funny when you first come up with lines of redundant and meaningless code then shrink it down to a meaningful level and then claim that it’s the virtue of a revolutionary (and imaginary) chip.

Try to decorate your code with a few branch instructions and let’s see what happens…

endyen · Feb 3, 2005

It might work.......... for a pacman type game. In terms of real computing, in non linear instructions, the thread would break

juin · Feb 3, 2005

Intel have patent on micro treads it allow the cpu to break treads in multiple very small tread to run on a SMP CPU.

The pape i have read had a SMT CPU in mind not a dual core.Also Alpha was working on this.Allow you the CPU to have very width stage lane 8 or 16 ALU and 4 to 8 FPU.Alllowing the maximun flexibility.

i need to change useur name.

slvr_phoenix · Feb 3, 2005

Try to decorate your code with a few branch instructions and let’s see what happens…

What happens is that it becomes even more powerful because then, like in EPIC, it can start processing several branches simultaneously before it even knows which branch to use. Then it just ditches the unused branches when it knows which one to go with.

<pre>I just want to say I wuv you.
And I mean it fwom the bottom of my hawt.</pre>

Smithfield: 2Q05, 2.8Ghz for $240!!!

Distinguished

Distinguished

Distinguished

Splendid

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Splendid

Distinguished

Distinguished

Splendid

Splendid

Distinguished

Distinguished

Distinguished

Distinguished

Splendid

Distinguished

Splendid

Distinguished

Splendid

Share this page