Can AMD salvage QFX with an in-house chipset?

tmac · Dec 3, 2006

A chipset is not going to make this turkey fly.....

Baron, it is a failure of the architecture.... the cHT and NUMA arrangement was never meant for the desktop.

Perhaps when quad core comes out and if they can cut the latency by 80% between the two CPUs, then you will see something. Until then, anytime you strap in a second CPU it will hurt performance.

No wonder AMD didn't buy Nvidia. How dare they mess up the
launch of the "performance at any cost" desktop. Opps! At any cost,
I hope they didn't mean losing loyal AMD fanboys.

Parrot · Dec 3, 2006

Well, the QFX has been released and while certain scenarios show incredible promise, certain areas are also saddled with too much baggage.

The fact that the Opteron dual can be outfitted with SLI in a wksta and offer good perf without these power levels implies that the total package could have been done better.

The Opteron 285 runs at 2.6GHz and this graph shows that without the additional SLI power AMD runs at 322W and the dual 5160 runs at 267W (full load).

Looking at the varous articles around teh web, it seems as though only Anand managed to actually find suitable tasks that were reasonable for multi-tasking. In his case he used BluRay movies which totally killed all the dual core systems.

His power numbers were also at least 100W lower than other test sites. He turned on CnQ and got the idle temps down to within 4W of the C2Q system. Of course this didn't dent the 456W the system drew at "full load" with an 8800GTX ( which I believe draws 225W+ ).

And these are FX74 numbers. FX70 is shown to use even less so I believe OEMs can get reasonable wksta power levels out of it in teh next few months, especially if AMD releases a new rev( they sorely need to drop power by at least 10%- perhaps more and the lessons learned can help get Agena down below the reported 125W)

Hexus is also reporting that they can show a defect in the NUMA implementation of the Asus board BIOS ( this post was going to be called "Did Asus and nVidia drop the QFX ball") that maybe why games are suffering so much from latency problems.

One review ( most are posted at AMDZone) stated that AMD is reporting that the Interleave mode will need to be turned off for Vista and on for XP.

I believe AMD reported that they would release their own branded chipset and hopefully it will be less power hungry than the 680a, which is reported to use more power than even 975X. Having two of them surely doesn't help. nVidia does have a two socket SLI hipset in the 3600 and ASUS' implementation is only $300. Even the $400 Asus 680i for Intel implements less PCIe for less power reserves.

Because 7950GT and the probably forthcoming 8950GT only require two slots for Quad SLI. I can see the need for 4 low end GPUs for certain content creators but even 3 PCIe slots can't really be used right now as no "Havok" type apps or cards have been released, except for the server (AMD Stream).

Only time will tell if AMD had planned to create an entire reference system based on an Ati chipset while allowing nVidia to be the launch partner.

But the real judgement is that only expert builders will make QFX something not too loud or hot, while Vista X64 may do wonders for it in multithreaded apps so it is not yet ready for prime time.

Let's go AMD! Show your true potential.

The NUMA support in Vista is supposed to be better than in Windows XP so Vista should give better performance. For NUMA to work well in this two node set up, node to node traffic should be minimized. I know that in multi node systems it can sometimes be an advantage to use node interleave mode, such as in databases. The problem with Quad FX on the hardware level is the HT link. I do not think HT2.0 is adequate and it would take a FULL HT3.0 link to do the job viz., a 2.6GHz link that is 32bits wide. If you look at some of the memory benchmark figures you will see that they are very low and are probably caused by excessive cross-fire.

BaronMatrix · Dec 3, 2006

Intel shows better scaling in the 'VRAD Map Compilation' benchmark, but AMD shows better scalling in the 'Particle Systems Test' benchmark. Hardly an advantage to either camp in terms of pure scaling.

However, regardless of scaling, it is clear that Intel is significantly faster than AMD in both tests.

The only reason it looks like AMD scales better by looking at the graphs is because you are looking at the scaling of a 2.8GHz FX-62 vs 3GHz FX-74, whereas with Intel you are looking at a 2.93GHz X6800 vs 2.66GHz QX6700.

Notice how I calculated scaling by comparing at the same clockspeeds?

Paints a whole different picture, doesn't it Baron?

Myth debunked! Nothing to see here people, just AMD fanboy dribble. Move along now...

Let's look at the numbers (all numbers according to WInodows XP Calculator):

For the particle systems tests

5200+ -- 26
FX 62 -- 28

FX70 -- 55
FX72 -- 58

Scaling:
2.6GHz - 111%
2.8GHz - 107%
no 3GHz numbers

Core 2

E6700 -- 44
C2Q 6700 -- 85

Scaling:
2.66GHz - 97%

This shows greater scaling though clockspeed seems to have a negative effect.
It also shows a reversal from single or dual thread games, but Carmack is busy creating a multi-core engine along with Valve and others, so QFX has a bright 2007 future. Optimizations

MAY EVEN (disclaime due to negative reaction)

increase perf in current single/dual threaded games (though keeping as much data as possible in the right banks will help a lot).

BaronMatrix · Dec 3, 2006

The NUMA support in Vista is supposed to be better than in Windows XP so Vista should give better performance. For NUMA to work well in this two node set up, node to node traffic should be minimized. I know that in multi node systems it can sometimes be an advantage to use node interleave mode, such as in databases. The problem with Quad FX on the hardware level is the HT link. I do not think HT2.0 is adequate and it would take a FULL HT3.0 link to do the job viz., a 2.6GHz link that is 32bits wide. If you look at some of the memory benchmark figures you will see that they are very low and is probably caused by excessive cross-fire.

You amplify my point. As I explained several times 2GB per socket should allow all game data to be loaded to one set of RAM. I was hoping for 8GB max to allow 4GB per socket but....

It has nothing to do with HT. It has to do with load/swap in Windows along with pre-scheduled to handle cases where apps need to call OS/Win32 functions. Vista has totally improved this code with an emphasis on multi-core functionality.

I'm sure that a side by side comparison will show much more "fragmentation" in XP/2003 than Vista/LongHorn. Latency tests (Hexus) show that a 10% increase would boost QFX above FX62 because of available bandwidth, such that CPU 0/1 has it's own path to memory and having all data (up to 2GB with swapping) for a game on CPU 2/3, watch out, the 5% I predicted will happen.

To Dirk, you need a driver that enables quick OS returns from CPU 0/1.

Parrot · Dec 3, 2006

The NUMA support in Vista is supposed to be better than in Windows XP so Vista should give better performance. For NUMA to work well in this two node set up, node to node traffic should be minimized. I know that in multi node systems it can sometimes be an advantage to use node interleave mode, such as in databases. The problem with Quad FX on the hardware level is the HT link. I do not think HT2.0 is adequate and it would take a FULL HT3.0 link to do the job viz., a 2.6GHz link that is 32bits wide. If you look at some of the memory benchmark figures you will see that they are very low and are probably caused by excessive cross-fire.

You amplify my point. As I explained several times 2GB per socket should allow all game data to be loaded to one set of RAM. I was hoping for 8GB max to allow 4GB per socket but....

It has nothing to do with HT. It has to do with load/swap in Windows along with pre-scheduled to handle cases where apps need to call OS/Win32 functions. Vista has totally improved this code with an emphasis on multi-core functionality.

I'm sure that a side by side comparison will show much more "fragmentation" in XP/2003 than Vista/LongHorn. Latency tests (Hexus) show that a 10% increase would boost QFX above FX62 because of available bandwidth, such that CPU 0/1 has it's own path to memory and having all data (up to 2GB with swapping) for a game on CPU 2/3, watch out, the 5% I predicted will happen.

To Dirk, you need a driver that enables quick OS returns from CPU 0/1.
If a core accesses its local memory then it has a bandwidth of 12.8GB/s(using DDR2-800). If it accesses the memory of another node, then this transfer has to take place across the HT2.0 link which is only 4GB/s (8GB/s aggregate). So as well as the added latency associated with the node hop, the transfer would take LONGER than it would from local memory-this is why the HT link bandwidth is important.

joset · Dec 3, 2006

(...) In the best case scenario for QFX - RAM wise - you have 4GB of RAM which even Vista's process load can't overwhelm, all of the OS processes AND data can be loaded to CPU 0 and CPU 1.

If the current request allocates more RAM than is available in the first set, the process is loaded in the second set with CPU 2 and CPU 3. Now if swapping is required by the process "over-filling" the second set, it should maintain a contiguous line by not reloading data to CPU 0/1s RAM banks in the case of the process that caused the overfill.

In this scenario, all game threads (single, dual or multi) should remain on CPU2/3 with their data. It may require a patch on AMDs part that assures this (like the dual core patch for sync) but this scenario should allow the two cores to fully use all of their potential bandwidth and processing wise.

For cases where OS code needs to be called, it should be just be passing the necessary data and recieving a return from CPU 0/1.

I expect big things with a refined implementation. Even you realized the defects in the first rev of C2D.

Well, that'd be stretching the pass a bit, wouldn't it? My guess would be that, 1MB L2 cache per core wouldn't help much, would it? Does AMD use 'NUCA', cache-wise?
http://portal.acm.org/citation.cfm?id=1077690

Cheers!

BaronMatrix · Dec 3, 2006

Only time will tell if AMD had planned to create an entire reference system based on an Ati chipset while allowing nVidia to be the launch partner.

I still hope to see an AMD/ATI 4x4 motherboard later on. You should wait until that happens instead of rushing to buy the current AMD/nVidia offering for Xmas. You already have two good dual-core systems that should be more than enough to tide you over until then.

If you read my posts my XMas was based on DX10 cards being reasonably priced (I won't be buying a DX9 card). They aren't in my mind for addition to a whole system.

Vista X64 Ultimate won't be available until Feb01/07 so unless I get Vista Business now I won't be buying anyway. I would be buying this for more multithreaded games - 107% min is worth my money - but moreso for business apps that are aleady multithreaed. Doom4/Quake5 WILL be multi-threaded and te Valave tests show my investment will have a return - regardless of C2D or C2Q..

TabrisDarkPeace · Dec 3, 2006

As a former owner of an Opteron 270 (2 x dual-core at 2.0 GHz per core, 1 MB L2 cache per core, ccNUMA using 2 x nodes of Dual-Channel Reg ECC DDR1-400).

The Opteron 200/800 series did not support Cool'n'Quiet as Registered DIMMs need a constant flow of energy to work and the memory controller is integrated (has pros and cons).

Since these new CPUs likely support Cool'n'Quiet, I'd recommend just disabling C'n'Q to get the stuttering to stop.

In addition to this add /usepmtimer to C:\boot.ini
http://search.microsoft.com/results.aspx?q=boot.ini+%2Fusepmtimer&l=1&mkt=en-US&FORM=QBME1

ccNUMA really wasn't designed for games, it will likely stutter a fair bit if it is used and /usepmtimer is not used at the same time.

You can run ccNUMA enabled or disabled in both Windows XP, XP x64 Edition, 2003 Server (x64), and Vista (and Linux, etc too). It depends what gives you the best performance.

"Node Interleaving" is disabling ccNUMA, so instead of having the OS control it the 'chipset' controls it, it will stutter less but the performance of the memory will not aggregate as well compared to ccNUMA. (Latency vs Throughput arguement)

The ccNUMA vs Node Interleaving being for one OS and not the others is total garbage. All Windows Kernels based on NT 5.1 will support it, just games may not scale well (stuttering) if ccNUMA is used.

Vista will not improve ccNUMA performance in gaming, the technology is years old, well established, and can not be improved very much at all (from a inter-latency perspective within an OS's Kernel).

Just google "NUMA" and "ccNUMA", it is an idea perhaps a decade old now.

http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access

VALVE'S SMP TESTS SHOW GREATER THAN 100% SCALING!!

If a core accesses its local memory then it has a bandwidth of 12.8GB/s(using DDR2-800). If it accesses the memory of another node, then this transfer has to take place across the HT2.0 link which is only 4GB/s (8GB/s aggregate). So as well as the added latency associated with the node hop, the transfer would take LONGER than it would from local memory-this is why the HT link bandwidth is important.

Bear in mind one sockets cores can benefit from the L1/L2 cache in the other sockets cores, and then if required, it will access its memory.

ccNUMA is designed to aggregate throughput, not combat latency. For a ccNUMA gaming system /usepmtimer must be added to C:\boot.ini - I can not stress this enough - Perhaps some other IT websites need to redo their tests ?

For anyone considering Quad-FX I strongly recommend they look to Opteron 2000 series instead.

PS: Sorry - This post is a bit all over the place, but meh, it'll do.

apache_lives · Dec 3, 2006

Well, the QFX has been released and while certain scenarios show incredible promise, certain areas are also saddled with too much baggage.

The fact that the Opteron dual can be outfitted with SLI in a wksta and offer good perf without these power levels implies that the total package could have been done better.

The Opteron 285 runs at 2.6GHz and this graph shows that without the additional SLI power AMD runs at 322W and the dual 5160 runs at 267W (full load).

Looking at the varous articles around teh web, it seems as though only Anand managed to actually find suitable tasks that were reasonable for multi-tasking. In his case he used BluRay movies which totally killed all the dual core systems.

His power numbers were also at least 100W lower than other test sites. He turned on CnQ and got the idle temps down to within 4W of the C2Q system. Of course this didn't dent the 456W the system drew at "full load" with an 8800GTX ( which I believe draws 225W+ ).

And these are FX74 numbers. FX70 is shown to use even less so I believe OEMs can get reasonable wksta power levels out of it in teh next few months, especially if AMD releases a new rev( they sorely need to drop power by at least 10%- perhaps more and the lessons learned can help get Agena down below the reported 125W)

Hexus is also reporting that they can show a defect in the NUMA implementation of the Asus board BIOS ( this post was going to be called "Did Asus and nVidia drop the QFX ball") that maybe why games are suffering so much from latency problems.

One review ( most are posted at AMDZone) stated that AMD is reporting that the Interleave mode will need to be turned off for Vista and on for XP.

I believe AMD reported that they would release their own branded chipset and hopefully it will be less power hungry than the 680a, which is reported to use more power than even 975X. Having two of them surely doesn't help. nVidia does have a two socket SLI hipset in the 3600 and ASUS' implementation is only $300. Even the $400 Asus 680i for Intel implements less PCIe for less power reserves.

Because 7950GT and the probably forthcoming 8950GT only require two slots for Quad SLI. I can see the need for 4 low end GPUs for certain content creators but even 3 PCIe slots can't really be used right now as no "Havok" type apps or cards have been released, except for the server (AMD Stream).

Only time will tell if AMD had planned to create an entire reference system based on an Ati chipset while allowing nVidia to be the launch partner.

But the real judgement is that only expert builders will make QFX something not too loud or hot, while Vista X64 may do wonders for it in multithreaded apps so it is not yet ready for prime time.

Let's go AMD! Show your true potential.

so baron, how do you explain the comparison - double the cost, double the heat, custom boards, double the power consumption for a mere -20% (yes, less) performance over Intels Q6700, running at lower frequencies, drop in compatible with current 965/975 mobo's? and the fact that intels off die memory controller and old FSB design can take on AMDs latest? go figure.

I wonder if any other manafactuere is game enough other then ASUS to make such a board for it...

BaronMatrix · Dec 3, 2006

As a former owner of an Opteron 270 (2 x dual-core at 2.0 GHz per core, 1 MB L2 cache per core, ccNUMA using 2 x nodes of Dual-Channel Reg ECC DDR1-400).

The Opteron 200/800 series did not support Cool'n'Quiet as Registered DIMMs need a constant flow of energy to work and the memory controller is integrated (has pros and cons).

Since these new CPUs likely support Cool'n'Quiet, I'd recommend just disabling C'n'Q to get the stuttering to stop.

In addition to this add /usepmtimer to C:\boot.ini
http://search.microsoft.com/results.aspx?q=boot.ini+%2Fusepmtimer&l=1&mkt=en-US&FORM=QBME1

ccNUMA really wasn't designed for games, it will likely stutter a fair bit if it is used and /usepmtimer is not used at the same time.

You can run ccNUMA enabled or disabled in both Windows XP, XP x64 Edition, 2003 Server (x64), and Vista (and Linux, etc too). It depends what gives you the best performance.

"Node Interleaving" is disabling ccNUMA, so instead of having the OS control it the 'chipset' controls it, it will stutter less but the performance of the memory will not aggregate as well compared to ccNUMA. (Latency vs Throughput arguement)

The ccNUMA vs Node Interleaving being for one OS and not the others is total garbage. All Windows Kernels based on NT 5.1 will support it, just games may not scale well (stuttering) if ccNUMA is used.

Vista will not improve ccNUMA performance in gaming, the technology is years old, well established, and can not be improved very much at all (from a inter-latency perspective within an OS's Kernel).

Just google "NUMA" and "ccNUMA", it is an idea perhaps a decade old now.

http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access

VALVE'S SMP TESTS SHOW GREATER THAN 100% SCALING!!

If a core accesses its local memory then it has a bandwidth of 12.8GB/s(using DDR2-800). If it accesses the memory of another node, then this transfer has to take place across the HT2.0 link which is only 4GB/s (8GB/s aggregate). So as well as the added latency associated with the node hop, the transfer would take LONGER than it would from local memory-this is why the HT link bandwidth is important.

Bear in mind one sockets cores can benefit from the L1/L2 cache in the other sockets cores, and then if required, it will access its memory.

ccNUMA is designed to aggregate throughput, not combat latency. For a ccNUMA gaming system /usepmtimer must be added to C:\boot.ini - I can not stress this enough - Perhaps some other IT websites need to redo their tests ?

For anyone considering Quad-FX I strongly recommend they look to Opteron 2000 series instead.

PS: Sorry - This post is a bit all over the place, but meh, it'll do.

ccNUMA was designed for high bandwidth apps such that app data is loaded nearest the "process" CPU. When implemented properly it only requires fitting more data into those closest banks.

For example, if an app uses 1.9GB peak, then swapping needs to happen for the process and if data is only swapped back and forth to the same banks, the "fragmentation" won't cause latency proelms with crossCPU accesses.

The fact that you saw the tests with tthe Valve engine means that you realize that QFX is not designed to take adavntage of the current "limited-threaded" game landscape but the future mutli-core optimized landscape. Vista X64, by default, will automatically have optimizations in X64 (similar to KernelGuard) that will boost perf for multi-core systems.

Of courseit will be up to drivers to optimize both transfers and load/swap, it is not an insurmountable task given the general purpose nature of AMD64s instruction set.

HTs tranfer rate is ONLY necessary when the "working set" is greater than the available bank space AND the total RAM usage is greater than system RAM.

Effective optimizations crated SSE and to some extent AMD64. QFX is alive and well though it's suffering from poor optimization. It is almost reminiscent of Core 2s initial hiccups which caused a MANDATORY replacement rev, as irrelevant as that maybe.

BaronMatrix · Dec 3, 2006

so baron, how do you explain the comparison - double the cost, double the heat, custom boards, double the power consumption for a mere -20% (yes, less) performance over Intels Q6700, running at lower frequencies, drop in compatible with current 965/975 mobo's? and the fact that intels off die memory controller and old FSB design can take on AMDs latest? go figure.

I wonder if any other manafactuere is game enough other then ASUS to make such a board for it...

It's very simple. I'm comparing FX70 w/ an optimized implementationwith my current AMD system and DX10 multi-threaded games.

I have never said that al of a sudden K8 wold faster core for core than C2. It has shown it's great potential with future tech such as BluRay (AnandTech) and with Valve tests.

As far as other manufs making boards, I have already commented that AMD will need to produce a reference platform that improves upon the current implementation to get more OEMs on board(The Inq , I believe went so far as to say that OEMs didn't want to use AM2).

The ripple effect could even cause a shift in the amount of sockets. AMD should have some kind of AgenaFX demo by Mar or so that will show the added potential of the enhanced arch.

I still believe in the concept and I hope that AMD does too so they will put money into getting the power down.

BaronMatrix · Dec 3, 2006

baron is stumped,ideals meet reality in a fiery blurr.

i wonder about a baron casewhite link,the information is pretty similar,however its just not working for amd yet.how do you compensate for a 27% rise in memory latency?

Amd does great on long winded equations,but its sooo different from the consumer desktop.you cant have hpc computing on a pc,the end result is slower apps processing.

i think numa could be amd's netburst,they should rethink that.you cant bring hpc to desktop and rule,i dont care how many petaflops you have on your record.

if thats what 4x4 is about ,amd just lost till they ditch that notion.
while hpc designers referr to desktop benches as superficial,the whole array of desktop computing application follows this superficial trend.
i doubt a desktop will do the depth of computing that an hpc does for a long time.

What do you mean? I have never expected QFX to overtake the per clock advantage of C2, but I did expect it to narrow the gap between the two, which it does.

QFX is not abotu NUMAs ability to transfer between cores. It's about dual sockets that use the per socket memory efficiently.

Let's not get back into the numbers game of process load/swap. Maybe Jack can enlighten you as to how 90nm 2.6GHz Opterons are running at 55W to allay your fears. Though that's not to say that AMD will try to lower per scket power with an F4 rev.

Action_Man · Dec 3, 2006

Maybe Jack can enlighten you as to how 90nm 2.6GHz Opterons are running at 55W to allay your fears.

Oh I know that one, its called hand picked and undervolted. Moron.

TabrisDarkPeace · Dec 3, 2006

Almost all Win32 processes are limited to 2 GB of RAM, the exception is applications designed to take advantage of /3GB and similar switches in BOOT.INI. Games do not fall into this category.

Search for /3GB on www.microsoft.com for examples (There are a few other similar methods used too, but games do not use any of them as games target consumers with 'normal' PCs). 8)

All allocated memory in Win32 / Win.x64 is virtual by nature, and managded by well... the memory manager.

Virtual simply means the physical and virtual addresses do not match up, it may also imply swapping or paging to disk (or to other RAM - no joke).

The problem with ccNUMA is that one process may use 2 nodes, but it may allocate 400 MB to one node and 800 MB to the other.

SANDRA has a test that demonstrates this very well, since the early Opteron days.

As for Vista having "significantly better multi-processor performance" I have yet to see anything that indicates this is even remotely true. In fact most data points to the exact opposite.

The Windows Kernel has not changed much since NT 4.0, NT 5.0 (Windows 2000), NT 5.1 (Windows XP, 2003 Server), etc - They all support 32 CPU cores and beyond NT 5.1 (Win 2K3 Server x64 Kernel) there isn't really all that much to improve.
However in the consumer varients they limit the processor core count to below 32, the Kernel can still do it, it has just been given an artificially low ceiling

Do consider that NT 4.0 was available for DEC Alpha processors (Compaq bought out DEC, then became HP / Compaq), this was around a decade ago. Aswell as SGI MIPS processors - These systems used (cc)NUMA years before AMD was selling it yo you, and it hasn't changed all that much. (There is nothing much left to 'optimize' in the Vista Kernel to get more performance from ccNUMA systems, Vista is mostly a fresh paint job GUI wise).

I think you may have ccNUMA and swapping/paging confused with something else. (In that how it works on a hardware + low level software basis).

As for the /usepmtimer usage:
[boot loader]
timeout=30
default=multi(0)disk(0)rdisk(0)partition(1)\WINDOWS
[operating systems]
multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="Microsoft Windows XP Professional" /noexecute=optin /fastdetect /usepmtimer

// (Do not wrap the line)

:? I'm not exactly sure what you think ccNUMA is, or how it works, but it doesn't seam to fall in line with the industry accepted definitions and platform descriptions of NUMA and/or ccNUMA systems (in that how they work at a hardware / kernel level). 8O

BaronMatrix · Dec 3, 2006

That is pretty much right...

Baron, the low power parts AMD produces are actually binning out higher, they are partitioning their best bin, undervolting and down clocking to hit power.

P=C*V*V*F so by down volting and underclocking one can achieve a cubic functionality to reduce power. Hence a 10% reduction in voltage and fequency gives a 30% reduction (approximately, actually 0.9^3 * 100%) in power.

As soon as I get a chance I will find your quotes of 100nm spacing as the reason that they got power down along with your assertion that it would damage yields.

Action_Man · Dec 3, 2006

:lol: What a moron. Keep trying and maybe one day you won't be such a dipshit. Enjoy your sh!tty 4x4 system as well.

BaronMatrix · Dec 3, 2006

Almost all Win32 processes are limited to 2 GB of RAM,

I think X64 has no limit to process size, but I could be mistaken in my desire to upgrade to that vs. Vista X86.

BaronMatrix · Dec 3, 2006

Almost all Win32 processes are limited to 2 GB of RAM, the exception is applications designed to take advantage of /3GB and similar switches in BOOT.INI. Games do not fall into this category.

Search for /3GB on www.microsoft.com for examples (There are a few other similar methods used too, but games do not use any of them as games target consumers with 'normal' PCs). 8)

All allocated memory in Win32 / Win.x64 is virtual by nature, and managded by well... the memory manager.

Virtual simply means the physical and virtual addresses do not match up, it may also imply swapping or paging to disk (or to other RAM - no joke).

The problem with ccNUMA is that one process may use 2 nodes, but it may allocate 400 MB to one node and 800 MB to the other.

SANDRA has a test that demonstrates this very well, since the early Opteron days.

As for Vista having "significantly better multi-processor performance" I have yet to see anything that indicates this is even remotely true. In fact most data points to the exact opposite.

The Windows Kernel has not changed much since NT 4.0, NT 5.0 (Windows 2000), NT 5.1 (Windows XP, 2003 Server), etc - They all support 32 CPU cores and beyond NT 5.1 (Win 2K3 Server x64 Kernel) there isn't really all that much to improve.
However in the consumer varients they limit the processor core count to below 32, the Kernel can still do it, it has just been given an artificially low ceiling

Do consider that NT 4.0 was available for DEC Alpha processors (Compaq bought out DEC, then became HP / Compaq), this was around a decade ago. Aswell as SGI MIPS processors - These systems used (cc)NUMA years before AMD was selling it yo you, and it hasn't changed all that much. (There is nothing much left to 'optimize' in the Vista Kernel to get more performance from ccNUMA systems, Vista is mostly a fresh paint job GUI wise).

I think you may have ccNUMA and swapping/paging confused with something else. (In that how it works on a hardware + low level software basis).

As for the /usepmtimer usage:
[boot loader]
timeout=30
default=multi(0)disk(0)rdisk(0)partition(1)\WINDOWS
[operating systems]
multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="Microsoft Windows XP Professional" /noexecute=optin /fastdetect /usepmtimer

// (Do not wrap the line)

:? I'm not exactly sure what you think ccNUMA is, or how it works, but it doesn't seam to fall in line with the industry accepted definitions and platform descriptions of NUMA and/or ccNUMA systems (in that how they work at a hardware / kernel level). 8O

You know it's funny that I really didn't read your whole post - which is moot btw because I tested the first ACPI Alpha on Win2000 - and it was IMMEASURABLY FASTER than any othe chip even close to the same clock speed.

C2Q also uses the same susbsystem but excels because of it's arch/implmentation. My whole point has been load/swap and OS scheduling. I have said numerous times that I am not thinkiing of this mainly for games, though the above 60fps AND severe multi-threading/tasking capabilities make it a worthy upgrade to prep for 8 cores.

The funny thing is that for my needs I could go with a 5200+(2.8GHz = 11x260), 8800GTS SLI, with 8GB RAM, and then Agena.

I want two sockets. It has shown the promise I wanted (EXCEPT IN THE POWER AREA - though FX70 is tamer with careful construction).

TabrisDarkPeace · Dec 3, 2006

- Quoting entire posts within the same scrollable region is kinda annoying.

Anyways, as I don't want people scrolling over the same repeated text over and over, here is my reply:

http://users.on.net/~darkpeace/forum_posts_long/BaronMatrix - Msg 56 - Can AMD salvage QFX with an in-house chipset _.pdf

If he was sever oriented he would be running an Opteron 2000 by now. - Something with Registered ECC DIMMs at least. :lol:

boduke · Dec 3, 2006

That is pretty much right...

Baron, the low power parts AMD produces are actually binning out higher, they are partitioning their best bin, undervolting and down clocking to hit power.

P=C*V*V*F so by down volting and underclocking one can achieve a cubic functionality to reduce power. Hence a 10% reduction in voltage and fequency gives a 30% reduction (approximately, actually 0.9^3 * 100%) in power.

As soon as I get a chance I will find your quotes of 100nm spacing as the reason that they got power down along with your assertion that it would damage yields.

Baron, I will save you the trouble.

The third part of that equation above, capacitance, is a harder one to manipulate but that is certainly one variable that can happen.

The problem with the C in the equation above is that when you target the poly line width (which is what you are referring to) the yield immediately drops off. In fact, this is how AMD and Intel squeeze out the enthusiast level parts.

They are called bouquet lots, in which case, these are special lots that are tagged throug the line. At key points down the line, certain process steps are intentionally dialed in to a particular value. A value that is great for transistor performance but horrid for actually getting good yield. So, in order to get the total dollar per wafer pass to be the same, they need to charge high high prices for these parts.... I would not be surprised if AMD is losing money on their lower end FX-7x line.

Anyway, back to the physics. Capacitance defined is dQ/dV, but physically is equal to A*k/D where A is area of the parallel charge planes, k is the dielectric constant, and D is the distance between those planes. To get power down (or, for high end parts, to give you more room to scale frequency up), you need to make C smaller, since K is a constant and D affect short channel affects, it is the A you make smaller. Thus the poly lines that define the area of the gate over the channel are made narrower, this is called a critical dimension, and it can be done at the lithography step for patterning the poly gate electrode lines.

Heck the likely do this too to get those powers down, since they are pushing a 2.6 GHz opteron, the volume will be very low and the price they can get will be relatively high....

The point is, it is nothing special like you are so ignorantly trying to say. In fact, you are fairly dimwittedly stupid.

Jack

Careful Jack - you'll make his head explode.

Action_Man · Dec 3, 2006

You know it's funny that I really didn't read your whole post

How unusual. Get back to cleaning toilets and mopping the floors janitor boy.

Insane_Maniac · Dec 3, 2006

I'm going to go out on a limb here and play devil's advocate. An in house chip set will not fix the problems that loom over AMD's FX processors. The problem lies in with their design. Overall I think the design is just bad. There is not a simple way to say it but that the design is flawed and horrid. Several reasons behind that, the poor latency, the low overhead for over clocking and the ridiculous amount of power that's required to run the 4X4. Honestly, SLI 8800GTX's is bad enough, but take that and add in possibly the Quad platform and your going to be chewing through the amps on the rails faster than Clinton lied about the scandal in the white house. Given the design implemented on this, AMD surely could have done much better. I personally do not know how they could have revised or changed the design, but the more I read about the 4X4 the more I see it as a hashed fusion of the Opteron processors and the implementation of the K8. Honestly, I mean come on, do the guys over at AMD expect us to buy into this crap and expect that this is the ultimate enthusiast system?

The guys really set their sights too high and should have kept on working out the K8L and AM3. To be honest the 4X4 is jut a total waste of their time and money. I'm going to give AMD credit for trying to go in a different direction and try something like this, but the Quad FX is only a real stop gap for them until they can roll out the K8L and 65nm production.

Moving on though, I would have tried to make improvements to the architecture or just start from scratch and return to my roots and go with something different besides a fusion of an aging architecture and re branding a server based processor as the FX series. Right now the price, power consumption, and benchmarks have AMD staring down the barrel of defeat.

(But honestly, I am rooting for AMD and hope they bring their A game and get it going soon. It's understandable that they are struggling with 65nm and don't have as many resources as the boys in blue, but for the underdog who was kicking Intel's tail for 2 years running, they really fricked this 4X4 idea up.)

TabrisDarkPeace · Dec 3, 2006

..... I tested the first ACPI Alpha on Win2000 - and it was IMMEASURABLY FASTER than any othe chip even close to the same clock speed .....

C2Q also uses the same susbsystem but excels because of it's arch/implmentation. My whole point has been load/swap and OS scheduling. I have said numerous times that I am not thinkiing of this mainly for games, though the above 60fps AND severe multi-threading/tasking capabilities make it a worthy upgrade to prep for 8 cores. .....

The funny thing is that for my needs I could go with a 5200+(2.8GHz = 11x260), 8800GTS SLI, with 8GB RAM, and then Agena.

I want two sockets. It has shown the promise I wanted (EXCEPT IN THE POWER AREA - though FX70 is tamer with careful construction).

Then explain why the Microsoft Windows 2000 on DEC Alpha documentation is always talking about "reducing paging" ?, While you are always talking about increasing paging / swapping ?

http://msdn2.microsoft.com/en-us/library/ms810461.aspx
http://search.microsoft.com/results.aspx?mkt=en-US&setlang=en-US&q=alpha+%22windows+2000%22

Until you do I'll be clicking [Ignore] when I encounter your posts, to ignore every post you've made within that that (and to save a ****load of scrolling as you do full quotes needlessly, trollish behaviour IMHO).

As for your comment about "I have said numerous times that I am not thinkiing of this mainly for games, though the above 60fps" immediately followed by your other comment "The funny thing is that for my needs I could go with a 5200+(2.8GHz = 11x260), 8800GTS SLI..."

It sounds to me as though you are a gamer, since you're not running Registered, ECC, or FB-DIMMs.

I'd like to thank the admins for adding an [Ignore] button. :trophy: - I've only needed to use it on a small group of people thankfully, mostly to save reading utter tripe and scrolling time due to full quotes being made w/o purpose.

To the others reading this thread providing support, I thank you all. 8)

UPDATE: AMD need a +46.55172 % gain in performance over the 2.8 GHz quad-core solution just to keep pace, and this is assuming the extra delta continues to scale at 107%, which it will not, it'll keep dropping off the higher the clockspeed or IPC gets compared to the system bus.

Perhaps 2 x +21.05854 % gains stacked together, eg: [clockspeed] + [something else] compounded, but they will not get this without a massive redesign of the CPU core, K8L on 65nm is not that redesign.

eg: 4 x 3.4 GHz cores on 65 nm using K8L, vs 4 x 2.8 GHz cores on 90 nm using K8. This will make up for +21.42856 % of the gain.

Where will the other +20.68966 % (compounded) performance come from ?

Even borrowing a few OOO / pre-fetching tricks from the Intel NGMA (Core 2 Duo, Xeon 5100 / 5300, etc) they'll only get another +10 % to +15 % of the 'required' +20.68966 %, leaving them offering about 95% of Intel Core 2 Quad performance, likely at a higher overall price.

It'll take at least 6 - 9 months once 65 nm is ramped up before they'll even be able to offer a hypothetical 3.4 GHz quad-core part too, time in which Intel will just gain more of a lead.

I still want to see a 45/45/10 market share balance between Intel / AMD / Others (including Sun Microsystems, SGI, Via, etc).

pausert20 · Dec 3, 2006

Well, the QFX has been released and while certain scenarios show incredible promise, certain areas are also saddled with too much baggage.

The fact that the Opteron dual can be outfitted with SLI in a wksta and offer good perf without these power levels implies that the total package could have been done better.

The Opteron 285 runs at 2.6GHz and this graph shows that without the additional SLI power AMD runs at 322W and the dual 5160 runs at 267W (full load).

Looking at the varous articles around teh web, it seems as though only Anand managed to actually find suitable tasks that were reasonable for multi-tasking. In his case he used BluRay movies which totally killed all the dual core systems.

Mmmmm, the only way to reduce this latency issue between cores is to do 2 things. 1) Reduce the number of memory reads/xtransfers between cores 2) Increase the speed of the link

I believe Intel has already accomplished this by increasing the cache for each processor. I bet if AMD would come out with a Dual Core processor with 2 or more Megs of L2 cache we would see a big performance boost for the 4X4 System. At least that is it looks like AMD could do to boost the performance of the 4X4 system using the K8 core processor.

Jack, does this ring true to your thinking?

His power numbers were also at least 100W lower than other test sites. He turned on CnQ and got the idle temps down to within 4W of the C2Q system. Of course this didn't dent the 456W the system drew at "full load" with an 8800GTX ( which I believe draws 225W+ ).

And these are FX74 numbers. FX70 is shown to use even less so I believe OEMs can get reasonable wksta power levels out of it in teh next few months, especially if AMD releases a new rev( they sorely need to drop power by at least 10%- perhaps more and the lessons learned can help get Agena down below the reported 125W)

Hexus is also reporting that they can show a defect in the NUMA implementation of the Asus board BIOS ( this post was going to be called "Did Asus and nVidia drop the QFX ball") that maybe why games are suffering so much from latency problems.

One review ( most are posted at AMDZone) stated that AMD is reporting that the Interleave mode will need to be turned off for Vista and on for XP.

I believe AMD reported that they would release their own branded chipset and hopefully it will be less power hungry than the 680a, which is reported to use more power than even 975X. Having two of them surely doesn't help. nVidia does have a two socket SLI hipset in the 3600 and ASUS' implementation is only $300. Even the $400 Asus 680i for Intel implements less PCIe for less power reserves.

Because 7950GT and the probably forthcoming 8950GT only require two slots for Quad SLI. I can see the need for 4 low end GPUs for certain content creators but even 3 PCIe slots can't really be used right now as no "Havok" type apps or cards have been released, except for the server (AMD Stream).

Only time will tell if AMD had planned to create an entire reference system based on an Ati chipset while allowing nVidia to be the launch partner.

But the real judgement is that only expert builders will make QFX something not too loud or hot, while Vista X64 may do wonders for it in multithreaded apps so it is not yet ready for prime time.

Let's go AMD! Show your true potential.

The NUMA support in Vista is supposed to be better than in Windows XP so Vista should give better performance. For NUMA to work well in this two node set up, node to node traffic should be minimized. I know that in multi node systems it can sometimes be an advantage to use node interleave mode, such as in databases. The problem with Quad FX on the hardware level is the HT link. I do not think HT2.0 is adequate and it would take a FULL HT3.0 link to do the job viz., a 2.6GHz link that is 32bits wide. If you look at some of the memory benchmark figures you will see that they are very low and are probably caused by excessive cross-fire.

Bang on correct. 4x4 as a platform will be much better with the socket 1207+ and HT 3.0.

mesarectifier · Dec 3, 2006

Okay, so I'm checking my morning e-mails here and this thread has grown from about 4 posts when I went to sleep to 4 pages when I wake up.

The general gist of this thread (so far as I can tell) is that BM seems to be making the same argument over...and...over...and...over...again.

This whole thing was summed up on page 1!

Can AMD salvage QFX with an in-house chipset?

Distinguished

Distinguished

Splendid

Splendid

Distinguished

Distinguished

Splendid

Distinguished

Splendid

Splendid

Splendid

Splendid

Splendid

Distinguished

Splendid

Splendid

Splendid

Splendid

Distinguished

Distinguished

Splendid

Distinguished

Distinguished

Distinguished

Distinguished

Share this page