News China's Loongson CPU seems to have IPC equivalent to Zen 4 and Raptor Lake -- but slow speeds and limited core count keeps 3A6000 well behind moder...

Status
Not open for further replies.
Jul 7, 2022
617
569
1,760
It seems like the loongson cpu is using a short pipeline to yield similar IPC to AMD and Intel’s 14+ stage pipeline. The downside of a short pipeline is low fequency limits, but it’s very easy to implement to reach its competitors IPC. This just shows how much more advanced it’s competitors are that they can reach a similar IPC with 14+ stage pipelines and clock to 6 ghz.
 

LuxZg

Distinguished
Dec 29, 2007
225
42
18,710
I believe this should be viewed against the 12nm production limitation. I wonder what would be possible if they had access to cutting edge process. Considering they are being compared to companies working on CPUs since "forever" they're doing pretty good. Now, if SMIC can do 7nm and possibly 5nm, and if government can help with this design similar as Huawei had huge resources tor their ARM CPU, they could be catching up. Modern processors barely advance gen on gen, most advances are due to manufacturing process and platform improvements (PCIe 5.0, DDR5, etc). So AMD/Intel won't be running away, and if China can keep pushing, they'll be closer and closer. Just saying... Good news is - this is and will stay product for internal market. Well, good news for Intel/AMD, bad for consumers and competition. For me there's no bad competition, the old corporations need kick in the tush to start moving from time to time.
 
  • Like
Reactions: George³
Jul 7, 2022
617
569
1,760
I believe this should be viewed against the 12nm production limitation. I wonder what would be possible if they had access to cutting edge process. Considering they are being compared to companies working on CPUs since "forever" they're doing pretty good. Now, if SMIC can do 7nm and possibly 5nm, and if government can help with this design similar as Huawei had huge resources tor their ARM CPU, they could be catching up. Modern processors barely advance gen on gen, most advances are due to manufacturing process and platform improvements (PCIe 5.0, DDR5, etc). So AMD/Intel won't be running away, and if China can keep pushing, they'll be closer and closer. Just saying... Good news is - this is and will stay product for internal market. Well, good news for Intel/AMD, bad for consumers and competition. For me there's no bad competition, the old corporations need kick in the tush to start moving from time to time.
My 12nm AMD ryzen 2700x can do 4.3 ghz all day, my 32nm i7E-3960x from 2012 can do 5 ghz all day. 12nm is not the reason why it’s clocked at 2.5ghz. It’s directly related to the length of the pipeline which is why I’m saying loongson is not anywhere near on par with AMD and Intel. Short pipelined Intel pentium 3’s at 1.13 ghz performed similar to long pipeline pentium 4’s at 1.7ghz because Long pipelines reduce IPC in exchange for high clock speed stability. This pipeline effect is why loongson’s design is well behind AMD and Intel. For AMD and Intel to get the same IPC as loongson while utilizing a long pipeline capable of 6 ghz stability demonstrates a micro architecture years ahead of loongson.
 

DavidC1

Distinguished
May 18, 2006
494
67
18,860
It seems like the loongson cpu is using a short pipeline to yield similar IPC to AMD and Intel’s 14+ stage pipeline.
It could be, but we can't say for sure.
Clockspeeds aren't just a function of architecture, but the capability of the circuit designers and layout engineers in the team.
The IPC of leading x86 CPUs is partly being held back by the ISA. I'm eager to see how the upcoming ARM cores from Qualcomm, AMD, Nvidia and others will compare.
The 90's called, they want their claims back.

Modern CPUs with billions of transistors aren't limited by the decoders anymore. That has been long past when engineers had transistor budgets measuring in the hundreds of thousands range.

Simple reason ARM vendors are leading is because both AMD/Intel has fallen completely flat on it's face over the years. You can tell by the comparison between AMD/Nvidia. The latter had far, far less missteps.

DEC in the 90's led everyone else despite having the same RISC instruction set. Then the team of DEC went to eventually work at Apple through acquisition of PA Semi. Along with them they had other brilliant engineers like Gerald Williams III, who moved on to find Nuvia, and breaking records again. As soon as Gerald left Apple, we had 4 generations of minimal gains from Apple chips.

Because we're outsiders it's very easy to fall into the trap the often clueless high level management with finance degrees fall into - the belief that it's the "tech" that makes the difference, when it's the ingenuity, perseverance, and creativity of the people that do.

The no-clues talked about the complexity of Sapphire Rapids as being the reason for downfall, when people in the know told for years how the management team led by Kraznich fired the ENTIRE SPR validation team at some point. That couldn't have had anything to do with SPR being delayed for 3+ years can't it?
 
Last edited:
  • Like
Reactions: Thunder64

bit_user

Polypheme
Ambassador
The 90's called, they want their claims back.
Look, I've been through this debate so many times I can't even keep track of whether or how many times I've debated it with you. So, I'm going to keep this short & sweet.
  1. AArch64 has an undeniable advantage in that its ISA is cheaper to decode and easier to scale up decoding. ARM's Cortex-X4 has a 10-way decoder, while Intel's Golden Cove is currently at just 6-way, and only a couple of those are fully general.
  2. AArch64 has a more efficient memory consistency model.
  3. AArch64 has twice as many general-purpose registers.
  4. AArch64's SVE2 enables flexible scaling of vector width, without rewriting code. In a CPU with AVX10/512, the extra width potentially goes to waste, when executing code written with AVX2 or AVX10/256.
  5. AArch64 has a 3-operand format, for most instructions, reducing the need for additional mov's.

Intel's announcement of APX is clear evidence that Intel is worried. Of the above advantages I listed, it enables them only to take #3 and #5 off the table. In exchange, it adds yet another prefix byte, which makes #1 even worse.

Modern CPUs with billions of transistors aren't limited by the decoders anymore.
Their width can be a bottleneck. So, even if you take the area impact of the table, decoders still matter. I believe the power they require can also be relevant, in certain contexts.

the management team led by Kraznich fired the ENTIRE SPR validation team at some point. That couldn't have had anything to do with SPR being delayed for 3+ years can't it?
Most of Sapphire Rapids' delay was due to manufacturing process, not validation. The bugs plaguing Sapphire Rapids only accounted for about the final year of delays in its launch date.
 
Jul 7, 2022
617
569
1,760
It could be, but we can't say for sure.
Clockspeeds aren't just a function of architecture, but the capability of the circuit designers and layout engineers in the team.

The 90's called, they want their claims back.

Modern CPUs with billions of transistors aren't limited by the decoders anymore. That has been long past when engineers had transistor budgets measuring in the hundreds of thousands range.

Simple reason ARM vendors are leading is because both AMD/Intel has fallen completely flat on it's face over the years. You can tell by the comparison between AMD/Nvidia. The latter had far, far less missteps.

DEC in the 90's led everyone else despite having the same RISC instruction set. Then the team of DEC went to eventually work at Apple through acquisition of PA Semi. Along with them they had other brilliant engineers like Gerald Williams III, who moved on to find Nuvia, and breaking records again. As soon as Gerald left Apple, we had 4 generations of minimal gains from Apple chips.

Because we're outsiders it's very easy to fall into the trap the often clueless high level management with finance degrees fall into - the belief that it's the "tech" that makes the difference, when it's the ingenuity, perseverance, and creativity of the people that do.

The no-clues talked about the complexity of Sapphire Rapids as being the reason for downfall, when people in the know told for years how the management team led by Kraznich fired the ENTIRE SPR validation team at some point. That couldn't have had anything to do with SPR being delayed for 3+ years can't it?
It could be circuit design and layout but that seems too rudimentary to mess up and this be the culprit. Im leaning towards a short pipeline, as one of the hardest parts of micro-architecture design is a high confidence branch predictor which becomes more and more important the longer the pipeline becomes as a mis-prediction will cause the entire pipeline to be flushed and the instructions to be re-run. A short pipeline can achieve high IPC with a low confidence branch predictor since the number of instructions flushed is maximum = to the number of stages. This is why pentium 4 was regarded as hot garbage with its 20+ stage pipeline mated with a mis-prediction prone branch predictor even though pentium 3’s branch predictor was objectively worse. I’m theorizing loongson does not have the experience in branch prediction to create a branch predictor that can be mated to a long pipeline design. However I could be wrong.
 

bit_user

Polypheme
Ambassador
Im leaning towards a short pipeline, as one of the hardest parts of micro-architecture design is a high confidence branch predictor which becomes more and more important the longer the pipeline becomes as a mis-prediction will cause the entire pipeline to be flushed and the instructions to be re-run.
Branch prediction seems like the kind of thing which can be somewhat decoupled from the pipeline, itself. Also, as painful as it is to invalidate speculative executions, probably the bigger penalty associated with mispredections is having to take the latency hit of fetching instructions down the path you failed to predict. Especially if it's nowhere in the cache hierarchy, since DRAM can easily be around 500 cycles away!

BTW, don't forget that memory prefetchers are an important form of prediction.

A short pipeline can achieve high IPC with a low confidence branch predictor since the number of instructions flushed is maximum = to the number of stages. This is why pentium 4 was regarded as hot garbage with its 20+ stage pipeline mated with a mis-prediction prone branch predictor even though pentium 3’s branch predictor was objectively worse.
This model of IPC is too simplistic. Since the days of Core 2, pipelines have again started to get longer, yet IPC has continued to increase.
 
Jul 7, 2022
617
569
1,760
Branch prediction seems like the kind of thing which can be somewhat decoupled from the pipeline, itself. Also, as painful as it is to invalidate speculative executions, probably the bigger penalty associated with mispredections is having to take the latency hit of fetching instructions down the path you failed to predict.

BTW, don't forget that memory prefetchers are an important form of prediction.


This model of IPC is too simplistic. Since the days of Core 2, pipelines have again started to get longer, yet IPC has continued to increase.
Oh I wasn’t saying that pipeline length is the only thing that affects IPC, the parts feeding the pipeline are also important to IPC. And today’s AMD/Intel designs are actually dynamic pipeline length, I believe Zen and Intel can be 14-19 stages depending on workload. That’s actually not that much of an increase from the original Core architectures 12-stages and Core 2’s 16 stages. Mind you, Core came right after Netburst Prescott with its 31-stage pipeline. Sandybridge to kabylake Intel architectures have all maintained a 14-19 stage dynamic pipeline so not really that much longer than the original core. I can’t seem to find any data on the number of stages in alderlake and newer architectures however it seems both Intel and AMD have decided 14-19 stages is the optimal length.

I’m simply saying that shortening the pipeline is an easy way to increase IPC to compete with AMD/Intel if the 3a6000’s architecture is unable to hit similar IPC with the same pipeline length. And if they did decide to shorten the pipeline for said reason, then it could explain why 3 ghz with liquid nitrogen cooling is the max stable clock assuming SMIC’s 12nm process and loongson’s circuit design and layout is competently designed.

And then my addition of branch predictor design was just to show to add more reinforcement that it is much simpler to increase IPC by shortening the pipeline as longer pipelines require more and more sophisticated front end (branch prediction, instruction fetch, buffers, schedulers, etc.) and backend (catch population and hit rates, etc.) to maintain the same IPC as a shorter pipeline.

But you are right that front end and backend optimization is key to increasing IPC in long pipeline designs as the longer the pipeline, the more perfect the front/backend must be to keep the pipeline fully utilized. Like in my example with pentium 3/4, going from a 10-stage pipeline to a 20-stage pipeline required an increase of 566mhz clock speed to match the pentium 3’s performance despite pentium 4’s improved front/back end. And when Prescott came out, it’s 31-stage pipeline was so long that AMD had the better processor even though it couldn’t clock nearly as high.
 
Last edited:
Feb 13, 2024
1
0
10
Oh I wasn’t saying that pipeline length is the only thing that affects IPC, the parts feeding the pipeline are also important to IPC. And today’s AMD/Intel designs are actually dynamic pipeline length, I believe Zen and Intel can be 14-19 stages depending on workload. That’s actually not that much of an increase from the original Core architectures 12-stages and Core 2’s 16 stages. Mind you, Core came right after Netburst Prescott with its 31-stage pipeline. Sandybridge to kabylake Intel architectures have all maintained a 14-19 stage dynamic pipeline so not really that much longer than the original core. I can’t seem to find any data on the number of stages in alderlake and newer architectures however it seems both Intel and AMD have decided 14-19 stages is the optimal length.

I’m simply saying that shortening the pipeline is an easy way to increase IPC to compete with AMD/Intel if the 3a6000’s architecture is unable to hit similar IPC with the same pipeline length. And if they did decide to shorten the pipeline for said reason, then it could explain why 3 ghz with liquid nitrogen cooling is the max stable clock assuming SMIC’s 12nm process and loongson’s circuit design and layout is competently designed.

And then my addition of branch predictor design was just to show to add more reinforcement that it is much simpler to increase IPC by shortening the pipeline as longer pipelines require more and more sophisticated front end (branch prediction, instruction fetch, buffers, schedulers, etc.) and backend (catch population and hit rates, etc.) to maintain the same IPC as a shorter pipeline.

But you are right that front end and backend optimization is key to increasing IPC in long pipeline designs as the longer the pipeline, the more perfect the front/backend must be to keep the pipeline fully utilized. Like in my example with pentium 3/4, going from a 10-stage pipeline to a 20-stage pipeline required an increase of 566mhz clock speed to match the pentium 3’s performance despite pentium 4’s improved front/back end. And when Prescott came out, it’s 31-stage pipeline was so long that AMD had the better processor even though it couldn’t clock nearly as high.
Zen only has a 19 stage pipeline, Intel’s current P-core CPUs have a 15-20 stage pipeline since Alder Lake
 
Keep saying it, Chinese are probably happy with CPU is good enough for what we need in the domestic market for general business usage and each generation keeps getting better. They have come a long way in short period of time, give them another 5 years and who knows, they could start being a real threat to AMD/Intel, especially if they sell overseas at low prices, can prove nothing dodgy in the CPU hardware/coding/drivers etc, to give the West an excuse to ban them.
 

frogr

Distinguished
Nov 16, 2009
63
33
18,570
Wouldn't it be better to compare with other 4 core 8 thread processors. In a 16 core 32 thread processor, there is more memory buss and/or L3 cache contention which results in a lower IPC.
 

bit_user

Polypheme
Ambassador
Wouldn't it be better to compare with other 4 core 8 thread processors. In a 16 core 32 thread processor, there is more memory buss and/or L3 cache contention which results in a lower IPC.
I didn't watch the video, but I assumed they compared the CPUs in single-threaded testing. If so, then it doesn't much matter how many cores the x86 CPUs have, so long as we're talking about the midrange or better desktop CPUs. Even if you went down to a monolithic 8700G Ryzen or an Alder Lake H0 die, I doubt the IPC changes much.
 

bit_user

Polypheme
Ambassador
This was just another China chip propaganda article. I think so many of these have to be published to get extra money.
Don't be so cynical.

Granted, they just reported on some 3rd party reviewer's unverifiable claims, so it's not really worth much. However, I'm definitely interested in developments in China's CPU scene, even if it means a little propaganda slips through. The truth will eventually come to light.
 
  • Like
Reactions: gg83

DavidC1

Distinguished
May 18, 2006
494
67
18,860
Look, I've been through this debate so many times I can't even keep track of whether or how many times I've debated it with you. So, I'm going to keep this short & sweet.
ISA has an impact but technical limitations are DWARFED by capability of the *people* and the groups made by them.

For example, both AMD and Intel has this crazy focus on clocks so they need ridiculous pipeline stages closing in on the Netburst P4, while ARM vendors stick with much less. Cortex X4 is at 10 stages, and on Apple A9 it was at 9 stages.

Intel claimed that each pipeline addition on the pipeline resulted in performance impact of 2-4% due to increased branch mispredict penalties. The uop cache hit rate is in reality only about 60%, so in very branchy scenarios where uop cache hit rate is low, we're talking 7, 8, or even 9 stages more for Intel/AMD resulting in anywhere from 14-35% difference in performance due to pipeline stages alone!
Most of Sapphire Rapids' delay was due to manufacturing process, not validation. The bugs plaguing Sapphire Rapids only accounted for about the final year of delays in its launch date.
Sapphire Rapids delay had other reasons, but firing of not just few members, but the whole team will be a deciding factor. You can choose to believe it or not, but it's reported from people who had sources at Intel(few that are on Anandtech).

It could be circuit design and layout but that seems too rudimentary to mess up and this be the culprit.
Rudimentary or not it cannot be ruled out since you are comparing established engineers and teams with multi-decades of experience at top tier manufacturers versus a relatively new and inexperienced one.

Fact: The original Itanium's design was shoddy, and the successor Itanium 2 improved it to such a degree it went from a 10-stage pipeline, 800MHz chip to an 8-stage pipeline 1GHz chip using the same Intel 0.18u process. Rumblings were they could have pushed the chip to 1.2GHz but chose not to, probably for power reasons. The huge difference was because the original chip was designed by inexperienced engineers at Intel while 2nd gen was taken over by experienced ones at HP.
 
Last edited:

DavidC1

Distinguished
May 18, 2006
494
67
18,860
Zen has a 14-19 stage pipeline according to my sources, can you provide the source for alderlake’s pipeline? I’m very interested
Zen only has a 19 stage pipeline, Intel’s current P-core CPUs have a 15-20 stage pipeline since Alder Lake
Both Zen and Alder/Raptor has similar number of pipeline stages. I say similar, because Intel has a bit more.

When you talk about pipeline stages, it's referring to the Integer pipeline as FP pipelines are quite a bit longer and it's Integer performance that's harder to gain.

The varying 14-19 stages is because both chips have uop cache. 19 stages on a uop cache miss, and 14 stages on a uop cache hit. The uop cache itself increases it by few stages, so without it maybe you are talking 16-17.

Nehalem was 16, and Intel introduced the uop cache in Sandy Bridge, which turned it into a 14-18 stage one. Golden Cove adds an extra stage, so we're probably talking 15-19 for Intel versus 14-18 for AMD.
 
Jul 7, 2022
617
569
1,760
Both Zen and Alder/Raptor has similar number of pipeline stages. I say similar, because Intel has a bit more.

When you talk about pipeline stages, it's referring to the Integer pipeline as FP pipelines are quite a bit longer and it's Integer performance that's harder to gain.

The varying 14-19 stages is because both chips have uop cache. 19 stages on a uop cache miss, and 14 stages on a uop cache hit. The uop cache itself increases it by few stages, so without it maybe you are talking 16-17.

Nehalem was 16, and Intel introduced the uop cache in Sandy Bridge, which turned it into a 14-18 stage one. Golden Cove adds an extra stage, so we're probably talking 15-19 for Intel versus 14-18 for AMD.
Nothing I didn’t already know, but I’m sure others may benefit, thanks
 
Status
Not open for further replies.