News Next-gen Intel Arrow Lake-S CPU spotted with 24-threads and no AVX-512 functionality

Status
Not open for further replies.
The CPU inside the test system features 24 threads, a 3GHz frequency, and lacks AVX512
This is no surprise, because Intel discloses ISA extensions of upcoming CPUs a couple years in advance. They have yet to disclose AVX-512 or AVX10/512 support on an E-core.

Speaking of which, if/when they do, it'll be AVX10/512 - not AVX-512. However, the whole thing about introducing AVX10 at a baseline width of 256-bit was telegraphing their intention to hold client CPUs at 256-bit, for the foreseeable future.

this feature segmentation is not likely to last forever. Intel has plans to bring AVX10 and AVX-512 to its future E-core CPU architectures
No, you misread their messaging. I've been through their documents (the whitepaper and datasheet) multiple times, and they decidedly characterize 512-bit (including AVX10/512) as legacy and even go so far as to say that hybrid CPUs will use AVX10/256. I'll probably have to go back and pull up the precise quotes...
 
Regarding core count, this leak (if true) showed their intention to stick with the 8P + 16E configuration, in the top-spec die:

FWIW, my take on the rumor of hyper-threading going away is one of cautious skepticism. I can see why they might do it, given all the threads contributed by the E-cores, but I can also see them not wanting to cede any ground on multithreaded performance to AMD. Perhaps what they'll do is take a page from an older playbook and disable HT in some of the lower-end and mid-range models.
 
No, you misread their messaging. I've been through their documents (the whitepaper and datasheet) multiple times, and they decidedly characterize 512-bit (including AVX10/512) as legacy and even go so far as to say that hybrid CPUs will use AVX10/256. I'll probably have to go back and pull up the precise quotes...
No, YOU misread.
The Intel AVX-512 ISA will be frozen as of the introduction of Intel AVX10 and all CPUID feature flags will continue to be enabled on future P-core processors for legacy support.
This means they will keep the coder calls for avx-512 so that no old software has to be re written, that's what legacy support is.
All new subsequent vector instructions will be enumerated only as part of Intel AVX10.
From here on out all new CPUs will have avx10 (even if they do have legacy support)
Apart from a few special cases, those instructions will be supported at all vector lengths, with 128-bit and 256-bit vector lengths being supported across all processors, and 512-bit vector lengths additionally supported on P-core processors.
AVX10 will work in the same way as directx or vulcan, the coder will not have to know exactly what the hardware can do they will only need to code what they want the hardware to do, coders be coding for avx10 ,and the small cores will be handling up to 256-bits while the big cores will be handling up to 512.
with 128-bit and 256-bit vector lengths being supported across all processors,

and 512-bit vector lengths additionally supported on P-core processors.
Maybe the few special cases could mean all of the mainstream CPUs but that would be an incredibly bad way of phrasing that.
Apart from a few special cases,
https://cdrdv2-public.intel.com/784343/356368-intel-avx10-tech-paper.pdf
 
Last edited:
Perhaps what they'll do is take a page from an older playbook and disable HT in some of the lower-end and mid-range models.
That would be the stupidest thing ever, take perfromance away from the tiers that need it the most to stay desirable.
They did that back then because there was nothing to compete against them at those tiers, that's not so now.
The only reason intel would remove hyperthreading would be to re introduce it in the next gen for a huge boost in numbers which would boost sales.
But that is a risk way too big to take.
 
That would be the stupidest thing ever, take perfromance away from the tiers that need it the most to stay desirable.
They did that back then because there was nothing to compete against them at those tiers, that's not so now.
The only reason intel would remove hyperthreading would be to re introduce it in the next gen for a huge boost in numbers which would boost sales.
But that is a risk way too big to take.
At least this is not a segmentation issue this time. Supposedly, Arrow Lake loses hyperthreading because Intel was going to introduce its replacement, "Rentable Units". But it's not ready in time, so Arrow Lake P-cores end up with neither hyperthreading nor Rentable Units.

It's difficult for me to believe that they couldn't put hyperthreading back into the design, since it's been around for over 20 years. So I'm not sure about the rumor. No worries, we'll find out soon enough.

If Arrow Lake has a decent +15% or better single-threaded performance increase, it could mostly cancel out the removal of hyperthreading. Some games and applications aren't helped by it at all.
 
  • Like
Reactions: bit_user
At least this is not a segmentation issue this time. Supposedly, Arrow Lake loses hyperthreading because Intel was going to introduce its replacement, "Rentable Units". But it's not ready in time, so Arrow Lake P-cores end up with neither hyperthreading nor Rentable Units.
What is this cookie baking?!
I was going to use stevia so I forgot to put the sugar in and now it has neither?!

That's not how that works, they can't just rip out pages of the design because they run out of time, and manufacture half a CPU.
If they aren't ready they will delay, they did that more than often enough.
If Arrow Lake has a decent +15% or better single-threaded performance increase, it could mostly cancel out the removal of hyperthreading. Some games and applications aren't helped by it at all.
So even if it works as expected some games and apps aren't helped at all so overall it would be a loss.
Why would anybody think that they will remove HTT?

Rentable units will break up threads into demanding and less demanding parts, that is perfect for making hyperthreading work even better, since it works on resources left empty by the main thread.
If a core can run one demanding part and one light part at the same time with hyperthreading it will be faster than doing it on two separate cores, especially if the second one is a e-core with much lower clocks, also less complicated and needing less resources overall.
Until now hyperthreading had no clue if the thread it's going to run would be demanding or not, so it could get two demanding threads issued to the same physical core at once, RE would help that tremendously.
If the hyper threads are all filled up then they would go to the e-cores.
Or at the very least the thread director can evaluate where it will run faster and act accordingly.
 
What is this cookie baking?!
I was going to use stevia so I forgot to put the sugar in and now it has neither?!

That's not how that works, they can't just rip out pages of the design because they run out of time, and manufacture half a CPU.
If they aren't ready they will delay, they did that more than often enough.
I'm just telling you what I've heard, which Tom's Hardware itself has mentioned a couple times now. Don't be shocked if Arrow Lake launches with 24 cores (8+16), 24 threads.
 
  • Like
Reactions: bit_user
No, YOU misread.
*sigh*

We've been through this, before.

This means they will keep the coder calls for avx-512 so that no old software has to be re written, that's what legacy support is.
The key phrase is "P-core processors", by which they mean non-hybrid CPUs containing only P-cores. So, they're just saying that product lines which currently support AVX-512 will continue to receive AVX-512 support. That's not relevant to the discussion.

From here on out all new CPUs will have avx10 (even if they do have legacy support)
Not "all new CPUs", since Arrow Lake won't have it.

AVX10 will work in the same way as directx or vulcan, the coder will not have to know exactly what the hardware can do they will only need to code what they want the hardware to do, coders be coding for avx10 ,and the small cores will be handling up to 256-bits while the big cores will be handling up to 512.
Not only is this wrong, but so is your DirectX/Vulkan analogy. Sure, in each case you can query the hardware to see what it supports, but then your program needs to have a codepath to exploit or deal with the hardware's capabilities or limitations.

For instance, if you detect the CPU only implements AVX10/256 v2, then you can't execute instructions using 512-bit operands or any instructions added in v3 or later.

Likewise, if the hardware is capable of 512-bit support, your program needs a dedicated codepath written using 512-bit operands, if you want to gain any benefit from it. In this regard, it's really no different than the current AVX2 vs. AVX-512 situation, except that AVX10/256 and AVX10/512 are identical except for operand size - not semantics or instruction set.

Fun fact: AVX-512 already supports different operand sizes: 128-bit, 256-bit, and 512-bit. AVX10 continues this theme, except that it makes 512-bit support optional.
Yes, I've been through that, multiple times. There's another document, as well.
 
Last edited:
If Arrow Lake has a decent +15% or better single-threaded performance increase, it could mostly cancel out the removal of hyperthreading.
According to this, it doesn't. Geek Bench SC (Single Core) only improves 9% to 13%:

tGi5oGqH6ZAZWnscDMc3m4.jpg


And the SPECint and SPECfp (1-copy) only improve 4-8% and 3-6%, respectively:

That's relative to i9-13900K, as well. Compared to the i9-14900K, the improvement would be even less.

Some games and applications aren't helped by it at all.
True. Some gamers either disable it in BIOS or use tools like Process Lasso to restrict games' use of it.

IMO, the problem isn't hyperthreading, but rather bad APIs which limit the kernel's understanding of how the game is trying to use threads and make the game jump through hoops to try and deal with the hardware's capabilities and limitations.
 
That's not how that works, they can't just rip out pages of the design because they run out of time, and manufacture half a CPU.
If they aren't ready they will delay, they did that more than often enough.
Although we're getting pretty far back in history, the Pentium 4: Prescott cores were designed with x86-64 support but didn't enable it. I think that's because it was buggy or incomplete and they decided to push the thing out the door, anyhow.

Rentable units will break up threads into demanding and less demanding parts, that is perfect for making hyperthreading work even better, since it works on resources left empty by the main thread.
If a core can run one demanding part and one light part at the same time with hyperthreading it will be faster than doing it on two separate cores, especially if the second one is a e-core with much lower clocks, also less complicated and needing less resources overall.
What you're describing is incredibly complex and hard to implement (well). We need to see real data on how well it works. It's not something you can reliably simulate in your brain.

Until now hyperthreading had no clue if the thread it's going to run would be demanding or not,
The ThreadDirector (introduced in Gen 12) now gives the OS kernel insight into such things. Of course, only hybrid CPUs have a ThreadDirector, but the basic approach should apply to SMT as well as P-core vs. E-core.
 
On a lighter note, Is it just me? Or do you wish they’d bring back *****bridge code names? I still frequent my sandybridge-extreme computer from 2013 and it would be awesome to upgrade to dirtybridge, rustybridge, dustybridge, ricketybridge, etc. haha, I can list many more but I’ll leave the joke there.
 
On a lighter note, Is it just me? Or do you wish they’d bring back *****bridge code names?
For a while, it seemed like the suffixes were tied to the CPU socket. Sandybridge and Ivy Bridge shared a socket. Haswell and Broadwell were supposed to, also. So did Skylake and Kaby Lake, but then it seems Intel grew too fond of Lake names...

I do like how they've separated out the core names from the CPU names, and those seem to have fairly robust conventions. Also, the server CPUs now seem to have their own naming conventions, which is helpful.
  • cove = P-core
  • mont = E-core
  • rapids = Server; P-cores only
  • forrest = Server; E-cores only
  • lake = Client

The different client market segments also have useful suffixes: U, P, H, HX, S, T.
 
  • Like
Reactions: Order 66
The key phrase is "P-core processors", by which they mean non-hybrid CPUs containing only P-cores. So, they're just saying that product lines which currently support AVX-512 will continue to receive AVX-512 support. That's not relevant to the discussion.
So the CPUs they won't be making anymore? Those will get avx10? They are doing this for legacy CPUs that will be pentiums and celerons only?
Are you for real?

Every core in a CPU is a processor
A core with a p-core instead of an e-core is a P-core processor.
https://www.tomshardware.com/news/cpu-core-definition,37658.html
A CPU core is a CPU’s processor.
Not "all new CPUs", since Arrow Lake won't have it.
New from the point of view of the paper.
Not only is this wrong, but so is your DirectX/Vulkan analogy. Sure, in each case you can query the hardware to see what it supports, but then your program needs to have a codepath to exploit or deal with the hardware's capabilities or limitations.
Do you even know computers?
They make standards like directx and vulcan so that devs DON'T querry hardware. They write one codepath for one API instead of having to write one for each hardware in existence.
What you're describing is incredibly complex and hard to implement (well). We need to see real data on how well it works. It's not something you can reliably simulate in your brain.


The ThreadDirector (introduced in Gen 12) now gives the OS kernel insight into such things. Of course, only hybrid CPUs have a ThreadDirector, but the basic approach should apply to SMT as well as P-core vs. E-core.
Oh, heeeey, look at that we have something that has info on these types of things and can manage threads and cores? What about that.
Yes I agree, sorting out threads breaking them up into heavy and light parts and then sending them to completely different types of cores seems a lot less complicated then sending them to the same type of cores...not!

In either way intel already did a lot of work with the thread director, so no matter how complicated it is, if it can provide additional performance they would be crazy not to use it.
 
So the CPUs they won't be making anymore?
This seems like a willful misunderstanding. If you're not going to be serious, I won't take your replies seriously.

They make standards like directx and vulcan so that devs DON'T querry hardware. They write one codepath for one API instead of having to write one for each hardware in existence.
This is not accurate.

Let's look at Vulkan. It has four versions: 1.0, 1.1, 1.2, and 1.3. Then, within each version there are oodles of extensions.

"Beyond the core specification, everything in Vulkan is optional, all 280+ extensions. Meaning that for developers who are building applications that tap into features that go beyond the core spec – which has quickly become almost everything not written for a smartphone – there hasn’t been good guidance available on what extensions are supported on what platforms, or even what extensions are meant to go together.

The freedom to easily add extensions to Vulkan is one of the standard’s greatest strengths, but it’s also a liability if it’s not done in an organized fashion. And with the core spec essentially locked at the (OpenGL) ES 3.1 level for the time being, this means that the number of possible and optional extensions has continued to bloom over the last 6 years."

Source: https://www.anandtech.com/show/1722...released-fighting-fragmentation-with-profiles

So, game or app developers first have to decide which version of Vulkan they're going to require. Then, they're faced with the dilemma of either being restricted to the core functionality of OpenGL ES 3.1, which is rather primitive (to say the least), or depending on some mix of extensions.

The idea that you just target Vulkan and it somehow magically runs everywhere suggests that you don't really "know computers". Hardware differs, software is complex, and versioning is always a pain.
 
So, game or app developers first have to decide which version of Vulkan they're going to require. Then, they're faced with the dilemma of either being restricted to the core functionality of OpenGL ES 3.1, which is rather primitive (to say the least), or depending on some mix of extensions.

The idea that you just target Vulkan and it somehow magically runs everywhere suggests that you don't really "know computers". Hardware differs, software is complex, and versioning is always a pain.
This seems like a willful misunderstanding. If you're not going to be serious, I won't take your replies seriously.
You are not writing code directly for the hardware but for the API.
 
This seems like a willful misunderstanding. If you're not going to be serious, I won't take your replies seriously.
Parroting that back to me doesn't work if the rest of your reply is nonsensical. It's also kinda funny to see someone try to call that post "not serious".

You are not writing code directly for the hardware but for the API.
The article is talking about Vulkan - the API. Try learning a thing or two.
 
Parroting that back to me doesn't work if the rest of your reply is nonsensical. It's also kinda funny to see someone try to call that post "not serious".


The article is talking about Vulkan - the API. Try learning a thing or two.
So, game or app developers first have to decide which version of Vulkan they're going to require. Then, they're faced with the dilemma of either being restricted to the core functionality of OpenGL ES 3.1, which is rather primitive (to say the least), or depending on some mix of extensions.
So you are saying that you can in fact select a vulcan version and stick with its core functionality....but I'm still wrong because you can't stand being wrong.
 
So you are saying that you can in fact select a vulcan version and stick with its core functionality....but I'm still wrong because you can't stand being wrong.
In practice, Vulkan-based apps always use extensions. That's the normal way to use Vulkan. It's not "write once, run everywhere", like you characterized. Only the very most basic apps could get away with doing that.

The equivalent of what you're describing would be like compiling code to use baseline x86-64, from 20 years ago, and not using any AVX* functionality. Once you use any form of AVX, you need to do a query to see if it's supported. If using AVX10, you not only need to query its presence, but also the version of AVX10 you want to use. Then, if you want to use 512-bit operands, you also need to query if they're supported. Then, if you want to support both 256-bit and 512-bit AVX10 operands, you need to have separate code paths, because that's how AVX10 works.

Also, don't throw a tantrum. It'd be a lot easier if you did a little homework up-front (or simply don't say things you're not sure about), instead of whinging on the backend when someone corrects errors in your posts.
 
Last edited:
Once you use any form of AVX, you need to do a query to see if it's supported. If using AVX10, you not only need to query its presence, but also the version of AVX10 you want to use. Then, if you want to use 512-bit operands, you also need to query if they're supported. Then, if you want to support both 256-bit and 512-bit AVX10 operands, you need to have separate code paths, because that's how AVX10 works.
Not according to intel, but I guess you'll say that they are wrong as well.
You can be a dev and happily only write code for the avx512 equivalent in avx10.
The converged version will run that at 256-bit length on all available cores, so you might lose some performance doing that but you will also have a much easier time developing.

Then, if you want to, you can also do optimization and code specifically for the p-cores and 512 only.
The converged version of the Intel AVX10 vector ISA will include Intel AVX-512 vector instructions with anAVX512VL feature flag, a maximum vector register length of 256 bits, eight 64-bit mask registers, and new versions of 256-bit instructions supporting embedded rounding. This converged version will be supported on both P-cores and E-cores. While the converged version is limited to a maximum 256-bit vector length, Intel AVX10 itself is not limited to 256 bits, and optional 512-bit vector use is possible on supporting P-cores.
 
Not according to intel, but I guess you'll say that they are wrong as well.
It's right there in the instruction encodings! From Intel, themselves!!

" For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer
listed in the above opcode table"

https://cdrdv2-public.intel.com/784267/355989-intel-avx10-spec.pdf

The operand width is baked right into the opcodes! There's no way to support multiple different operand sizes, other than by having multiple code paths (or doing runtime code-generation).

You can be a dev and happily only write code for the avx512 equivalent in avx10.
It will only run on CPUs supporting AVX10/512.

The converged version will run that at 256-bit length on all available cores, so you might lose some performance doing that but you will also have a much easier time developing.
That's not at all how it works.
 
It's right there in the instruction encodings! From Intel, themselves!!
" For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-​
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table"​

The operand width is baked right into the opcodes! There's no way to support multiple different operand sizes, other than by having multiple code paths (or doing runtime code-generation).


It will only run on CPUs supporting AVX10/512.


That's not at all how it works.
What you link is about the transitional avx10.1 that will only be on servers and still works in the old ways, 10.2 will be on both server and client and will work on p+e cores together.
cB3hs8b.jpg

Intel AVX10 Version 1 will be introduced for early software enablement and supports the subset of all the Intel AVX512 instruction set available as of future Intel Xeon processors with P-cores, codenamed Granite Rapids, that is forward compatible to Intel AVX10. This version will not include the new 256-bit vector instructions supporting embedded rounding or any of the new instructions and will serve as the transition base version from Intel AVX-512to Intel AVX10.Intel AVX10 Version 2 will include the 256-bit instruction forms supporting embedded rounding as well as a suite of new Intel AVX10 instructions covering new AI data types and conversions, data movement optimizations, and standards support. All new instructions will be supported at 128-, 256-, and 512-bit vector lengths with limited variances. All Intel AVX10 versions will implement the new versioning enumeration scheme.
 
What you link is about the transitional avx10.1 that will only be on servers and still works in the old ways, 10.2 will be on both server and client and will work on p+e cores together.
cB3hs8b.jpg
No, the misunderstanding is yours. The instruction encoding isn't changing between AVX10.1 and AVX10.2. The operand size is baked right into the opcodes.

It's detailed, here:

In bit 6 of byte 3. Hard-coded directly in the instruction stream.

Furthermore, there's no mechanism for simultaneously executing the same code with one operand size on P-cores and a different operand size on E-cores.
 
Status
Not open for further replies.