News Intel's First Discrete GPU, the DG1, Listed with 96 EUs

twotwotwo

Honorable
Aug 10, 2019
50
18
10,535
Even if nothing in the silicon were inherently better, a discrete card might be able to run a bit faster just because it gets its own power and thermal budget apart from the CPU's. It could also get a latency/BW boost from RAM on the card, especially GDDR or HBM but even eDRAM like Crystalwell.

If they ship it reasonably close to their schedule in sufficient volume (to be seen given manufacturing issues), and the initial pricing and marketing aren't delusional, they've probably successfully wedged their way into the discrete GPU market. It doesn't have to be the new performance leader, just offer a legit jump over iGPUs of the time at a proportional price.
 

bit_user

Titan
Ambassador
as many as Tiger Lake’s integrated graphics.
This is the key point. If these EUs are comparable to those of their traditional iGPUs, you're only talking about a 768-shader GPU. So, maybe about GTX 1050 Ti-level performance.

Of course, they could always make the new EUs much wider... which would only be reasonable, considering how much narrower they are than everybody else (8 vs 32 or 64).

Although by working together with the CPU, DG1 might double Tiger Lake's graphics performance.
Yeah, I'll believe it when I see it. IMO, it would make more sense to run physics on the iGPU and do rendering on the dGPU, for instance. Perhaps some engines already do that, since they wouldn't even have to be the same architecture.

The information originated from the Eurasian Economic Union on Wednesday
Not saying it's fake news, or anything, but...

From http://www.eaeunion.org/?lang=en#about
The Member-States of the Eurasian Economic Union are the Republic of Armenia, the Republic of Belarus, the Republic of Kazakhstan, the Kyrgyz Republic and the Russian Federation.
...interesting.

The driver also mentioned DG2 with Gen 12 HP and possibly up to 512 EUs.
This is nuts! And I really mean nuts, because that's not the most efficient way to scale. As I said, they should really make the EUs wider, before making them more numerous. There's a reason for SIMD's enduring popularity. You'd think Intel, with their 16-wide AVX-512 would get this.
 
This is nuts! And I really mean nuts, because that's not the most efficient way to scale. As I said, they should really make the EUs wider, before making them more numerous. There's a reason for SIMD's enduring popularity. You'd think Intel, with their 16-wide AVX-512 would get this.
Adding more of already available units is much more efficient than designing something new from the ground up...
Let them release their first gen,they will make wider units in the future.
 

JayNor

Honorable
May 31, 2019
456
102
10,860
Intel states 8K60 video encoding on Tiger Lake. Their Xe gpu is probably going to occupy more layout than all their cores combined. Interesting design choices...
 

InvalidError

Titan
Moderator
Their Xe gpu is probably going to occupy more layout than all their cores combined. Interesting design choices...
You make the choices that should get you the most sales in your target markets. As for other stuff accounting for more die area than the cores themselves, you can say the same thing about Ryzen: 1/1+ had cores occupying 1/3 of the die area and that goes down to around 1/4 with Zen 2 largely thanks to doubling L3 cache. The cores are getting smaller but the amount of infrastructure required to keep them fed is increasing.
 

SBnemesys

Commendable
Jul 6, 2017
2
0
1,510
From all the leaks, I am far from impressed. But I won't make any judgements until we get to see the final product. I honestly think Intel is making a mistake though. They really need to focus on their cpus considering amd is smashing it right now.
 

InvalidError

Titan
Moderator
I honestly think Intel is making a mistake though. They really need to focus on their cpus considering amd is smashing it right now.
The CPU stuff is already done, Intel is two generations ahead on architecture. What is bogging Intel down is lacking the process required to make it actually work without the substantial clock frequency penalty seen on mobile Ice Lake vs mobile Coffee Lake. The Tiger Lake ES leaks look promising, may very well give Zen 3 a run for its money.
 

bit_user

Titan
Ambassador
Adding more of already available units is much more efficient than designing something new from the ground up...
They've been working on this for... how long? Since probably 2016, at least, with Gen11 as a way point.

And their iGPUs were not designed to scale up, so a lot of redesign was clearly necessary.

At the ISA level, they already introduced major ABI incompatibilities.

Nearly every instruction field, opcode, and register type is updated and there are other big changes like removing the hardware register scoreboard logic that leaves it up to the compiler now for ensuring data coherency between register reads and writes and a new sync hardware instruction.


Compared to all that, doubling or quadrupling their SIMD width should be easy.

Even on a superficial level, it wouldn't make sense for Intel to release something that's not competitive, simply because it's "easy". That's how companies lose large amounts of money and how product lines get cancelled. They can't afford to go to market with something they're not confident will be competitive.

I'm just wondering why they think an entry level card with 96 EUs will be competitive, when you consider that Radeon VII has only 60 CUs and even the mighty Titan V has just 80 SMs. What AMD and Nvidia both know is that each unit adds overhead. So, it's important to strike the right balance between the number and their width.

If you're going to offer a counter-points, please try harder.
 

bit_user

Titan
Ambassador
I honestly think Intel is making a mistake though. They really need to focus on their cpus considering amd is smashing it right now.
Intel's 2018 revenues were $70.8 B. AMD's were $6.48 B. The two companies are on completely different scales.

Intel can walk and chew gum, at the same time. They have like 5 major divisions, in the company. And they've been building GPUs or iGPUs for well over 20 years.


In fact, it would a lot harder for Intel to redirect personnel and resources to their CPU team (or manufacturing, per I.E.'s point). Once an organization reaches a certain size, it's a lot easier for them to branch out into more areas, rather than have people stepping all over each other, trying to work on the same products. Ideas along these lines were most famously described in a classic book:


Finally, they need competitive iGPUs, for the lucrative laptop market. Specifically because AMD is showing resurgence on the CPU front, and has long had an edge on the GPU front, Intel can't afford not to have a competitive GPU offering, if they want any hope of keeping this market locked up.
 
Last edited:

InvalidError

Titan
Moderator
I'm just wondering why they think an entry level card with 96 EUs will be competitive, when you consider that Radeon VII has only 60 CUs and even the mighty Titan V has just 80 SMs.
Why are you expecting Intel's 96 EUs to be competitive with VII or a Titan when Intel itself is labeling it as entry-level which would be more along the lines of RX570/RX5500/GTX1650 at best? It isn't the number of EUs that matter, it is how much stuff you cram into them (not very much for DG1) and how fast you can run them.

Having more EUs may have some amount of overhead but so does making EUs wider, need wider everything to match (IO from registers, IO from local buffers, IO from cache, register count, etc.) and all of that extra width needs to get managed and scheduled too. I would expect Intel to model its architectures in C or some other language to determine the optimal combination of EU count and EU width for its 3D rendering pipeline by throwing scenes grabbed from actual games and apps at it. I'm fairly sure Intel did its homework before settling on plentiful narrow EUs instead of fewer fatter ones.
 

bit_user

Titan
Ambassador
Why are you expecting Intel's 96 EUs to be competitive with VII or a Titan when Intel itself is labeling it as entry-level which would be more along the lines of RX570/RX5500/GTX1650 at best?
No, that wasn't my point. Did you even read my first post in this thread? It was my only post before the one to which you replied. I don't feel like I should have to repeat myself, just because you can't be bothered to scroll up to get some more context.

It isn't the number of EUs that matter,
Oh, but it does. Their existing iGPUs have dual-issue EUs. So, by scaling up the number of EUs before making them wider, they're burning power & die space to redundantly fetch, decode, and schedule the same instructions that AMD or Nvidia would dispatch within a wider SIMD.

it is how much stuff you cram into them
With 512, in their DG2, one doesn't get the impression it'll be very much.

how fast you can run them.
This is somewhat orthogonal to the question at hand.

Having more EUs may have some amount of overhead but so does making EUs wider, need wider everything to match (IO from registers, IO from local buffers, IO from cache, register count, etc.)
No, that's not overhead. If doubling SIMD size costs twice the area but delivers 2x the benefit, then it doesn't count as overhead.

Of course, I'm not so naive as to think that you can double a datapath for only 2x the area, but whatever the overhead, it's definitely less than doubling the number of EUs, because a lot of the execution & control logic is constant with respect to SIMD width. SIMD scales very well - especially in graphics.

AMD and Nvidia have been doing this for a long time and seem to think the optimal SIMD size is at least 32.

I would expect Intel to model its architectures in C or some other language to determine the optimal combination of EU count and EU width for its 3D rendering pipeline by throwing scenes grabbed from actual games and apps at it.
I'd hope so.

I'm fairly sure Intel did its homework before settling on plentiful narrow EUs instead of fewer fatter ones.
You read too much into the current architecture, at your peril. It has a long legacy, and I'm fairly certain Intel drank the cool aid about how GPU-compute was going to be the next big thing. They certainly didn't provision the fp64 and int32 capabilities of iGPUs through Gen9.5, based on any games.
 
Last edited:

InvalidError

Titan
Moderator
No, that's not overhead. If doubling SIMD size costs twice the area but delivers 2x the benefit, then it doesn't count as overhead. And it wouldn't cost twice the area per EU, because a lot of the execution & control logic is constant with respect to SIMD width. SIMD scales very well - especially in graphics.
The control logic may be mostly unchanged but access to shared registers and buffers grows at a rate of N while the data routing matrices for things like swizzle operations and getting the output of any given unit to the input of whatever other units require that result (for things like register store bypass to reduce latency) scale with N^2.
 

bit_user

Titan
Ambassador
The control logic may be mostly unchanged but access to shared registers and buffers grows at a rate of N while the data routing matrices for things like swizzle operations and getting the output of any given unit to the input of whatever other units require that result (for things like register store bypass to reduce latency) scale with N^2.
You're thinking of these like they're CPUs, but they're not. You don't do swizzle operations on GPU SIMD words, because the whole programming model is to keep the lanes independent. That's why Nvidia likes to call each SIMD lane a "core" and the instance of data that it's processing as a "thread".

While Nvidia does support shuffling w/o a round-trip to shared memory, the capability is a lot more restrictive than you seem to think:

Code:
T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_xor_sync(unsigned mask, T var, int laneMask, int width=warpSize);

So, it's basically 4 data movement patterns:​
  • broadcast
  • rotate-left
  • rotate-right
  • xor
No NxN random access.​
Regarding access to on-chip memory:​
Shared Memory
Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory.​
To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.​
However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts.​
Of course, you still have to route every SIMD lane to the port of every bank, and that obviously doesn't scale linearly, but I doubt it's more than N log N.​
 
Last edited:

InvalidError

Titan
Moderator
You're thinking of these like they're CPUs, but they're not. You don't do swizzle operations on GPU SIMD words, because the whole programming model is to keep the lanes independent.

Of course, you still have to route every SIMD lane to the port of every bank, and that obviously doesn't scale linearly, but I doubt it's more than N log N.
No swizzle on GPUs? So you have never done something like " out = f.xyyz * g.wyxz + h.w " ? Guess that would be why Nvidia says it is an "often overlooked" optimization, still very much there. That's one 4x4 matrix for each swizzleable operand to present any part of any float4 to any FMA input.

The size of a non-blocking switch is M x N: M incoming operands x N places these operands may need to get to. Double the number of shader units, you double the potential number of operands that need to get tapped and the number of possible destinations. Since warps operate on the premise of every thread having access to everything it needs whenever it gets scheduled, there isn't a whole lot of room to scale data-moving structures shared between shaders slower than that.

Data-moving structures is way up there with power efficiency for reasons why GPUs only go to ~2GHz.
 

bit_user

Titan
Ambassador
No swizzle on GPUs? So you have never done something like " out = f.xyyz * g.wyxz + h.w " ? Guess that would be why Nvidia says it is an "often overlooked" optimization, still very much there. That's one 4x4 matrix for each swizzleable operand to present any part of any float4 to any FMA input.
No, you still don't get it. That's not how they use SIMD. They regard lane-swap operations as a form of "thread communication". They treat each SIMD lane as a scalar processor, which means they don't need lane-swaps to implement that.

You're forgetting what SIMD originally means:

It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.


They way CPUs implement it is mostly not about SIMD programming, but mainly optimizing vector and DSP operations. Intel just decided to start using the term, when they created SSE.

Anyway, I never said you couldn't do lane-swap, just that it's not central to the programming model, the way you think it is. That's why they only support the limited communication patterns I listed.

Also, please cite your sources.
 
Last edited:

bit_user

Titan
Ambassador
Thank you.

That was taken right out of one of Nvidia's GPUGems2, chapter 35.
I'm not saying it wasn't - just that you don't need lane-swaps for it, if those elements aren't all packed into different lanes of a single SIMD word. That's the part you're missing.

Anyway, you might be on to something. Maybe the reason Intel's GPUs have been stuck on 4-wide SIMD is that they approached it from a CPU mindset, instead of a classical SIMD vector processor mindset. And without fundamentally breaking that model, they face the sort of scalability issues you mentioned. If so, that could be their undoing.

GPUs aren't that hard. Imagine if Intel, of all players, found a way to screw it up.
 

InvalidError

Titan
Moderator
I'm not saying it wasn't - just that you don't need lane-swaps for it, if those elements aren't all packed into different lanes of a single SIMD word. That's the part you're missing.
Vector registers and data is packed as xyzw. Unless all you are doing is scalar and point products, you will need to swizzle at some point, be it explicitly or implicitly via a convenient keyword to abstract boilerplate math like vector cross-product out. Quite a bit of swizzling is involved in the 3D coordinate rotation matrices using either the Euler angles method or vector-to-vector one, which is a kind of important part of transforming geometry and applying viewport perspective before rendering it.

You may not be using swizzles (at least not directly) for whatever it is you are doing, but it is essential to much of the math at the core of what general-purpose shaders need to be able to do.
 

bit_user

Titan
Ambassador
Vector registers and data is packed as xyzw. Unless all you are doing is scalar and point products, you will need to swizzle at some point,
Beware, this will probably blow your friggin' mind: you can implement vector arithmetic on scalar processors!

Now, imagine treating a wide SIMD machine as a collection of scalar machines. Boom. Your swizzles just disappeared! Each component in your vectors is now in the same lane of a separate vector register! Instead of swizzles, you just reference a different register!

If you want to scale up to large SIMD sizes, the first thing you don't do is anything that doesn't scale well.

You may not be using swizzles (at least not directly) for whatever it is you are doing, but it is essential to much of the math at the core of what general-purpose shaders need to be able to do.
Then what the heck was the point of all that?

Obviously, it's important - that's why GLSL and HLSL both have specialized, short-hand syntax for it. I never meant that you didn't need the operation, at a logical level, simply that you didn't need to do it by actually swizzling SIMD lanes.

Now, by this point, I'm pretty sure that what Intel did was a fairly direct implementation of the virtualized GPU defined in the Direct3D API. That neatly explains why they went to a dual-issue model to reach 8-wide - instead of simply doubling their SIMD size - and have not surpassed 8 lanes per EU, since then. In the beginning, it probably seemed like a fine idea... perhaps AMD and Nvidia even started out, here, too.
 

InvalidError

Titan
Moderator
Now, imagine treating a wide SIMD machine as a collection of scalar machines. Boom. Your swizzles just disappeared! Each component in your vectors is now in the same lane of a separate vector register! Instead of swizzles, you just reference a different register!
The swizzles don't disappear, you merely expanded them. From a hardware point of view, you still need to pull the same xyz values from registers, except now you need to hog six separate scalar registers to do a single 3D cross-product operation instead of two vector registers, and you have three products to write back to the register file instead of only one.

If you want to scale up to large SIMD sizes, the first thing you don't do is anything that doesn't scale well.
There is no point in objecting to swizzling on GPUs as it is nothing more than an ALU input mux manipulation, it is there and free whether you use it or not.

Doubling the number of ALUs connected to a register file will double the number of data lines regardless of whether they are scalar or vector, no miracles to be had there. How much worse it gets beyond that depends on how many of the extra selectable registers are made available to each ALU.
 

bit_user

Titan
Ambassador
The swizzles don't disappear, you merely expanded them.
sigh

I only said the SIMD lane-swizzles disappeared. What do I have to do to make it clear that THE ONLY POINT I'M MAKING IS THAT YOU DON'T DO SIMD LANE SWIZZLES??? Not as a matter of course. Obviously neither the math nor algorithms change - just how you implement them.

From a hardware point of view, you still need to pull the same xyz values from registers, except now you need to hog six separate scalar registers to do a single 3D cross-product operation instead of two vector registers,
No, no, no. You're not getting this.

The concept is that you imagine the code as if it were scalar code, on a scalar machine. Then, lash 32 or 64 of those together, and slot them each into its own SIMD lane. So, what would be a single scalar operation now turns into N scalar operations, where N is the SIMD width.

There is no point in objecting to swizzling on GPUs as it is nothing more than an ALU input mux manipulation, it is there and free whether you use it or not.
GPUs are optimized for power-efficiency and area-efficiency. That's what lets them scale up. So, you have to do everything according to those objectives.

You have to drop the CPU mindset of trying to optimize single-thread performance. GPUs are about concurrency.
 

InvalidError

Titan
Moderator
No, no, no. You're not getting this.
Your last reply clearly shows that you have failed at making the distinction between vectors in the SIMD sense and vectors in the MATHEMATICAL or spatial sense of an n-dimensional direction, mainly 3D in GPUs' case, Sure, you can do the same calculations by treating each part of a 3D vector individually and going out of your way to re-arrange your data accordingly, but it makes no sense on hardware designed to handle any 3x3/4x4 FMA as an atomic operation. Chewing through 3D cross-products and similar operations is what GPUs got vector units for long before GPGPU became a thing.
 

bit_user

Titan
Ambassador
Your last reply clearly shows that you have failed
Aw, no need to get crabby.

I was always crystal clear that I was talking about GPU SIMD lanes. You're the one who conflated that with components of mathematical vectors. Don't blame me for that.

Sure, you can do the same calculations by treating each part of a 3D vector individually and going out of your way to re-arrange your data accordingly,
Which is what AMD and Nvidia actually do.

but it makes no sense on hardware designed to handle any 3x3/4x4 FMA as an atomic operation.
Which is something that exists only in your imagination.

Though I have yet to find a complete EU ISA reference, I've so far only seen them discuss dot product instructions, which is not something you find in either Nvidia or AMD.

Intel definitely has some "horizontal" instructions in there, deviating from a pure SIMD programming model. That's going to hold them back...

Chewing through 3D cross-products and similar operations is what GPUs got vector units for long before GPGPU became a thing.
It's so funny how you just decided this. Instead of making assumptions about how they work, why don't you try actually reading?

Maybe GPUs of the distant past worked differently, but those are about as relevant as vacuum tubes, to this discussion.