A leak gives some interesting details about Intel's first discrete GPU.
Intel's First Discrete GPU, the DG1, Listed with 96 EUs : Read more
Intel's First Discrete GPU, the DG1, Listed with 96 EUs : Read more
This is the key point. If these EUs are comparable to those of their traditional iGPUs, you're only talking about a 768-shader GPU. So, maybe about GTX 1050 Ti-level performance.as many as Tiger Lake’s integrated graphics.
Yeah, I'll believe it when I see it. IMO, it would make more sense to run physics on the iGPU and do rendering on the dGPU, for instance. Perhaps some engines already do that, since they wouldn't even have to be the same architecture.Although by working together with the CPU, DG1 might double Tiger Lake's graphics performance.
Not saying it's fake news, or anything, but...The information originated from the Eurasian Economic Union on Wednesday
The Member-States of the Eurasian Economic Union are the Republic of Armenia, the Republic of Belarus, the Republic of Kazakhstan, the Kyrgyz Republic and the Russian Federation.
This is nuts! And I really mean nuts, because that's not the most efficient way to scale. As I said, they should really make the EUs wider, before making them more numerous. There's a reason for SIMD's enduring popularity. You'd think Intel, with their 16-wide AVX-512 would get this.The driver also mentioned DG2 with Gen 12 HP and possibly up to 512 EUs.
Adding more of already available units is much more efficient than designing something new from the ground up...This is nuts! And I really mean nuts, because that's not the most efficient way to scale. As I said, they should really make the EUs wider, before making them more numerous. There's a reason for SIMD's enduring popularity. You'd think Intel, with their 16-wide AVX-512 would get this.
You make the choices that should get you the most sales in your target markets. As for other stuff accounting for more die area than the cores themselves, you can say the same thing about Ryzen: 1/1+ had cores occupying 1/3 of the die area and that goes down to around 1/4 with Zen 2 largely thanks to doubling L3 cache. The cores are getting smaller but the amount of infrastructure required to keep them fed is increasing.Their Xe gpu is probably going to occupy more layout than all their cores combined. Interesting design choices...
The CPU stuff is already done, Intel is two generations ahead on architecture. What is bogging Intel down is lacking the process required to make it actually work without the substantial clock frequency penalty seen on mobile Ice Lake vs mobile Coffee Lake. The Tiger Lake ES leaks look promising, may very well give Zen 3 a run for its money.I honestly think Intel is making a mistake though. They really need to focus on their cpus considering amd is smashing it right now.
They've been working on this for... how long? Since probably 2016, at least, with Gen11 as a way point.Adding more of already available units is much more efficient than designing something new from the ground up...
Nearly every instruction field, opcode, and register type is updated and there are other big changes like removing the hardware register scoreboard logic that leaves it up to the compiler now for ensuring data coherency between register reads and writes and a new sync hardware instruction.
Intel's 2018 revenues were $70.8 B. AMD's were $6.48 B. The two companies are on completely different scales.I honestly think Intel is making a mistake though. They really need to focus on their cpus considering amd is smashing it right now.
Why are you expecting Intel's 96 EUs to be competitive with VII or a Titan when Intel itself is labeling it as entry-level which would be more along the lines of RX570/RX5500/GTX1650 at best? It isn't the number of EUs that matter, it is how much stuff you cram into them (not very much for DG1) and how fast you can run them.I'm just wondering why they think an entry level card with 96 EUs will be competitive, when you consider that Radeon VII has only 60 CUs and even the mighty Titan V has just 80 SMs.
No, that wasn't my point. Did you even read my first post in this thread? It was my only post before the one to which you replied. I don't feel like I should have to repeat myself, just because you can't be bothered to scroll up to get some more context.Why are you expecting Intel's 96 EUs to be competitive with VII or a Titan when Intel itself is labeling it as entry-level which would be more along the lines of RX570/RX5500/GTX1650 at best?
Oh, but it does. Their existing iGPUs have dual-issue EUs. So, by scaling up the number of EUs before making them wider, they're burning power & die space to redundantly fetch, decode, and schedule the same instructions that AMD or Nvidia would dispatch within a wider SIMD.It isn't the number of EUs that matter,
With 512, in their DG2, one doesn't get the impression it'll be very much.it is how much stuff you cram into them
This is somewhat orthogonal to the question at hand.how fast you can run them.
No, that's not overhead. If doubling SIMD size costs twice the area but delivers 2x the benefit, then it doesn't count as overhead.Having more EUs may have some amount of overhead but so does making EUs wider, need wider everything to match (IO from registers, IO from local buffers, IO from cache, register count, etc.)
I'd hope so.I would expect Intel to model its architectures in C or some other language to determine the optimal combination of EU count and EU width for its 3D rendering pipeline by throwing scenes grabbed from actual games and apps at it.
You read too much into the current architecture, at your peril. It has a long legacy, and I'm fairly certain Intel drank the cool aid about how GPU-compute was going to be the next big thing. They certainly didn't provision the fp64 and int32 capabilities of iGPUs through Gen9.5, based on any games.I'm fairly sure Intel did its homework before settling on plentiful narrow EUs instead of fewer fatter ones.
The control logic may be mostly unchanged but access to shared registers and buffers grows at a rate of N while the data routing matrices for things like swizzle operations and getting the output of any given unit to the input of whatever other units require that result (for things like register store bypass to reduce latency) scale with N^2.No, that's not overhead. If doubling SIMD size costs twice the area but delivers 2x the benefit, then it doesn't count as overhead. And it wouldn't cost twice the area per EU, because a lot of the execution & control logic is constant with respect to SIMD width. SIMD scales very well - especially in graphics.
You're thinking of these like they're CPUs, but they're not. You don't do swizzle operations on GPU SIMD words, because the whole programming model is to keep the lanes independent. That's why Nvidia likes to call each SIMD lane a "core" and the instance of data that it's processing as a "thread".The control logic may be mostly unchanged but access to shared registers and buffers grows at a rate of N while the data routing matrices for things like swizzle operations and getting the output of any given unit to the input of whatever other units require that result (for things like register store bypass to reduce latency) scale with N^2.
T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_xor_sync(unsigned mask, T var, int laneMask, int width=warpSize);
Shared Memory
Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory.To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts.
No swizzle on GPUs? So you have never done something like " out = f.xyyz * g.wyxz + h.w " ? Guess that would be why Nvidia says it is an "often overlooked" optimization, still very much there. That's one 4x4 matrix for each swizzleable operand to present any part of any float4 to any FMA input.You're thinking of these like they're CPUs, but they're not. You don't do swizzle operations on GPU SIMD words, because the whole programming model is to keep the lanes independent.
Of course, you still have to route every SIMD lane to the port of every bank, and that obviously doesn't scale linearly, but I doubt it's more than N log N.
No, you still don't get it. That's not how they use SIMD. They regard lane-swap operations as a form of "thread communication". They treat each SIMD lane as a scalar processor, which means they don't need lane-swaps to implement that.No swizzle on GPUs? So you have never done something like " out = f.xyyz * g.wyxz + h.w " ? Guess that would be why Nvidia says it is an "often overlooked" optimization, still very much there. That's one 4x4 matrix for each swizzleable operand to present any part of any float4 to any FMA input.
It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.
That was taken right out of one of Nvidia's GPUGems2, chapter 35.No, you still don't get it. That's not how they use SIMD.
I'm not saying it wasn't - just that you don't need lane-swaps for it, if those elements aren't all packed into different lanes of a single SIMD word. That's the part you're missing.That was taken right out of one of Nvidia's GPUGems2, chapter 35.
Vector registers and data is packed as xyzw. Unless all you are doing is scalar and point products, you will need to swizzle at some point, be it explicitly or implicitly via a convenient keyword to abstract boilerplate math like vector cross-product out. Quite a bit of swizzling is involved in the 3D coordinate rotation matrices using either the Euler angles method or vector-to-vector one, which is a kind of important part of transforming geometry and applying viewport perspective before rendering it.I'm not saying it wasn't - just that you don't need lane-swaps for it, if those elements aren't all packed into different lanes of a single SIMD word. That's the part you're missing.
Beware, this will probably blow your friggin' mind: you can implement vector arithmetic on scalar processors!Vector registers and data is packed as xyzw. Unless all you are doing is scalar and point products, you will need to swizzle at some point,
Then what the heck was the point of all that?You may not be using swizzles (at least not directly) for whatever it is you are doing, but it is essential to much of the math at the core of what general-purpose shaders need to be able to do.
The swizzles don't disappear, you merely expanded them. From a hardware point of view, you still need to pull the same xyz values from registers, except now you need to hog six separate scalar registers to do a single 3D cross-product operation instead of two vector registers, and you have three products to write back to the register file instead of only one.Now, imagine treating a wide SIMD machine as a collection of scalar machines. Boom. Your swizzles just disappeared! Each component in your vectors is now in the same lane of a separate vector register! Instead of swizzles, you just reference a different register!
There is no point in objecting to swizzling on GPUs as it is nothing more than an ALU input mux manipulation, it is there and free whether you use it or not.If you want to scale up to large SIMD sizes, the first thing you don't do is anything that doesn't scale well.
sighThe swizzles don't disappear, you merely expanded them.
No, no, no. You're not getting this.From a hardware point of view, you still need to pull the same xyz values from registers, except now you need to hog six separate scalar registers to do a single 3D cross-product operation instead of two vector registers,
GPUs are optimized for power-efficiency and area-efficiency. That's what lets them scale up. So, you have to do everything according to those objectives.There is no point in objecting to swizzling on GPUs as it is nothing more than an ALU input mux manipulation, it is there and free whether you use it or not.
Your last reply clearly shows that you have failed at making the distinction between vectors in the SIMD sense and vectors in the MATHEMATICAL or spatial sense of an n-dimensional direction, mainly 3D in GPUs' case, Sure, you can do the same calculations by treating each part of a 3D vector individually and going out of your way to re-arrange your data accordingly, but it makes no sense on hardware designed to handle any 3x3/4x4 FMA as an atomic operation. Chewing through 3D cross-products and similar operations is what GPUs got vector units for long before GPGPU became a thing.No, no, no. You're not getting this.
Aw, no need to get crabby.Your last reply clearly shows that you have failed
Which is what AMD and Nvidia actually do.Sure, you can do the same calculations by treating each part of a 3D vector individually and going out of your way to re-arrange your data accordingly,
Which is something that exists only in your imagination.but it makes no sense on hardware designed to handle any 3x3/4x4 FMA as an atomic operation.
It's so funny how you just decided this. Instead of making assumptions about how they work, why don't you try actually reading?Chewing through 3D cross-products and similar operations is what GPUs got vector units for long before GPGPU became a thing.