[citation][nom]markem[/nom]Wow, so nvidia wants to follow apple (The biggest patent troll), maybe nvidia should learn something from current events (aka samsung vs apple law suits), especially if they have not even created the tech or maybe intel and amd should start applying for patents, then we will see how far CPU and gpu progresses...nVidia wants to kill progression and competition as is apple currently (In the end - troll always loses)[/citation]
How is this trolling? This is Nvidia patenting a new basis for their next architecture. This is no worse than Nvidia patenting Fermi and Kepler.
[citation][nom]kronos_cornelius[/nom]Nvidia is definitely no patent troll... To accuse the company of such is to confuse the meaning of the word.I don't think I get the patent from this short description. Given that they have general processing cores (CUDA) and graphics cores(regular GL), is this patent only for graphics rendering ? or general processing ?[/citation]
This should be used for gaming and compute workloads. Compute often scales the cores far better than gaming does, so technically, it's probably more for gaming than compute.
I'll try to explain this a little. If I make a GPU with 1 core, then it won't have multi-core scaling issues. If I make a GPU with 64 cores, then it probably still won't have scaling problems. However, every time you add more cores, the distance between each core (and the other hardware in the GPU, all measured in transistors between each core and the other hardware) increases. Eventually, this leads to scaling problems. IE if you have a GPU with 1024 cores and they are in a monolithic block, then the ones furthest away from whatever they need to talk to (such as memory controllers and other hardware, or even the other cores on the other side of the chip) have to wait long times to talk to those other components. This decreases scaling (among other causes for the scaling decreases).
One thing that AMD and Nvidia has done to decrease this problem is use the shader cores in blocks (IE Kepler has blocks of 192 shaders, Fermi had blocks of 64 shaders, I don't remember Cayman's and GCN's block sizes, but they were probably something like that). By using multiple smaller blocks instead of a single, large block, scaling improves somewhat because the cores only need to talk to other pieces of the block instead of hardware on the other side of the GPU.
However, with so many blocks becoming necessary or very large blocks (Kepler's 192 core blocks are very large for such blocks, yet it still needed 8 of these blocks for the GK104, each almost as large by transistor count as one of the lower end Fermi GPUs), this is no longer enough to keep scaling up. A hierarchical structure can improve this scaling more. For example, look at Bulldozer. Sure, it's called a modular architecture, but think about it. Instead of cores, it has modules. That's like adding an additional hierarchical level to the CPU, rather than just throwing more cores into the same level. It's not really thr same as what Nvidia is trying to do (more comparable to the change from a ingle block of cores to multiple blocks), but just look at how it affected scaling. Under Windows 8, an OS optimized for this, Bulldozer shows incredible performance scaling with many cores (even if each core is slower than a Phenom II core, which was already pretty slow for this day and age when compared to a Nehalem core, let alone a Sandy or Ivy Bridge core).
What this style of computing is supposed to be like (as far as I can tell) is that it will change the way the cores and hardware talk to each other and process the data. It will make the scheduling process into a more hierarchical structure at the hardware level. IE, instead of having the hardware strewn about the GPU, each block will instead of being a semi-independent component that does more or less all of the processing within the schedule for the data that goes into the block (a portion of the screen's pixels), each *block* will do all of the processing for a specific part of the schedule. One block will do something like the first part of a work, then shuffle the data to the next block for it to do the next part of the work, and so on. This could help improve the scaling because it should reduce the amount of data that get's shuffled all over each block and the GPU and make the data shuffling more uniform and more from certain points to others (instead of it going from all over the GPU to some other place all over the GPU, it will go from one block to the next block in the line, providing a more ordered and focused path from start to finish).
So, this would improve parallelism (pretty much multi-core scaling, IE scaling between splitting the load of each step in the schedule between multiple more or less identical parts, such as splitting the load of a job between multiple shader cores) because everything that needs to talk will be right next to each other and the data transfers can be more optimized for because they will be more uniform and predictable. For example, a bus such as those extremely fast ring buses that Intel uses (another notable example would be the Cell chip in the PS3, if I remember correctly) could be put into use.
However, I'm pretty much grasping at straws at this point, so if an engineer who knows more about this could come here and clear things up for us, I think I can speak for at least most of the commenters on this article if I say that we would be grateful.