OpenCL In Action: Post-Processing Apps, Accelerated

Th-z · Feb 4, 2012

gc9 :

Hmm, it's neither A nor B. It's two related numbers produced from the same test: apply enhancement(s) to a video and measure both CPU usage and render speed at same time.

gc9 :

Real time is 100% = full 30 fps. If you have enhancements and the render speed is as fast as normal playback speed = 100%. 50% = half as slow, in other words, it's rendering at 15 fps (50% of normal playback speed), so it will take twice as much time to complete. So I think author is right on this one.

lucas1024 · Feb 4, 2012

I love OpenCL and GPUs and really appreciate what you are trying to do with your extended coverage, but this article just doesn't cut it - for example with statements like "...the sheer complexity of a discrete GPU is undeniably superior..." - if complexity made things superior, then CPUs would rule all - and a discrete GPU is not really any more complex than an integrated one.

And then spreading results over a billion graphs rather than comparing side by side on the same graph. This could be written much clearer and organized better.
Thanks for trying, though.

cangelini · Feb 4, 2012

[citation][nom]Th-z[/nom]William, on page "Benchmark Results: ArcSoft Total Media Theatre SimHD". After enabling GPU acceleration, most actually have their CPU utilizations increased. It seems counter-intuitive, can you explain why?[/citation]
When I was editing this, my hypothesis was that achieving closer-to-real-time frame rates was achieved by offloading one big piece of the workload to the graphics cores, allowing the host processor to be better-utilized. So yes, you end up with higher CPU utilization, but the division of labor simultaneously facilitates the performance you'd be looking for, too. Again, that's my guess.

DjEaZy · Feb 4, 2012

[citation][nom]deanjo[/nom]Umm, ya pretty much "just apple" from creation to the open standard proposal to the getting it of it accepted, to the influencing of the hardware vendors to support it. Apple designed it so that it would be crossplatform to begin with, that was kind of the whole idea behind it.[/citation]
... that's not bad...

Rock_n_Rolla · Feb 4, 2012

Very nice,.. I hope application developers now particularly those in the "multimedia" scene like Sony for example with their Vegas applications where it uses Gpu acceleration through OpenCL and speeding up the process of its digital content creation could further expand their support to this neutral API for a more better, more faster computing experience using their applications.

Also, for those companies who are engaged into hi definition digital effects and 3d rendering
applications, adding support to OpenCL will really help their applications render complex 3d scenes
where very fast computing is heavily needed and by using the gpu acceleration to aid the processors
to speed up the render times will surely give a lot positive results specially for a company where
delivery of projects in time or in the most faster way is at most priority...

Th-z · Feb 5, 2012

cangelini :

Thanks for the reply Chris. Yes that's certainly possible. William only mentioned enabling GPU has advantage over CPU only process, but we don't know how five systems fair to each other. And Intel HD Graphic 3000 doesn't OpenCL, does the higher utilization mean CPU is processing OpenCL for the GPU? That probably doesn't make sense, I am also guessing

bit_user · Feb 5, 2012

[citation][nom]alchemist07[/nom]APU should have a benefit over discrete GPU, there is an ADVANTAGE in sharing the same memory space since the GPU and CPU now write to the same space and don't need to transfer data between each others memory space.[/citation]It all depends on the workload. If the task involves memory-intensive processing by the GPU, with relatively small inputs & outputs, then the higher memory bandwidth of a dedicated GPU far outweighs the benefits provided by an APU's shared memory.

On the other extreme, if a task involves lots of communication between the GPU and CPU, then the lack of copies and lower latency communication could turn in favor of an APU design.

I'd like to point out that the compute-to-bandwidth ratio of a GPU already suggests it's starved for bandwidth (and why else do you think AMD and NVidia are messing around with exotic stuff like GDDR5 and 384-bit interfaces?). For single precision, the HD 7970 is something like 14:1 (GFLOPS/GBps). In the DSP world, they strive for something more balanced, ideally being able to read two numbers & write out one (this would be 1:12, since single-precision floats are 4 bytes).

I think the reason AMD hasn't been focusing on increasing raw GFLOPS, for the last 2 generations, is that the HD 5870's beastly floating point capacity was starved & otherwise under-utilized. As evidence for this, I point to the fact that the number of transistors between that card's Cypress XT and the HD 7970's Tahiti XT almost exactly doubled & core clock increased by 9%, yet single-precision floating-point performance increased by only 39%. Memory bandwidth increased by 72%. These numbers all point to a conclusion by AMD that their bottleneck (for their primary workloads, at least) is memory bandwidth, not compute.

This also answers the question of why the GPUs in AMD & Intel APUs aren't faster or from the higher-end of AMD's lineup. The extra performance would mostly go to waste.

bit_user · Feb 5, 2012

[citation][nom]lucas1024[/nom]with statements like "...the sheer complexity of a discrete GPU is undeniably superior..." - if complexity made things superior, then CPUs would rule all[/citation]The GPU advantage is using a sea of simple, wide processors, running at a more power-efficient clock frequency. Pair this with as much memory bandwidth as possible and you get vast amounts of raw computing power, that can be harnessed by sufficiently parallel workloads.

At its core, the GPU approach is really one of simplicity. Where some complexity gets re-introduced is in the hardware scheduling, busses, caches, local memories, etc. But I'd have to agree that the GPU approach is to make the hardware simple & wide, placing the burden of complexity on the software. That's why CUDA and OpenCL are such important enabling technologies in helping software start to utilize the capabilities of the hardware.

Conversely, you could look at the modern x86 CPU as the opposite. The CPU is doing lots of work (at the cost of much complexity) to make life easier for the software. It translates the instruction set to a more efficient one, schedules instructions in parallel & out-of-order, predicts branches, runs multiple threads on the same core, runs at a more aggressive clock speed, and automatically prefetches data. All so that software can get the most benefit with the least amount of parallelism.

bit_user · Feb 6, 2012

[citation][nom]Antilycus[/nom]Apple is all about using closed systems just as much as MS

...

because Apple, while popular, is only going to kill open standards[/citation]I disagree, and I'm no Apple lover. I do think it's important to understand what they did and why. By all accounts, Apple was an influential force in driving OpenCL. If they're so fond of closed standards, as you say, then why would they do this?

Because Apple embraces open standards from its suppliers. This allows them to bid suppliers against each other and it gives them a common interface for GPGPU across all their platforms.

Apple would probably rather have gone the DirectCompute route, but they don't have enough market share to make a go of that. So, they compromised and decided that they had a better chance of success with an open standard than by trying to push their own proprietary one.

triny · Feb 7, 2012

The lano test seems useless as if you use higher than a 6670 card the igp is doing nothing

alchemist07 · Feb 7, 2012

Hi Again,

Great article, Next time could you include a bar or some other information to show the total power usage. The sandy bridge cores are quite efficient in terms of power so it would be interesting to see what power gain there is from using a llano at 50% compared to a sandy bridge at 100%

peevee · Feb 8, 2012

The worst article ever.
"Rather than indicate frames per second (which pegs at 30 and stays there, telling us very little), vReveal spits back a percentage of real-time at which a render job is operating. This is probably a more meaningful number to the average user. For instance, if a one-minute video clip is rendering at 50%, the render job takes two minutes to complete."

Total BS, if it goes at 30 frames per second, and number of seconds stays the same, then the time to complete also will stay the same. Strange software though, if it cannot go faster than real-time when resources are available.

peevee · Feb 9, 2012

And why the Intel laptop was equipped with HDD, while AMD desktops were equipped with SSD? The most obvious effect is in the last test, when 1080p video with Basic effect had only 70% of the CPU utilized, while not reaching real time - indicating that the bottleneck is elsewhere, probably the slow notebook HDD, or maybe just one stick of memory installed, halving memory throughput essential for such a task.

enisj · Feb 10, 2012

I wonder whether OPENCL could be used offload some of the memory intensive workload away from the CPU to the GPU involved in calculating effects like delay, reverb and spatilsation in DAWs like Ableton Live. I know that some GPU accelerated VSTs exist but they are far and few between and run on specific hardware - OPENCL should allow GPU acceleration on most newer GPUs/APUs.

bit_user · Feb 11, 2012

[citation][nom]enisj[/nom]I wonder whether OPENCL could be used offload some of the memory intensive workload away from the CPU to the GPU involved in calculating effects like delay, reverb and spatilsation[/citation]If the workload is truly compute or memory I/O bound, then it should be easy to implement most audio processing operations in OpenCL. Several years ago, I heard about model-based reverb being GPU-accelerated. That might have even predated CUDA.

I suspect at least some DAW operations are really limited by disk I/O, however. And for a lot of audio processing, a fast multicore CPU has more than enough power. In those cases, the speed of an operation is it's really a question of how well-designed the software & plugins are. Do they use SSE/AVX? Are they multi-threaded? Do they make efficient use of CPU cache?

clonazepam · Feb 14, 2012

Reading this and my own hands on with the A8-3870K... the future looks very bright for APUs. As soon as we move on from DDR3 based system memory, and CPUs are using memory closer in spec to what current GPUs are using, those APUs are really going to be awesome.

I've been tweaking the settings on this lil system and my first attempt to use AMD's Video Converter built into the AMD Vision Engine Control Center just ended up with a crash after about 20 minutes. I haven't gone back to it yet to play with it some more. Playing some less gpu intensive games at 1080p like Super Street Fighter IV Arcade Edition (High Settings, no AA) and DiRT 3 (low settings) has been a lot of fun and looks stunning for what little amount of power it pulls from the wall socket.

If this article had just one more thing, it would've been the avg watts from the wall during testing. We know the TDPs of the products, and my 3870K has a Max TDP of 117W with its current settings as reported by CPU-Z, but at wall readings with the drives and everything can be a lot less and a little more depending on the workload when the APU is stressed to its current 3400mhz speed.

Search

OpenCL In Action: Post-Processing Apps, Accelerated

Th-z

Distinguished

lucas1024

Distinguished

cangelini

Contributing Editor

DjEaZy

Distinguished

Rock_n_Rolla

Distinguished

Th-z

Distinguished

bit_user

Titan

bit_user

Titan

bit_user

Titan

triny

Distinguished

alchemist07

Distinguished

peevee

Distinguished

peevee

Distinguished

enisj

Distinguished

bit_user

Titan

clonazepam

Distinguished

TRENDING THREADS

Latest posts

Moderators online

Share this page