How To Calculate Theoretical GPU FLOPS?

Icaraeus

Honorable
How do I measure the theoretical performance of my Sapphire R9 270X? The on-paper performance of the standard 270X is 2.69TF but I don't know how AMD ended up with those numbers. I've overclocked mine the most it could go so just wondering what it could theoretically achieve for the fun of it.

Default GPU Clock - 1070mhz
Modified GPU Clock - 1190mhz (+120mhz)
Default VRAM Clock - 1400mhz
Modified GPU Clock - 1510mhz (+110mhz)

Before overclock

Pixel Fillrate - 34.2 GPixel/s
Texture Fillrate - 85.6 GTexel/s
Bandwidth - 179.2 GB/s

After overclock:

Pixel Fillrate - 38.1GPixel/s (+3.9GPixel/s)
Texture Fillrate - 95.2GTexel/s (+9.6GTexel/s)
Bandwidth - 193.3 GB/s (+14.1GB/s)

Maximum OC temp on stress-test: 70 degrees
Maximum voltage: 1.295V
 
Solution
Okay so considering my 270X at the moment has:

TMUs - 80
ROPS (Raster Operations) - 32
Core Clock - 1180
Mem Clock - 1570
VDDC - 1.257V

80x32x1180 = 3020800 = 3.02 TeraFlops if I measured correctly.

Does the memory clock have any effect on the theoretical performance?
 
Oh right my mind completely blanked about the whole bandwidth part, I remember now! It puzzles me however why the GTX 970 and 980 have 256 bit buses for the VRAM when it's quite narrow and wouldn't be very good for anything over 2GB VRAM unless Nvidia did something to the architecture I haven't looked into (considering they pack 4GB of GDDR5 VRAM).
 


Sorry, wrong.

The maximum theoretical floating point throughput is a function of the gross number of shaders.

I'll use the HD 7970 clocked at 1Ghz as an example because it's what I have.

The HD 7970 is constructed from 32 compute units.

Each compute unit has 4 SIMD execution units

Each SIMD unit is is 16 ALUs wide.

32 compute units * 4 SIMDs per compute unit * 16 ALUs per SIMD = 2048 total shaders

In each cycle, each shader can perform one multiply operation and one accumulate operation (called a MAC, or Multiply-Accumulate). Although this is executed as a single operation, it is considered to be two separate operations for the sake of computation.

2048 shaders * 2 floating point operations per cycle * 1 billion cycles per second = 4096 gigaflops

The same logic holds true for the R9-270X
 


So if it isn't TMUs x ROPS x GPU Core Clock then what would it be exactly? I don't really understand the whole SIMD and ALU part.
 


peak floating point throughput = shaders * 2 * clock frequency

In reality, hitting peak throughput is damned near impossible. AMD realized this when their older Radeon HD architectures (HD 6000 series and prior) had a theoretical computational edge over their competition at NVidia, yet generally underperformed in comparison. They radically changed the architecture to create GCN which enables them to do a much better job of keeping the shaders busy.

Whereas the HD 6000 series and prior were based on a VLIW MIMD design, GCN is based on a pure SIMD design.
 
Solution


That's purely a result of the way AMD has the core laid out. It's just a coincidence.
 


thanks for the info @Pinhedd, it seemed rather convenient though, i must say.! i havent really dig into that stuff! thanks again!
 


In the classical fixed function graphical pipeline the number of TMUs, ROPs, Vertex Shaders, and Pixel shaders were typically the same. However, over time the complexity of the shader programs has grown far faster than the complexity of texture manipulation and rasterization. Thus, manufacturers have unified the shaders and decoupled them from the rest of the pipeline. Core configuration is usually expressed in the form Shaders:TMUs:ROPs.
 


well, consindering that 3dfx voodoo 2 was the first that featured TMUs, things gotten rather complex now, hard to keep up!:)