News FPGA Demo Shows Efficiency Gains Compared to x86 Chip

Admin · Sep 28, 2022

A paper from two programmers shows efficiency gains of 50x running the same program on an FPGA vs a CPU

FPGA Demo Shows Efficiency Gains Compared to x86 Chip : Read more

Geef · Sep 28, 2022

I think the big hurdle here won't be the hardware, it will be programmers needing to rewrite their software for the new setup.

TJ Hooker · Sep 28, 2022

From what I can tell, the main point of the whitepaper isn't that the FPGA can run the program more efficiently. After all, once the FPGA has been programmed it's essentially an ASIC, and saying that an ASIC executes some function far faster/more efficiently than a general purpose CPU isn't a revelation.

It seems the important thing they achieved is being able use the C same source code to either compile and run in software, generate a hardware model and simulate performance, and/or to generate the FPGA image (directly from C code rather than an HDL language). This improves ease and speed of development. Maybe to the point where it starts making sense to use FPGAs in heretofore impractical use cases.

suarezvictor · Sep 28, 2022

Hi TJ, author here. I'm glad to know you're aware of the gist of the project: improve development time.
Doing a project like such demo would be impossible using traditional hardware design tools, since they're really slow as compared with a modern compiler. The game required many development-test iterations that would render next to impossible if those iterations weren't done fast. As showed in the video, we can compile the project in just a second, and once we're sasisfied with results, go to better simulations and ultimately, to the hardware where it runs as simulated. The code base is exactly the same, with no single line changed.

bit_user · Oct 1, 2022

TJ Hooker said:
From what I can tell, the main point of the whitepaper isn't that the FPGA can run the program more efficiently.

Well, they do seem to place a lot of emphasis on power-efficiency.

A few things I noticed:

They don't use the same data types between the CPU and FPGA version. In some cases, fixed-point could still be more power-efficient on the CPU, depending how it's used.
Their code doesn't appear to be SIMD. It uses vector data types, but I don't see any intrinsic instructions and it's not written in a proper data-parallel SIMD fashion. The paper mentions using SIMD, but I'm not really sure what they mean by it.
I can't see how they do threading or load-balancing, but this could cause lots of overhead if not done well.
I see other potential optimizations in their C++ code. They don't use any restricted pointers. They have some unnecessary fdivs and sqrt. I also see places they do bit-banging on floats (usually not a win, in modern CPUs), but I can't easily tell how much of the fast path code is doing it.
The data model of the game is trivial. Here's where I expected they might be getting a big win from on-chip scratchpad memory of the FPGA, but the game is almost too simple for it to help much.
Similar to the above point, they kept the game logic within the complexity sweet spot of the FPGA.

If I had a couple weeks to burn, I could probably speed up the CPU version by something like 2-5x. Possibly more, depending on some of the unknowns I mentioned above. Granted, that wouldn't entirely close the efficiency gap vs. the FPGA, but again the game does play into the FPGA's strengths.

A better experiment might be to take an app that already quite well-optimized for CPU, and port it to FPGA.

TJ Hooker said:
once the FPGA has been programmed it's essentially an ASIC,

Not at all. A true ASIC is much faster, primarily due to having much better-optimized layout and probably lots of other bespoke customizations.

For a long time, people have used FPGAs to validate ASIC RTL, before tapeout. In an example of just how much more efficient ASICS are, you need multiple giant FPGAs just to simulate a modest-sized ASIC.

TJ Hooker said:
and saying that an ASIC executes some function far faster/more efficiently than a general purpose CPU isn't a revelation.

Depends on whether they're using fixed-function arithmetic or whether they have to synthesize it. I think FPGAs have been on a trend of incorporating more fixed-function units for a while. If they have to synthesize things like multipliers, then that puts the FPGA at a potential disadvantage relative to a CPU, which has far more optimized implementations of these building blocks.

TJ Hooker said:
It seems the important thing they achieved is being able use the C same source code to either compile and run in software, generate a hardware model and simulate performance, and/or to generate the FPGA image (directly from C code rather than an HDL language).

People have been using tools like OpenCL to target FPGAs for more than a decade.

I think there are also tools for doing this that are closer to C. I assumed that was what SystemC was all about, but it seems to be more of a simulation and modeling tool. However, that article linked to SpecC, which sounds more production-oriented.

https://en.wikipedia.org/wiki/SpecC

TJ Hooker said:
This improves ease and speed of development. Maybe to the point where it starts making sense to use FPGAs in heretofore impractical use cases.

The thing that jumps out at me is that we know hardware is better for ray tracing. People have been doing it in ASICs since about 20 years ago, and GPUs since almost 5. We've also been getting more purpose-built ASICs for other high-profile applications of FPGAs, like crypto-mining and AI. That leaves the CPU to do what it does best: be the generalist and the traffic cop. The main places I see where FPGAs are still relevant are things like DSP in ultra low-power, low-volume embedded devices and when you need the absolute minimum latency, for things like software-defined networking and high-frequency trading.

The other point I'd make about your conclusion is the trivial complexity of their demo. As soon as you start to run out of gates, you'll have to do a lot of restructuring and indeed architecting of the code to cater to the FPGA's limits.

suarezvictor · Oct 1, 2022

bit_user: thanks for your extensive review. I feel you may be missing the objectives of the project, so I'll try to explain.

The main one is that this project is not at all about optimizing a raytraced game to be run in a CPU, but to be able to implement in hardware (with no need of software) algorithms that are usually complex. It's a hardware design tool, not a software platform of any kind.

Contrary to your analysis, we do use multiple threads based on openmp (see simulator_main.cpp#L247), indeed, the CPU shows all cores running at close to 100% usage (power is measured only when the CPU are doing the calculations). And we use intrinsic vector types (see c_compat.h#L51)

Since it's a hardware design tool, data types are not larger than needed, on purpose. We don't waste hardware resources when not strictly required. In a CPU, you need to chose among a limited amount of widths/precisions, so we used the ones that fit best. Running the design in a CPU is just a convenience for fast development of hardware, and we do it fast enough to be able to run the design in realtime in the CPU. This project doesn't aim to replace CPUs at all, indeed, you can design a CPU in C using this tool (ongoing project) and have it conveniently simulated on another CPU, in a fast way, with no code changes. And we show that for some algorithms, including ones using math calculations, you can save power in a significant way.

I don't see where we use "unnecessary fdivs and sqrt". In any case, those operations are executed both in the CPU and in the FPGA. If you optimize them, the power ratio may stay similar.
Is the game logic simple? Yes, on purpose. We haven't seen any FPGA game of more complexity yet, not based a CPU and stored program instructions. The only objective of this project is to ease hardware design and empower designers.

About OpenCL, you can run it in a FPGA, but we think it's too high level and not convenient for general hardware design. If you read the article, you'll see that we designed a UART communications core that's translated to a circuit in the FPGA, or can be run as a bit-banged peripheral in a microcontroller, with no code changes. I bet that's not possible with OpenCL or, if you get close to that, how much resources it will unnecesarily spend. Regarding other "C to FPGA" tools, they may not offer conveniences as we do, like ease to work with vector types with a clean syntax. If you know of a project similarly complex as ours, that you can also run in a CPU, please let us now.

In regards to the many ASICs the world has, including crypto cores, GPUs and the like, we propose that the language they use for design are too complex (Verilog, VHDL, etc), and much less known than C, and definitely slower to simulate than using an advanced compiler like we do (in this case, clang).

About out demo being "trivial", as told we consider it a feature, but we'll be eager to know if there exist a more complex one capable of running in a CPU or target a FPGA, with no code changes.

bit_user · Oct 1, 2022

suarezvictor said:
Contrary to your analysis, we do use multiple threads based on openmp (see simulator_main.cpp#L247), indeed, the CPU shows all cores running at close to 100% usage (power is measured only when the CPU are doing the calculations). And we use intrinsic vector types (see c_compat.h#L51)

I didn't say you weren't using threads. I just said I didn't see where/how it was done, and that it could be a bottleneck depending on that.

I'm glad you mentioned you're using OpenMP, because I have some experience as a user of OpenMP applications, and I can tell you that libgomp (one of the most common implementations) does something extremely naughty that could be blowing your performance/efficiency comparison out of the water. By default, it uses spinlocks, which is absolutely atrocious, in this day and age.

71781 – Severe performance degradation of short parallel for loop on hardware with lots of cores

gcc.gnu.org

To disable them, export the environment variable:

export OMP_WAIT_POLICY=passive

For more info, see: https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fWAIT_005fPOLICY.html#OMP_005fWAIT_005fPOLICY

We got a significant performance boost, when we set this! Some Linux distros set it by default, but others don't.

suarezvictor said:
I don't see where we use "unnecessary fdivs and sqrt".

I'm not sure which source files I should be looking at, but your float_shift() function asks for a true division, rather than multiplying by the reciprocal. Whether or not it's a real divide depends on your compiler and options. I saw where you use -ffast-math, in the Makefile, but I don't know any details about how clang implements it. Check the generated assembly to know for sure.

As for the sqrt(), your sphere intersection test is using it. You can eliminate it by squaring both sides. That's practically the oldest trick in the book. Depending on which C standard you're compiling for, I think it might even be doing a double-precision sqrt(). Traditionally, if you want the float version, you have to call fsqrt().

suarezvictor said:
About OpenCL, you can run it in a FPGA, but we think it's too high level and not convenient for general hardware design.

Yes, it's not for general hardware design. It's mainly for using FPGAs to accelerate compute-intensive algorithms. A lot of FPGAs will have general-purpose, hardwired CPU cores (usually ARM A53 or similar) which can do the housekeeping and orchestrate the compute-intensive parts running on the gate array.

suarezvictor said:
About out demo being "trivial", as told we consider it a feature,

Yeah, it wasn't meant as an insult. Small is always a good place to start, but you ought to have a plan for how to scale up. My point was really that it's not clear to me how you handle the situation where you start to run out of gates. There are/were companies who would take source code and compile part of it into a hardware design and part of it into executable code. I'm thinking that's where you probably want to go. However, like I said, it's not an untrodden path. I know of two examples, and I'm not even a hardware engineer/researcher.

Stretch, Inc. (the second link) claimed to have tools that could take C code, analyze it for hot spots, and then synthesize the hotspots into one or more custom configurations for something like a FPGA. Not only that, but their hardware was much faster to reprogram than conventional FPGAs. So much so that their H.264 encoder supposedly reconfigured the hardware multiple times per frame, and could sustain throughput of several hundred FPS. That might not sound too impressive, but it was back in 2010 or so.

I should add that you definitely deserve kudos for turning your vision into reality. I'm sure you learned a lot by doing it, and you have something interesting to show people. That's worth a lot, regardless of where the project ends up.

suarezvictor · Oct 5, 2022

bit_user
We're welcome to good criticism but we suggest that you need to review the code in full to spot real issues with the project. For example, we don't use stock sqrt but an optimized one, so the possible pitfalls that you mention aren't really there. In our case we selected a fast algorithm that gives enough precision and is implemented using just 4 multiplications. And we don't use any double type in the implementation, either. Additionally, the "float_shift" function is always called with a constant as second argument since that's almost free in hardware and in software the compiler optimized the situation nicely. In case of replacing sqrt by squaring both terms, we know it's an easy and old trick but we need the distance measurement in a linear domain so we don't do just a "test" for intersection.
About not using OpenCL, we see that you now realize doesn't fit requirements. We'll check about wether OMP_WAIT_POLICY is enabled or not, thanks.
About running "out of gates" with larger design, well, we are not wasting resources. The objective of this is not to implement a top of the line raytracer but just to show how you can easily implement hardware design that runs as software as-is. By no means we want to translate a random C design to FPGA, this is jost a hardware design tool that may take some C code as is, and is able to run it fast in a CPU for simulation purposes. We don't believe a tool that tries to map any C code is anything good, like the mentioned tools that you says works as that. If they're able to reconfigure the chip fast, we think this is a good feature, everyone wants that.
About resource usage, when you get out of them, it's obvious you need to look for optimizations or change the chip. This is natural and our tool doesn't interfere with that at all.
Thanks for your kudos! And let's keep learning.

bit_user · Oct 6, 2022

suarezvictor said:
we suggest that you need to review the code in full to spot real issues with the project.

It did become apparent there was more going on in your project's C/C++ code than I was fully able to review. However, I have limited time and provided what feedback I could. At least, for now.

suarezvictor said:
For example, we don't use stock sqrt but an optimized one, so the possible pitfalls that you mention aren't really there. In our case we selected a fast algorithm that gives enough precision and is implemented using just 4 multiplications.

I would just point out that the VSQRTPS instruction can operate on 8x fp32 elements with a throughput of > 1 per 6 cycles, on Skylake cores. I haven't had much luck finding instruction throughput & latency data on AMD CPUs, but perhaps Zen 2 is similar?

For even higher throughput, Intel recommends VRSQRTPS + one Newton-Raphson iteration.

I think this highlights some of the benefits a pure SIMD implementation can provide.

BTW, I'd have loved to have a project like this about 15 years ago. I was very interested in video processing algorithms, like deinterlacing and motion interpolation. A decent FPGA seemed just the thing for doing this stuff in realtime. Of course, nowadays, I'd just use a GPU... ¯\(ツ)/¯

suarezvictor · Oct 6, 2022

Indeed we take care in writing this code to be optimized enough. In regards to the SIMD sqrt, if we need to hand-code a SIMD version reviewing instruction set manuals or writing assembly code, the whole point of "easy write" would be lost. A hand-coded Verilog implementation of hardware could be optimized too, at the cost of much higher development time. So we rely on currently compiler optimized and I think this is fair enough for a comparison. Anyways, assuming a hand-coded SIMD version would double performance, we are still using a 28nm FPGA vs a 7nm CPU so there's plenty enough to keep being much lower power. And we're not counting other options like FPGAs including floating point hard cores. The main power benefit comes from the architecture: the long pipeline, with hardwired logic, running at a much lower clock rate, and avoiding instruction decoding and the like brings a lot of advantages. If you translate that to an ASIC the gains could be hard to believe... And we provide an easy way to access all those benefits from clean C code that you can run so fast in a CPU that you can play a full HD game smoothly. This is no easy tasks, if possible at all, with current hardware design tools.

Search

News FPGA Demo Shows Efficiency Gains Compared to x86 Chip

Admin

Administrator

Geef

Distinguished

TJ Hooker

Titan

suarezvictor

bit_user

Titan

suarezvictor

bit_user

Titan

71781 – Severe performance degradation of short parallel for loop on hardware with lots of cores

suarezvictor

bit_user

Titan

suarezvictor

TRENDING THREADS

Latest posts

Moderators online

Share this page