A paper from two programmers shows efficiency gains of 50x running the same program on an FPGA vs a CPU
FPGA Demo Shows Efficiency Gains Compared to x86 Chip : Read more
FPGA Demo Shows Efficiency Gains Compared to x86 Chip : Read more
Well, they do seem to place a lot of emphasis on power-efficiency.From what I can tell, the main point of the whitepaper isn't that the FPGA can run the program more efficiently.
Not at all. A true ASIC is much faster, primarily due to having much better-optimized layout and probably lots of other bespoke customizations.once the FPGA has been programmed it's essentially an ASIC,
Depends on whether they're using fixed-function arithmetic or whether they have to synthesize it. I think FPGAs have been on a trend of incorporating more fixed-function units for a while. If they have to synthesize things like multipliers, then that puts the FPGA at a potential disadvantage relative to a CPU, which has far more optimized implementations of these building blocks.and saying that an ASIC executes some function far faster/more efficiently than a general purpose CPU isn't a revelation.
People have been using tools like OpenCL to target FPGAs for more than a decade.It seems the important thing they achieved is being able use the C same source code to either compile and run in software, generate a hardware model and simulate performance, and/or to generate the FPGA image (directly from C code rather than an HDL language).
The thing that jumps out at me is that we know hardware is better for ray tracing. People have been doing it in ASICs since about 20 years ago, and GPUs since almost 5. We've also been getting more purpose-built ASICs for other high-profile applications of FPGAs, like crypto-mining and AI. That leaves the CPU to do what it does best: be the generalist and the traffic cop. The main places I see where FPGAs are still relevant are things like DSP in ultra low-power, low-volume embedded devices and when you need the absolute minimum latency, for things like software-defined networking and high-frequency trading.This improves ease and speed of development. Maybe to the point where it starts making sense to use FPGAs in heretofore impractical use cases.
I didn't say you weren't using threads. I just said I didn't see where/how it was done, and that it could be a bottleneck depending on that.Contrary to your analysis, we do use multiple threads based on openmp (see simulator_main.cpp#L247), indeed, the CPU shows all cores running at close to 100% usage (power is measured only when the CPU are doing the calculations). And we use intrinsic vector types (see c_compat.h#L51)
I'm not sure which source files I should be looking at, but your float_shift() function asks for a true division, rather than multiplying by the reciprocal. Whether or not it's a real divide depends on your compiler and options. I saw where you use -ffast-math, in the Makefile, but I don't know any details about how clang implements it. Check the generated assembly to know for sure.I don't see where we use "unnecessary fdivs and sqrt".
Yes, it's not for general hardware design. It's mainly for using FPGAs to accelerate compute-intensive algorithms. A lot of FPGAs will have general-purpose, hardwired CPU cores (usually ARM A53 or similar) which can do the housekeeping and orchestrate the compute-intensive parts running on the gate array.About OpenCL, you can run it in a FPGA, but we think it's too high level and not convenient for general hardware design.
Yeah, it wasn't meant as an insult. Small is always a good place to start, but you ought to have a plan for how to scale up. My point was really that it's not clear to me how you handle the situation where you start to run out of gates. There are/were companies who would take source code and compile part of it into a hardware design and part of it into executable code. I'm thinking that's where you probably want to go. However, like I said, it's not an untrodden path. I know of two examples, and I'm not even a hardware engineer/researcher.About out demo being "trivial", as told we consider it a feature,
It did become apparent there was more going on in your project's C/C++ code than I was fully able to review. However, I have limited time and provided what feedback I could. At least, for now.we suggest that you need to review the code in full to spot real issues with the project.
I would just point out that the VSQRTPS instruction can operate on 8x fp32 elements with a throughput of > 1 per 6 cycles, on Skylake cores. I haven't had much luck finding instruction throughput & latency data on AMD CPUs, but perhaps Zen 2 is similar?For example, we don't use stock sqrt but an optimized one, so the possible pitfalls that you mention aren't really there. In our case we selected a fast algorithm that gives enough precision and is implemented using just 4 multiplications.