Very interesting results, it's good you have spent the time gathering data.
Jawed, where do you think the bottleneck is?
You got me curious to find out exactly how expensive 1MM taps is. It certainly isn't the bottleneck.
The bottleneck is that it's not commercially realisable as a stand-alone DAC. The next step from Blu 2 is a million taps in a device of the size, portability and power consumption of Hugo or Mojo. Nothing else is worth doing. At least not for Chord. Some other competitor? Sure, they can try. If it sounds better than DAVE or Hugo 2, then yay.
The power consumption of a general purpose processor to run FFT based convolution is the plainest bottleneck. There's also the matter of latency. Latency can be reduced by running a partitioned convolution algorithm, but then power consumption goes up. This page is very interesting (I got there via your link to Convolver):
https://www.ludd.ltu.se/~torger/brutefir.html#whatis
on the subject of performance, latency and practicality. As you can see from that page, long FIR has been practical in real time for 15+ years on an ordinary consumer PC.
When running a million taps you need to use 64-bit (double precision) arithmetic. If you want to introduce GPUs as a solution, you have to bear in mind there's about 5 chips out there that aren't absurdly slow when running double-precision. Most of the GPUs are 1/16 or 1/32 rate on DP, entirely nullifying any theoretical benefits they might have over an ordinary CPU. Also, the fast double-precision GPUs are silly expensive, etc.
If your test processor was a phone - which could represent a device that replaces Hugo or Mojo with built in 1 million taps processing - then whatever numbers you produced would be more interesting! See if it could run for 5 hours, say, off battery...
As it happens, back in 2016 we had a discussion about FFT-based algorithms:
https://www.head-fi.org/threads/chord-electronics-dave.766517/page-273#post-12778587
https://www.head-fi.org/threads/chord-electronics-dave.766517/page-275#post-12782076
That discussion obviously pre-dated Blu 2, which moved the goalposts rather dramatically.
When you have a specific algorithm to run, a general purpose processor (CPU or GPU) is going to be worse than an FPGA and utterly comical in the face of an ASIC. That's why so much AI research is focussing on FPGA and ASIC. It's why investment banks use FPGAs not GPUs for their high performance, low latency, workloads.
GPUs are great for proof of concept, because of their programmability and dense compute capability. Programmability could be attractive to Rob (it's likely easier to make C work than an FPGA when building a million taps) but that doesn't get you a device with a million taps in a portable replacment for Hugo.
It's worth noting that GPUs have hardly progressed in compute capability in the last 9 years. Factor of approximately 5x is not very impressive, compared to the previous 9 years:
https://techreport.com/review/17618/amd-radeon-hd-5870-graphics-processor/5
https://techreport.com/review/17618/amd-radeon-hd-5870-graphics-processor/5
And that chip has very high double-precision capability (1/4 rate) and to get 11x today requires Volta V100
https://www.anandtech.com/show/1136...v100-gpu-and-tesla-v100-accelerator-announced
which took 8 years with a chip that's 2.4x bigger. So in effect it took 8 years to get 4.5x higher capability in the same chip area (cost) though it is probably half the power. Though this GPU (in a consumer card that's just been released) is about 4x the cost of the card from 2009, adjusted for inflation.
So if you have a DSP algorithm you want to deploy to 10s of thousands of customers, an FGPA (or a grid of them, they're usually small and low power) is going to be preferable to a GPU. If you have a one-off application installation or you're changing the application all the time, then sure, use a GPU. Or a supercomputer full of them.
---
You can find some of my GPU related coding here:
https://github.com/JawedAshraf
though it's for noise reduction in video, not audio processing.