Ok, then you should have clarified that your target market is telecom, and not deep learning applications.
However, if you do decide to go after deep learning (seeing as it is a much faster growing, and potentially much bigger market), I have a few questions for you:
1. Will I be able to take my highly optimized Torch/Theano/custom CUDA code, and run it on your chip with minimal modifications? Especially taking into consideration that even some latest CUDA code is not compatible with older GPU architectures?
2. How much will your devbox cost, compared to the Nvidia devbox?
3. Will I get much better performance (16 bit) as a result?
Keep in mind that I'm talking about 2016/2017 version of Nvidia devbox (Pascal should have at least 30% better performance compared to current Maxwell cards, and probably more than that if they manage to move to 14nm process).
Regarding the "magic" compiler, have you been watching the development of Mill CPU? Designed by the guys with strong DSP and compiler design background. They also put a lot of emphasis on the compiler. Is that project dying? After two years of hype, it seems like it never got out of the "simulated on FPGA" phase... What can you learn from them?
It's a pretty big leap from lots of little parallel (and I assume 1D) FFTs to a deep learning processor. The rexcomputing chip looks like a DMA-driven systolic array with beefier data-processing units than I've ever seen before.
There are certainly applications for this sort of processor (embarrassingly parallel batches of small independent units of work), but I'd be highly skeptical that this guy has anything close to a "magical compiler(tm)" given his inaccurate understanding and significant underestimation of the competition. That's dangerously close to Intel's absurd "recompile and run" nonsense for Xeon Phi (It's anything but that)...
However, if you do decide to go after deep learning (seeing as it is a much faster growing, and potentially much bigger market), I have a few questions for you:
1. Will I be able to take my highly optimized Torch/Theano/custom CUDA code, and run it on your chip with minimal modifications? Especially taking into consideration that even some latest CUDA code is not compatible with older GPU architectures? 2. How much will your devbox cost, compared to the Nvidia devbox? 3. Will I get much better performance (16 bit) as a result?
Keep in mind that I'm talking about 2016/2017 version of Nvidia devbox (Pascal should have at least 30% better performance compared to current Maxwell cards, and probably more than that if they manage to move to 14nm process).
Regarding the "magic" compiler, have you been watching the development of Mill CPU? Designed by the guys with strong DSP and compiler design background. They also put a lot of emphasis on the compiler. Is that project dying? After two years of hype, it seems like it never got out of the "simulated on FPGA" phase... What can you learn from them?