NVLink would be in 2016 - http://wccftech.com/nvidias-gp100-pascal-flagship-pack...

trsohmers · on Aug 10, 2015

While you will be able to interconnect GPUs with NVLink in ~2016, the first CPUs supporting it won't be until 2018/2019.

We will be releasing benchmark data when we have first silicon back next year. As of right now, our numbers are based on FPGA prototype implementations.

varelse · on Aug 10, 2015

So basically you're shamelessly promoting a powerpoint processor versus shipping hardware? Pot.kettle.tapeout. I thought that was Intel's specialty.

It's a very long road to a shipping chip and a lot can happen between now and then. And afterwards, it's an even longer road to the equivalent of nvcc, nvvp, cuda-gdb, cuBLAS, cuFFT, cub, cuRand, and cuDNN with solid linux/windows/mac support. And it's all free. NVIDIA achieves this through its nearly bottomless pockets from the gaming/industrial complex that has allowed them to weather many near-death experiences.

What's your burn rate BTW?

In the meantime, a $1000 GTX TitanX (which you can easily buy off Amazon) delivers 27 GFLOPS/W of FP32. So much for "a 10 to 25x increase in energy efficiency for the same performance level compared to existing GPU and CPU systems." At least get your facts straight. And while we're talking facts, the real challenge here is that they deliver 6.7 GFLOPS/$. That seems like a tough squeeze for a startup to beat. Again, what's your burn rate? For that's how it's ended so far for the other contenders. Why are you different?

I have no problem believing NVIDIA will one day be disrupted, but nothing I read on your web site made me feel in any way that it's your architecture. Suggestion: lose the router and replace it with something simpler after studying NVIDIA's memory controller and hierarchy. They considered a lot of crazy things too and there are many reasons why GPUs ended up the way they are with a very clean programming abstraction that automagically subsumes SIMD, multithreading, and multicore.

trsohmers · on Aug 10, 2015

As for being a "powerpoint processor", we have gone through multiple prototypes with FPGA prototypes and physical simulation with high speed timing, and we are using that knowledge to get our performance and efficiency metrics. I don't want to release full benchmark data at this point based on software and FPGA simulations of an unfinished design, as it is that... unfinished.

Burn rate is very low (relatively speaking)... Our full size team, which we are building up (We're hiring!), is only 7 people. Our current runway is around 18 to 20 months, and we have some unannounced (and not included in that runway) funding coming along.

Semiconductor economics are also in our favor, with our ~100mm^2 chip being a lot cheaper (per unit, and in terms of GFLOPs/$) than NVIDIA's ~650mm^2 (and up) chips.

As for "getting my facts straight" I was using actual Titan X numbers I have seen for real applications (e.g. nbody simulation)... it only gets around 4000GFLOPs single precision compared to their advertised 7000. I was being generous to NVIDIA saying 20 GFLOPs/watt at best when they are currently at ~16 (before you include the CPU). If you know about NVIDIA, you should know they have a long history of bullshitting numbers.

As for our design, there is a lot more interesting stuff that we haven't disclosed for obvious reasons, but I do stand by all of our numbers. When it comes to our network on chip, the main reason we are keeping it this way is because it is a general purpose design... of course, we would be doing things very differently if we wanted to have something application specific for machine learning.

p1esk · on Aug 10, 2015

I've been looking at the history of specialized hardware for running neural networks. Starting with Intel ETANN, every single chip initiative has failed. The reason has typically been the same - "silicon steamroller", meaning that general purpose hardware is always progressing fast enough to make any specialized chips not attractive enough in terms of cost/performance ratio, when they finally hit the market (which was always later than planned).

With Nvidia throwing all their weight behind acceleration of deep learning applications, they're advancing on multiple fronts: Tegra X1 claims 1 Tflops @ 10W (half precision). That's what they're shipping today, what about 2016?

Aside from rapidly improving GPUs, you now have rapidly improving FPGAs. For example, Altera has put thousands of hard multipliers on their latest Stratix 10 chip. They claim 10 Tflops SP, and 80 Gflops/W: https://www.altera.com/products/fpga/stratix-series/stratix-... Microsoft is already using them to power Bing search (apparently they use NN based algorithms for that).

What makes you think your chip - if it's actually out in 2016, which can easily slip to 2017 - can compete with Pascal chips from Nvidia, or the next gen FPGAs from Altera/Intel or Xilinx?

Who is your market exactly?

trsohmers · on Aug 10, 2015

To broadly cover most of what you brought up, we think our "secret sauce" that both will allow us to actual make a commercially viable processor and allow us to compete is the fact that we are developing some very advanced software in house taking advantage of many new techniques. I can't go into too many details, but we actually think we have developed what is close to the long talked about and "magic" compiler.

When it comes to something like the X1 you have to remember NVIDIA loves exaggerating their benchmark results. In reality, it gets around 80% of its theoretical peak, but if you apply it to double precision, it only gets around 40 GFLOPs on Linpack (out of a theoretical 64 DP GFLOPs). Even if you take NVIDIA at face value and say they get full theoretical peak (64 GFLOPs double precision) at 10 watts, that only gets you 6.4 GFLOPs/Watt. One thing that has been completely disregarded in this and my previous posts has been that the GFLOPs numbers we have been saying have been for Linpack and matrix-matrix (Level 3 BLAS) workloads, which are a very small number of real world applications... I would say on average Level 3 BLAS benchmarks get around 90% of theoretical peak on GPU systems, but as soon as you get into other application spaces (Level 1 and Level 2 BLAS, or anything dealing with a lot of memory movement) is where GPUs really start to fall, and only get ~10-20% of theoretical peak. Our architecture is built to actually be able to reach theoretical peak in a "perfect" (which in our case, our hand written FFT kernel), in all 3 BLAS levels. In reality, I would expect us to hit at least 85-90% theoretical peak in all of those floating point domains.

As for Altera and FPGAs... they are a royal PITA to program for, and will never be all that efficient. Altera's 10 TFLOP/s number is ONLY capable of single precision float, and is based on adding up the theoretical capabilities of all the DSP slices... in reality, you would never be able to hit that with the memory limitations going to all the DSP slices. In small print, Altera even admits their 10 TFLOPs number is BS, as they list the highest FP32 number as 9.2 TFLOPs. Again, that theoretical 80 TFLOPs/Watt (which I would be surprised if it hits 50 in the real world) is only for single precision, in which we are aiming for 128. All the while we are a hell of a lot easier to program for.

As for short term competition, we know for a fact that Intel and NVIDIA will not be hitting our efficiency levels in this decade, and in addition we will have a cost advantage. Compared to FPGAs, we have the ability to actually port existing code over without huge performance sacrifices (C to gates or Altera's OpenCL work is abysmal to performance) or change your development team/learn how to write RTL (You can write code for our chip in any language that LLVM has a frontend for).

As for market, our initial target market we are actively working on is large constellation FFT type workloads... think LTE-advanced and "5G" basestation processing as that's where we have our best numbers (25x efficiency over the best DSPs in that space). Beyond that we are looking at the larger "HPC" category, and in the 5+ years out, I hope to be able to expand to more general purpose markets.

p1esk · on Aug 10, 2015

Ok, then you should have clarified that your target market is telecom, and not deep learning applications.

However, if you do decide to go after deep learning (seeing as it is a much faster growing, and potentially much bigger market), I have a few questions for you:

1. Will I be able to take my highly optimized Torch/Theano/custom CUDA code, and run it on your chip with minimal modifications? Especially taking into consideration that even some latest CUDA code is not compatible with older GPU architectures? 2. How much will your devbox cost, compared to the Nvidia devbox? 3. Will I get much better performance (16 bit) as a result?

Keep in mind that I'm talking about 2016/2017 version of Nvidia devbox (Pascal should have at least 30% better performance compared to current Maxwell cards, and probably more than that if they manage to move to 14nm process).

Regarding the "magic" compiler, have you been watching the development of Mill CPU? Designed by the guys with strong DSP and compiler design background. They also put a lot of emphasis on the compiler. Is that project dying? After two years of hype, it seems like it never got out of the "simulated on FPGA" phase... What can you learn from them?

varelse · on Aug 12, 2015

It's a pretty big leap from lots of little parallel (and I assume 1D) FFTs to a deep learning processor. The rexcomputing chip looks like a DMA-driven systolic array with beefier data-processing units than I've ever seen before.

https://en.wikipedia.org/wiki/Systolic_array

There are certainly applications for this sort of processor (embarrassingly parallel batches of small independent units of work), but I'd be highly skeptical that this guy has anything close to a "magical compiler(tm)" given his inaccurate understanding and significant underestimation of the competition. That's dangerously close to Intel's absurd "recompile and run" nonsense for Xeon Phi (It's anything but that)...

varelse · on Aug 11, 2015

One more thing I wanted to add. As someone who often evaluates new processor architectures, n-body is anything but a real application. It's a nice demo (pretty too), but real applications (even those that are similar to n-body) have far more complicated inner loops along with sophisticated culling algorithms to effectively reduce the computation to O(n). So far, GPUs have been the only platform flexible enough to do this well. If you pitched me with n-body, I'd laugh you out of the room. Show me compelling AlexNet performance (both forwards and backwards) and you'll have my undivided attention. Those few apps that get 10-20% peak performance tend to show even worse efficiency on CPUs in my experience because they're fundamentally I/O-bound.

But unfortunately, those are exactly the sort of pitches I see: unrealistic or overly simplified demos that tell me jack about the real world utility of a new architecture.

Switching to a new platform is a Mt. Doom of technical debt no matter how fantastic it is. You need to make climbing Mt. Doom at least sound like a good idea. That said, I've been on your side of the fence several times in the past. Hear me now, believe me later I guess.

You can defer and say you are targeting telecom and that's great. But you just picked a fight with NVIDIA here over deep learning and you're 7 people with <2 years of funding left in the bank. Put up or shut up.

p1esk · on Aug 13, 2015

"Put up or shut up"

I think we should give him some respect. He's developed a novel processor, and started a funded company before he turned 18.

Even if he fails to sell this particular design, we need more people like him.

varelse · on Aug 20, 2015

What can I say? I respect a shipping product and nothing short of it. He entered the conversation pitting his powerpoint processor against shipping hardware.

Have we not tired of all the broken promises of Kickstarter and IndieGogo yet?

varelse · on Aug 10, 2015

"As for "getting my facts straight" I was using actual Titan X numbers I have seen for real applications (e.g. nbody simulation)... it only gets around 4000GFLOPs single precision compared to their advertised 7000."

Nonsense. I regularly get ~5.5 TFLOPS out of them running cuBLAS SGEMMs in neural networks. You can get that down to ~1.4 if you do your best to choose stupid small values for m, n, and k, but that's a relatively minor bug and I think it's fixed in cuDNN's latest kernels.

But even if it isn't, if you're willing to download Scott Grey's maxas: https://github.com/NervanaSystems/maxas, you can hit 6.4 TFLOPS with his hand-coded SGEMM. Similarly, one can do the same for convolutional layers with Andrew Lavin's maxDNN: https://github.com/eBay/maxDNN

Your hubris is amusing, but I reiterate that until you have a shipping chip, all you have is a powerpoint processor. I'm sure you disagree. Good luck with that. But do come back when you have numbers (real numbers on real DNNs like AlexNet, VGG, or Googlenet as opposed to synthetic fantasy networks that fit your architecture well). See Nervana for a company doing this well so far.

"If you know about NVIDIA, you should know they have a long history of bullshitting numbers."

I know them very well and, well, pot.kettle.bs... IMO they're going to stay the leader in parallel computing technology right until Intel stops sniffing its own tailpipe and/or AMD hires a better driver team (they have promising HW). And both of them need to study NVIDIA's engagement with the academic community and one-up it rather than deny its efficacy.