Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Nvidia Digits DevBox (nvidia.com)
193 points by hendler on Aug 9, 2015 | hide | past | favorite | 150 comments


Nvidia owns deep learning. They are alone at the top. Intel and AMD aren't even in the picture. I think this could end up being a bigger business than graphics accelerators. There's a huge opportunity here for the first company to put out a specialized deep learning chip that can beat GPUs (which is definitely possible; probably by 10x or more).


Shameless self promotion... my startup (http://rexcomputing.com) is producing a standalone chip capable of 64 GFLOPs/watt double precision (128 GFLOPs/watt single precision), compared to NVIDIA's next generation chips only hitting 20GFLOPs/watt single precision... and that is before you take into account the power waste of the CPU controlling the NVIDIA GPU.

Our biggest plus factors compared to a GPU is that we are a fully standalone/independent chip that does not need to have a CPU with your main system memory attached to it. Large machine learning data sets are getting into terabytes in size, and the biggest bottleneck with GPUs is the PCIe link limiting them to 16GB/s and adds a whole lot of additional latency. In our case, we have direct connection to DRAM (We've been looking at DDR4 and HMC). In addition, we have designed the architecture to allow massive scalability with up to 384 GB/s of aggregate chip-to-chip bandwidth... NVIDIA's NVLink is aiming for 80GB/s in the 2018/2019 timeframe and will still need a connected CPU to issue jobs.

EDIT: I should also mention that our chip is fully general purpose, but we perform really well when it comes to dense matrix math (Most deep learning), with a 10 to 15x efficiency advantage over GPU. Our real killer app is FFTs, which GPUs do abysmally on, and our current benchmarks are showing a 25x efficiency advantage over the best DSPs and FPGAs built for large constellation FFTs.


Single precision isn't low enough. You want half precision or maybe even lower. You also want to throw IEEE 754 out the window. Save on power and area: no denormals, no infinities, no NaNs, relaxed precision requirements. It may even be worth looking at exotic things like logarithmic number systems or analog logic (deep learning should tolerate noise extremely well). You're also going to need vast amounts of memory bandwidth, which means on-package memory, and probably specialized caches and compute units for convolution.

A truly specialized deep learning chip probably wouldn't be useful for much else, but it would be a monster at deep learning. And the thing about deep learning is it scales really well. If you have a 10x faster machine you're almost certain to set world records on any machine learning benchmark you try.


While I personally dislike IEEE Float, we decided to remain compliant for our first chip, as that is a checkbox for a lot of businesses that we want to sell into. We are looking at a new variable precision floating point format, called Unum, created by one of our advisers (And HPC industry legend) John Gustafson. Unum would be fantastic for deep learning, asyou would only use the precision actually required, thus bringing the program size and memory bandwidth numbers down and total energy efficiency up at least ~30%-50% over the same system with IEEE float... You can check out a previous HN discussion on it here: (https://news.ycombinator.com/item?id=9943589

We still have the option of including a 16 bit (half precision float) packed SIMD mode into our FPUs, which would add a bit of complexity (bringing our efficiency numbers down a bit for the double precision float, which we like to talk about as it is over 10x better than anything out there), but if there is enough customer interest we may decide to include it.


Variable precision sounds scary, but maybe it could work. You should look into building support for your chips into Theano, Torch7, and/or Caffe. You should be able to hide Unum or any other quirks behind the interfaces of those libraries, so people can drop in their existing models with no work. If you can show a significant training speed advantage over GPUs, you'll have a market.


You should check unum out... It is actually higher accuracy then IEEE float, as it does not have "rounding", which causes errors over time. unum is to floating point as floating point is to integer.

The other nice thing is that it is a superset of IEEE float, and has a "IEEE mode" where you can convert to IEEE float, which is also jokingly called the "guess" function.

As for support, that is the plan. Right now our customers have been most interested in high end signal processing, so we have been taking time to port FFTW, but supporting Theano/Torch/Caffe, etc is a relatively straightforward process.


Didn't Cray had something like that, a variable floating point where you could opt how many bits (up to a limit) you wanted? I vaguely remember hearing about it from a talk Computer History Museum had.


I just stuck a bit on unums on Wikipedia as there didn't seem to be much.

https://en.wikipedia.org/wiki/John_Gustafson_(scientist)#Unu...

Feel free to correct/improve it anyone.


The End of Error (http://www.amazon.com/The-End-Error-Computing-Computational/...) is the book describing them in great detail... I'm working with John to put together a publicly accessible wiki, which I hope will be up in the next month or two. That being said, the book is worth having.


NVLink would be in 2016 - http://wccftech.com/nvidias-gp100-pascal-flagship-pack-4096-...

Do you have any benchmarks on *gemm or FFTs?


While you will be able to interconnect GPUs with NVLink in ~2016, the first CPUs supporting it won't be until 2018/2019.

We will be releasing benchmark data when we have first silicon back next year. As of right now, our numbers are based on FPGA prototype implementations.


So basically you're shamelessly promoting a powerpoint processor versus shipping hardware? Pot.kettle.tapeout. I thought that was Intel's specialty.

It's a very long road to a shipping chip and a lot can happen between now and then. And afterwards, it's an even longer road to the equivalent of nvcc, nvvp, cuda-gdb, cuBLAS, cuFFT, cub, cuRand, and cuDNN with solid linux/windows/mac support. And it's all free. NVIDIA achieves this through its nearly bottomless pockets from the gaming/industrial complex that has allowed them to weather many near-death experiences.

What's your burn rate BTW?

In the meantime, a $1000 GTX TitanX (which you can easily buy off Amazon) delivers 27 GFLOPS/W of FP32. So much for "a 10 to 25x increase in energy efficiency for the same performance level compared to existing GPU and CPU systems." At least get your facts straight. And while we're talking facts, the real challenge here is that they deliver 6.7 GFLOPS/$. That seems like a tough squeeze for a startup to beat. Again, what's your burn rate? For that's how it's ended so far for the other contenders. Why are you different?

I have no problem believing NVIDIA will one day be disrupted, but nothing I read on your web site made me feel in any way that it's your architecture. Suggestion: lose the router and replace it with something simpler after studying NVIDIA's memory controller and hierarchy. They considered a lot of crazy things too and there are many reasons why GPUs ended up the way they are with a very clean programming abstraction that automagically subsumes SIMD, multithreading, and multicore.


As for being a "powerpoint processor", we have gone through multiple prototypes with FPGA prototypes and physical simulation with high speed timing, and we are using that knowledge to get our performance and efficiency metrics. I don't want to release full benchmark data at this point based on software and FPGA simulations of an unfinished design, as it is that... unfinished.

Burn rate is very low (relatively speaking)... Our full size team, which we are building up (We're hiring!), is only 7 people. Our current runway is around 18 to 20 months, and we have some unannounced (and not included in that runway) funding coming along.

Semiconductor economics are also in our favor, with our ~100mm^2 chip being a lot cheaper (per unit, and in terms of GFLOPs/$) than NVIDIA's ~650mm^2 (and up) chips.

As for "getting my facts straight" I was using actual Titan X numbers I have seen for real applications (e.g. nbody simulation)... it only gets around 4000GFLOPs single precision compared to their advertised 7000. I was being generous to NVIDIA saying 20 GFLOPs/watt at best when they are currently at ~16 (before you include the CPU). If you know about NVIDIA, you should know they have a long history of bullshitting numbers.

As for our design, there is a lot more interesting stuff that we haven't disclosed for obvious reasons, but I do stand by all of our numbers. When it comes to our network on chip, the main reason we are keeping it this way is because it is a general purpose design... of course, we would be doing things very differently if we wanted to have something application specific for machine learning.


I've been looking at the history of specialized hardware for running neural networks. Starting with Intel ETANN, every single chip initiative has failed. The reason has typically been the same - "silicon steamroller", meaning that general purpose hardware is always progressing fast enough to make any specialized chips not attractive enough in terms of cost/performance ratio, when they finally hit the market (which was always later than planned).

With Nvidia throwing all their weight behind acceleration of deep learning applications, they're advancing on multiple fronts: Tegra X1 claims 1 Tflops @ 10W (half precision). That's what they're shipping today, what about 2016?

Aside from rapidly improving GPUs, you now have rapidly improving FPGAs. For example, Altera has put thousands of hard multipliers on their latest Stratix 10 chip. They claim 10 Tflops SP, and 80 Gflops/W: https://www.altera.com/products/fpga/stratix-series/stratix-... Microsoft is already using them to power Bing search (apparently they use NN based algorithms for that).

What makes you think your chip - if it's actually out in 2016, which can easily slip to 2017 - can compete with Pascal chips from Nvidia, or the next gen FPGAs from Altera/Intel or Xilinx?

Who is your market exactly?


To broadly cover most of what you brought up, we think our "secret sauce" that both will allow us to actual make a commercially viable processor and allow us to compete is the fact that we are developing some very advanced software in house taking advantage of many new techniques. I can't go into too many details, but we actually think we have developed what is close to the long talked about and "magic" compiler.

When it comes to something like the X1 you have to remember NVIDIA loves exaggerating their benchmark results. In reality, it gets around 80% of its theoretical peak, but if you apply it to double precision, it only gets around 40 GFLOPs on Linpack (out of a theoretical 64 DP GFLOPs). Even if you take NVIDIA at face value and say they get full theoretical peak (64 GFLOPs double precision) at 10 watts, that only gets you 6.4 GFLOPs/Watt. One thing that has been completely disregarded in this and my previous posts has been that the GFLOPs numbers we have been saying have been for Linpack and matrix-matrix (Level 3 BLAS) workloads, which are a very small number of real world applications... I would say on average Level 3 BLAS benchmarks get around 90% of theoretical peak on GPU systems, but as soon as you get into other application spaces (Level 1 and Level 2 BLAS, or anything dealing with a lot of memory movement) is where GPUs really start to fall, and only get ~10-20% of theoretical peak. Our architecture is built to actually be able to reach theoretical peak in a "perfect" (which in our case, our hand written FFT kernel), in all 3 BLAS levels. In reality, I would expect us to hit at least 85-90% theoretical peak in all of those floating point domains.

As for Altera and FPGAs... they are a royal PITA to program for, and will never be all that efficient. Altera's 10 TFLOP/s number is ONLY capable of single precision float, and is based on adding up the theoretical capabilities of all the DSP slices... in reality, you would never be able to hit that with the memory limitations going to all the DSP slices. In small print, Altera even admits their 10 TFLOPs number is BS, as they list the highest FP32 number as 9.2 TFLOPs. Again, that theoretical 80 TFLOPs/Watt (which I would be surprised if it hits 50 in the real world) is only for single precision, in which we are aiming for 128. All the while we are a hell of a lot easier to program for.

As for short term competition, we know for a fact that Intel and NVIDIA will not be hitting our efficiency levels in this decade, and in addition we will have a cost advantage. Compared to FPGAs, we have the ability to actually port existing code over without huge performance sacrifices (C to gates or Altera's OpenCL work is abysmal to performance) or change your development team/learn how to write RTL (You can write code for our chip in any language that LLVM has a frontend for).

As for market, our initial target market we are actively working on is large constellation FFT type workloads... think LTE-advanced and "5G" basestation processing as that's where we have our best numbers (25x efficiency over the best DSPs in that space). Beyond that we are looking at the larger "HPC" category, and in the 5+ years out, I hope to be able to expand to more general purpose markets.


Ok, then you should have clarified that your target market is telecom, and not deep learning applications.

However, if you do decide to go after deep learning (seeing as it is a much faster growing, and potentially much bigger market), I have a few questions for you:

1. Will I be able to take my highly optimized Torch/Theano/custom CUDA code, and run it on your chip with minimal modifications? Especially taking into consideration that even some latest CUDA code is not compatible with older GPU architectures? 2. How much will your devbox cost, compared to the Nvidia devbox? 3. Will I get much better performance (16 bit) as a result?

Keep in mind that I'm talking about 2016/2017 version of Nvidia devbox (Pascal should have at least 30% better performance compared to current Maxwell cards, and probably more than that if they manage to move to 14nm process).

Regarding the "magic" compiler, have you been watching the development of Mill CPU? Designed by the guys with strong DSP and compiler design background. They also put a lot of emphasis on the compiler. Is that project dying? After two years of hype, it seems like it never got out of the "simulated on FPGA" phase... What can you learn from them?


It's a pretty big leap from lots of little parallel (and I assume 1D) FFTs to a deep learning processor. The rexcomputing chip looks like a DMA-driven systolic array with beefier data-processing units than I've ever seen before.

https://en.wikipedia.org/wiki/Systolic_array

There are certainly applications for this sort of processor (embarrassingly parallel batches of small independent units of work), but I'd be highly skeptical that this guy has anything close to a "magical compiler(tm)" given his inaccurate understanding and significant underestimation of the competition. That's dangerously close to Intel's absurd "recompile and run" nonsense for Xeon Phi (It's anything but that)...


One more thing I wanted to add. As someone who often evaluates new processor architectures, n-body is anything but a real application. It's a nice demo (pretty too), but real applications (even those that are similar to n-body) have far more complicated inner loops along with sophisticated culling algorithms to effectively reduce the computation to O(n). So far, GPUs have been the only platform flexible enough to do this well. If you pitched me with n-body, I'd laugh you out of the room. Show me compelling AlexNet performance (both forwards and backwards) and you'll have my undivided attention. Those few apps that get 10-20% peak performance tend to show even worse efficiency on CPUs in my experience because they're fundamentally I/O-bound.

But unfortunately, those are exactly the sort of pitches I see: unrealistic or overly simplified demos that tell me jack about the real world utility of a new architecture.

Switching to a new platform is a Mt. Doom of technical debt no matter how fantastic it is. You need to make climbing Mt. Doom at least sound like a good idea. That said, I've been on your side of the fence several times in the past. Hear me now, believe me later I guess.

You can defer and say you are targeting telecom and that's great. But you just picked a fight with NVIDIA here over deep learning and you're 7 people with <2 years of funding left in the bank. Put up or shut up.


"Put up or shut up"

I think we should give him some respect. He's developed a novel processor, and started a funded company before he turned 18.

Even if he fails to sell this particular design, we need more people like him.


What can I say? I respect a shipping product and nothing short of it. He entered the conversation pitting his powerpoint processor against shipping hardware.

Have we not tired of all the broken promises of Kickstarter and IndieGogo yet?


"As for "getting my facts straight" I was using actual Titan X numbers I have seen for real applications (e.g. nbody simulation)... it only gets around 4000GFLOPs single precision compared to their advertised 7000."

Nonsense. I regularly get ~5.5 TFLOPS out of them running cuBLAS SGEMMs in neural networks. You can get that down to ~1.4 if you do your best to choose stupid small values for m, n, and k, but that's a relatively minor bug and I think it's fixed in cuDNN's latest kernels.

But even if it isn't, if you're willing to download Scott Grey's maxas: https://github.com/NervanaSystems/maxas, you can hit 6.4 TFLOPS with his hand-coded SGEMM. Similarly, one can do the same for convolutional layers with Andrew Lavin's maxDNN: https://github.com/eBay/maxDNN

Your hubris is amusing, but I reiterate that until you have a shipping chip, all you have is a powerpoint processor. I'm sure you disagree. Good luck with that. But do come back when you have numbers (real numbers on real DNNs like AlexNet, VGG, or Googlenet as opposed to synthetic fantasy networks that fit your architecture well). See Nervana for a company doing this well so far.

"If you know about NVIDIA, you should know they have a long history of bullshitting numbers."

I know them very well and, well, pot.kettle.bs... IMO they're going to stay the leader in parallel computing technology right until Intel stops sniffing its own tailpipe and/or AMD hires a better driver team (they have promising HW). And both of them need to study NVIDIA's engagement with the academic community and one-up it rather than deny its efficacy.


Xeon Phi Knight's Landing, a product available within months, will do 500 GB/s. I


> fully general purpose

Programmed via OCL? Don't I need some kind of OS/host CPU functionality to load the inputs into DRAM and retrieve them?


We're building our LLVM backend, allowing us to support any language with a LLVM frontend. You do not need an OS or host CPU to move memory (unlike a lot of similar many core processors), and each core is fully capable of loading/storing from any cores scratchpad (either on a single chip or over multiple connected chips) or to/from DRAM.


Any idea how much your chip is likely to cost?


I'm not an expert in these things, but I was under the impression that when things like this make it to consumers, they've run a big training data set, taken a snapshot of the trained neural network, then it's used as a static snapshot.

Then, if outliers/problems are encountered in the real world, rather than training the end user's neural network alone, the outliers are fed back to the centralised system, and later a new snapshot might be sent out.

After all, people want their self-driving cars and voice recognition phones to work out of the box. And if there's a misclassification you correct on your phone, you want it to propagate to your tablet, PC and smart fridge :)

If things are done that way, I would have thought the market would end up with only a handful of customers - there won't be a need for a deep learning chip in every phone. There might be a world market for maybe five deep learning systems :)


The world market is bigger than 5 systems. I'd argue it looks more like relational databases in ~1980. Most people to this day do not know what a relational database is, but they probably use one multiple times a day. Over the years, databases have gotten bigger, faster, more complex, more powerful, easier to use, and more open.

Deep learning--and machine learning more generally, because that's what we're really talking about here--is going to go through the same process. NVIDIA is positioning themselves to be the vertically integrated supplier in this market. The really interesting part is that no one is sure yet what kinds of applications these tools will open up -- it is likely to be an enabling platform/toolset in much the same way relational databases, app stores, or IaaSs were.

AMD and Intel are woefully behind. Intel because it's not yet a big enough market ($137B market cap compared to NVIDIA's $12B, Intel needs an opportunity to be literally 10x as interesting to be interested). Why AMD is not all over this I'm not sure... If it's interesting for NVIDIA, it should be very interesting to a $1.6B operation like AMD.


> Deep learning--and machine learning more generally, because that's what we're really talking about here

For Nvidia's market, imo we have to be careful to separate out which kind of ML specifically. Certain kinds of training of deep networks get large speedups from GPUs, which isn't true of all ML techniques. So if Nvidia wants to get a big win in this space by being a vertically integrated tool supplier leveraging GPU hardware, the strategy is tied to a bet on a specific subset of ML techniques, those where the ratio of GPU:CPU performance is especially large.


"Why AMD is not all over this I'm not sure..."

NVidia have been very good at encouraging the use of their proprietary CUDA. They also have a slight FLOPS advantage I'm told (where AMD have an integer advantage)


It helps when CUDA allows for C++, Fortran and any other language with a compiler that spits PTX, while OpenCL stayed for so long as stone age C.


> Why AMD is not all over this

AMD are barely treading water in the markets they already compete in. Both Intel (CPU) and NVIDIA (GPU) have stolen back enormous mind and market share [0]. I doubt they could afford the outlay of capital and research time to build something competitive. Maybe if they were in better shape financially a braver board/CEO could pivot them away from the CPU/GPU markets they've lost to something different. Their current road-map takes AMD through to 2016, after that who knows.

[0] NVIDIA's 9XX series GPUs have been incredibly aggressively priced and marketed. AMD effectively ceded the high end consumer (gaming etc) market since September last year. Gaming parts are big PR for both companies, so it's quite a bad sign IMO.


"Maybe if they were in better shape financially a braver board/CEO could pivot them away from the CPU/GPU markets they've lost to something different."

Yes, wouldn't it be great to let the market leaders in these sectors have a total uncontested monopoly...


> Why AMD is not all over this I'm not sure...

They're barely staying alive, and have been axing branches left and right to "focus". Like selling off their smartphone graphics branch a year after the iPhone was announced… it's now rather successful under Qualcomm (Adreno is an anagram of Radeon).


> The world market is bigger than 5 systems.

He's referencing this:

https://en.wikipedia.org/wiki/Thomas_J._Watson#Famous_misquo...


The chips that are best at running neural nets are the same ones that are best at training. If you want to run state-of-the-art deep neural nets in your phone, e.g. for translation, image recognition, always-on voice recognition, etc, then you're going to need dedicated hardware to fit in a phone's power constraints. State-of-the-art deep neural nets are already too large to run well on a phone CPU; today's applications either drastically cut down the neural net (compromising on performance) or send the data to a server for processing.


I would also add AR/VR. Running 4K screens on high framerate or actively parsing suroundings to overlay new data on top requires a bit of computing power.


Is the field of machine learning really stable enough to warrant an ASIC? Seems to me a set of new techniques is released every year now. The lead-time on ASICs is ~1 year. So I don't see how a Deep Learning ASIC could keep pace.


New training techniques are released on a monthly basis, but the basic structure of convolutional neural nets has actually been around since the '90s and hasn't materially changed. I think that's a stable enough target to build an ASIC for. Even if you're unable to implement the year's hottest training technique for some reason, if you're 10x bigger/faster you'll likely still set world records using last year's techniques.


Yes, NVIDIA owns Deep Learning. And they will continue to own it for at least the next 2 years. But I would be remiss not to point out that you can build one of these things yourself for less than half the price at which they're selling it.

And while I think one can build a deep learning ASIC, a 10x better ASIC seems like a tough bet to me. I mean on the surface it sounds good, but the devil is in the details here and by the time you've built something flexible enough to run every reasonable variant of a neural network both forwards and backwards, you start making the sort of decisions that make your processor look more and more like a GPU and that magical perf delta drops. Also if you start today, you need to target tomorrow's GPUs, not the $1000 consumer model you can buy on Amazon now.

That said, I'm looking forward to Altera's new hardcoded floating point-enhanced FPGAs. Too bad I have no idea how much they cost.


Even less if you use AMD GPU's -https://www.reddit.com/r/linux/comments/2zgpj8/15000_nvidia_...

I wonder why the popular deep learning frameworks are using mainly CUDA instead of OpenCL. Is because of better Linux GPU drivers? Wondering why AMD isn't jumping on deep learning

The Altera FP capable FPGA's sound real interesting too. 10 TFLOPS, OpenCL support?

http://www.slideshare.net/embeddedvision/a04-altera-singh

Looks like they're about to be bought by Intel? http://www.electronicsweekly.com/news/business/altera-import...

Does this mean FPGA co-processors in the future from Intel?


I would love to see AMD jump in the ring, and there's even an OpenCL port of Caffe in progress: https://github.com/BVLC/caffe/pull/2610

But its performance is less than half that of a GTX 980 running CUDA. Still, AMD is silly not to try and improve on this IMO.


Wonder what's causing the perf drop. Would make sense to Intel to push openCL for deep learning too


Bad compilers would be my guess. And not only would it make sense for Intel to push OpenCL for Deep Learning, but it would make sense (IMO of course) for them to push OpenCL as a much better abstraction for accessing the vector units in all their CPUs in a multicore fashion across the board.


"Altera's new hardcoded floating point-enhanced FPGAs. Too bad I have no idea how much they cost."

Probably in the same ballpark as current high end Altera chips: over $30k. If you need a thousand of them for your datacenter, think about what kind of ASIC you could design for $30 millions. Or think about how many Titan X cards you can buy for $30k :)


Could you elaborate as to why this is a huge opportunity business wise?


Deep learning has already completely taken over speech recognition, face recognition, and object recognition. Going forward deep learning is the technology that will solve sensing and perception for machines. You're going to want sensing and perception in your phone and every computer you use, but even more than that: every drone, every self-driving car, every kind of robot is going to need multiple deep learning chips. Each of those chips is going to need vastly more FLOPS and more memory bandwidth than the biggest CPUs and GPUs of today.

Beyond sensing and perception, I believe that deep learning will also be the technology that solves planning, natural language, reasoning, and creativity for machines. This is much more speculative, but you can see the beginnings of planning in the DeepMind Atari work, the beginnings of natural language processing and reasoning in machine translation and various question answering systems, and the beginnings of creativity in Deep Dream and other generative models people have done. Of course once all those pieces are solved then AI is solved. The market for a technology that solves AI is practically unlimited.

Intel is constantly searching for the next big application that will require more processing power so people need to buy faster CPUs and they can justify spending $X billion on their next fab. In recent years they've been struggling to find it. Deep learning is it. The appetite for FLOPS and memory bandwidth is unbounded for the foreseeable future. Unfortunately for Intel, CPUs are weak on both compared to GPUs. Maybe Xeon Phi can morph into a deep learning system?


I am similarly very bullish on this field, even for what I once thought were incredible claims that it can eventually understand the semantics behind documents, do question and answering, and even reason.

What blew my mind was this lecture, given by Hinton https://drive.google.com/file/d/0B8i61jl8OE3XdHRCSkV1VFNqTWc...

The main idea is that reasoning is just a sequence of 'thought vectors' that can be encoded within a recurrent neural network.

Sounded almost outlandish to me until I watched Richard Socher's lectures and started to understand that words can be represented as vectors, and that these vectors can be then encoded into new vectors of even higher representation. 'Thought vectors' may not be so outlandish after all.


You will likely be interested in this recent presentation by Hinton where he explains how he believes the brain does deep learning: https://www.youtube.com/watch?v=kxp7eWZa-2M


Thanks, I'll watch that tonight!


>Going forward deep learning is the technology that will solve sensing and perception for machines.

I would take this (and any such very self-assured predictions) with a grain of salt.


In the coming years we will see a lot more applications powered by deep learning. If someone releases a chip that provides 10x performance compared to the current GPU-based method, they will sell a lot of chips.


No doubt -- the problem is the assumption that 10x performance is possible. Though if anyone can deliver it will (imo) be Intel: it will be a process war, and they seem to be able to do more transistors than anyone else.



Intel may have to find a different niche. Their Xeon Phi approach to parallelism is very interesting, speaking as someone dabbling in AGI, but is not comparably well suited for deep learning.


How well suit are fpga's for deep learning?


The answer to the question "How well suited are FPGAs for -insert field where GPUs or vanilla processors do no excel at-?" is always "Far better than the GPUs or proccessors but with an abysmal power usage". In this kind of new developments FPGAs are usually used as a test before moving to ASICs.



Basically everything you interact with on a daily basis involves (deep) machine learning. Everything from advertising to logistics to drug discovery to credit evaluations to packet routing on the internet either involves deep learning in its operation, or is presently implemented with some optimal strategy that was discovered via deep learning. Machine learning runs the world.

As for the business opportunity, if you are in one of these industries then one of the prime ways you differentiate from your competitors is how well your algorithm works. And how well it works depends a great deal on how much training data you are able to feed into it, which in this era of information saturation is basically a hardware limitation.

Give me a 10x more powerful machine learning system, and I'll give you a few basis points advantage over the competition, and that'll make you, not them, the dominant player.


Packet routing? If there is an example of this, it is very far from the norm. Routing on the Internet is driven first and foremost by peering arrangements.


Now ask yourself how decisions are made regarding new infrastructure deployments.

But what I actually had in mind was QoS.


New infrastructure deployments is not "packet routing" and QoS is usually as strongly tied to policy as BGP is.

If you have an example of QoS relying on ML, I would be interested in reading more.


Who is using deep learning for packet routing and QoS decisions?


No one I know of, from the last time I did check on this topic. However, it seems possible to plug ML on top of something like OpenFlow. PM me if interested to look this up further.


Thanks. I was mostly interesting in seeing a source for maaku's statement of "Now ask yourself how decisions are made regarding new infrastructure deployments", as it seemed pretty condescending to me.


Even in AI, deep learning is not the state of the art in the majority of domains. E.g., in QA Watson is still the best in Jeopardy style general QA. In general game play, statistical methods rule, etc.


OpenCL is slowly being supported by some of the main deep learning packages. The annoying part of Nvidia's GPUs are drivers and closed souce CUDA. Their CuDNN extension is a proprietary blob (with bugs) that boost up perfs but remains out of touch from scientists and developers. My hunch is that OpenCL will become the standard for deep learning, thus opening up to a larger set of hardware options.


Especially in the (possibly very lucrative market) of Deep Learning powered devices, like self-driving cars

nVidia has already demoed hardware on that area


Is deep learning suitable for self-driving cars? I would expect that cars, like other safety critical systems, need to be formally proven to work as intended, which AFAIK can't really be done with neural networks.


I don't think it would be possible to formally prove that e.g. any image recognition system will work correctly given all realistic real world inputs, so that requirement doesn't work out anyways. It's more likely that they are required to perform extensive real world tests + maybe beforehand tests against video material.


Where did you get this idea from?

Formal proofs even in avionics are very limited, and planes aren't exactly dropping out of the sky. Safety critical systems need a degree of auditing and testing, to be sure, but formal proofs have never been a requirement since they aren't that practical.


This just FUD. Formal proofs are quite prevalent. Look at ACL2 for a start.


Not FUD at all. Pointing at a system doesn't provide any evidence that proofs are prevalent in industry. Check out:

https://chess.eecs.berkeley.edu/hcssas/papers/Cleaveland-pos...

> Although formal proofs of correctness are (rightly) touted as providing superior guarantees to those of testing, DO-178B makes no mention of formal methods, except in an Annex given over to techniques deemed not yet mature enough for certification purposes.

Or http://ti.arc.nasa.gov/m/pub-archive/1023h/1023%20(Denney).p...:

> In principle, formal methods offer many advantages for aerospace software development: they can help to achieve ultra-high reliability, and they can be used to provide evidence of the reliability claims which can then be subjected to external scrutiny. However, despite years of research and many advances in the underlying formalisms of specification, semantics, and logic, formal methods are not much used in practice. In our opinion this is related to three major shortcomings. First, the application of formal methods is still expensive because they are labor- and knowledge-intensive. Second, they are difficult to scale up to complex systems because they are based on deep mathematical insights about the behavior of the systems (i.e., they rely on the “heroic proof”). Third, the proofs can be difficult to interpret, and typically stand in isolation from the original code.

The situation will just get worse as we increasingly move away from hand written logic to machine learned logic. Now, there is probably a place for formal methods in this newer order, but it will take awhile for researchers to find it, and it is definitely not going to be at the level of verifying hand-encoded logic that doesn't exist.


I meant to point out formal verification is used in the industry quite a bit (not in aviation alone).

1. Replacing testing with formal verification at Intel (2009)

http://link.springer.com/chapter/10.1007/978-3-642-02658-4_3...

2. Fifteen years of formal property verification at Intel (2008)

http://link.springer.com/chapter/10.1007/978-3-540-69850-0_8

3. seL4: formal verification of an OS kernel

http://dl.acm.org/citation.cfm?id=1629596

There is such a big need that DARPA has even come up with a contest.

http://www.darpa.mil/program/crowd-sourced-formal-verificati...

I could go on, but I would rather not spend my time combating FUD and the general incivility here.


Formal verification is used a lot, but it is not a rule and definitely not a requirement by the FAA for aircraft certification. The idea that formal verification would be obviously necessary for the logic of a self-driving car is what was being challenged here. You could call that FUD, but you obviously have an agenda to push, I guess.


> not a requirement by the FAA for aircraft certification.

I never said that. Where did you get that?

> Safety critical systems need a degree of auditing and testing, to be sure, but formal proofs have never been a requirement since they aren't that practical. (emphasis mine)

I was challenging this idea.

Are systems for self-driving cars the only safety critical systems? What about Intel manufactured chips used in those same cars and flights and a gazillion other devices?

You have an agenda to push and got called out on facts :)

Have a good day. (Someone just went around down voting my other posts. Not you)


I guess it depends on how you interpret "that", which I meant to refer to "requirement" (they are not practical enough to be a requirement). But I guess I should have been more clear in my sentence structure.


Still not clear but we will leave it here :)

Btw, I use both deep learning and knowledge based systems in my work (but not formal verification) but don't like advertising that goes in AI by academics from all camps(don't want another AI winter).


Fair enough. I have no skin in this game since I'm a PL researcher...


Intel uses verification after they fumbled the Pentium FPU, verifying silicon is a good idea (also it's complex but doable)

Now good luck verifying a machine learning system beyond the basics.


Umm that is the point being made ...


I don't necessarily disagree with your points here, but for completeness you should note that DO-178B has been moving to DO-178C for a few years now, which includes references to formal methods for systems verification, specifically DO-333. Also, that paper by Denney is 10 years old, and some strides have been made since then to improve formal method tools for viable verification.


Even with a lack of formal proof, a reproducible, and therefore thoroughly testable black-box AI is still much safer than a diverse human drivers population which include drink-drivers, partially blind, psychopaths and so on.


There are certain subsystems of self driving (visual recognition) that are inherently learning/pattern based.


People are doing deep learning even on mobile GPUs. Even on VC4 on Raspberry Pi.


installed standard Ubuntu 14.04 w/ Caffe, Torch, Theano, BIDMach, cuDNN v2, and CUDA 7.0

whoa - are you telling me that the nVidia drivers on Linux are so stable that they are building a commercial deep learning system on top of that. Is this the same thing as normal graphics drivers ?


The difference is between "given this particular hardware and OS setup, the driver will work correctly, guaranteed." vs. "on your (discontinued) Sony laptop with a strange hardware interface running the beta Slackware release the driver will probably work"


It's also the difference between "using our tools implementing our API on our hardware" vs. "trying to figure out the right thing when every component has a slightly different take on the API spec and the applications using the API make mistakes that we have to try to correct for with unreliable heuristics". Developers in the scientific computing world care about correctness a lot more than game developers working under unrealistic deadlines and with no commitment to long-term maintenance.


that doesnt make sense to me. If the driver works and does not crash the operating system (kernel panic FTW), that's good enough at this point.

nVidia uses CUDA or OpenGL - so its not quite the question of proprietary API.

At this point, I'm not worried about "framerate on my linux box isnt as good as windows".. its more "it works...".


An OpenGL driver capable of running most commercial games is horrifically more complex than a CUDA driver. An OpenGL driver that merely works according to spec is useless in practice. To achieve any practicality for non-trivial use cases an OpenGL driver has to take an attitude of "do what I mean, not what I say", much like web browsers and Windows' backwards compatibility. CUDA doesn't have those problems. NVidia never has to deal with developers complaining that their broken code worked fine on some other vendor's platform. They don't have to worry about programs relying on some esoteric decades-old feature that NVidia doesn't care about but had to implement anyways for standards compliance. And since CUDA is operating in the professional segment of the market, they can take their time when it comes to compatibility with bleeding-edge versions of other OS components.


Also compute API makes a lot more sense than OpenGL, with an understandable mapping to actual GPU resources.


AFAIK Linux with Nvidia GPUs is used in plenty of supercomputers and Hollywood special effects companies, so the drivers must be stable enough.


In Nvidia's deep learning lab last week [0] they described setting up on Ubuntu as "the easy way" and Windows the "not so easy way" [1], adding that their developers are on Linux so no-one has really tested it on Windows.

[0] https://developer.nvidia.com/deep-learning-courses

[1] http://on-demand.gputechconf.com/gtc/2015/webinar/deep-learn...


Nvidias drivers for Linux are pretty darn good these days..


Those for laptops are pretty shitty, we are still using third party programs to support optimus technology (with not great results).


I just wasted two days trying to get bumblebee to work on Debian with my Optimus laptop. :( Ultimately had to switch to Ubuntu and the Nvidia drivers (which apparently default to 'Prime' there).


Optimus technology is a fairly stupid idea in the first place though that exists for business reasons rather than technical.


Up to this day, it's not really possible to a laptop with Linux to drive a 4K external display without too much inconvenience. Even Broadwell has problems with driving 4K with SST @ 60 Hz, so Optimus is useful.


What's stupid about it? Sounds pretty smart to not have to use the powerful GPU if I'm only browsing the web, to me.


Intel's got the good fabs, but they can't get along well enough with the folks who know how to make a good GPU, so it's impossible to get an affordable desktop or mobile system at any price that has a decent GPU without actually paying for two entirely different GPUs.


"The powerful GPU."

You've been sold the idea that a "powerful GPU" needs to suck a lot of power all the time.

There is no real reason a "powerful GPU" shouldn't be able to scale its power usage way down when doing something simple like browsing the web. The only reason NVidia weren't able to do the "low power" thing on these systems is they weren't able to be the ones putting their GPUs on the same die as the CPU like Intel (& AMD) were. But of course they still wanted part of the action, so people ended up being sold this massive engineering bodge and told it's a good thing.


I have a laptop that has an intel cpu, and a powerful discrete amd gpu. Even though on-die gpus have gotten better, there is still a considerable difference in performance between on-die and dedicated. AMD and NVidia both realize that there is an onboard chip that can do the common stuff, and so they have made the ability to turn the dedicated off when needed.


If you imagine a properly, holistically-designed product, which wasn't full of chips from different warring companies, the high power GPU could be used to augment the power of the on-die GPU, instead of having to turn it off and deal with a whole bunch of mad signal-switching issues. AMD products can do this to an extent with crossfire, but generally, this is a world that we don't live in.


laptop or desktop ? because laptop drivers for Ubuntu 14.04 are still not that good. I'm on a Thinkpad T430S with nvidia (I disable optimus in the bios).


for CUDA only.

and keep in mind, it is a binary blob. proprietary to the core. in fact, CUDA the protocol itself is kinda proprietary.

if you base your solution on OpenCL, you can use all the vendors. nvidia, amd, imb, intel, altera, etc...

but NVIDIA spends billions on marketing to convince you that only CUDA matters. the same way sony spent millions to convince you that only laser disc^H^H^H^H^H^H mini disc ^H^H^H^H^H memory stick ^H^H^H^H^H blue ray, matter.


the nVidia driver is the most stable and and feature-complete driver you can get on Linux.


Yep. Godsend in comparison to AMD drivers


The price for a custom build with these specs is ~$8k: http://pcpartpicker.com/p/NP4MNG Spending that money on an EC2 GPU instance would be a better use of money unless you really need a local workstation.


I'd recommend a local custom build over EC2. EC2 GPU instances are virtualized via a Hypervisor, which dramatically reduces performance on multi-GPU networks. and this doesn't take into account the large amounts of disk space needed for the training set.


does anyone know which hypervisor do they use ? Can I build a local EC2 GPU instance with these GPUs ? I'm quite amazed that they are able to get the drivers, etc working with these GPUs on top of a hypervisor



There are dedicated instances specifically for this purpose.


agreed - but which hypervisor do they use ?


Normally xen?


Really? If you're serious enough to need something like this, you likely need it for more than 4-5 months which is the break even point for a single instance (for your quoted price anyway) and I'm pretty sure this'll have much higher performance.


But that $25 promo discount


We built our own quad-titan devbox a few months ago, same general components as this, except used Core i7-5960X and threw in a few 1TB samsung SSD's in Raid, came in just at $9,000 USD hardware cost, which I think Nvidia was charging about $15,000. Still, I'm sure they aren't making a ton of money, and you get hardware guarantee with configuration (but config wasn't so bad..).


Agree - we crunched the numbers and came up with the same figures to do-it-yourself (~USD$9K). Although, I remember from the day this was announced (few months back) that Nvidia were loud-and-proud that they weren't going to make money from this. Each box was hand built and tested, so not deemed to be a large-scale device - they recognized that it's a niche market.

$15K is probably OK(ish) if you figure in your own time for the DIY build...probably a few days. Plus you get some vendor support, warranty on the whole package, certified working stack, future test-bed for CUDA updgrades (will work first), etc as you say.

In wild agreement. Save maybe 30% doing a custom build...so they aren't adding a huge mark-up, as they would for a gaming machine. Apparently...someone at Nvidia is looking a bit further into the future than just the short-term revenue.


newegg has been selling quad 12GB Titan X GPU combo packs for a while. single-click add-to-cart for 18 components: http://www.newegg.com/Product/ComboBundleDetails.aspx?ItemLi...


FWIW, the case is a Corsair A540 with hard drive sleds in the two 5.25" bays: http://www.corsair.com/en-us/carbide-series-air-540-high-air...

Makes sense to me, since you want the best airflow possible getting to the cards in a multi-GPU setup, and unlike in conventional cases, the A540 doesn't have a drive cage between the front fans and the video cards.


I wonder how Nvidia building their own machines goes over with the many, many third party partners building similar rigs. On the Supercomputing 2014 showroom floor it seemed like half the booths were selling something like this and were covered in Nvidia branding.


This is a developers platform, in a rather inconvenient form factor for any sort of scale deployment. The partners honestly probably love it because it means they won't be hit with support requests because a driver's acting up, and will just be getting the sales to handle the finished product.


I think Boxx Apexx-5 boxes are already on par with these (even more powerful).

http://www.boxxtech.com/products/apexx-5


That unit with only one K40 Tesla appears to be $12,000. If the nVidia box is really $15,000 (with 4 Titans, 9TB SATA, etc), then the nVidia box looks like a much better value (and probably more powerful).


Keep in mind this has a dual socket motherboard. My colleagues bought 2 of those boxes with 4 titan-X and they are actually cheaper than the price on the web site.


The Pascal cards are going to be much better, with HBM2 memory and possibly even actual double-precision performance (which isn't a problem for deep learning, but still...)


NVidia in particular are very good at selling the future - I'll warn you that much.


True 'dat, but combined with the first GPU die shrink in years, there's a decent chance a generational jump is actually coming. Whether the first version will be bug-free is much more questionable...


If a move from 28nm to 14nm FINFET plus HBM at the same time doesn't drastically increase performance, something is very wrong.


Why would I buy this, vs. renting a cluster of ec2 gpu nodes?


Latency, support, specifications. The EC2 GPU nodes have less, slower, memory, graphics cards with half the performance and a third of the memory, and less drive space. Additionally, said compute resources are not next to you, which means certain things, like deep learning combined with AR are not possible. If you're just doing number crunching, you may be fine with EC2 instances, but any sort of realtime development, you probably want a supported platform right next to you.


These GPUs are more than twice as fast as Amazon's GPUs. They also have 4x the memory and there is probably more GPU to GPU bandwidth as well. Doing deep learning on clusters is not practical yet due to bandwidth issues. If you want to train state-of-the-art models you need the biggest single machine you can get, and this is it.


Data locality and performance.


Titans are $1.5k each, so that's $6k down before you even account for the rest of the hardware to run it. Ouch.


The whole machine off the lot is going to be less than a drywall contractor's Ford F150, yet potential payback is many times higher while the operating costs are several orders of magnitude lower. Throw in the three year depreciation and the box is bargain so long as it can be put to work.


The whole machine off the lot is going to be less than a drywall contractor's Ford F150...so long as it can be put to work

Yes. You're a drywall contractor you go to Craigslist and start humping some jobs. Make some bucks and buy you a nice F150.

Conversely, there aren't a lot of "need deep learning in my dentistry" ads on CL -- and not as easy or clearcut path from point A to point B.

So the investment makes sense, if the right conditions are true. To the average developer, those conditions look very murky and uncertain to gauge. So while it's obviously some sweet tech, but it takes a lot more than tech.


I don't disagree. The price is the price because the price makes sense for businesses. Prosumers may buy one for the same reason pump and valve salespeople buy 911 GT3's...to drive something fast slowly. It's on a continuum the drywall feller ordering the 6" lift and 34" tires, leather seats and dual climate control. All that he really needed was the four wheel drive and the diesel V8.

Which BTW, the drywall impresario who's buying a new truck for his business isn't finding jobs on Craigslist. He's got business relationships with serious people who call him when a bid needs bidding and drywall needs hanging. It's deal flow just as it is for the sort of person who needs a 4 GPU box and CUDA code for their business. It's only a lot of money if it sits idle.

If there isn't a business case, there isn't a business case and buying it is an inefficient allocation of resources.


On the other hand, there are lots of researchers twiddling their thumbs waiting for code to run right now. The bet is that this activity will continue to grow quickly.

Of course this isn't for the average developer. Which is why they are marketing it to researchers and maybe AI startups.


The boxes will be $15,000 USD. Lead time is 8-10 weeks.


Poster above says the GPUs cost $6k total, and cases that can fit 4GPUs aren't that expensive nor elusive. So I guess the rest of the system is made from unobtainium, or they figure deep learning is the kind of red hot topic that attracts enough suckers with money that this will fly.


Or people for whom the money value of their time exceeds the margin over a custom build that nVidia is asking.


I can spec out a system like that in less than 1h, but let's say it takes 2h, ending up with a total hardware cost of ~$9k plus $200 to have it assembled. Are you really valuing your time at more than $2000 an hour?


If someone's willing to pay, sure! But more realistically, a ML researcher doesn't want to know the differences between Haswell-E and Skylake, or which DIMMs are rated at which speed, or what's the optimal Linux driver situation, still less to know how to fix any of the myriad things that can go wrong. When Caffe starts screwing up, it's really nice to be able to call nVidia and make your overheating Titan their problem.


I'm down with you about calling support when your hardware craps out. But an ML researcher who doesn't know the difference between Haswell-E and Skylake, etc., will likely make suboptimal choices when using the hardware. Using GPUs efficiently requires in-depth understanding of how they function and how they interact with the rest of the system.


This is true, and also, I think it's perfectly likely that nVidia is trying to soak people buying with other people's money.



I'm working on probabilistic programming. Hierarchical models are very close to deep learning. PyMC3 has a Theano backend, so this kind of setup is very exciting. Anyone else with the same thought/interests?


NVidia should also market this for people who want to do molecular dynamics and other gpu enabled physics sim locally.


I have one of these for electromagnetics sim:

http://www.microway.com/product/whisperstation-tesla/


Why not get a server tower case and motherboard while you're at it? Supermicro has some good ones.


A machine like this, you're buying the support, with the hardware as an add-on. This isn't made for the deployments, it's made for the development and debugging, where being able to call nvidia at any hour and get a decent engineer is worth it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: