I'm not sure I understand what your getting at. If Python's performance in the M...

amkkma · on Feb 12, 2021

>If you find yourself writing a simulation, and you need to optimize away billions of function calls, you can use the C++ API provided by TensorFlow.

I prefer to just use Julia and get that for free. Also can write custom cuda kernels in pure Julia. And differentiate through arbitrary julia code, compose libraries that know nothing about each other etc

swagonomixxx · on Feb 12, 2021

Sure, that's fine. But I'm assuming GGP is talking about TensorFlow, since that's what this post is about. If you don't need TensorFlow, then this whole conversation is kinda moot, and you can use whatever you like.

amkkma · on Feb 12, 2021

Nobody needs tensorflow. That's just a kludge around the fact that python is terrible for ML.

swagonomixxx · on Feb 12, 2021

So nobody needs TensorFlow, yet it and PyTorch are pretty much the only frameworks widely adopted for ML, and they're in Python. A quick Google search will tell you that, I don't really feel like rebutting to a baseless statement.

PartiallyTyped · on Feb 12, 2021

> If Python's performance in the ML space is not sufficient, then the community would have quickly moved on from it and built something better.

Python's performance is sufficient when the bottleneck is actually the computation done in the accelerator. In my flavour of ML, we use small models, think 3 layer NN 64 neuron wide and in some cases a small CNN. During training, most of the models reported using <150MB.

Most of the community finds python sufficient because they do not need to interleave training and simulating.

> For ML purposes, if your model isn't running on the GPU, it doesn't matter if you're using Swift, Rust, or whatever, your stuff is gonna be slow. Like it or not, Python is one of the best glue languages out there, and the reason why libraries like TensorFlow and Torch are not used in their native languages (C++) [0] is because they're significantly simpler to use in Python and the performance overhead is usually one function call (e.g run_inference(...)) and not billions.

You don't know in what I am running/doing, hence your comment comes off as ignorant.

This is the setup:

Run X number of simulated steps on CPU, collect X samples, store them in a buffer of size either X, or Z>>X, train for Y number of steps sampling from buffer, copy model to CPU, repeat for total of 1M steps, repeat 10+ times to get a good average of the performance.

All of that without any hyper parameter tuning.

Now, you also need to note that a non trivial amount of work is also done in python to augment the inputs as necessary. If the work is done in numpy, there's usually little overhead, but, it is often the case that the environment I am simulating is wrapped in wrapper functions that modify the behavior, e.g. I may need to remember the last 4 instances created by the environment, and so on. All these modifications quickly accumulate and for a single experiment of 1M steps, you end up with billions of python calls. The community uses more or less the same package/framework as the intermediary between simulations and models.

The issue is so prominent, that the community as a whole moved away from running in single core the simulations, to having multiple parallel actors to collect transitions/data points, this also requires new theory. Furthermore, there have been many proposed architectures for distributed and asynchronous training because the bottleneck is not the GPU or the model, but rather, how fast you can collect transitions. Infact, there was a distributed architecture by google that literally sends the transitions over the network into a few GPUs, the reason is that the network cost is amortized because you get to run hundreds of simulations concurrently.

IIRC, a maintainer of a popular project/framework that I contribute saw improvements upwards of 2x when using C++ over python.

swagonomixxx · on Feb 12, 2021

> Most of the community finds python sufficient because they do not need to interleave training and simulating.

That's kind of my point. Most of the community has models running on GPU's and don't care too much about CPU-bound workloads, training or otherwise.

If you do care about that, you are in a relatively small niche of machine learning, statistically speaking. I am not denying it's existence, I'm just saying that your stack will have to be different if you want to extract the maximum level of performance out of your CPU.

> IIRC, a maintainer of a popular project/framework that I contribute saw improvements upwards of 2x when using C++ over python.

That's not surprising at all. Like I mentioned in the parent, if you profiled your code and you found that Python function calls are the main bottleneck, and you believe it to be worth investing time in getting rid of them, you can use the C++ API's of Caffe/TF/PyTorch/Whatever.

I personally don't work in simulations so I haven't ran into your problem. In the deep learning world, the CPU is unusable for any task (training, inference, evaluation, etc.), so I've never been concerned with things like function call overhead.

BadInformatics · on Feb 13, 2021

I work with bog-standard deep learning and this does come up, albeit not in the research stage that most people are familiar with. The closer you get to deployment, the less adequate Python becomes and the more you struggle with artificial limitations like the GIL. https://news.ycombinator.com/item?id=20301619 had a good discussion on whether we've collectively "overfit" on this slow glue + fast matrix accelerator model.