I'm not sure I understand what your getting at. If Python's performance in the ML space is not sufficient, then the community would have quickly moved on from it and built something better.
And that something better is certainly not Julia and it's definitely not Swift.
> Yes, I know the heavy lifting is done by C/C++/Rust/Cuda/Blas/Numba and so on, but, when you run simulations for millions of steps, you end up with billions of python function calls.
For ML purposes, if your model isn't running on the GPU, it doesn't matter if you're using Swift, Rust, or whatever, your stuff is gonna be slow. Like it or not, Python is one of the best glue languages out there, and the reason why libraries like TensorFlow and Torch are not used in their native languages (C++) [0] is because they're significantly simpler to use in Python and the performance overhead is usually one function call (e.g run_inference(...)) and not billions.
If you find yourself writing a simulation, and you need to optimize away billions of function calls, you can use the C++ API provided by TensorFlow.
>If you find yourself writing a simulation, and you need to optimize away billions of function calls, you can use the C++ API provided by TensorFlow.
I prefer to just use Julia and get that for free. Also can write custom cuda kernels in pure Julia. And differentiate through arbitrary julia code, compose libraries that know nothing about each other etc
Sure, that's fine. But I'm assuming GGP is talking about TensorFlow, since that's what this post is about. If you don't need TensorFlow, then this whole conversation is kinda moot, and you can use whatever you like.
So nobody needs TensorFlow, yet it and PyTorch are pretty much the only frameworks widely adopted for ML, and they're in Python. A quick Google search will tell you that, I don't really feel like rebutting to a baseless statement.
> If Python's performance in the ML space is not sufficient, then the community would have quickly moved on from it and built something better.
Python's performance is sufficient when the bottleneck is actually the computation done in the accelerator. In my flavour of ML, we use small models, think 3 layer NN 64 neuron wide and in some cases a small CNN. During training, most of the models reported using <150MB.
Most of the community finds python sufficient because they do not need to interleave training and simulating.
> For ML purposes, if your model isn't running on the GPU, it doesn't matter if you're using Swift, Rust, or whatever, your stuff is gonna be slow. Like it or not, Python is one of the best glue languages out there, and the reason why libraries like TensorFlow and Torch are not used in their native languages (C++) [0] is because they're significantly simpler to use in Python and the performance overhead is usually one function call (e.g run_inference(...)) and not billions.
You don't know in what I am running/doing, hence your comment comes off as ignorant.
This is the setup:
Run X number of simulated steps on CPU, collect X samples, store them in a buffer of size either X, or Z>>X, train for Y number of steps sampling from buffer, copy model to CPU, repeat for total of 1M steps, repeat 10+ times to get a good average of the performance.
All of that without any hyper parameter tuning.
Now, you also need to note that a non trivial amount of work is also done in python to augment the inputs as necessary. If the work is done in numpy, there's usually little overhead, but, it is often the case that the environment I am simulating is wrapped in wrapper functions that modify the behavior, e.g. I may need to remember the last 4 instances created by the environment, and so on. All these modifications quickly accumulate and for a single experiment of 1M steps, you end up with billions of python calls. The community uses more or less the same package/framework as the intermediary between simulations and models.
The issue is so prominent, that the community as a whole moved away from running in single core the simulations, to having multiple parallel actors to collect transitions/data points, this also requires new theory. Furthermore, there have been many proposed architectures for distributed and asynchronous training because the bottleneck is not the GPU or the model, but rather, how fast you can collect transitions. Infact, there was a distributed architecture by google that literally sends the transitions over the network into a few GPUs, the reason is that the network cost is amortized because you get to run hundreds of simulations concurrently.
IIRC, a maintainer of a popular project/framework that I contribute saw improvements upwards of 2x when using C++ over python.
> Most of the community finds python sufficient because they do not need to interleave training and simulating.
That's kind of my point. Most of the community has models running on GPU's and don't care too much about CPU-bound workloads, training or otherwise.
If you do care about that, you are in a relatively small niche of machine learning, statistically speaking. I am not denying it's existence, I'm just saying that your stack will have to be different if you want to extract the maximum level of performance out of your CPU.
> IIRC, a maintainer of a popular project/framework that I contribute saw improvements upwards of 2x when using C++ over python.
That's not surprising at all. Like I mentioned in the parent, if you profiled your code and you found that Python function calls are the main bottleneck, and you believe it to be worth investing time in getting rid of them, you can use the C++ API's of Caffe/TF/PyTorch/Whatever.
I personally don't work in simulations so I haven't ran into your problem. In the deep learning world, the CPU is unusable for any task (training, inference, evaluation, etc.), so I've never been concerned with things like function call overhead.
I work with bog-standard deep learning and this does come up, albeit not in the research stage that most people are familiar with. The closer you get to deployment, the less adequate Python becomes and the more you struggle with artificial limitations like the GIL. https://news.ycombinator.com/item?id=20301619 had a good discussion on whether we've collectively "overfit" on this slow glue + fast matrix accelerator model.
And that something better is certainly not Julia and it's definitely not Swift.
> Yes, I know the heavy lifting is done by C/C++/Rust/Cuda/Blas/Numba and so on, but, when you run simulations for millions of steps, you end up with billions of python function calls.
For ML purposes, if your model isn't running on the GPU, it doesn't matter if you're using Swift, Rust, or whatever, your stuff is gonna be slow. Like it or not, Python is one of the best glue languages out there, and the reason why libraries like TensorFlow and Torch are not used in their native languages (C++) [0] is because they're significantly simpler to use in Python and the performance overhead is usually one function call (e.g run_inference(...)) and not billions.
If you find yourself writing a simulation, and you need to optimize away billions of function calls, you can use the C++ API provided by TensorFlow.
[0] https://www.tensorflow.org/api_docs/cc