As a web developer, I've often heard neckbeards bickering about Python's perform...

xenonflash · on June 26, 2013

If you don't have a lot of experience with Python you may not know how some of the "magic" parts actually work inside. Some of the things you can do however will end up allocating a lot of memory with extra copies of data that you don't need. If you do that a lot, you will chew through lots of memory. There are often two or more ways of accomplishing the same thing, with one way creating duplicates of data, and the other not. There are valid applications for both, and if you're only dealing with small amounts of data then the differences don't really matter.

Where a lot of people who are new to Python run into problems is that the language is deceptively easy. They try writing Python code that's simply a direct analogue of how they would write Java or C#. The resulting code will run, but it often be slow and a lot more verbose than necessary. Very often the way you would do something in Java or C# is the worst possible way to do it in Python. Conversely, the best way to do it in Python often has no direct analogue in Java or C#. With Python, the learning curve is shallow, but it's very long, and there's lots to learn if you want to reap all the benefits.

Without knowing what your data or algorithms are, it's pretty difficult to give any sensible detailed advice. However, if you are dealing with long "lists" of numbers, perhaps what you really want is long "arrays" of numbers. Lists and arrays are not the same thing in Python.

mercuryrising · on June 26, 2013

It's easy to burn through RAM in Python because it's easy to keep unnecessary data around. Iterating through a large dataset is better than storing it all (and all the subsequent 'filtered' data) in memory at once.

PommeDeTerre · on June 26, 2013

Did the person from Google that you talked to happen to mention an issue number in Python's tracker, or anything of that sort?

Something seems very wrong if every 1 MB of data read from your SQLite database ends up consuming 150+ MB of memory in some way.

Are you able to provide any sample code and a SQLite database that exhibit this problem, so attempts can be made to fix it?

gregorsamza · on June 26, 2013

A dataset with "many long lists of numbers" sounds like an ideal use case for NumPy, have you tried using that?

bsimpson · on June 26, 2013

I haven't. I don't have any complex math in mind (yet), just some simple transformations. The problem is that even something as simple as checking a list for potential duplicates becomes really RAM intensive for sufficiently large lists. (I'm not even doing deep equality, just comparing metadata.)

I still have plenty more work to do on the project. I think I'll end up fanning out each list iteration into a series of smaller chunks to keep me from blowing through all the RAM on any one request.

maxerickson · on June 26, 2013

Numpy supports lots of array math, but another way to think of it is as an api for working directly with memory (and values stored as platform types instead of python objects).

(Which you may well realize...)

_h8ft · on June 26, 2013

Or even simply the array module.

scardine · on June 26, 2013

GAE allows only pure Python. No binary modules like NumPy.

kirubakaran · on June 26, 2013

Numpy is supported https://developers.google.com/appengine/docs/python/tools/li...

corey · on June 26, 2013

Python has an array class too:

http://docs.python.org/2/library/array.html

tehwalrus · on June 26, 2013

I often find this problem with python, although usually it is the parsing code; the code that loads all the data up, that actually uses the high-water-mark of RAM.

For example, I had a 100MB JSON file that I tried to use the stdlib json library to load. It quickly used >8GB (my machine's RAM) and started paging, dragging everything to a halt. This is partly because the stdlib JSON parser is written in python.

Now, if you switch to a small, clever implementation called cjson[1], it can load the whole thing without bumping 3-400MB in RAM, and the high watermark is the data at the end. Much better!

So, in summary, be careful that the important part of your code is the one that uses all the RAM - and that it's not some "hello world" quality stdlib code that's killing you. If it is, and there isn't a cjson for the job, I've found wrapping C/C++ libraries with Cython[2] a simple way to solve the problem without too much hassle (generally only a couple of days work at a time if you're tight, and only wrap the functions you actually need to use yourself.)

[1] https://pypi.python.org/pypi/python-cjson - although there's a 1.5.1 out there somewhere with a fix for a bug that loses precision on floats...which is the only one I use personally. It's so hard to find that I keep a copy of the source in my Dropbox for when I need it!

[2] http://cython.org/ - although of course actually using cython means you can't take advantage of pypy, IronPython, and other "faster" implementations because you're tied to the cpython C interface forever.

pekk · on June 26, 2013

On the contrary, App Engine bears responsibility for the design of db.Model while Python does not.