Tips for saving memory with Pandas

30 points by bigsassy 4 years ago · 2 comments

Reader

Here are the big tips I think the article missed:

Use the new string dtype that requires way less memory, see this video: https://youtu.be/_zoPmQ6J1aE. object types are really memory hungry and this new type is a game changer.

Use Parquet and leverage column pruning. `usecols` doesn't leverage column pruning. You need to use columnar file formats and specify the `columns` argument with `read_parquet`. You can never truly "skip" a column when using row based file formats like CSV. Spark optimizer does column projections automagically - you need to do them manually with Pandas.

Use predicate pushdown filtering to limit the data that's read into the DataFrame, here's a blog post I wrote on this: https://coiled.io/blog/parquet-column-pruning-predicate-push...

Use a technology like Dask (each partition in a Dask DataFrame is a Pandas DataFrame) that doesn't require everything to be stored in memory and can run computations in a streaming manner.

mint2 4 years ago

It’s useful to point out pitfalls of lower precision too. I don’t usually see these articles go over that.

Operations on a few million rows of float32s can give strange results. For example when summing. Df[“colA”].sum() can be fairly different than df.sort_index()[“colA”].sum(). It’s a trap for the unwary.

Settings

Tips for saving memory with Pandas

Keyboard Shortcuts