ASK HN: Are there code standards for using Pandas DataFrames in production?

1 points by bradmerlin 6 years ago · 1 comment · 1 min read

I'm working on a Python project where some modelling has been implemented using Pandas. I'm helping to add an API over the modelling logic, and when I see a function that accepts a dataframe (sometimes many dataframes), it feels like it's not obvious what that function requires without reading through all of the function's code (e.g. which dataframe columns it requires, maybe even their types, etc.).

Requiring series doesn't seem like the right thing either because sometimes a function might require a few columns whose rows are related.

Is there an accepted way to define these sort of functions that lets the caller to easily understand what columns (or even types) are required? Or am I missing something obvious and this isn't a real problem?

I can think of a few ways to do it (mostly thinking decorators) but it'd be awesome to hear what people are doing in the real world.

redff0000 6 years ago

Sounds like you're mostly interested in provenance and lineage. Pandas doesn't really help with that and I haven't seen any serious efforts to build on top of pandas for it.

If you want to roll code, you could use decorators or overload pandas methods like joins to build a provenance graph.

If you don't want to roll code, you can treat the transform as a black box and define your inputs and outputs in terms of database/store tables.

Settings

ASK HN: Are there code standards for using Pandas DataFrames in production?

Keyboard Shortcuts