Settings

Theme

Ask HN: Too much code cleaning, not enough results (data science)

1 points by throwawaystress 6 years ago · 3 comments · 1 min read


How important is it for data scientists to have clean, modular, reusable code? Here’s my problem: while working on a project, I’ll start off in Jupyter notebooks, toying around with the data, doing some EDA, etc. Eventually I’ll pull out some of that code into functions in a Python file, and call those functions from the Jupyter. Neat.

The problem is, as I get more and more functions, I want to organize them more, make them more generalizable and consistent, etc. I’ll also get carried away with organizing files and source control, cleaning up my notes, and making documentation to explain what models/data/source files/results exist, what they mean, etc.

And then I realize I’ve been spending less and less time getting results, and more on this “overhead”. I struggle to balance the desire the rush ahead and get results with the compulsion to make the code “beautiful” and to have the project in the cleanest possible state. I’ve seen plenty of other projects with terrible organization, no documentation, and confusing, poorly formatted code. But if I’m not producing value, my neatness doesn’t matter.

All in all, I’m feeling pretty unproductive because of these habits. Any advice?

lordkrandel 6 years ago

It depends, so I'm asking you some questions to give you ideas.

How much of this code is going to be read, reused, modified, studied by you or other people?

Is it opensource or foundational?

Is it of any interest for the general public?

Could you actually spend more time in doing something else which is more productive?

Is this refactor make you learn a new tecnique?

Can you find or develop an auto-formatter that makes messy code just neat and clean?

If you are building models for a process or phenomenon, can the results be the subject of an article, maybe in the future, to show your tecniques and ask for feedback? Notebooks are just great for that.

itqwertz 6 years ago

A good rule to follow is to get it done dirty, add some tests, then refactor. Real-world code is not always pretty or academic quality.

Automation is also a good way to get rid of monotonous tasks and boilerplate.

  • throwawaystressOP 6 years ago

    Does that work with data science work, though? Along the way you build many models and many kinds of ad hoc analyses that can build up. I’ve yet to see someone write tests. For the most part, I’ve only seen people write big long scripts that they call, setting some global constants at the top. I’m aspiring to be better than that, but it seems counter to the goal of getting results quickly.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection