In defense of Cheap Data Science

Back in 2014, soon after I joined the World Bank, I went to Senegal with “Iamthecode.” One of the goals was to do a Data Science workshop and training. As excited as I was to do the training on the latest and greatest of remote sensing, data munging, satellites images, … I failed to be aware a pretty basic linchpin. It’s embarrassing even to remember it: My fancy maxed-out MacBook Pro laptop — worth three years of their full average income — . Half the people didn’t have a laptop (only desktop at their job/university), and those who had, had several years-old ones.

Months later our team managed to procure three gaming laptops (not an easy conversation with the IT department). They are equally beyond most people's affordability in the countries we worked with, but gaming laptops have the best performance/cost/weight. We used them to do e.g. drone mapping processing on the field in Kosovo or rural Argentina (cloud processing is unsuitable due to connectivity speeds), but we also used it to train neural networks and other Data Science stuff.

Press enter or click to view image in full size

Processing drone cadastral maps in Kosovo, pulling all available things with CPUs and GPUs.

Data Science has disproportionately large benefits for development and in developing countries. Analytics are usually rudimentary to start with, and on a substantially reduced quantity and quality of available data. On top of that, the urgency to address this digital divide challenge compounds with the missing digital dividends. “Digital Dividends” is, in fact, the title of the World Bank report that came out in 2016 (I do remember discussing the topic of this post with the authors). If Data Science has a role in development, one of our Innovation Labs’ goals was to prove it, with development data, for development outcomes, in typical development infrastructure.

Many of the tools for Data Science are both free and usable on modest laptops. Linux, python, git, qgis, perl, bash, … there is tons one can do in virtually any computer. However, I see more and more proliferation of expensive tools (both in license cost, but also minimum performance or assuming fast connectivity). Tools like Hadoop, Tensor Flow, AWS RedShift, ArcGIS, ….

Cheap Data Science is then the design principle of gracefully degrading to a modest computer and poor connectivity. It’s ok if it takes more time, or you need to bundle your Internet needs. The important thing is that you don’t leave out those without that MacBook Pro or Internet connection to spin a powerful AWS EC2 instance.

When I left the Bank in January of this year, one of the reasons was to learn about development from “the other side” as much as possible. I got rid of my MacBook, and I bought the cheapest laptop I could travel with and could do most of the work. A not-so-modest 300$ Thinkpad. Still roughly 10 times the MacBook. Also, I bought a 40$ Raspberry Pi, to see how much I could squeeze from it.

Press enter or click to view image in full size

300$, 3000$, 100$ and 40$ computer. How much Data Science could you do with each?

Since then, I have done all my personal and professional use with the “cheap laptop”, including coding consultancies, moving to Bhutan to train and build a logistics system, or work with huge satellite images at Satellogic. It is possible to deliver amazing professional grade Data Science products with very few resources, hardware and connectivity. It also forced me to resort to useful tricks that reminded me how things were back when the Internet was slow and expensive. Offline installs with USBs, setting up intranet and torrent services. Git to an intranet machine instead to the github …

It is also, I would contend, increasingly harder to do Cheap Data Science. Some anecdotal evidence. Back in Bhutan, most of the Data Science online courses I wanted to leverage with the team, use fancy multimedia videos and sockets to do remote in-browser code execution. Not a chance in our setting, where bandwidth is limited and jammed during office hours. I had to resort to good old long pdfs manuals (usually in LaTex). The great people fast.ai were especially helpful and concerned to help me troubleshoot this. Still, it feels to me that much of the latest data science progress use frameworks that are just unable to run on slow computers. In fact, many general tools like nvm, npm, docker don’t have easily packageable commands to do offline distributions. If you have 4 people installing the same software, you’ll need to download each little sub-dependency for each laptop from the remote source.

It worries me that the advance of Data Science is creating an unnecessarily higher entry barrier. That the usual computer setup becomes the minimal setup. That we are leaving out possibly the segment of potential Data Scientists for which life could change the most, and for whom their work would change the life of the people that need it the most.

Ironically, I write to you from a MacBook I’ve just bought today. As much it pains me morally, I must also recognize that probably that set-up made me lose ~30% of efficiency. Others, especially those who had the same computer I had, might disagree with this. I could have probably had the same efficiency If I had spent more time tweaking and maintaining it. May be that’s my point. I want to focus on delivering value, not on the maintenance of the system to do it If I can avoid it. I don’t know. In the Digital Dividends framework, I could not afford to miss that 30% of efficiency.

I don’t know if this means I failed the experiment of lobbying for “Cheap Data Science.” It certainly has forced me to unfold the awareness I only glimpsed in Senegal or when we bought the gaming laptops. And I do hope reading this has helped you at least to consider if the software you run, the software you build, degrades gracefully.