Make + SQL: An old new way for Data Science workloads

Well I can hear you say: Why? Apparently both technologies are quite old and both have been dropped out of favor.

Make has been with us from 1976 and SQL is even older first appearing in early seventies. They both have an arcane reputation at least. Even the most modern version of Make which is GNU Make is considered quirky and it is avoided for new projects. There are several reasons why is that the case but I will list only 2 for brevity.

Tools that generate Makefiles such as autoconf/automake have raised the threshold to understand the build procedure and are blamed for slow and difficult to debug builds.
Newer languages are coming with their own simpler – “better” build tool, tailored for them.

These reasons led to overproduction of next gen build tools like Ant, Maven, Meson, SCons, CMake, ninja, cargo, bazel and probably many more.

Similarly SQL hits a number of nerves among developers. Some of them are:

SQL does not compose well with the rest of the program.
SQL is not type safe.
Cannot declare variables (inside a query) that leads to duplication, refactoring resistance and ultimately to ossification.

The pain inflicted is so annoying that there are frameworks (ORMs) that try to hide SQL from the main application’s business logic. But SQL is not easy to hide and almost impossible to ignore.

And now we come the right side of the equation: Data Science. I do not feel like giving a definition that you could easily find in Wikipedia. Instead I will say that in many cases modern tools are not well suited for certain workloads. For example using pandas in Jupyter notebooks is great until you hit memory limits.

To clear things up let’s define some rules of engagement. What I am proposing below is not a silver bullet applicable in all cases, but it is something worth evaluating when certain conditions hold true. These are:

Your dataset is updated outside of your control and you need to rerun a workflow to produce derived data – maybe to feed your next generation LLM or something.
Your dataset is too big to fit in memory – even with compressing tricks these kind of libraries may have. So basically we are exchanging RAM with HDD here but we are doing it in an organized (tables) and optimized way (indices) that is informed about our access patterns and not in a crude horizontal way like virtual memory (swap).
Joining tables is a hard requirement. Join in SQL is the matching and combination of rows from distinct sets (tables) in case you have never heard that before. If you do not need to join (or self join) then you can skip this article as you could probably escape with a few scratches and a single pass, first order solution, like awk, sed or SQL itself wrapped in a shell script.

Well if you read so far and you started nodded affirmatively in the bullets above here is the sales pitch. What if:

You could have variables inside SQL queries? Not only on values but also on column names, table names and database names.
You could have ifs, loops and other high level constructs on query generation without escaping to stored procedures. That means you can have some rudimentary code sharing for your SQLs.
You could specify dependencies and automatically construct a DAG (Directed Acyclic Graph) and never rebuild a table that doesn’t need rebuilding. That means your workflow is restartable.
You could exploit the non dependent parts of your DAG and build them in parallel.
You could print / inspect the above mentioned DAG for obvious bugs and parallelization opportunities.

All these with only requirements access to

SQL server – and its client let’s say MariaDB but MySQL, PostgreSQL, SQLite and possibly DuckDB should be possible with some adjustments.
GNU Make
Standard Unix utilities such as cat, grep, gzip, sed, awk etc.

So if you are sold let’s roll up our sleeves and let’s see how is done.

First the boilerplate:

# Specified per project# Current DATABASE if no DATABASE was defined in the command line
DATABASE ?= dbtest
# Targets to build
TARGETS = weather_seismic
# Common Infrastructure - you can split it in an include file if you like
# Debugging
#DEBUG_SHELL:=1
#DRY_RUN := echo
# Use one shell per Makefile target invocation. This allows the use of shell variables inside the same target
.ONESHELL:
.DELETE_ON_ERROR:
# Maybe you prefer bash? Leave commented to use system's default (dash in Debian)
#SHELL=/bin/bash
.SHELLFLAGS = -ec $(if $(DEBUG_SHELL),-xv,)
.PHONY : all clean FORCE
all: $(TARGETS)
MAKEFILE:= $(realpath $(firstword $(MAKEFILE_LIST)))
MAKEDIR:= $(patsubst %/,%,$(dir $(MAKEFILE)))
# Use InnoDB
#ENGINE := ENGINE = InnoDB
# Use default -- MyISAM for my configuration because it handles better my use cases.
ENGINE :=
# Connects to server without DB. Required for DB creation.
MYSQLWODB = mysql $(MYSQL_EXTRA) -N --show-warnings -n
MYSQL = $(DRY_RUN) mysql $(MYSQL_EXTRA) -N --show-warnings -n -v -v -v $(DATABASE)
# silent variant
MYSQLS = $(DRY_RUN) mysql $(MYSQL_EXTRA) -N --show-warnings -n $(DATABASE)
# Outputs to stdout without headers. Required for checks.
MYSQLC = $(DRY_RUN) mysql $(MYSQL_EXTRA) -N $(DATABASE)
# Outputs to stdout with headers and tab separated
MYSQLH = $(DRY_RUN) mysql $(MYSQL_EXTRA) -B $(DATABASE)
GRAPHVIZ?= dot
#GRAPHVIZ ?= sfdp -Goverlap=false -Gconcentrate=true -Gsplines=true
GRAPHVIZ_RATIO_MONITOR_4_3 = 0.7500
GRAPHVIZ_RATIO_MONITOR_16_9 = 0.5625
GRAPHVIZ_RATIO_A4_PORTRAIT = 1.5397
GRAPHVIZ_RATIO_A4_LANDSCAPE = 0.6495
GRAPHVIZ_RATIO = $(GRAPHVIZ_RATIO_A4_LANDSCAPE)
make-db.log: $(MAKEFILE) FORCE
	$(MAKE) -f $< -nd | sed -e 's,$(MAKEDIR)/,,g' > $@
make-db.png: make-db.log ## Create the dependency graph in PNG
	make2graph < $< | sed -e 's/digraph G {/digraph G { ratio = $(GRAPHVIZ_RATIO)/g' | $(GRAPHVIZ) -Tpng -o $@
make-db.svg: make-db.log ## Create the dependency graph in SVG
	make2graph < $< | sed -e 's/digraph G {/digraph G { ratio = $(GRAPHVIZ_RATIO)/g' | $(GRAPHVIZ) -Tsvg -o $@
# Automatic Documentation for Makefile targets. Taken from https://news.ycombinator.com/item?id=30137254
.PHONY: help
help: ## Show this help screen
	@grep -E '^[^:]+:.*?##.*$$' $(MAKEFILE_LIST) | grep -v MAEKFILE_LIST | sort -k1,1 | sed -e 's,^[^:]*/,,g' -e 's/:/: /g' | awk 'BEGIN {FS = ":[^:]*?## "}; {printf "\033[1m%-30s\033[0m %s\n", $$1, $$2}'
# This is from https://blog.jgc.org/2015/04/the-one-line-you-should-add-to-every.html
# It allows to print variables like this:
# make print-DATABASE
print-var-%: ; @echo Variable: $*=$($*) ## print variables specified after print-var- prefix
#based on https://stackoverflow.com/questions/23908756/getting-the-list-of-dependencies-of-a-target
define print_deps
$(shell $(MAKE) -f $(MAKEFILE) -qp FORCE | sed -ne '/^$(strip $(1)):/ p;' | sed -e 's/^$(strip $(1)): *//g')
endef
print-deps-%: ## print dependencies of specified target specified after print-deps- prefix
	@echo "Dependencies for $(subst print-deps-,,$@): $(call print_deps,$(subst print-deps-,,$@))"

The basic building block is the capability the SQL client to execute scripts from its stdin wrapped by the shell capability of multiline EOF marked strings. Let’s say we have a file named weather.txt with some columns and we want to load it in a table conveniently named weather.

weather: weather.txt
	@$(MYSQL) << EOF
	DROP TABLE IF EXISTS $@;
	CREATE TABLE $@ (id INT, tstamp DATETIME, lat DOUBLE, lon DOUBLE, air_direction INT, air_speed DOUBLE, humidity DOUBLE, rain DOUBLE);
	#LOAD DATA INFILE '$<' INTO TABLE $@;
	# or maybe the LOCAL variant?
	LOAD DATA LOCAL INFILE '$<' INTO TABLE $@;
	ALTER TABLE $@ ADD INDEX(id);
	EOF
	touch $@;

Some notes:

The target assumes that a database named $(DATABASE) already exists. This can be done with another target that is prerequisite for the weather table but I will leave it as an exercise for the determined reader.
It is preferred to use automatic variables such as $@ or $< freely in order to minimize the use of explicit names. This makes the code easier to refactor and to copy paste fragments.
You can access Makefile variables with the $(VARIABLE) notation or shell variables with $$VARIABLE inside the multiline EOF string. Loads of hair pulling fun is waiting you in the road to become an expert in shell quoting and Make evaluation order.
Loading the file weather.txt has its own challenges. You may have to download it first, put it is some sanctioned directory where the MySQL server can find and is allowed to read it or specify the LOCAL parameter of the LOAD command at expense of performance.
In practice the data never arrives clean. You probably want to pass them through awk, sed or some custom cleanup. You can automate much of these with Makefile functions but we will leave this for another day.
Since GNU Make cannot tell if a database table exists, we fake that by creating (touching) a file with the same name. If the file exists and it is newer from its dependencies the target (table) will not be rebuilt again.

Now let’s do the same with another table with seismic data.

seismic: seismic.txt
	@$(MYSQL) << EOF
	DROP TABLE IF EXISTS $@;
	CREATE TABLE $@ (id INT, tstamp DATETIME, lat DOUBLE, lon DOUBLE, seismic_type INT, seismic_energy DOUBLE);
	#LOAD DATA INFILE '$<' INTO TABLE $@;
	# or maybe the LOCAL variant?
	LOAD DATA LOCAL INFILE '$<' INTO TABLE $@;
	ALTER TABLE $@ ADD INDEX(id);
	EOF
	touch $@;

Finally let’s relate these two in order to see if there is any correlation.

weather_seismic: weather seismic
	@$(MYSQL) << EOF
	DROP TABLE IF EXISTS $@;
	CREATE TABLE $@ (weather_id INT, seismic_id INT, tdiff TIME, lat DOUBLE, lon DOUBLE, correlation DOUBLE);
	# Do your data science magic here..
	# INSERT INTO $@ SELECT * FROM  weather w, seismic s WHERE...?
	# End of magic
	ALTER TABLE $@ ADD INDEX(weather_id);
	ALTER TABLE $@ ADD INDEX(seismic_id);
	EOF
	touch $@;

Bonus Points:

You can use the make-db.png (or svg) target to create dependency graphs. They can be pretty complicated and a graphical overview always help to tame complexity. make2graph is part of makefile2graph debian package or else it can be found at https://github.com/lindenb/makefile2graph. You can see an example at the top image.
You can write per target documentation by appending ## in the line of the target definition. You can invoke the documentation by specifying: make help
You can print the value of any single variable by specifying: make print-var-DATABASE (to show the $(DATABASE))
You can print all the dependencies of a target by specifying: make print-deps-TARGET to show all dependencies of a TARGET.
You can employ GNU Make functions to reduce boiler plate and consolidate some code. You can even generate Makefile targets on the fly with the eval builtin function, but then your simple and nifty Makefile starts to move to uncharted lispy code auto-generation muddy waters.
Finally you can invoke make with – j 4 to instruct make to execute 4 targets in parallel assuming it is allowed by the dependency graph. The example above as is evident by the picture at the start can utilize only 2 cores simultaneously.

That’s all folks.

Interested in a wide range of topics from deep learning and big-data to embedded, from web GUI (GWT) to API and code consolidation. Maintainance is also fun for me unlike many other normal people... View all posts by vasvir