make
is a command-line program of the same breed as classic UNIX programs like grep
and ssh
.
A powerful tool that has stood the test of time, make
is available in terminals everywhere serious compute is done.
Originally used for as a build automation tool - make
can be used for any workflow that involves running programs and making files.
Here is how to use this classic tool in a modern data science project.
Der Anfang ist das Ende
We start where we will end - at the Makefile
we will develop together in this post:
# Makefile
all: ./data/clean.json
./data/raw.json: ./data/raw.json ./ingest.py
mkdir -p data
./ingest.py
./data/clean.json: ./data/raw.json ./clean.py
./clean.py
Don’t worry if this doesn’t make sense now! By the end you’ll understand how it all works.
Notation
Just a quick note to make sure we are on the same page.
Shell commands start with a $
, with shell output shown below unindented.
$ shell-command
printed output
Filenames are given at the top of the codeblock, separated from the start of the file by one line.
file.name
first_line_of_file - commonly a shebang like #!/usr/bin/env python3
second_line_of_file
The shell code blocks were run with zsh
on MacOS
, and are bash
compatible. The Python code was run with 3.8.10
.
Anatomy of a Makefile
A Makefile
has three components:
- targets - files you are trying to make or a
PHONY
target, - dependencies - targets that need to be run before a target,
- a workflow - a sequence of
TAB
separated steps needed to make your target.
target: dependencies
<TAB>workflow
<TAB>workflow
<TAB>workflow
Much of the power of a Makefile
comes from being able to make targets depend on other targets.
The simple Makefile
below shows a simple pipeline with one dependency - end
depends on begin
. Both begin
and end
are PHONY
targets - meaning they do not create a file:
# Makefile
.PHONY: begin end
begin:
echo "The beginning is the end"
end: begin
echo "Der Anfang is das Ende"
A Makefile
can represent & manage complex data pipelines. A single workflow can do anything you can do with a shell, making even a single workflow arbitrarily powerful.
Running a Makefile
Take the Makefile
below, that creates an empty data.html
file:
# Makefile
data.html:
echo "making data.html"
touch data.html
Running make
without a target will run the first target - in our case the only target, data.html
.
make
prints out the commands it runs:
$ make
making data.html
touch data.html
If we run this again, we see that make
runs differently - it doesn’t make data.html
again:
$ make
make: `data.html' is up to date.
If we do reset our pipeline (by deleting data.html
), running make
will run our pipeline again:
$ rm data.html; make
making data.html
touch data.html
Above we have demonstrated one useful feature of make
- intelligent re-execution of pipelines.
Under the hood, make
makes use of the timestamps on files to understand what to run (or not run).
Before we get too carried away, lets set our motivation for using make
in a data project.
Why make
for data science?
Workflow documentation
Documenting project workflow is a basic quality of a good data project.
Most projects will need only one Makefile
- making this file a natural, central place for your project (second only to the README.md
). It’s an anchor your project is setup around.
A Makefile
is excellent documentation - machine readable & executable - the best kind of documentation. Like any text file it’s easy to track changes in source control.
Creating your data science workflow in a sequence of make
targets also has the benefit of massaging your pipelines to be more modular - encouraging functional decomposition of shell or Python scripts.
CLI for free
A Makefile
tightly integrates with the shell environment it runs in. We can easily configure variables at runtime via either shell environment variables or via command line arguments.
The Makefile
below has two variables - NAME
and COUNTRY
:
# Makefile
all:
echo "$(NAME) is from $(COUNTRY)"
We can set our two variables using two different methods:
EXPORT name=adam
- setting our variableNAME
by to a shell environment variable,COUNTRY=NZ
- ourCOUNTRY
variable by an argument passed to themake
command.
$ export NAME=adam; make COUNTRY=NZ
echo "$(NAME) is from $(COUNTRY)"
adam is from nz
We can also assign the output values of shell commands using:
VAR = $(shell echo value)
Intelligent pipeline re-execution
We have already seen the functionality of intelligent pipeline re-execution - it’s a powerful way to not re-run code that doesn’t need to run.
make
uses timestamps on files to track what to re-run (or not re-run) - it won’t re-run code that has already been run and will re-run if dependencies of the target change.
This can save you lots of time - not rerunning that expensive data ingestion and cleaning step when you are working on model selection.
Our pipeline
We will build a data pipeline - using Python scripts as mock for real data tasks - with data flowing from left to right.
Our ingestion step creates raw data, and our cleaning step creates clean data:

We can look at the same pipeline in terms of the dependency between the data artifacts & source code of our pipeline - with dependency flowing from right to left:

Our clean data depends on both the code used to generate it and the raw data. Our raw data depends only on the ingestion Python script.
Developing our pipeline in a Makefile
0. Our pipeline components
Lets look at the two components in our pipeline - an ingestion step and a cleaning step, both of which are Python scripts.
ingest.py
writes some data to a JSON file:
# ingest.py
#!/usr/bin/env python3
from datetime import datetime
import json
from pathlib import Path
fi = Path.cwd() / "data" / "raw.json"
fi.parent.mkdir(exist_ok=True)
fi.write_text(json.dumps({"data": "raw", "ingest-time": datetime.utcnow().isoformat()}))
We can run this Python script and use cat
to take a look at it’s JSON output:
$ ./ingest.py; cat data/raw.json
{"data": "raw", "ingest-time": "2021-12-19T13:57:53.407280"}
clean.py
takes the raw data generated and updates the data
field to clean
:
# clean.py
#!/usr/bin/env python3
from datetime import datetime
import json
from pathlib import Path
data = json.loads((Path.cwd() / "data" / "raw.json").read_text())
data["data"] = "clean"
data["clean-time"] = datetime.utcnow().isoformat()
fi = Path.cwd() / "data" / "clean.json"
fi.write_text(json.dumps(data))
We can use cat
again to look at the result of our cleaning step:
$ ./clean.py; cat data/clean.json
{"data": "clean", "ingest-time": "2021-12-19T13:57:53.407280", "clean-time": "2021-12-19T13:59:47.640153"
1. Track pipeline dependencies
Let’s start out with a Makefile
that runs our two stage data pipeline.
We are already taking advantage of the ability to create dependencies between our pipeline stages, making our clean
target depend on our raw
target.
We have also included a top level meta target all
which depends on our clean
step:
# Makefile
all: clean
raw:
mkdir -p data
./ingest.py
clean: raw
./clean.py
We can use this Makefile
from a terminal using by running make
, which will run our meta target all
:
$ make
mkdir -p data
./ingest.py
ingesting {'data': 'raw', 'ingest-time': '2021-12-19T14:14:54.765570'}
./clean.py
cleaning {'data': 'clean', 'ingest-time': '2021-12-19T14:14:54.765570', 'clean-time': '2021-12-19T14:14:54.922659'}
If we go and run only the clean
step of our pipeline, we run both the ingest and cleaning step again. This is because our cleaning step depends on the output of data ingestion:
$ make clean
mkdir -p data
./ingest.py
ingesting {'data': 'raw', 'ingest-time': '2021-12-19T14:15:21.510687'}
./clean.py
cleaning {'data': 'clean', 'ingest-time': '2021-12-19T14:15:21.510687', 'clean-time': '2021-12-19T14:15:21.667561'}
What if we only want to re-run our cleaning step? Our next Makefile
iteration will avoid this unnecessary re-execution.
2. Track pipeline outputs
Now let’s improve our Makefile
, by making changing our targets to be actual files - the files generated by that target.
all: clean
./data/raw.json:
mkdir -p data
./ingest.py
./data/clean.json: ./data/raw.json
./clean.py
Removing any output from previous runs with rm -rf ./data
, we can run full our pipeline with make
:
$ rm -rf ./data; make
mkdir -p data
./ingest.py
ingesting {'data': 'raw', 'ingest-time': '2021-12-27T13:56:30.045009'}
./clean.py
cleaning {'data': 'clean', 'ingest-time': '2021-12-27T13:56:30.045009', 'clean-time': '2021-12-27T13:56:30.193770'}
Now if we run make
a second time, nothing happens:
$ make
make: Nothing to be done for `all'.
If we do want to only re-run our cleaning step, we can remove the previous output and run our pipeline again - with make
knowing that it only needs to run the cleaning step again with existing raw data:
$ rm ./data/clean.json; make
./clean.py
cleaning {'data': 'clean', 'ingest-time': '2021-12-27T13:56:30.045009', 'clean-time': '2021-12-27T14:02:30.685974'}
3. Track source code dependencies
The final improvement we will make to our pipeline is to track dependencies on source code.
Let’s update our clean.py
script to also track clean-date
:
# clean.py
#!/usr/bin/env python3
from datetime import datetime
import json
from pathlib import Path
data = json.loads((Path.cwd() / "data" / "raw.json").read_text())
data["data"] = "clean"
data["clean-time"] = datetime.utcnow().isoformat()
data["clean-date"] = datetime.utcnow().strftime("%Y-%m-%d")
fi = Path.cwd() / "data" / "clean.json"
fi.write_text(json.dumps(data))
And now our final pipeline:
# Makefile
all: ./data/clean.json
./data/raw.json: ./data/raw.json ./ingest.py
mkdir -p data
./ingest.py
./data/clean.json: ./data/raw.json ./clean.py
./clean.py
Our final step, after updating only our clean.py
script, make
will run our cleaning step again:
$ make
./clean.py
ingesting {'data': 'clean', 'ingest-time': '2021-12-27T13:56:30.045009', 'clean-time': '2021-12-27T14:10:06.799127', 'clean-date': '2021-12-27'}
Summary
That’s it! We hope you have enjoyed learning a bit about make
& Makefile
, and are enthusiastic to experiment with it in your data work.
There is more depth and complexity to make
and the Makefile
- what you have seen so far is hopefully enough to encourage you to experiment and learn more while using a Makefile
in your own project.
Key takeaways are:
make
is a powerful, commonly available tool that can run arbitrary shell workflows,- a
Makefile
forms a natural central point of execution for a project, with a simple CLI that integrates well with the shell, make
can intelligently re-execute your data pipeline - keeping track of the dependencies between code and data.
Thanks for reading!