make
already installed.
make --version
Download and install Rtools: https://cran.r-project.org/bin/windows/Rtools/
Restart R.
Change R Studio’s settings “Global Options… > Terminal > New terminals open with” so that it reads “Windows Powershell”. This will make it so that the terminal is Windows PowerShell, not git bash (so it will look a little weird and act a little differently).
Open a new terminal. Again, this is a Windows PowerShell.
Always run make
in the Windows PowerShell from R
Studio. This pipeline won’t work if you try to directly open the
PowerShell outside of R Studio.
You can always convert the terminal back to git bash later.
To verify make
is installed, run
make --version
in the Windows PowerShell.
I am not a big fan of RStudio Projects, but if you like them then
there is some R Studio make
functionality when you use an
RStudio Project (https://stat545.com/make-test-drive.html). This lets you
avoid using Windows PowerShell.
Only start this if you are willing to spend a few hours fiddling around your computer.
You can install Ubuntu on Windows to use all of its powerful features. This is what you would need to do for more advanced computational operations. This is how I use Windows.
But on Ubuntu, you’ll need to install a separate version of R, all of the R packages that you usually use, and git.
NOTE: This strategy could take up to a gigabyte of storage.
Install Ubuntu using these directions: https://ubuntu.com/wsl
Open up Ubuntu (this will be a shell).
Follow the instructions on CRAN (https://cran.r-project.org/) to install R for Ubuntu inside the Ubuntu shell.
In Ubuntu open up the command line for R
R
In the R command prompt, install all of the R packages you will
need with install.packages()
. You can exit R afterwards
with q()
.
Change R Studio’s settings “Global Options… > Terminal > New terminals open with” so that it reads “Bash (Windows Subsystem for Linux)”.
Open up a new terminal, and you should now be using Ubuntu for your terminal.
To verify make
is installed, run
make --version
in the Ubuntu terminal.
There are a lot of steps in a data analysis
httr
) |>
dplyr
, tidyr
) |>
ggplot2
, dplyr
)
|>
stats
,
broom
, tidymodels
) |>
shiny
, rmarkdown
,
ggplot2
).Each of these steps should be done in separate files. But a more typical pipeline would include multiple files for each step.
Files downstream in this pipeline typically depend on files upstream in this pipeline.
Here is a really basic example from a recent project of mine. Each node is a file name. The direction of the arrows indicates the dependency between the files. E.g. “sims.R” is used to create “sims_out.csv”.
The top row contains R scripts. The middle row contains some simulation output (sims_out.csv), and the bottom row contains the output of my analyses.
If I make a modification to “time.R”, I would only need to re-generate “time.pdf”, since that is the only downstream file.
However, if I modify “sims.R” and re-generate “sims_out.csv”, then I should also re-generate “simplots.pdf”, “qqplots.pdf”, and “time.pdf” because all of those files are created using “sims_out.csv”.
Having to manually remember to re-run all these scripts is prone to error (because of forgetfulness, tediousness, etc), so ideally there should be some automated way to know that when a file upstream as been changed, then all downstream files need to be re-generated.
This is exactly what make does!
You place all commands for make
in a file
exactly titled “Makefile”. You can create this file in the
terminal via
touch Makefile
Inside the Makefile, you prepare a series of rules of the form
target: prereq_1 prereq_2 prereq_3 ...
first bash command to make target
second bash command and so on
Each rule contains three things: a target, prerequisites, and commands.
target
is the name of the file that will be
generated.
prereq_1
, prereq_2
,
prereq_3
, etc are the names of the files which are used to
generate target
. These can be datasets, R scripts,
etc.
Each subsequent line is a bash command that will be evaluated in the terminal in the order listed. This sequence of commands is sometimes called a recipe.
touch
, cp
, mv
, rm
,
etc…)curl
and wget
), unzip
files (7z
and tar
),
convert
image files, compile LaTeX documents, etc…Run an R script
R CMD BATCH --no-save --no-restore input_file.R output_file.Rout
Make sure to change “input_file.R” and “output_file.Rout”
--no-save --no-restore
options make sure that you
are working with a clean environment and that you don’t save this
environment after the command is executed. This is a good thing for
reproducibility.Knit an R Markdown file
Rscript -e "library(rmarkdown);render('rmarkdown_file.Rmd')"
Make sure to change “rmarkdown_file.Rmd”
Run python script
python3 input_file.py
Make sure to change “input_file.py”
Download data from the web
wget --no-clobber url_to_data
If you have multiple, related, final outputs, it is common to place these as prerequisites to “phony” targets:
.PHONY : phony_target
phony_target : target1 target2 target3 ...
where “phony_target” is a name you provide to represent the operation being performed.
“Phony” targets are not real files. They are just convenient names to use to describe a collection of targets that should be generated.
At the top of the makefile, you then list the phony
targets after all
:
.PHONY : all
all : phony_target1 phony_target2 phony_target3 ...
NOTE: It is important to have
all : phony_target1 phony_target2
as the very first
rule because by default make
will only evaluate
the very first rule in the file. So if all
is first, then
make
will evaluate all targets.
.PHONY : all
all : phony_target1 phony_target2
.PHONY : phony_target1
phony_target1 : target1 target2
.PHONY : phony_target2
phony_target2 : target3
# Comment 1
target1 : prereq1.R data0.csv data1.RDS
R CMD BATCH --no-save --no-restore prereq1.R prereq1_out.Rout
# Comment 2
target2 : prereq2.py data2.csv
python prereq2.py
# Comment 3
target3 : prereq3.Rmd
Rscript -e "library(rmarkdown);render('prereq3.Rmd')"
You can generate all target files by running the following in the terminal
make
You can run just the targets in a phony target by specifying the phony target
make phony_target
Make will check if any of the prerequisites have changes for each
target
and, if so, will re-run the bash commands of that
rule.
Make
will not re-run commands if the prerequisites
have not changed. That is, if no upstream files to target
were modified, then target
will not be re-generated. This
makes make
very efficient.
tl;dr
So if your file structure is
Makefile
analysis/script.R
analysis/report.Rmd
data/data.csv
Then you need specify your targets according to this structure
data/data.csv : analysis/script.R
R CMD BATCH --no-save --no-restore analysis/script.R
NOTE that the following will not work because each command is executed in its own subshell (assuming the working directory is the location of the Makefile):
# DOES NOT WORK
data/data.csv : analysis/script.R
cd analysis
R CMD BATCH --no-save --no-restore script.R
But you can get around this by putting these commands on one line, with each command separated by a semicolon:
# Works, but not recommended
data/data.csv : analysis/script.R
cd analysis; R CMD BATCH --no-save --no-restore script.R
Any file manipulation in “script.R” needs to be done assuming the working directory is where Makefile is (notice the single dot):
library(readr)
dat <- read_csv("./data/data.csv")
However, confusingly, when you render an R Markdown file using
knitr
, you need to assume the working directory is the
location of the R Markdown file, not the location of
Makefile
. So in “report.Rmd” you would write
(notice the double dots)
library(readr)
dat <- read_csv("../data/data.csv")
I created a repo with a very basic example of using a Makefile: https://github.com/data-science-master/pvalue_sims
The files after everything is evaluated are:
These files have the following dependency structure:
The Makefile that organizes this structure is
.PHONY : all
all : sims
.PHONY : sims
sims : analysis/panal.html
analysis/panal.html : analysis/panal.Rmd output/pdat.csv
Rscript -e "library(rmarkdown);render('analysis/panal.Rmd')"
output/pnull.RDS : analysis/null_sims.R
R CMD BATCH --no-save --no-restore analysis/null_sims.R output/null_sims.Rout
output/pdat.csv : analysis/add_alt_sims.R output/pnull.RDS
R CMD BATCH --no-save --no-restore analysis/add_alt_sims.R output/add_alt_sims.Rout
Exercise: Clone this repo and run
make
in the terminal.
Exercise: What happens when you modify
“panal.Rmd” and you rerun make
?
Exercise: What happens when you modify
“add_alt_sims.R” and rerun make
?
Exercise: What happens when you modify
“null_sims.R” and rerun make
?
Clone the penguins repo: https://github.com/data-science-master/penguins
Use your new skills to create a makefile to manage a small project that examines the really cool Palmer Penguins data.
The files in the final report are:
These files have the following dependency structure:
Make sure you have the necessary R packages installed:
library(tidyverse)
library(tidymodels)
library(GGally)
library(palmerpenguins)
library(randomForest)
Modify the Makefile to automatically manage this pipeline.
Run make
in the terminal to generate all of the
output (penguin_class.csv, penguin_pairs.png, and
penguin_report.html)
Change the color scheme in the pairs plot and re-run
make
.
Correct the date field in the YAML header in
“penguin_report.html” and re-run make
You usually define variables at the top of a file use an equals
sign (=
).
my_first_variable = fig.pdf
You access a variable value by placing the variable name in
parentheses after a dollar sign $
.
So when a makefile comes across
$(my_first_variable)
, it will actually read it as
fig.pdf
.
For example, suppose the file fig_create.R
generates
multiple pdf files: fig1.pdf
, fig2.pdf
, and
fig3.pdf
. Then we could write this rule in the following
two equivalent ways:
fig1.pdf fig2.pdf fig3.pdf : fig_create.R
R CMD BATCH --no-save --no-restore fig_create.R
figs = fig1.pdf fig2.pdf fig3.pdf
$(figs) : fig_create.R
R CMD BATCH --no-save --no-restore fig_create.R
If you do not like writing
R CMD BATCH --no-save --no-restore
each time, then you can
save this command as a variable. The following two rules are
equivalent:
fig.pdf : script.R
R CMD BATCH --no-save --no-restore script.R
rexec = R CMD BATCH --no-save --no-restore
fig.pdf : script.R
$(rexec) script.R
Exercise: Modify the following Makefile (from the p-value exercise) to use variables that reduce the amount of copying/pasting:
.PHONY : all
all : sims
.PHONY : sims
sims : analysis/panal.html
analysis/panal.html : analysis/panal.Rmd output/pdat.csv
Rscript -e "library(rmarkdown);render('analysis/panal.Rmd')"
output/pnull.RDS : analysis/null_sims.R
R CMD BATCH --no-save --no-restore analysis/null_sims.R output/null_sims.Rout
output/pdat.csv : analysis/add_alt_sims.R output/pnull.RDS
R CMD BATCH --no-save --no-restore analysis/add_alt_sims.R output/add_alt_sims.Rout
There are a lot of automatic variables that you can use to make your Makefile more concise.
Here are the ones I use:
$@
: The target of the rule.$<
: The first prerequisite.$^
: All of the prerequisites, with spaces between
them.$(@D)
: The directory part of the target, with the
trailing slash removed.$(@F)
: The file part of the target$(<D)
: The directory part of the first
prerequisite.$(<F)
: The file part of the first prerequisite.$(basename names)
: Extracts all but the suffix of each
file name in names.Example: Suppose I have the following rule:
output/figs/foo.pdf : analysis/scripts/gaa.R data/hii.csv
R CMD BATCH --no-save --no-restore analysis/scripts/gaa.R
Then the following are these automatic variable values:
$@
: output/figs/foo.pdf$<
: analysis/scripts/gaa.R$^
: analysis/scripts/gaa.R data/hii.csv$(@D)
: output/figs$(@F)
: foo.pdf$(<D)
: analysis/scripts$(<F)
: gaa.R$(basename $(<F))
: gaaFor example, if you always place the R script first that generates the target, and the datasets that the R script uses second, then the following is typically how I evaluate the R script.
rexec = R CMD BATCH --no-save --no-restore
figure.pdf : script.R data1.csv data2.csv
$(rexec) $< $(basename $(<F)).Rout
The variables and automatic variables would interpret this as
figure.pdf : script.R data1.csv data2.csv
R CMD BATCH --no-save --no-restore script.R script.Rout
Exercise: Re-write the following to use an automatic variable instead of the Rmd’s file name in the recipe.
report.html : report.Rmd
Rscript -e "library(rmarkdown);render('report.Rmd')"
There are lots of pipeline management competitors to
make
. The most-likely that you’ll run across are
drake
is an R package for interacting with make
.targets
is another make
-like R package.snakemake
is
a python-based tool for workflow management that is perhaps more
readable than make
and allows for more options than only
using shell commands. This is probably the most popular alternative to
make
.But make
has been around since the 1970s, is widely
used, and isn’t going anywhere.
It is also relatively simple compared to more sophisticated pipeline management tools, so I think that makes it easier to setup and use with fewer chances for bugs.
Comments
#
for comments in a makefile.