Managing Workflows with GNU Make

Author

David Gerard

Published

June 3, 2025

Learning Objectives

Installation Mac/Linux

  • If you are using Mac or Linux, you probably have make already installed.
    • To verify, open up the terminal and run make --version

Installation Windows

  • This is one of the cases where Windows is really inferior to Mac and Linux. I will show you two pipelines. You only need to choose one pipeline.

Use Rtools (light)

  • Download and install Rtools: https://cran.r-project.org/bin/windows/Rtools/

  • Restart R.

  • Change R Studio’s settings “Global Options… > Terminal > New terminals open with” so that it reads “Windows Powershell”. This will make it so that the terminal is Windows PowerShell, not git bash (so it will look a little weird and act a little differently).

  • Open a new terminal. Again, this is a Windows PowerShell.

  • Always run make in the Windows PowerShell from R Studio. This pipeline won’t work if you try to directly open the PowerShell outside of R Studio.

  • You can always convert the terminal back to git bash later.

  • To verify make is installed, run make --version in the Windows PowerShell.

  • I am not a big fan of RStudio Projects, but if you like them then there is some R Studio make functionality when you use an RStudio Project (https://stat545.com/make-test-drive.html). This lets you avoid using Windows PowerShell.

Windows Subsystem for Linux (heavy)

  • Only start this if you are willing to spend a few hours fiddling around your computer.

  • You can install Ubuntu on Windows to use all of its powerful features. This is what you would need to do for more advanced computational operations. This is how I use Windows.

  • But on Ubuntu, you’ll need to install a separate version of R, all of the R packages that you usually use, and git.

  • NOTE: This strategy could take up to a gigabyte of storage.

  • Install Ubuntu using these directions: https://ubuntu.com/wsl

  • Open up Ubuntu (this will be a shell).

  • Follow the instructions on CRAN (https://cran.r-project.org/) to install R for Ubuntu inside the Ubuntu shell.

  • In Ubuntu open up the command line for R

    R
  • In the R command prompt, install all of the R packages you will need with install.packages(). You can exit R afterwards with q().

  • Change R Studio’s settings “Global Options… > Terminal > New terminals open with” so that it reads “Bash (Windows Subsystem for Linux)”.

  • Open up a new terminal, and you should now be using Ubuntu for your terminal.

  • To verify make is installed, run make --version in the Ubuntu terminal.

Motivation

  • There are a lot of steps in a data analysis

    • Download data (httr) |>
      tidy data (dplyr, tidyr) |>
      exploratory data analysis (ggplot2, dplyr) |>
      statistical analysis (stats, broom, tidymodels) |>
      present results (shiny, rmarkdown, ggplot2).
  • Each of these steps should be done in separate files. But a more typical pipeline would include multiple files for each step.

    • Download multiple datasets using multiple scripts
    • Breakdown complex tidying into multiple scripts.
    • Explore many aspects of your data, using multiple scripts.
    • Modify your statistical models based on model-checks.
    • Develop multiple reports on different aspects of your project.
  • Files downstream in this pipeline typically depend on files upstream in this pipeline.

  • Here is a really basic example from a recent project of mine. Each node is a file name. The direction of the arrows indicates the dependency between the files. E.g. “sims.R” is used to create “sims_out.csv”.

    The top row contains R scripts. The middle row contains some simulation output (sims_out.csv), and the bottom row contains the output of my analyses.

  • If I make a modification to “time.R”, I would only need to re-generate “time.pdf”, since that is the only downstream file.

  • However, if I modify “sims.R” and re-generate “sims_out.csv”, then I should also re-generate “simplots.pdf”, “qqplots.pdf”, and “time.pdf” because all of those files are created using “sims_out.csv”.

  • Having to manually remember to re-run all these scripts is prone to error (because of forgetfulness, tediousness, etc), so ideally there should be some automated way to know that when a file upstream as been changed, then all downstream files need to be re-generated.

  • This is exactly what make does!

Make

  • You place all commands for make in a file exactly titled “Makefile”. You can create this file in the terminal via

    touch Makefile

Rules

  • Inside the Makefile, you prepare a series of rules of the form

    target: prereq_1 prereq_2 prereq_3 ...
      first bash command to make target
      second bash command and so on
  • Each rule contains three things: a target, prerequisites, and commands.

  • target is the name of the file that will be generated.

  • prereq_1, prereq_2, prereq_3, etc are the names of the files which are used to generate target. These can be datasets, R scripts, etc.

  • Each subsequent line is a bash command that will be evaluated in the terminal in the order listed. This sequence of commands is sometimes called a recipe.

    • IMPORTANT: Make sure each bash command has one tab (not spaces) at the start of the line. If you copy and paste a Makefile from a web site then usually tabs are converted to spaces and produce an error!
    • From the terminal, it is possible to evaluate R scripts, python scripts, and knit R Markdown files.
    • You can also use the usual bash commands you are used do (touch, cp, mv, rm, etc…)
    • There are tons of other commands that you can install that allow you to do things like download files (curl and wget), unzip files (7z and tar), convert image files, compile LaTeX documents, etc…

Useful bash commands for data science

  • Run an R script

    R CMD BATCH --no-save --no-restore input_file.R output_file.Rout

    Make sure to change “input_file.R” and “output_file.Rout”

    • The --no-save --no-restore options make sure that you are working with a clean environment and that you don’t save this environment after the command is executed. This is a good thing for reproducibility.
  • Render a Quarto Document

    quarto render quarto_file.qmd
  • Use Quarto to Render an Jupyter Notebook (.ipynb Document):

    quarto render notebook.ipynb --execute
  • Knit an R Markdown file

    Rscript -e "library(rmarkdown);render('rmarkdown_file.Rmd')"

    Make sure to change “rmarkdown_file.Rmd”

  • Run python script

    python3 input_file.py

    Make sure to change “input_file.py”

  • Download data from the web

    wget --no-clobber url_to_data

Phony Targets

  • If you have multiple, related, final outputs, it is common to place these as prerequisites to “phony” targets:

    .PHONY : phony_target
    phony_target : target1 target2 target3 ...

    where “phony_target” is a name you provide to represent the operation being performed.

  • “Phony” targets are not real files. They are just convenient names to use to describe a collection of targets that should be generated.

  • At the top of the makefile, you then list the phony targets after all:

    .PHONY : all
    all : phony_target1 phony_target2 phony_target3 ...
  • NOTE: It is important to have all : phony_target1 phony_target2 as the very first rule because by default make will only evaluate the very first rule in the file. So if all is first, then make will evaluate all targets.

Comments

  • Use a hashtag # for comments in a makefile.

Pseudo-code for Makefile

.PHONY : all
all : phony_target1 phony_target2

.PHONY : phony_target1
phony_target1 : target1 target2

.PHONY : phony_target2
phony_target2 : target3

# Comment 1
target1 : prereq1.R data0.csv data1.RDS
  R CMD BATCH --no-save --no-restore prereq1.R prereq1_out.Rout

# Comment 2
target2 : prereq2.py data2.csv
  python prereq2.py

# Comment 3
target3 : prereq3.Rmd
  Rscript -e "library(rmarkdown);render('prereq3.Rmd')"

Evaluate a makefile

  • You can generate all target files by running the following in the terminal

    make
  • You can run just the targets in a phony target by specifying the phony target

    make phony_target
  • Make will check if any of the prerequisites have changes for each target and, if so, will re-run the bash commands of that rule.

  • Make will not re-run commands if the prerequisites have not changed. That is, if no upstream files to target were modified, then target will not be re-generated. This makes make very efficient.

Working directory considerations

  • tl;dr

    • For R and the terminal: Assume the working directory is the location of the Makefile.
    • For R Markdown: Assume the working directory is the location of the Rmd file.
  • So if your file structure is

    Makefile
    analysis/script.R
    analysis/report.Rmd
    data/data.csv

    Then you need specify your targets according to this structure

    data/data.csv : analysis/script.R
      R CMD BATCH --no-save --no-restore analysis/script.R

    NOTE that the following will not work because each command is executed in its own subshell (assuming the working directory is the location of the Makefile):

    # DOES NOT WORK
    data/data.csv : analysis/script.R
      cd analysis 
      R CMD BATCH --no-save --no-restore script.R

    But you can get around this by putting these commands on one line, with each command separated by a semicolon:

    # Works, but not recommended
    data/data.csv : analysis/script.R
      cd analysis; R CMD BATCH --no-save --no-restore script.R
  • Any file manipulation in “script.R” needs to be done assuming the working directory is where Makefile is (notice the single dot):

    library(readr)
    dat <- read_csv("./data/data.csv")
  • However, confusingly, when you render an R Markdown file using knitr, you need to assume the working directory is the location of the R Markdown file, not the location of Makefile. So in “report.Rmd” you would write (notice the double dots)

    library(readr)
    dat <- read_csv("../data/data.csv")

A worked example

  • I created a repo with a very basic example of using a Makefile: https://github.com/data-science-master/pvalue_sims

  • The files after everything is evaluated are:

    • Makefile
    • Readme.Rmd
    • Readme.md
    • analysis
      • add_alt_sims.R
      • null_sims.R
      • panal.html
      • panal.Rmd
    • output
      • add_alt_sims.Rout
      • null_sims.Rout
      • pdat.csv
      • pnull.RDS
  • These files have the following dependency structure:

  • The Makefile that organizes this structure is

    .PHONY : all
    all : sims
    
    .PHONY : sims
    sims : analysis/panal.html
    
    analysis/panal.html : analysis/panal.Rmd output/pdat.csv
        Rscript -e "library(rmarkdown);render('analysis/panal.Rmd')"
    
    output/pnull.RDS : analysis/null_sims.R
        R CMD BATCH --no-save --no-restore analysis/null_sims.R output/null_sims.Rout
    
    output/pdat.csv : analysis/add_alt_sims.R output/pnull.RDS
        R CMD BATCH --no-save --no-restore analysis/add_alt_sims.R output/add_alt_sims.Rout
  • Exercise: Clone this repo and run make in the terminal.

  • Exercise: What happens when you modify “panal.Rmd” and you rerun make?

  • Exercise: What happens when you modify “add_alt_sims.R” and rerun make?

  • Exercise: What happens when you modify “null_sims.R” and rerun make?

Your turn

  • Clone the penguins repo: https://github.com/data-science-master/penguins

  • Use your new skills to create a makefile to manage a small project that examines the really cool Palmer Penguins data.

  • The files in the final report are:

    • Makefile
    • Readme.Rmd
    • Readme.md
    • analysis
      • classify_penguins.R
      • penguin_report.html
      • penguin_report.Rmd
      • plot_penguin.R
    • output
      • penguin_class.csv
      • penguin_pairs.png
  • These files have the following dependency structure:

  1. Make sure you have the necessary R packages installed:

    library(tidyverse)
    library(tidymodels)
    library(GGally)
    library(palmerpenguins)
    library(randomForest)
  2. Modify the Makefile to automatically manage this pipeline.

  3. Run make in the terminal to generate all of the output (penguin_class.csv, penguin_pairs.png, and penguin_report.html)

  4. Change the color scheme in the pairs plot and re-run make.

  5. Correct the date field in the YAML header in “penguin_report.html” and re-run make

Variables

  • You usually define variables at the top of a file use an equals sign (=).

    my_first_variable = fig.pdf
  • You access a variable value by placing the variable name in parentheses after a dollar sign $.

  • So when a makefile comes across $(my_first_variable), it will actually read it as fig.pdf.

  • For example, suppose the file fig_create.R generates multiple pdf files: fig1.pdf, fig2.pdf, and fig3.pdf. Then we could write this rule in the following two equivalent ways:

    fig1.pdf fig2.pdf fig3.pdf : fig_create.R
      R CMD BATCH --no-save --no-restore fig_create.R
    figs = fig1.pdf fig2.pdf fig3.pdf
    $(figs) : fig_create.R
      R CMD BATCH --no-save --no-restore fig_create.R
  • If you do not like writing R CMD BATCH --no-save --no-restore each time, then you can save this command as a variable. The following two rules are equivalent:

    fig.pdf : script.R
      R CMD BATCH --no-save --no-restore script.R
    rexec = R CMD BATCH --no-save --no-restore
    fig.pdf : script.R
      $(rexec) script.R
  • Exercise: Modify the following Makefile (from the p-value exercise) to use variables that reduce the amount of copying/pasting:

    .PHONY : all
    all : sims
    
    .PHONY : sims
    sims : analysis/panal.html
    
    analysis/panal.html : analysis/panal.Rmd output/pdat.csv
        Rscript -e "library(rmarkdown);render('analysis/panal.Rmd')"
    
    output/pnull.RDS : analysis/null_sims.R
        R CMD BATCH --no-save --no-restore analysis/null_sims.R output/null_sims.Rout
    
    output/pdat.csv : analysis/add_alt_sims.R output/pnull.RDS
        R CMD BATCH --no-save --no-restore analysis/add_alt_sims.R output/add_alt_sims.Rout

Automatic Variables

  • There are a lot of automatic variables that you can use to make your Makefile more concise.

  • Here are the ones I use:

    • $@: The target of the rule.
    • $<: The first prerequisite.
    • $^: All of the prerequisites, with spaces between them.
    • $(@D): The directory part of the target, with the trailing slash removed.
    • $(@F): The file part of the target
    • $(<D): The directory part of the first prerequisite.
    • $(<F): The file part of the first prerequisite.
    • $(basename names): Extracts all but the suffix of each file name in names.
  • Example: Suppose I have the following rule:

    output/figs/foo.pdf : analysis/scripts/gaa.R data/hii.csv
      R CMD BATCH --no-save --no-restore analysis/scripts/gaa.R

    Then the following are these automatic variable values:

    • $@: output/figs/foo.pdf
    • $<: analysis/scripts/gaa.R
    • $^: analysis/scripts/gaa.R data/hii.csv
    • $(@D): output/figs
    • $(@F): foo.pdf
    • $(<D): analysis/scripts
    • $(<F): gaa.R
    • $(basename $(<F)): gaa
  • For example, if you always place the R script first that generates the target, and the datasets that the R script uses second, then the following is typically how I evaluate the R script.

    rexec = R CMD BATCH --no-save --no-restore
    figure.pdf : script.R data1.csv data2.csv
      $(rexec) $< $(basename $(<F)).Rout

    The variables and automatic variables would interpret this as

    figure.pdf : script.R data1.csv data2.csv
      R CMD BATCH --no-save --no-restore script.R script.Rout
  • Exercise: Re-write the following to use an automatic variable instead of the Rmd’s file name in the recipe.

    report.html : report.Rmd
      Rscript -e "library(rmarkdown);render('report.Rmd')"

Competitors

  • There are lots of pipeline management competitors to make. The most-likely that you’ll run across are

    • targets is a make-like R package.
    • snakemake is a python-based tool for workflow management that is perhaps more readable than make and allows for more options than only using shell commands. This is probably the most popular alternative to make.
  • But make has been around since the 1970s, is widely used, and isn’t going anywhere.

  • It is also relatively simple compared to more sophisticated pipeline management tools, so I think that makes it easier to setup and use with fewer chances for bugs.