Learning Objectives

Installation Mac/Linux

Installation Windows

Use Rtools (light)

  • Download and install Rtools: https://cran.r-project.org/bin/windows/Rtools/

  • Restart R.

  • Change R Studio’s settings “Global Options… > Terminal > New terminals open with” so that it reads “Windows Powershell”. This will make it so that the terminal is Windows PowerShell, not git bash (so it will look a little weird and act a little differently).

  • Open a new terminal. Again, this is a Windows PowerShell.

  • Always run make in the Windows PowerShell from R Studio. This pipeline won’t work if you try to directly open the PowerShell outside of R Studio.

  • You can always convert the terminal back to git bash later.

  • To verify make is installed, run make --version in the Windows PowerShell.

  • I am not a big fan of RStudio Projects, but if you like them then there is some R Studio make functionality when you use an RStudio Project (https://stat545.com/make-test-drive.html). This lets you avoid using Windows PowerShell.

Windows Subsystem for Linux (heavy)

  • Only start this if you are willing to spend a few hours fiddling around your computer.

  • You can install Ubuntu on Windows to use all of its powerful features. This is what you would need to do for more advanced computational operations. This is how I use Windows.

  • But on Ubuntu, you’ll need to install a separate version of R, all of the R packages that you usually use, and git.

  • NOTE: This strategy could take up to a gigabyte of storage.

  • Install Ubuntu using these directions: https://ubuntu.com/wsl

  • Open up Ubuntu (this will be a shell).

  • Follow the instructions on CRAN (https://cran.r-project.org/) to install R for Ubuntu inside the Ubuntu shell.

  • In Ubuntu open up the command line for R

    R
  • In the R command prompt, install all of the R packages you will need with install.packages(). You can exit R afterwards with q().

  • Change R Studio’s settings “Global Options… > Terminal > New terminals open with” so that it reads “Bash (Windows Subsystem for Linux)”.

  • Open up a new terminal, and you should now be using Ubuntu for your terminal.

  • To verify make is installed, run make --version in the Ubuntu terminal.

Motivation

Make

Rules

  • Inside the Makefile, you prepare a series of rules of the form

    target: prereq_1 prereq_2 prereq_3 ...
      first bash command to make target
      second bash command and so on
  • Each rule contains three things: a target, prerequisites, and commands.

  • target is the name of the file that will be generated.

  • prereq_1, prereq_2, prereq_3, etc are the names of the files which are used to generate target. These can be datasets, R scripts, etc.

  • Each subsequent line is a bash command that will be evaluated in the terminal in the order listed. This sequence of commands is sometimes called a recipe.

    • IMPORTANT: Make sure each bash command has one tab (not spaces) at the start of the line. If you copy and paste a Makefile from a web site then usually tabs are converted to spaces and produce an error!
    • From the terminal, it is possible to evaluate R scripts, python scripts, and knit R Markdown files.
    • You can also use the usual bash commands you are used do (touch, cp, mv, rm, etc…)
    • There are tons of other commands that you can install that allow you to do things like download files (curl and wget), unzip files (7z and tar), convert image files, compile LaTeX documents, etc…

Useful bash commands for data science

  • Run an R script

    R CMD BATCH --no-save --no-restore input_file.R output_file.Rout

    Make sure to change “input_file.R” and “output_file.Rout”

    • The --no-save --no-restore options make sure that you are working with a clean environment and that you don’t save this environment after the command is executed. This is a good thing for reproducibility.
  • Knit an R Markdown file

    Rscript -e "library(rmarkdown);render('rmarkdown_file.Rmd')"

    Make sure to change “rmarkdown_file.Rmd”

  • Run python script

    python3 input_file.py

    Make sure to change “input_file.py”

  • Download data from the web

    wget --no-clobber url_to_data

Phony Targets

  • If you have multiple, related, final outputs, it is common to place these as prerequisites to “phony” targets:

    .PHONY : phony_target
    phony_target : target1 target2 target3 ...

    where “phony_target” is a name you provide to represent the operation being performed.

  • “Phony” targets are not real files. They are just convenient names to use to describe a collection of targets that should be generated.

  • At the top of the makefile, you then list the phony targets after all:

    .PHONY : all
    all : phony_target1 phony_target2 phony_target3 ...
  • NOTE: It is important to have all : phony_target1 phony_target2 as the very first rule because by default make will only evaluate the very first rule in the file. So if all is first, then make will evaluate all targets.

Comments

  • Use a hashtag # for comments in a makefile.

Pseudo-code for Makefile

.PHONY : all
all : phony_target1 phony_target2

.PHONY : phony_target1
phony_target1 : target1 target2

.PHONY : phony_target2
phony_target2 : target3

# Comment 1
target1 : prereq1.R data0.csv data1.RDS
  R CMD BATCH --no-save --no-restore prereq1.R prereq1_out.Rout

# Comment 2
target2 : prereq2.py data2.csv
  python prereq2.py

# Comment 3
target3 : prereq3.Rmd
  Rscript -e "library(rmarkdown);render('prereq3.Rmd')"

Evaluate a makefile

  • You can generate all target files by running the following in the terminal

    make
  • You can run just the targets in a phony target by specifying the phony target

    make phony_target
  • Make will check if any of the prerequisites have changes for each target and, if so, will re-run the bash commands of that rule.

  • Make will not re-run commands if the prerequisites have not changed. That is, if no upstream files to target were modified, then target will not be re-generated. This makes make very efficient.

Working directory considerations

  • tl;dr

    • For R and the terminal: Assume the working directory is the location of the Makefile.
    • For R Markdown: Assume the working directory is the location of the Rmd file.
  • So if your file structure is

    Makefile
    analysis/script.R
    analysis/report.Rmd
    data/data.csv

    Then you need specify your targets according to this structure

    data/data.csv : analysis/script.R
      R CMD BATCH --no-save --no-restore analysis/script.R

    NOTE that the following will not work because each command is executed in its own subshell (assuming the working directory is the location of the Makefile):

    # DOES NOT WORK
    data/data.csv : analysis/script.R
      cd analysis 
      R CMD BATCH --no-save --no-restore script.R

    But you can get around this by putting these commands on one line, with each command separated by a semicolon:

    # Works, but not recommended
    data/data.csv : analysis/script.R
      cd analysis; R CMD BATCH --no-save --no-restore script.R
  • Any file manipulation in “script.R” needs to be done assuming the working directory is where Makefile is (notice the single dot):

    library(readr)
    dat <- read_csv("./data/data.csv")
  • However, confusingly, when you render an R Markdown file using knitr, you need to assume the working directory is the location of the R Markdown file, not the location of Makefile. So in “report.Rmd” you would write (notice the double dots)

    library(readr)
    dat <- read_csv("../data/data.csv")

A worked example

Your turn

  1. Make sure you have the necessary R packages installed:

    library(tidyverse)
    library(tidymodels)
    library(GGally)
    library(palmerpenguins)
    library(randomForest)
  2. Modify the Makefile to automatically manage this pipeline.

  3. Run make in the terminal to generate all of the output (penguin_class.csv, penguin_pairs.png, and penguin_report.html)

  4. Change the color scheme in the pairs plot and re-run make.

  5. Correct the date field in the YAML header in “penguin_report.html” and re-run make

Variables

Automatic Variables

  • There are a lot of automatic variables that you can use to make your Makefile more concise.

  • Here are the ones I use:

    • $@: The target of the rule.
    • $<: The first prerequisite.
    • $^: All of the prerequisites, with spaces between them.
    • $(@D): The directory part of the target, with the trailing slash removed.
    • $(@F): The file part of the target
    • $(<D): The directory part of the first prerequisite.
    • $(<F): The file part of the first prerequisite.
    • $(basename names): Extracts all but the suffix of each file name in names.
  • Example: Suppose I have the following rule:

    output/figs/foo.pdf : analysis/scripts/gaa.R data/hii.csv
      R CMD BATCH --no-save --no-restore analysis/scripts/gaa.R

    Then the following are these automatic variable values:

    • $@: output/figs/foo.pdf
    • $<: analysis/scripts/gaa.R
    • $^: analysis/scripts/gaa.R data/hii.csv
    • $(@D): output/figs
    • $(@F): foo.pdf
    • $(<D): analysis/scripts
    • $(<F): gaa.R
    • $(basename $(<F)): gaa
  • For example, if you always place the R script first that generates the target, and the datasets that the R script uses second, then the following is typically how I evaluate the R script.

    rexec = R CMD BATCH --no-save --no-restore
    figure.pdf : script.R data1.csv data2.csv
      $(rexec) $< $(basename $(<F)).Rout

    The variables and automatic variables would interpret this as

    figure.pdf : script.R data1.csv data2.csv
      R CMD BATCH --no-save --no-restore script.R script.Rout
  • Exercise: Re-write the following to use an automatic variable instead of the Rmd’s file name in the recipe.

    report.html : report.Rmd
      Rscript -e "library(rmarkdown);render('report.Rmd')"

Competitors