Learning Objectives

Prereqs

Motivation

  1. Share your code/methods with others.

  2. Re-use functions for yourself.

Package States

Package Structure

Create a package skeleton

  • You can create a package skeleton with the usethis::create_package().

  • Before running this, change your working directory to where you want to create your R package with “Session > Set Working Directory > Choose Directory…”.

  • This will be the “source” state of the package, so you can choose it to be almost anywhere on your computer.

  • Choose a location that is not inside an RStudio project, another R package, another git repo, or inside an R library.

  • Then just type

    usethis::create_package(path = ".")
  • I don’t like RStudio projects, so I typically run

    usethis::create_package(path = ".", rstudio = FALSE, open = FALSE)

    You can use RStudio Projects if you want. But I won’t help with any issues you have with RStudio projects.

  • Example: For this lecture, we will create a simple R package called forloop that reproduces some Base R functions using for-loops. Create a folder called “forloop”, set the working directory to this folder, and run

    usethis::create_package(path = ".", rstudio = FALSE, open = FALSE)

The R folder

Documentation

Formatting

  • Within {roxygen2} blocks, you can format your documentation with LaTeX-style code:

  • \emph{italics}: italics.

  • \strong{bold}: bold.

  • \code{fn()}: formatted in-line code

  • \link{fn}: Link to the documentation offn(). Do *not* include parentheses inside`

  • I often do \code{\link{fn}()} when I link to a function so that it is both linked, code-formatted, and has a parentheses.

  • To link to functions in other packages, use \link[pkg]{fn}

  • \url{}: Include a URL.

  • \href{www}{word}: Provide a hyperlink.

  • \email{}: Provide an email.

  • \doi{}: Provide the DOI of a paper (with a link to that paper).

  • You can make an itemized list with

    #' \itemize{
    #'   \item Item 1
    #'   \item Item 2
    #' }
  • You can make an enumerated list with

    #' \enumerate{
    #'   \item Item 1
    #'   \item Item 2
    #' }
  • You can make a named list with

    #' \describe{
    #'   \item{Name 1}{Item 1}
    #'   \item{Name 2}{Item 2}
    #' }
  • Example: Let’s work together to document our col_means() function.

  • Exercise: Document sum2() and count_na(). Make sure to include the following tags @title, @description, @details, @param, @return, @author, and @examples.

  • Exercise: In the @seealso tag, provide a link to each function. Also provide a link to base::sum(). Use an itemized list.

Namespace

Exporting

  • Include the following tag in the documentation of any function that you want a user to have access to.

    @export
  • This will add the function to the “NAMESPACE” file (which you should not edit by hand).

  • Note devtools::load_all() will attach all functions from your package so you can test them out. But if a user installs your package and uses library(), they will only have access to exported functions.

  • You should only export functions you want others to use.

  • Exporting a function means that you have to be very wary about changing it in future versions since that might break other folks’ code.

  • Exercise: Look at the “NAMESPACE” file in {forloop}. Now export all of your functions in {forloop}. Look at the “NAMESPACE” file again.

Importing

  • Never use library() or require() in an R package.

  • Package dependencies go in the DESCRIPTION folder. You can tell R that your package depends on another package by running:

    usethis::use_package()
  • This will make it so that the package is available when your package is installed.

  • Then, you call functions from other packages via package::function(), where package is the name of the package and function() is the function name from package.

  • You can suggest (but not require) a package to be installed. This is usually done if the functions from the suggested package are not vital, or the suggested package is hard to install (or has a lot of dependencies itself). To do so, also use usethis::use_package() with type = "Suggests".

  • If you suggest a package, whenever you use a function from that package you should first check to see if it is installed with requireNamespace("pkg", quietly = TRUE). E.g.

    if (requireNamespace("pkg", quietly = TRUE)) {
      pkg::fun()
    }

Technical notes on importing

  • Importing functions from a package is different than including a package in the “Imports” field of the DESCRIPTION file.

    • Importing a function attaches it so that you do not need to use ::.
    • Including a package in the Imports field makes sure the package is installed.
  • The importing part of a namespace is important to avoid ambiguity.

  • E.g. many packages use c(). We can (rather foolishly) redefine it via

    ## Don't run this
    c <- function(x) {
      return(NA)
    }

    and no package will be affected because they all import c() from the {base} R package.

  • Search Path: The list of packages that R will automatically search for functions. See your search path by search().

    search()
    ## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
    ## [4] "package:grDevices" "package:utils"     "package:datasets" 
    ## [7] "package:methods"   "Autoloads"         "package:base"
  • Loading a package: Put a package in memory, but do not put it in the search path. You can then use the function by using :: as in dplyr::mutate(). If you call a function via :: then it will load that package automatically.

  • Attaching a package: Both load a package and place it in the search path, so you don’t need to use :: to call a function from that package. This is what library() does.

  • If you import a package (via the DESCRIPTION file), then it loads it, it does not attach it, so you need to use ::.

  • If you import a function (via the namespace), then it attaches it, so you do not need to use ::.

  • Generally, I do not recommend importing functions. but you can do it by using @importFrom anywhere in your package.

  • E.g. adding this line anywhere in your package will attach dplyr::mutate() whenever your package is attached.

    #' @importFrom dplyr mutate
  • Why do I recommend rarely using @importFrom? Because that could make coding more complicated for your users

    • E.g. if you import dplyr::lag() then when a user attached your package, R will now think lag() comes from {dplyr} and not {stats}, which could break the user’s code.

Practical suggestions

  • You should try to have as few dependencies as possible. When packages change, that can affect (or break) your package, which means more work on your part to maintain your package.

  • Try to avoid dependencies on new packages or on packages from folks who do not have a history of maintaining their packages.

  • Try to avoid dependencies on packages from the tidyverse (dplyr, tidyr, etc). These packages have changed frequently in the past. The maintainers are great about notifying folks about breaking changes, but it still means more work on your part.

    • E.g., if you only use string manipulation in one spot in your package, try using grepl() instead of stringr::str_detect().
  • Here is a list of nice essays on limiting dependencies: https://www.tinyverse.org/

  • If you do import functions from other packages, put all of those {roxygen2} tags in the same location in one file.

  • Example: Together, let’s modify our col_means() function to one called col_stats() that also allows for calculating the standard deviation. However, sd() comes from the {stats} package, and so we need to make sure to tell R which package it is from.

  • Exercise: Instead of using count_na(), you decide to use the n_na() function from the {na.tools} package. Make these edits to your package now.

  • Exercise: Now import n_na() and remove your use of :: from the previous exercise.

DESCRIPTION file

Workflow

Including Datasets

External data

  • External data is available to the user. For example, the mpg dataset from the {ggplot2} is available to us by running

    data("mpg", package = "ggplot2")
    str(mpg)
    ## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
    ##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
    ##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
    ##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
    ##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
    ##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
    ##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
    ##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
    ##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
    ##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
    ##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
    ##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...
  • To include data in a package, simply add it, in the format of an RData file, in a directory called “data”.

  • You can use usethis::use_data() to save a dataset in the “data” directory. The first argument is the data you want to save.

  • You should document your dataset, using roxygen, in a separate R file in the “R” directory. Typically, folks document all of their data in “R/data.R”.

  • Instead of a function declaration, you just type the name of the dataset. E.g., from my {updog} R package I have the following documentation for the snpdat tibble.

    #' @title GBS data from Shirasawa et al (2017)
    #'
    #' @description Contains counts of reference alleles and total read counts 
    #' from the GBS data of Shirasawa et al (2017) for
    #' the three SNPs used as examples in Gerard et. al. (2018).
    #'
    #' @format A \code{tibble} with 419 rows and 4 columns:
    #' \describe{
    #'     \item{id}{The identification label of the individuals.}
    #'     \item{snp}{The SNP label.}
    #'     \item{counts}{The number of read-counts that support the reference allele.}
    #'     \item{size}{The total number of read-counts at a given SNP.}
    #' }
    #'
    #' @source \doi{10.1038/srep44207}
    #'
    #' @references
    #' \itemize{
    #'   \item{Shirasawa, Kenta, Masaru Tanaka, Yasuhiro Takahata, Daifu Ma, Qinghe Cao, Qingchang Liu, Hong Zhai, Sang-Soo Kwak, Jae Cheol Jeong, Ung-Han Yoon, Hyeong-Un Lee, Hideki Hirakawa, and Sahiko Isobe "A high-density SNP genetic map consisting of a complete set of homologous groups in autohexaploid sweetpotato (Ipomoea batatas)." \emph{Scientific Reports 7} (2017). \doi{10.1038/srep44207}}
    #'   \item{Gerard, D., Ferrão, L. F. V., Garcia, A. A. F., & Stephens, M. (2018). Genotyping Polyploids from Messy Sequencing Data. \emph{Genetics}, 210(3), 789-807. \doi{10.1534/genetics.118.301468}.}
    #' }
    #'
    "snpdat"
  • Never export a dataset (via the @export tag).

  • The @format tag is useful for describing how the data are structured.

  • The @source tag is useful to describe the URL/papers/collection process for the data.

Internal Data

  • To use pre-computed data, you place all internal data in the “R/sysdata.rda” file.

  • usethis::use_data() will do this automatically if you use the internal = TRUE argument.

  • E.g. the following will put x and y in “R/sysdata.rda”

    x <- c(1, 10, 100)
    y <- data.frame(hello = c("a", "b", "c"), goodbye = 1:3)
    usethis::use_data(x, y, internal = TRUE)
  • You can use internal data in a package as you normally would use an object that is loaded into memory.

  • Exercise: Create a function called fib() that takes as input n and returns the nth Fibonacci number. Recall that the sequence is \[ 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, \ldots \] where the next number is the sum of the previous two numbers. Put this function in a new R script file (called “fib.R”) and make sure your function is well documented.

  • Exercise: Save the first 1000 Fibonacci numbers as a vector for internal data. Then create a function called fib1000() that just looks up the nth Fibonacci number from this internal vector.

Documenting a Package