Make sure you have the following packages installed.
<- c("usethis", "devtools", "roxygen2", "testthat", "knitr", "covr")
pkgvec for (pkg in pkgvec) {
if (!requireNamespace(pkg, quietly = TRUE))
install.packages(pkg)
}
The {usethis}
and {devtools}
packages
automate many of the tedious tasks of package development, allowing you
to focus on writing R code. These are the packages we will mostly
use.
Share your code/methods with others.
Re-use functions for yourself.
The same R package is in a different format/state at different points of development.
Source package: A directory of files (R scripts, documentation files, test scripts, etc) with a specific structure. This lecture is about developing source packages.
Bundled package: A source package that has been compressed into a single file (along with a few other operations). These usually end in “.tar.gz”. We use the following to create a bundled package from a source package:
::build() devtools
You typically only do this when you are about to submit to CRAN.
Binary package: A ready-to-install version for folks who do not have R development tools. You typically don’t need to worry about this. If you submit to CRAN, then they will create binaries for you.
Installed package: Installing a package
decompresses/places your package in the library directory. This makes it
so that you can use library()
to load a package.
Terminology: A package is a collection of functions, along with documentation, in a specific format. A library is a directory (folder) on your computer that contains installed packages.
Confusingly, you use the library()
function to load
a package.
You can see/control your active libraries with
.libPaths()
## [1] "/home/dgerard/R/x86_64-pc-linux-gnu-library/4.2"
## [2] "/usr/local/lib/R/site-library"
## [3] "/usr/lib/R/site-library"
## [4] "/usr/lib/R/library"
For example, these are some of the packages in /home/dgerard/R/x86_64-pc-linux-gnu-library/4.2
head(list.files(path = .libPaths()[[1]]))
## [1] "abind" "animation" "annotate" "AnnotationDbi"
## [5] "ape" "ashr"
Ways to install a package:
install.packages()
.BiocManager::install()
.devtools::install()
.devtools::install_github()
.In-memory package: makes functions in a package available for use.
library()
to place an installed package in
memory.devtools::load_all()
to place a source package in
memory. You typically do this during your workflow when you are building
your package.A typical package will have this directory/file structure
.
├── DESCRIPTION
├── .git
├── .gitignore
├── LICENSE
├── LICENSE.md
├── man
│ ├── f1.Rd
│ └── f2.Rd
├── NAMESPACE
├── R
│ └── rcode.R
├── .Rbuildignore
├── README.md
├── README.Rmd
├── src
│ └── cppcode.cpp
└── tests
├── testthat
│ └── test-file.R
└── testthat.R
Most of these files/folders will be generated by
{devtools}
and {usethis}
, but you should still
know what they are.
.git
is a hidden directory that git uses to store
your version control history. Don’t touch this.
.gitignore
is a hidden file used to tell git what
files/folders to not place under version control. See the Pro
Git Book.
LICENSE
and LICENSE.md
contain the
license that your code is distributed under. Typical open-source
licenses are MIT and
GPL-3.
The man
(for “manual”) folder contains files that
hold your package’s documentation. E.g. whenever you use
help()
it uses information from a file in the
man
folder. This package has two functions which are
documented f1()
and f2()
.
NAMESPACE
is a file that determines
The R
folder contains R script files (ending in
“.R”) that hold all of your R code.
.Rbuildignore
is a hidden file which tells R which
files/folders to exclude from the package bundle. You use regular
expressions to determine which files to ignore.
.Rbuildignore
.usethis::use_build_ignore()
to
add files/folders to .Rbuildignore
.README.md
is the file that other developers
typically first look at, and it is the front page of your package’s
GitHub website. README.Rmd
is an R markdown file that
generates README.md
.
src
is a folder that contains C++ files (ending in
“.cpp”).
tests
is a folder that contains R code for
unit-tests, which are automatic checks that you write to determine if
your R package works as you intend.
You can create a package skeleton with the
usethis::create_package()
.
Before running this, change your working directory to where you want to create your R package with “Session > Set Working Directory > Choose Directory…”.
This will be the “source” state of the package, so you can choose it to be almost anywhere on your computer.
Choose a location that is not inside an RStudio project, another R package, another git repo, or inside an R library.
Then just type
::create_package(path = ".") usethis
I don’t like RStudio projects, so I typically run
::create_package(path = ".", rstudio = FALSE, open = FALSE) usethis
You can use RStudio Projects if you want. But I won’t help with any issues you have with RStudio projects.
Example: For this lecture, we will create a
simple R package called forloop
that reproduces some Base R
functions using for-loops. Create a folder called “forloop”, set the
working directory to this folder, and run
::create_package(path = ".", rstudio = FALSE, open = FALSE) usethis
Here, we will discuss how programming is a little different compared to working in an R script or an R Markdown file in interactive mode.
All R code in package should be a function definition (with very few exceptions).
<- function(arg1 = val1, arg2 = val2, ...) {
fname ## code here
return(retval)
}
Don’t have R code outside of a function definition in your package until you really understand the benefits of exceptions to this rule.
All R code should go in R scripts (ending in “.R”) not R markdowns (ending in “.Rmd”)
Use informative file names. Put only related R functions into the same file (e.g. a main function and some helpers).
As you add or modify function definitions, you should test interactively test them. That is, iteratively:
devtools::load_all()
to load a source package into
memory.In a typical R script (outside of an R package), code is run when you run it. In an R package, code is run when the package is built. So, for example, if you include the following line of code in your package.
<- Sys.time()
x x
## [1] "2022-11-16 13:45:41 EST"
Then x
be the time of the package build. If you want the
time that a user runs some code, include this statement in a
function.
<- function() {
ftime return(Sys.time())
}ftime()
## [1] "2022-11-16 13:45:41 EST"
When you alias a function from another package, don’t do
<- pkg::bar foo
instead, do
<- function(...) pkg::bar(...) foo
This is since foo
is defined as pkg::bar
during build time of your package. So if the {pkg}
maintainers fix an issue in bar()
, your aliased function
will still be the incorrect version of bar()
until a user
rebuilds your package.
Don’t modify a user’s R landscape (the global settings and the behavior of functions/objects outside of your package). With rare exceptions, here are some things to not do:
setwd()
.library()
or require()
.
source()
.
devtools::load_all()
while developing a package
(but never have devtools::load_all()
in your
package).options()
or
par()
.set.seed()
to alter the random number
generation for a user.
Sys.setenv()
or
Sys.setlocale()
.Example: Let’s work together to build a function
called col_means()
that will take as input a data frame and
return a vector of column means. We will not use the
colMeans()
function.
Exercise: Create an R script file in your package called “sum.R” via
::use_r(name = "sum") usethis
In this file, create a function called sum2()
that takes
as input a numeric vector x
and returns its sum. Use a
for-loop to calculate the sum (not the sum()
function).
Exercise: Include an na.rm
argument
that defaults to FALSE
. It removes NA
’s if
TRUE
and does not if FALSE
.
Exercise: Create a function called
count_na()
that will use a for-loop to count how many
NA
’s there are in a vector.
Exercise: There are a couple edge cases you
should worry about. If the length of x
is 0, then you
should return NA_real_
. If all values of x
are
NA
, then you should return NA_real_
(use
count_na()
to check for this). Edit your function to make
these changes now. Test it out on
sum2(c(NA, NA, 1), na.rm = TRUE) ## should be 1
sum2(c(), na.rm = TRUE) ## should be NA
sum2(c(NA, NA, NA), na.rm = TRUE) ## should be NA
Documentation: Describing:
Documentation is vital for
You should be writing documentation while you are writing R code
Documentation in an R package is in “.rd” files in the “man”
folder. This is rather esoteric, so we’ll use {roxygen2}
to
generate them automatically.
{roxygen2}
documentation is provided by comments
above a function, where each line begins with #'
.
#'
#' Documentation goes here
#'
<- function() {
fn ## Function code here.
}
After you write some documentation, you can run
::document() devtools
and {roxygen2}
will automatically update your
documentation.
You can then look at your documentation by using ?
or help()
.
{roxygen2}
comments are formatted as tag-value
pairs, where tags begin with an ampersand @
.
Values of a tag extend from the tag to the next tag.
A typical {roxygen2}
documentation looks like
this
#' @title One line description of what the function does.
#'
#' @description One paragraph description of what the function does
#'
#' @details
#' Long documentation, detailing exactly what the function does
#'
#' @param arg1 What is arg1?
#' @param arg2 What is arg2?
#'
#' @return What is returned?
#'
#' @author Your name
#'
#' @examples
#' ## Some example code goes here
<- function(arg1, arg2) {
fn ## Function code here
}
@param
: Each argument should be documented. You
should state
@examples
: Include a few lines of example R code. Do
not use @example
as this expects only one line.
@return
: What does your function return (numeric
vector, character matrix, etc). Describe not just its type but what it
is (posterior probabilities, summation, geometric means, etc)
Use @inheritParams
to use the parameter
documentation from a function in a different function.
The following will use fn1()
’s documentation for
a
in fn2()
.
#' @param a An argument
#'
<- function(a) {
fn1
}
#' @inheritParams fn1
#' @param b Another argument
#'
<- function(a, b) {
fn2 }
If you want to document your function, but do not want
{roxygen2}
to create a man file for it, then add the
@noRd
tag.
Documenting is very good, but having a manual page means that you need to maintain it for other users, so I usually document all of my functions but only have man pages for my exported functions.
Within {roxygen2}
blocks, you can format your
documentation with LaTeX-style code:
\emph{italics}
: italics.
\strong{bold}
: bold.
\code{fn()}
: formatted in-line code
\link{fn}: Link to the documentation of
fn(). Do *not* include parentheses inside
`
I often do \code{\link{fn}()}
when I link to a
function so that it is both linked, code-formatted, and has a
parentheses.
To link to functions in other packages, use
\link[pkg]{fn}
\url{}
: Include a URL.
\href{www}{word}
: Provide a hyperlink.
\email{}
: Provide an email.
\doi{}
: Provide the DOI of a paper (with a link to
that paper).
You can make an itemized list with
#' \itemize{
#' \item Item 1
#' \item Item 2
#' }
You can make an enumerated list with
#' \enumerate{
#' \item Item 1
#' \item Item 2
#' }
You can make a named list with
#' \describe{
#' \item{Name 1}{Item 1}
#' \item{Name 2}{Item 2}
#' }
Example: Let’s work together to document our
col_means()
function.
Exercise: Document sum2()
and
count_na()
. Make sure to include the following tags
@title
, @description
, @details
,
@param
, @return
, @author
, and
@examples
.
Exercise: In the @seealso
tag,
provide a link to each function. Also provide a link to
base::sum()
. Use an itemized list.
A namespace tells R what functions come from what packages.
Each package has a namespace. You use ::
to tell R
which package to use a function from. Otherwise, it wouldn’t know to
distinguish between, e.g.
::lag()
dplyr::lag() stats
Your package namespace will determine
Include the following tag in the documentation of any function that you want a user to have access to.
@export
This will add the function to the “NAMESPACE” file (which you should not edit by hand).
Note devtools::load_all()
will attach all functions
from your package so you can test them out. But if a user installs your
package and uses library()
, they will only have
access to exported functions.
You should only export functions you want others to use.
Exporting a function means that you have to be very wary about changing it in future versions since that might break other folks’ code.
Exercise: Look at the “NAMESPACE” file in
{forloop}
. Now export all of your functions in
{forloop}
. Look at the “NAMESPACE” file again.
Never use library()
or
require()
in an R package.
Package dependencies go in the DESCRIPTION folder. You can tell R that your package depends on another package by running:
::use_package() usethis
This will make it so that the package is available when your package is installed.
Then, you call functions from other packages via
package::function()
, where package
is the name
of the package and function()
is the function name from
package
.
You can suggest (but not require) a package to be installed. This
is usually done if the functions from the suggested package are not
vital, or the suggested package is hard to install (or has a lot of
dependencies itself). To do so, also use
usethis::use_package()
with
type = "Suggests"
.
If you suggest a package, whenever you use a function from that
package you should first check to see if it is installed with
requireNamespace("pkg", quietly = TRUE)
. E.g.
if (requireNamespace("pkg", quietly = TRUE)) {
::fun()
pkg }
Importing functions from a package is different than including a package in the “Imports” field of the DESCRIPTION file.
::
.The importing part of a namespace is important to avoid ambiguity.
E.g. many packages use c()
. We can (rather
foolishly) redefine it via
## Don't run this
<- function(x) {
c return(NA)
}
and no package will be affected because they all import
c()
from the {base}
R package.
Search Path: The list of packages that R will
automatically search for functions. See your search path by
search()
.
search()
## [1] ".GlobalEnv" "package:stats" "package:graphics"
## [4] "package:grDevices" "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads" "package:base"
Loading a package: Put a package in memory, but
do not put it in the search path. You can then use the function by using
::
as in dplyr::mutate()
. If you call a
function via ::
then it will load that package
automatically.
Attaching a package: Both load a package and
place it in the search path, so you don’t need to use ::
to
call a function from that package. This is what library()
does.
If you import a package (via the DESCRIPTION file), then it loads
it, it does not attach it, so you need to use ::
.
If you import a function (via the namespace), then it attaches
it, so you do not need to use ::
.
Generally, I do not recommend importing functions. but you can do
it by using @importFrom
anywhere in your package.
E.g. adding this line anywhere in your package will attach
dplyr::mutate()
whenever your package is attached.
#' @importFrom dplyr mutate
Why do I recommend rarely using @importFrom
? Because
that could make coding more complicated for your users
dplyr::lag()
then when a user
attached your package, R will now think lag()
comes from
{dplyr}
and not {stats}
, which could break the
user’s code.You should try to have as few dependencies as possible. When packages change, that can affect (or break) your package, which means more work on your part to maintain your package.
Try to avoid dependencies on new packages or on packages from folks who do not have a history of maintaining their packages.
Try to avoid dependencies on packages from the tidyverse (dplyr, tidyr, etc). These packages have changed frequently in the past. The maintainers are great about notifying folks about breaking changes, but it still means more work on your part.
grepl()
instead of
stringr::str_detect()
.Here is a list of nice essays on limiting dependencies: https://www.tinyverse.org/
If you do import functions from other packages, put all of those
{roxygen2}
tags in the same location in one file.
Example: Together, let’s modify our
col_means()
function to one called col_stats()
that also allows for calculating the standard deviation. However,
sd()
comes from the {stats}
package, and so we
need to make sure to tell R which package it is from.
Exercise: Instead of using
count_na()
, you decide to use the n_na()
function from the {na.tools}
package. Make these edits to
your package now.
Exercise: Now import n_na()
and
remove your use of ::
from the previous exercise.
The file called “DESCRIPTION” (with no filename extension) contains “meta” information about your package (like the authors, the dependencies, the license, etc)
usethis::create_package()
gives you a template for
DESCRIPTION which you can fill in. Most of the options are
self-explanatory.
Package: mypackage
Title: What The Package Does (one line, title case required)
Version: 0.1
Authors@R: person("First", "Last", email = "first.last@example.com",
role = c("aut", "cre"))
Description: What the package does (one paragraph)
Depends: R (>= 3.1.0)
License: What license is it under?
LazyData: true
Each line consists of a field and a value, separated by colon. For example, above the “Package” field has the value “mypackage”.
“Package”: Choose a package name that is
“Title”: In less than 65 characters, in title case, describe what your package does. Must be in title case and cannot end in a period.
“Description”: A paragraph of what the package does. Each line should be less than 80 characters long and each newline should be indented by 4 white spaces.
“Authors@R:” This
field contains executable R code. See the help file for
person()
.
The first
and last
arguments are the
first and last names of the person.
The email
should be the email of the
individual.
The role
should be a vector containing possible
values of "aut"
for author, "cre"
for
creator/maintainer (one one person should be "cre"
),
"ctb"
for contributor (only providing minor edits). There
are other roles that are possible.
The comment
argument is a named character vector
with additional notes. The most common value of comment
is
your ORCID number with
comment = c(ORCID = "number here")
.
If you have more than one author, put the person()
calls in a vector.
Authors@R:: c(person("Jane", "Doe", email = "janedoe@american.edu", role = c("aut", "cre")),
person("John", "Doe", email = "johndoe@american.edu", role = c("aut")))
“License”: What license should you distribute this package under? Don’t edit this by hand. You should use either (in decreasing order of restrictiveness)
Proprietary license: No one can use your package (CRAN won’t except this)
::use_proprietary_license() usethis
GPL-3 license: If other folks make derivatives of your package, they have to also place it as open-source under a GPL-3 license:
::use_gpl3_license() usethis
MIT license: Folks can use your stuff as long they distribute the license with your code.
::use_mit_license() usethis
CC0 license: You place your stuff in the public domain, and anyone can use it for any reason.
::use_cc0_license() usethis
“Version”: At least two integers separated by dots (.) or dashes
(-) like 1.0.2
or 1.0-11
. You should usually
only have three integers (at most).
1.0.1
, don’t change it to 0.9.0
.“Imports”: What packages does your package depend on? If you run
usethis::use_package()
, then {usethis}
will
edit the DESCRIPTION file for you. But typically the imports field looks
something like this
Imports:
pkg1,
pkg2
“Suggests”: What packages do you suggest installing to use your
package (but are not required)? Again, use
usethis::use_package()
, but this time with the
Type = "Suggests"
argument. It will look like this in the
DESCRIPTION folder:
Suggests:
pkg1,
pkg2
Exercise: Edit the “Package”, “Title”, “Authors@R” and “Description” sections of your “DESCRIPTION” file
Exercise: Use a GPL-3 license.
Keep working directory at all times at top level of your R package.
Iterate the following until done:
devtools::document()
(if you’ve made any changes
that impact help files or NAMESPACE)
devtools::load_all()
if you haven’t made those
changes.devtools::test()
devtools::check()
load_all()
is how we can load a source package into
memory. This will load all functions (both exported and
non-exported).
External data is available to the user. For example, the
mpg
dataset from the {ggplot2}
is available to
us by running
data("mpg", package = "ggplot2")
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
To include data in a package, simply add it, in the format of an RData file, in a directory called “data”.
You can use usethis::use_data()
to save a dataset in
the “data” directory. The first argument is the data you want to
save.
You should document your dataset, using roxygen, in a separate R file in the “R” directory. Typically, folks document all of their data in “R/data.R”.
Instead of a function declaration, you just type the name of the
dataset. E.g., from my {updog}
R package I have the
following documentation for the snpdat
tibble.
#' @title GBS data from Shirasawa et al (2017)
#'
#' @description Contains counts of reference alleles and total read counts
#' from the GBS data of Shirasawa et al (2017) for
#' the three SNPs used as examples in Gerard et. al. (2018).
#'
#' @format A \code{tibble} with 419 rows and 4 columns:
#' \describe{
#' \item{id}{The identification label of the individuals.}
#' \item{snp}{The SNP label.}
#' \item{counts}{The number of read-counts that support the reference allele.}
#' \item{size}{The total number of read-counts at a given SNP.}
#' }
#'
#' @source \doi{10.1038/srep44207}
#'
#' @references
#' \itemize{
#' \item{Shirasawa, Kenta, Masaru Tanaka, Yasuhiro Takahata, Daifu Ma, Qinghe Cao, Qingchang Liu, Hong Zhai, Sang-Soo Kwak, Jae Cheol Jeong, Ung-Han Yoon, Hyeong-Un Lee, Hideki Hirakawa, and Sahiko Isobe "A high-density SNP genetic map consisting of a complete set of homologous groups in autohexaploid sweetpotato (Ipomoea batatas)." \emph{Scientific Reports 7} (2017). \doi{10.1038/srep44207}}
#' \item{Gerard, D., Ferrão, L. F. V., Garcia, A. A. F., & Stephens, M. (2018). Genotyping Polyploids from Messy Sequencing Data. \emph{Genetics}, 210(3), 789-807. \doi{10.1534/genetics.118.301468}.}
#' }
#'
"snpdat"
Never export a dataset (via the @export
tag).
The @format
tag is useful for describing how the
data are structured.
The @source
tag is useful to describe the
URL/papers/collection process for the data.
To use pre-computed data, you place all internal data in the “R/sysdata.rda” file.
usethis::use_data()
will do this automatically if
you use the internal = TRUE
argument.
E.g. the following will put x
and y
in
“R/sysdata.rda”
<- c(1, 10, 100)
x <- data.frame(hello = c("a", "b", "c"), goodbye = 1:3)
y ::use_data(x, y, internal = TRUE) usethis
You can use internal data in a package as you normally would use an object that is loaded into memory.
Exercise: Create a function called
fib()
that takes as input n
and returns the
n
th Fibonacci
number. Recall that the sequence is \[
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, \ldots
\] where the next number is the sum of the previous two
numbers. Put this function in a new R script file (called “fib.R”) and
make sure your function is well documented.
Exercise: Save the first 1000 Fibonacci numbers
as a vector for internal data. Then create a function called
fib1000()
that just looks up the n
th Fibonacci
number from this internal vector.
It is pretty standard to have a help page for your package called
“<package>-package”. E.g., for our {forloop}
package
we would call it forloop-package
. Then, if a user wanted to
see more about the package, they could look at that help file via
?`forloop-package`
.
This help file is also where I typically put function imports.
Here is my documentation for my {ldsep}
package
#' Linkage Disequilibrium Shrinkage Estimation for Polyploids
#'
#' Estimate haplotypic or composite pairwise linkage disequilibrium
#' (LD) in polyploids, using either genotypes or genotype likelihoods. Support is
#' provided to estimate the popular measures of LD: the LD coefficient D,
#' the standardized LD coefficient D', and the Pearson correlation
#' coefficient r. All estimates are returned with corresponding
#' standard errors. These estimates and standard errors can then be used
#' for shrinkage estimation.
#'
#' @section Functions:
#'
#' The main functions are:
#' \describe{
#' \item{\code{\link{ldfast}()}}{Fast, moment-based, bias-corrected LD
#' LD estimates from marginal posterior distributions.}
#' \item{\code{\link{ldest}()}}{Estimates pairwise LD.}
#' \item{\code{\link{mldest}()}}{Iteratively apply \code{\link{ldest}()}
#' across many pairs of SNPs.}
#' \item{\code{\link{sldest}()}}{Iteratively apply \code{\link{ldest}()}
#' along a sliding window of fixed length.}
#' \item{\code{\link{plot.lddf}()}}{Plot method for the output of
#' \code{\link{mldest}()} and \code{\link{sldest}()}.}
#' \item{\code{\link{format_lddf}()}}{Format the output of
#' \code{\link{mldest}()} and \code{\link{sldest}()} into a matrix.}
#' \item{\code{\link{ldshrink}()}}{Shrink correlation estimates
#' using adaptive shrinkage (Stephens, 2017; Dey and Stephens, 2018).}
#' }
#'
#' @section Citation:
#' If you find the methods in this package useful, please run the following
#' in R for citation information: \code{citation("ldsep")}
#'
#'
#' @importFrom stats var
#' @importFrom foreach %dopar%
#' @useDynLib ldsep, .registration = TRUE
#' @importFrom Rcpp sourceCpp
#'
#' @docType package
#' @name ldsep-package
#' @aliases ldsep
#'
#' @author David Gerard
NULL
## NULL
I use the @section
tag to make custom sections for
(i) the important functions and (ii) how to cite the package. This is
not required, but it is a good standard.
@section Name:
. Make sure
you end the name of the section with a colon.@docType package
: Should be included to show that
this is not a function.
@name ldsep-package
: Makes it so that if a user
types ?`ldsep-package`
, then they will reach this help
file.
@aliases ldsep
: This makes it so that a user type
?ldsep
, then they will reach this help file as
well.
Just include NULL
below the documentation so that
{roxygen2}
knows to make a help file for it.