<- c(TRUE, TRUE, FALSE, TRUE) ## logical
x <- c(1L, 1L, 0L, 1L) ## integer
x <- c(1, 1, 0, 1) ## double
x <- c("1", "1", "0", "1") ## character x
Vectors
Learning Objectives
- Understanding vectors, the fundamental objects in R.
- This is mostly a long list of facts about vectors that you should be aware of.
- Chapter 5 of HOPR
- Chapter 3 from Advanced R
- The topic should be mostly review, but we will go a little deeper.
Vector Types
- Two types of vectors:
- Atomic: All elements of same type.
- List (“generic vectors”): Objects may be of different types.
- (low-key third type)
NULL
: Absence of a vector.
Atomic Vectors
Four basic types:
- Logical: Either
TRUE
orFALSE
- Integer:
- Exactly an integer. Assign them by adding
L
behind it (for “long integer”). -1L
,0L
,1L
,2L
,3L
, etc…
- Exactly an integer. Assign them by adding
- Double:
- Decimal numbers.
1
,1.0
,1.01
, etc…Inf
,-Inf
, andNaN
are also doubles.
- Character:
- Anything in quotes:
"1"
,"one"
,"1 won one"
, etc…
- Logical: Either
You create vectors with
c()
for “combine”There are no scalars in R. A “scalar” is just a vector length 1.
is.vector(TRUE)
[1] TRUE
Integers and doubles are together called “numerics”
You can determine the type with
typeof()
.<- c(TRUE, FALSE) x typeof(x)
[1] "logical"
<- c(0L, 1L) x typeof(x)
[1] "integer"
<- c(0, 1) x typeof(x)
[1] "double"
<- c("0", "1") x typeof(x)
[1] "character"
The special values,
Inf
,-Inf
, andNaN
are doublestypeof(c(Inf, -Inf, NaN))
[1] "double"
Determine the length of a vector using
length()
length(x)
[1] 2
Missing values are represented by
NA
.NA
is technically a logical value.typeof(NA)
[1] "logical"
This rarely matters because logicals get coerced to other types when needed.
typeof(c(1L, NA))
[1] "integer"
typeof(c(1, NA))
[1] "double"
typeof(c("1", NA))
[1] "character"
But if you need missing values of other types, you can use
NA_integer_ ## integer NA NA_real_ ## double NA NA_character_ ## character NA
Never use
==
when testing for missingness. It will returnNA
since it is always unknown if two unknowns are equal. Useis.na()
.<- c(NA, 1) x == NA x
[1] NA NA
is.na(x)
[1] TRUE FALSE
You can check the type with
is.logical()
,is.integer()
,is.double()
, andis.character()
.is.logical(TRUE)
[1] TRUE
is.integer(1L)
[1] TRUE
is.double(1)
[1] TRUE
is.character("1")
[1] TRUE
Attempting to combine vectors of different types coerces them to the same type. The order of preference is character > double > integer > logical.
typeof(c(1L, TRUE))
[1] "integer"
typeof(c(1, 1L))
[1] "double"
typeof(c("1", 1))
[1] "character"
Exercise (from Advanced R): Predict the output:
c(1, FALSE) c("a", 1) c(TRUE, 1L)
Exercise (from Advanced R): Explain these results:
1 == "1"
[1] TRUE
-1 < FALSE
[1] TRUE
"one" < 2
[1] FALSE
Attributes
Attributes are meta information applied to atomic vectors.
Many common objects (like matrices, arrays, factors, date-times) are just atomic vectors with special attributes.
You get and set attributes with
attr()
<- 1:3 a attr(a, "x") <- "abcdef" # sets x attribute of vector a to be "abcdef" attr(a, "x") # retrieve the x attribute of vector a
[1] "abcdef"
You can see all attributes of a vector with
attributes()
.attr(a, "y") <- 4:6 attributes(a)
$x [1] "abcdef" $y [1] 4 5 6
You can set many attributes at the same time with
structure()
.<- structure(1:3, b x = "abcdef", y = 4:6) attributes(b)
$x [1] "abcdef" $y [1] 4 5 6
Attributes are name-value pairs, and all of these attributes are associated with an object. Below, the vector
c(1, 2, 3)
points to attributesx
andy
that each have their own values.Most attributes are typically lost by most operations.
attributes(a[[1]])
NULL
attributes(sum(a))
NULL
Exception: Two attributes are not lost typically: names and dim.
Names
Names are a character vector the same length as the atomic vector. Each name corresponds to a single element.
You could set names using
attr()
, but you should not.<- 1:3 x attr(x, "names") <- c("a", "b", "c") attributes(x)
$names [1] "a" "b" "c"
Names are so special, that there are special ways to create them and view them
<- c(a = 1, b = 2, c = 3) x names(x)
[1] "a" "b" "c"
<- 1:3 x names(x) <- c("a", "b", "c") names(x)
[1] "a" "b" "c"
The proper way to think about names is like this:
But each name corresponds to a specific element, so Hadley does it like this:
Names stay with single bracket subsetting (not double bracket subsetting)
names(x[1])
[1] "a"
names(x[1:2])
[1] "a" "b"
names(x[[1]])
NULL
Names can be used for subsetting (more in Chapter 4)
"a"]] x[[
[1] 1
You can remove names with
unname()
.unname(x)
[1] 1 2 3
S3 Atomic Vectors
The class of an object is an important attribute that controls R’s S3 system for object oriented programming.
The class of an object will determine its behavior when you use that class in a generic function such as
print()
orsummary()
.- A generic function is a function that has different behavior based on the class of the input.
You can create your own S3 classes (chapter 13).
Here, we will talk about some S3 classes that come with R by default.
You can determine the class of object with
class()
, and you can set the class toNULL
byunclass()
.Factors, Dates, and POSIXct (date-times)
A factor is an integer vector with
- The
class
attributefactor
, and - A
levels
attribute describing the possible levels
<- factor(c("a", "b", "b", "a")) x x
[1] a b b a Levels: a b
typeof(x)
[1] "integer"
class(x)
[1] "factor"
attributes(x)
$levels [1] "a" "b" $class [1] "factor"
- The
R also does some stuff under the hood for encoding factors (i.e. has a lot of methods specifically for factors).
Factors are R’s way of storing categorical variables, and are useful when a variable only has a certain number of possible values.
Learn more about factors here.
A Date is a double vector with class attribute
Date
.<- Sys.Date() today typeof(today)
[1] "double"
attributes(today)
$class [1] "Date"
class(today)
[1] "Date"
Let’s look at the underlying double to today:
unclass(today)
[1] 20242
This is the number of days since January 1, 1970:
unclass(as.Date("1970-01-01"))
[1] 0
Date-time classes are called either
POSIXct
(Portable Operating System Interface in Unix, Calendar Time) orPOSIXlt
(Portable Operating System Interface in Unix, Local Time).POSIXct
shows up more often. It is a double representing the number of seconds since the beginning of 1970.<- Sys.time() now typeof(now)
[1] "double"
class(now)
[1] "POSIXct" "POSIXt"
unclass(now)
[1] 1.749e+09
POSIXlt
is a named list of vectors with elements representing seconds, minutes, hours, days of the month, months, years, weekdays, etc…<- as.POSIXlt(x = c("1980-10-10 01:11:01", ltvec "1970-01-11 10:15:22", "2010-05-30 20:01:18")) typeof(ltvec)
[1] "list"
unclass(ltvec)
$sec [1] 1 22 18 $min [1] 11 15 1 $hour [1] 1 10 20 $mday [1] 10 11 30 $mon [1] 9 0 4 $year [1] 80 70 110 $wday [1] 5 0 0 $yday [1] 283 10 149 $isdst [1] 1 0 1 $zone [1] "EDT" "EST" "EDT" $gmtoff [1] NA NA NA attr(,"tzone") [1] "" "EST" "EDT" attr(,"balanced") [1] TRUE
You mostly interact with these date-time objects through the
{lubridate}
package, but base R has their own interfaces (which I think are more difficult to use).Learn more about dates and date-times here.
Exercise (From Advanced R):
table()
will take as input a vector or vectors and count how many observations have each value. What sort of object doestable()
return? What is its type? What attributes does it have? How does the dimensionality change as you tabulate more variables?
Creating Empty Vectors
In many applications, you will want to create empty vectors or vectors filled with missing values.
Create an empty vector with
vector()
.vector(mode = "character", length = 0)
character(0)
vector(mode = "double", length = 0)
numeric(0)
vector(mode = "integer", length = 0)
integer(0)
vector(mode = "logical", length = 0)
logical(0)
Shorthand for this is
character()
character(0)
double()
numeric(0)
integer()
integer(0)
logical()
logical(0)
Empty vectors often show up in defaults that are returned when folks ask for something of length 0.
E.g., in if you are simulating something, you might return a vector of length 0 if they ask for 0 elements.
<- function(n) { f <- double(n) sout for (i in seq_len(n)) { <- simcode(...) ## put simulation code here sout[[i]] }return(sout) }
You often want to create an empty vector that you then fill in with values. I like to create this vector to be with missing values, so that I know I made a mistake if they are not all filled in.
<- 100 n <- rep(NA_character_, lenght.out = n) x <- rep(NA_integer_, lenght.out = n) x <- rep(NA_real_, lenght.out = n) x <- rep(NA, lenght.out = n) x
E.g. in a for-loop, you often fill in the elements of a vector. Let’s suppose we are evaluating the performance of the mean in a simulation study.
<- 1000 ## number of simulations nsim <- 10 ## sample size nsamp <- rep(NA_real_, length.out = nsim) mvec <- 0 true_mean for (i in seq_len(nsim)) { <- mean(rnorm(n = nsamp, mean = true_mean)) mvec[[i]] }mean((mvec - true_mean)^2) ## mean squared error
[1] 0.09765
If you are filling in the values of a matrix, you need to be able to create a matrix with missing values.
<- 100 n <- 3 p <- matrix(NA_character_, nrow = p, ncol = n) matval <- matrix(NA_real_, nrow = p, ncol = n) matval <- matrix(NA_integer_, nrow = p, ncol = n) matval <- matrix(NA, nrow = p, ncol = n) matval
Lists
Lists are like vectors except each element can be of any type.
You create lists with
list()
.<- list(a = 1:3, lobj log_val = TRUE, list(c = 10))
You can view a list with
str()
.str(lobj)
List of 3 $ a : int [1:3] 1 2 3 $ log_val: logi TRUE $ :List of 1 ..$ c: num 10
c()
will combine lists into a single list. If you usec()
with a list and a vector, then it will first coerce the vector into a list where each element is a list.<- list( l1 1:2, c("a", "b")) <- list(c(TRUE, FALSE)) l2 c(l1, l2)
[[1]] [1] 1 2 [[2]] [1] "a" "b" [[3]] [1] TRUE FALSE
c(l1, c("c", "d"))
[[1]] [1] 1 2 [[2]] [1] "a" "b" [[3]] [1] "c" [[4]] [1] "d"
as.list(c("c", "d")) ## this is what it does before combining
[[1]] [1] "c" [[2]] [1] "d"
typeof()
will return"list"
andis.list()
tests for a list.typeof(l1)
[1] "list"
is.list(l1)
[1] TRUE
Use
unlist()
to remove the list structure.l1
[[1]] [1] 1 2 [[2]] [1] "a" "b"
unlist(l1)
[1] "1" "2" "a" "b"
The
dim
attribute can be applied to lists<- list( lmat 1:2, 3:10, runif(4), c("Hello", "world")) dim(lmat) <- c(2, 2) lmat
[,1] [,2] [1,] integer,2 numeric,4 [2,] integer,8 character,2
1, 2]] lmat[[
[1] 0.1402 0.9930 0.2153 0.6578
Data Frames
Data Frames are lists where
- Each element is a vector.
- Each vector has the same length.
<- data.frame(a = 4:6, df b = c("A", "B", "C")) typeof(df)
[1] "list"
attributes(df)
$names [1] "a" "b" $class [1] "data.frame" $row.names [1] 1 2 3
Above, the “names” attribute are the columnames, and you can get them with
colnames()
colnames(df)
[1] "a" "b"
names(df)
[1] "a" "b"
The row.names are the row names, and you can obtain them with
row.names()
orrownames()
.row.names()
are specifically for data frames, whereasrownames()
was designed for extracting dimnames and was also altered to work with data frames.
row.names(df)
[1] "1" "2" "3"
rownames(df)
[1] "1" "2" "3"
Those row names are automatically generated, but you can set them with
rownames()
.rownames(df) <- c("h", "i", "j") df
a b h 4 A i 5 B j 6 C
tibbles
, from the package{tibble}
are tidyverse data frames. The main differences are:Tibbles do not automatically coerce data (such as from strings to factors). Data frames used to do this in older versions of R.
data.frame( x = c("a", "b", "c"), stringsAsFactors = FALSE) ## needed to be safe for older versions of R
x 1 a 2 b 3 c
::tibble(x = c("a", "b", "c")) tibble
# A tibble: 3 × 1 x <chr> 1 a 2 b 3 c
Tibbles do not change names if they happen to be non-syntactic (e.g. have spaces in them)
data.frame(`hello world` = c(1, 2, 3))
hello.world 1 1 2 2 3 3
::tibble(`hello world` = c(1, 2, 3)) tibble
# A tibble: 3 × 1 `hello world` <dbl> 1 1 2 2 3 3
Tibbles will only recycle vectors of length 1.
data.frame( x = c(1, 2, 3, 4), y = c(1, 2))
x y 1 1 1 2 2 2 3 3 1 4 4 2
::tibble( tibblex = c(1, 2, 3, 4), y = c(1, 2))
Error in `tibble::tibble()`: ! Tibble columns must have compatible sizes. • Size 4: Existing data. • Size 2: Column `y`. ℹ Only values of size one are recycled.
{tibbles}
do not reduce to vectors when you subset one column. Folks disagree on whether this is good or bad.<- data.frame(`hello world` = c(1, 2, 3)) df <- tibble::tibble(`hello world` = c(1, 2, 3)) tib attributes(df[, 1])
NULL
attributes(tib[, 1])
$names [1] "hello world" $row.names [1] 1 2 3 $class [1] "tbl_df" "tbl" "data.frame"
Data frames allow for row names, tibbles do not. Folks disagree on whether this is desirable (Hadley is extremely against it).
Tibbles print differently than data frames. Tibbles only print 10 rows and only the columns that will fit. But I actually prefer the data frame method better, because pretty doesn’t matter when you are doing data analysis, and it’s better to see all columns.
Exercise: Based on our discussion of making zero-length vectors, create a data frame with zero rows and columns
a
, andb
. Both should be double columns.Exercise: What does
data.frame()
do without any arguments?Exercise: Use the
row.names
argument ofdata.frame()
to create a data frame with 100 rows and no columns.
NULL
NULL
is its own data type, that always has length 0.typeof(NULL)
[1] "NULL"
length(NULL)
[1] 0
NULL
is used to represent an empty vector.c()
NULL
NULL
is often used as a default argument in a function for complicated arguments. The function operates one way unless a user specifies something for that argument. Look at?ashr::ash.workhorse
for multiple examples.<- function(x = NULL) { f if (is.null(x)) { ## do something else { } ## do something else } }
E.g., let’s create a function
wmean
that calculates a weighted mean if weights are provided, an the sample mean otherwise.<- function(x, w = NULL) { wmean if (is.null(w)) { <- rep(1 / length(x), length.out = length(x)) w else { } <- w / sum(w) w }return(sum(x * w)) } <- c(1, 2, 3) x wmean(x)
[1] 2
wmean(x, w = c(5, 1, 1))
[1] 1.429
There are two alternative strategies to this. First, use
missingArg()
to test if an argument is missing.<- function(x, w) { wmean if (missingArg(w)) { <- rep(1 / length(x), length.out = length(x)) w else { } <- w / sum(w) w }return(sum(x * w)) } <- c(1, 2, 3) x wmean(x)
[1] 2
wmean(x, w = c(5, 1, 1))
[1] 1.429
This works because of lazy evaluation (which we will learn about later).
I don’t like this because it is confusing to the user, who thinks
w
is a required argument.Second, you can include more complicated defaults.
<- function(x, w = rep(1, length(x))) { wmean <- w / sum(w) w return(sum(x * w)) } <- c(1, 2, 3) x wmean(x)
[1] 2
wmean(x, w = c(5, 1, 1))
[1] 1.429
I don’t like this because default arguments are evaluated inside the function, but user-provided arguments are evaluated outside the function (more on this later). This can lead to strange results.
NULL
is one of the ways R handle’s missingness. The others areNA
andNaN
.NULL
: An empty object. Can be thought of as a zero-length vector.NA
: A missing value. Can be used as an element of a vector.NaN
: Undefined numeric values, such as the output of0/0
.
New Functions
typeof()
: Determine the type of an object (character, double, integer, or logical).attr()
: Get or set an attribute.attributes()
: View all attributes.structure()
: Create an object with many attributes.names()
: Get or set names attributes.unname()
: Remove the names attribute.dim()
: Get or set dim attributes.class()
: Get or set class attributes.unclass()
: Remove the class attribute.