You are mostly used to procedural programming where you list out a sequence of steps that are carried out in succession.
mean_vec <-rep(NA_real_, length.out =length(mtcars))names(mean_vec) <-names(mtcars)for (i inseq_along(mtcars)) { mean_vec[[i]] <-mean(mtcars[[i]])}mean_vec
mpg cyl disp hp drat wt qsec vs
20.0906 6.1875 230.7219 146.6875 3.5966 3.2172 17.8487 0.4375
am gear carb
0.4062 3.6875 2.8125
You have also been exposed to functional programming where you compose functions with other functions.
sapply(mtcars, mean)
mpg cyl disp hp drat wt qsec vs
20.0906 6.1875 230.7219 146.6875 3.5966 3.2172 17.8487 0.4375
am gear carb
0.4062 3.6875 2.8125
purrr::map_dbl(mtcars, mean) ## tidyverse version
mpg cyl disp hp drat wt qsec vs
20.0906 6.1875 230.7219 146.6875 3.5966 3.2172 17.8487 0.4375
am gear carb
0.4062 3.6875 2.8125
Object oriented programming (OOP) is a different style of programming than you are used to, centered around objects with data and functions attached to them and their class.
R has three native object oriented programming systems (S3, S4, and RC for “reference classes”), and many other third-party packages have made their own object oriented systems ({R6} being the most popular).
These systems are listed in increasing order of complexity, with S3 being “baby” OOP, S4 being “YA” OOP, and RC and R6 being “big boy” OOP.
If you are extending {ggplot2} then you will learn about another OOP system specific to {ggplot2}: ggproto.
E.g.: To calculate the column means in S3 OOP, we would probably create a generic function for column means.
mpg cyl disp hp drat wt qsec vs
20.0906 6.1875 230.7219 146.6875 3.5966 3.2172 17.8487 0.4375
am gear carb
0.4062 3.6875 2.8125
Because R programmers are not OOP programmers, you should be coding mostly in S3 and S4 when using OOP. We’ll only spend time on S3 for this class (the most popular one).
S3 and S4 use generic function OOP where the same function name is evaluated differently based on the class of the object.
E.g. that allows the output of summary() to differ between doubles and factors.
x <-sample(1:10, size =100, replace =TRUE)summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 2.00 4.00 4.66 7.00 10.00
y <-factor(x)summary(y)
1 2 3 4 5 6 7 8 9 10
13 20 10 10 10 7 10 7 5 8
R6 and RC use encapsulated OOP where objects are the center of everything, holding fields (data) and methods (functions) that operate on those values. These are closest to what you would be used to if you are coming from an OOP language. Try not to use them.
E.g. in R we apply a function, like mean() to a vector, like x. But in an encapsulated object oriented programming system would have the function mean() attached to a vector x. That’s one difference between R and Python.
R
x <-c(19, 22, 31)mean(x) ## apply mean to x
[1] 24
Python
import numpy as npx = np.array([19, 22, 31])x.mean() ## mean belongs to x
np.float64(24.0)
S3 allows you to use functions like print() and summary() and plot() on outputs of your functions. You can also define your own “generics.”
S4 is similar to S3 but is more formal and strict. S4 is important to understand if you want to use or contribute to Bioconductor.
OOP Vocabulary
Polymorphism: Use the same function name for different types of input, but have the function evaluate differently based on the types of input.
An object is a specific instance of a class. E.g. below, x is an object of class factor.
x <-factor(c(119, 22, 31))class(x)
[1] "factor"
A function for a specific class is a method.
In R6, methods belong to objects, like the col_means() method for our R6 class above.
In S3 and S4, methods are specific versions of generics. Like in S3, print.factor() is the print method for factor objects.
A field is data that belongs to an object. In our R6 example, we had the df and mean_vec fields.
In S3, fields are called attributes.
In S4, fields are called slots.
Classes are defined in a hierarchy. So if a method does not exist in one class it is searched for in the parent class. It is said that the child class inherits the behavior the parent class.
E.g. tibbles inherit the behavior of data frames.
class(tibble::tibble(a =1))
[1] "tbl_df" "tbl" "data.frame"
The order in which classes are searched for methods is called method dispatch.
{sloop}
The {sloop} package is an interface for exploring OOP systems.
sloop::otype() allows you to see if the system is S3, S4, R6, etc…
pb <- progress::progress_bar$new() ## progress bars for for-loopssloop::otype(pb)
[1] "R6"
Base Types
S (the precursor to R) was developed first without an OOP system. So their only objects were “base types”. But these don’t have basic OOP functionality like polymorphism, inheritance, etc..
R users often call base types “objects” even though they aren’t OOP objects.
x <-1:10sloop::otype(x)
[1] "base"
In R, an OO object has a class attribute and a base type does not.
x <-1:10attr(x, "class")
NULL
y <-factor(x)attr(y, "class")
[1] "factor"
class() will return the result of typeof() if an object has no class attribute, this is called its implicit class.
class(x)
[1] "integer"
typeof(x)
[1] "integer"
Every object, including OO objects, have a base type that can be seen by typeof().
typeof(y)
[1] "integer"
typeof(mtcars)
[1] "list"
typeof(USCounties)
[1] "S4"
typeof(pb)
[1] "environment"
There are 25 base types. From Hadley’s list, the important ones are:
Vector: NULL, logical, integer, double, character, list
typeof(NULL)
[1] "NULL"
typeof(TRUE)
[1] "logical"
typeof(1L)
[1] "integer"
typeof(1)
[1] "double"
typeof("1")
[1] "character"
typeof(list(1))
[1] "list"
Functions: closure (regular R functions), special (internal R functions), builtin (“primitive” functions in the base namespace that were built using C)
typeof(mean)
[1] "closure"
typeof(`if`)
[1] "special"
typeof(sum)
[1] "builtin"
Environments: environment
typeof(rlang::global_env())
[1] "environment"
S4 types: S4
typeof(USCounties)
[1] "S4"
Language types (used in metaprogramming): symbol, language, pairlist, and expression.
typeof(quote(a))
[1] "symbol"
typeof(quote(a +1))
[1] "language"
typeof(formals(mean))
[1] "pairlist"
typeof(expression(a))
[1] "expression"
Exercise: What’s the (i) type, (ii) OOP system, and (iii) class of the following objects.
x <- lubridate::make_date(year =c(1990, 2022), month =c(1, 2), day =c(30, 22))y <-matrix(NA_real_, nrow =10, ncol =2)z <- tibble::tibble(a =1:3)aa <-lm(mpg ~ wt, data = mtcars)bb <-t.test(mpg ~ am, data = mtcars)cc <- rTensor::as.tensor(array(1:30, dim =c(2, 3, 5)))
Exercise: Why do we get different results from summary() with the following code?
a <-lm(mpg ~ wt, data = mtcars)b <-t.test(mpg ~ am, data = mtcars)summary(a)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.543 -2.365 -0.125 1.410 6.873
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.285 1.878 19.86 < 2e-16
wt -5.344 0.559 -9.56 1.3e-10
Residual standard error: 3.05 on 30 degrees of freedom
Multiple R-squared: 0.753, Adjusted R-squared: 0.745
F-statistic: 91.4 on 1 and 30 DF, p-value: 1.29e-10
summary(b)
Length Class Mode
statistic 1 -none- numeric
parameter 1 -none- numeric
p.value 1 -none- numeric
conf.int 2 -none- numeric
estimate 2 -none- numeric
null.value 1 -none- numeric
stderr 1 -none- numeric
alternative 1 -none- character
method 1 -none- character
data.name 1 -none- character
Exercise: From the previous exercise, if we remove the class from a and b, what happens to the summary() call? What does this tell you about the summary() methods of the htest and lm classes?