Function Creation

Author

David Gerard

Published

June 3, 2025

Learning Objectives

Creating your own functions.
Chapter 2 of HOPR
Chapter 26 of RDS

Functions

All functions are of the form

name <- function(arg1, arg2 = default1, arg3 = default2) {
  ## Code using arg1, arg2, arg3, to create result
  return(result)
}

You can choose any name you want, but they should be informative.
You choose the names of the arguments arg1, arg2, etc…
- These are the inputs the user will use.
Arguments can have defaults by setting arg1 = default1, where default1 is whatever the default value of arg1 is. In the above example, arg1 has no default but arg2 and arg3 have defaults.
Your code creates some output which I call result above.
You put the output in a return() call at the end of the function.
Steps to creating a function:
1. figure out the logic in a simple case
2. name it something meaningful - usually a verb
3. list the inputs inside function(x,y,z)
4. place code for function in a {} block
5. test your function with some different inputs
6. add error-checking of inputs
Coding standards
1. Save as text file
2. Indent code
3. Limit width of code (80 columns?)
4. Limit the length of individual functions
5. Frequent comments
```
add_two <- function(a, b) {
  return(a + b)
}
add_two(2, 4)
```
```
[1] 6
```

Example from our book follows.

df <- tibble::tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

df$a <- (df$a - min(df$a, na.rm = TRUE)) / 
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / 
  (max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / 
  (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) / 
  (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))

How many inputs does each line have?

x <- df$a
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))

 [1] 0.78057 0.08637 0.32873 0.88008 0.03995 0.26108 0.00000 1.00000 0.75874
[10] 0.93993

# get rid of duplication
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])

 [1] 0.78057 0.08637 0.32873 0.88008 0.03995 0.26108 0.00000 1.00000 0.75874
[10] 0.93993

# make it into a function and test it
rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}
rescale01(c(0, 5, 10))

[1] 0.0 0.5 1.0

rescale01(c(-10, 0, 10))

[1] 0.0 0.5 1.0

rescale01(c(1, 2, 3, NA, 5))

[1] 0.00 0.25 0.50   NA 1.00

df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)

Now, if we have a change in requirements, we only have to change it in one place. For instance, perhaps we want to handle columns that have Inf as one of the values.

x <- c(1:10, Inf)
rescale01(x)

 [1]   0   0   0   0   0   0   0   0   0   0 NaN

rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE, finite = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}
rescale01(x)

 [1] 0.0000 0.1111 0.2222 0.3333 0.4444 0.5556 0.6667 0.7778 0.8889 1.0000
[11]    Inf

Do’s and do not’s of function naming:
- pick either snake_case or camelCase but don’t use both
- meaningful names (preferably verbs)
- for a family of functions, start with the same word
- try not to overwrite common functions or variables
- use lots of comments in your code, particularly to explain the “why” of your code or to break up your code into sections using something like # load data --------------------
Exercise: Write a function that calculates the \(z\)-scores of a numeric vector. The \(z\)-score takes each value, subtracts the mean, then divides the standard deviation. It is the measure of how many standard deviations above (or below) the mean a value is.
Exercise: Write a function that takes a numeric vector as input and replaces all instances of -9 with NA.
Exercise: Write a function that takes a numeric vector and returns the coefficient of variation (the mean divided by the standard deviation).
Exercise: Write a function that takes as input a vector and returns the number of missing values.
Exercise (from RDS): Given a vector of birth dates, write a function to compute the age in years.
Exercise: Re-write the the function range(). Use functions: min(), max()
Exercise: Write both_na(), a function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors. Useful functions: is.na(), sum(), logical operators.

Exercise: Read the source code for each of the following three functions, describe what they do, and then brainstorm better names.

f1 <- function(string, prefix) {
  substr(string, 1, nchar(prefix)) == prefix
}

f2 <- function(x) {
  if (length(x) <= 1) return(NULL)
  x[-length(x)]
}

f3 <- function(x, y) {
  rep(y, length.out = length(x))
}