Function Creation

Author

David Gerard

Published

June 3, 2025

Learning Objectives

  • Creating your own functions.
  • Chapter 2 of HOPR
  • Chapter 26 of RDS

Functions

  • All functions are of the form

    name <- function(arg1, arg2 = default1, arg3 = default2) {
      ## Code using arg1, arg2, arg3, to create result
      return(result)
    }
  • You can choose any name you want, but they should be informative.

  • You choose the names of the arguments arg1, arg2, etc…

    • These are the inputs the user will use.
  • Arguments can have defaults by setting arg1 = default1, where default1 is whatever the default value of arg1 is. In the above example, arg1 has no default but arg2 and arg3 have defaults.

  • Your code creates some output which I call result above.

  • You put the output in a return() call at the end of the function.

  • Steps to creating a function:

    1. figure out the logic in a simple case
    2. name it something meaningful - usually a verb
    3. list the inputs inside function(x,y,z)
    4. place code for function in a {} block
    5. test your function with some different inputs
    6. add error-checking of inputs
  • Coding standards

    1. Save as text file
    2. Indent code
    3. Limit width of code (80 columns?)
    4. Limit the length of individual functions
    5. Frequent comments
    add_two <- function(a, b) {
      return(a + b)
    }
    add_two(2, 4)
    [1] 6
  • Example from our book follows.

    df <- tibble::tibble(
      a = rnorm(10),
      b = rnorm(10),
      c = rnorm(10),
      d = rnorm(10)
    )
    
    df$a <- (df$a - min(df$a, na.rm = TRUE)) / 
      (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
    df$b <- (df$b - min(df$b, na.rm = TRUE)) / 
      (max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
    df$c <- (df$c - min(df$c, na.rm = TRUE)) / 
      (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
    df$d <- (df$d - min(df$d, na.rm = TRUE)) / 
      (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
  • How many inputs does each line have?

    x <- df$a
    (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
     [1] 0.78057 0.08637 0.32873 0.88008 0.03995 0.26108 0.00000 1.00000 0.75874
    [10] 0.93993
    # get rid of duplication
    rng <- range(x, na.rm = TRUE)
    (x - rng[1]) / (rng[2] - rng[1])
     [1] 0.78057 0.08637 0.32873 0.88008 0.03995 0.26108 0.00000 1.00000 0.75874
    [10] 0.93993
    # make it into a function and test it
    rescale01 <- function(x) {
      rng <- range(x, na.rm = TRUE)
      (x - rng[1]) / (rng[2] - rng[1])
    }
    rescale01(c(0, 5, 10))
    [1] 0.0 0.5 1.0
    rescale01(c(-10, 0, 10))
    [1] 0.0 0.5 1.0
    rescale01(c(1, 2, 3, NA, 5))
    [1] 0.00 0.25 0.50   NA 1.00
    df$a <- rescale01(df$a)
    df$b <- rescale01(df$b)
    df$c <- rescale01(df$c)
    df$d <- rescale01(df$d)
  • Now, if we have a change in requirements, we only have to change it in one place. For instance, perhaps we want to handle columns that have Inf as one of the values.

    x <- c(1:10, Inf)
    rescale01(x)
     [1]   0   0   0   0   0   0   0   0   0   0 NaN
    rescale01 <- function(x) {
      rng <- range(x, na.rm = TRUE, finite = TRUE)
      (x - rng[1]) / (rng[2] - rng[1])
    }
    rescale01(x)
     [1] 0.0000 0.1111 0.2222 0.3333 0.4444 0.5556 0.6667 0.7778 0.8889 1.0000
    [11]    Inf
  • Do’s and do not’s of function naming:

    • pick either snake_case or camelCase but don’t use both
    • meaningful names (preferably verbs)
    • for a family of functions, start with the same word
    • try not to overwrite common functions or variables
    • use lots of comments in your code, particularly to explain the “why” of your code or to break up your code into sections using something like # load data --------------------
  • Exercise: Write a function that calculates the \(z\)-scores of a numeric vector. The \(z\)-score takes each value, subtracts the mean, then divides the standard deviation. It is the measure of how many standard deviations above (or below) the mean a value is.

  • Exercise: Write a function that takes a numeric vector as input and replaces all instances of -9 with NA.

  • Exercise: Write a function that takes a numeric vector and returns the coefficient of variation (the mean divided by the standard deviation).

  • Exercise: Write a function that takes as input a vector and returns the number of missing values.

  • Exercise (from RDS): Given a vector of birth dates, write a function to compute the age in years.

  • Exercise: Re-write the the function range(). Use functions: min(), max()

  • Exercise: Write both_na(), a function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors. Useful functions: is.na(), sum(), logical operators.

  • Exercise: Read the source code for each of the following three functions, describe what they do, and then brainstorm better names.

    f1 <- function(string, prefix) {
      substr(string, 1, nchar(prefix)) == prefix
    }
    
    f2 <- function(x) {
      if (length(x) <= 1) return(NULL)
      x[-length(x)]
    }
    
    f3 <- function(x, y) {
      rep(y, length.out = length(x))
    }