Why not base?

You need a way to iterate in R in a data-structure-informed way. What does that mean?

  • Iterate over elements of a list
  • Iterate over rows or columns of a 2-dimensional object
  • Iterate over sub data frames induced by one or more factors
  • Iterate over tuples formed from the i-th element of several vectors of equal length

All of this is absolutely possible with base R, using for() loops or the “apply” functions, such as apply(), [slvmt]apply(), and by(). I know, because I did so religiously for over 15 years. It works fine!

Why might you do otherwise? As an instructor, I encounter lots of useRs who sheepishly say they know they should be using the apply functions, but they don’t. Because they’ve never quite figured them out or been able to form the habit.

I have a theory about why. The user interface of the “apply”" functions is not as consistent as it could be, which slows down learning. The return objects frequently require further checking and massage to use downstream, which diminishes the payoff. In particular, there’s a tendency to return a vector (atomic or otherwise) or array, instead of data frame, with the original factor levels appearing in a names attribute.

Regardless of your preference, hopefully the side-by-side examples below are a helpful tour of different ways to iterate. Even if you prefer purrr/tidyverse in some settings, you might want base methods when you don’t want any dependencies.

Why purrr?

purrr addresses some of the friction identified in the base functions for “split-apply-combine”:

  • The map() family of functions is highly internally consistent, making it easier to transfer expertise from one function to another.
  • Greater encouragement for type-safe simplification to atomic vector or data frame, producing output that is more ready for the next step.
  • Concise syntax for defining anonymous functions.

We load purrr now. Below we also use example lists from repurrrsive which are introduced and explored elsewhere.

library(purrr)
library(repurrrsive)
#> Error in library(repurrrsive): there is no package called 'repurrrsive'

Why not plyr?

If you have never heard of the plyr package, you can skip this section! Personally, I’m a bit sad that it’s time to move on from plyr. But it is time.

Why? plyr is no longer under active development. The innovation is happening elsewhere, in purrr and the other packages in the tidyverse. Also, even I must admit that plyr can be rather slow. I wouldn’t recommend writing new code that requires plyr.

In terms of understanding or translating plyr code, here’s the minimum you need to know: The key feature of the main 3 * 4 = 12 functions in plyr is their explicit, memorable form.

XYply()

where X and Y specify the type of input and output, respectively. X and Y take values from here:

  • a = “array”
  • d = “data frame”
  • l = “list”
  • _ = “nothing” (not eligible as X re: input, for obvious reasons)

The most useful function proved to be ddply(): data frame in, data frame out. This inspired the creation of the dplyr package. Most tasks suitable for a*ply() and, especially l*ply() are now handled by purrr’s map() family. The functions called for their side effects, as opposed to return value, are those named *_ply(); this is handled by purrr’s walk() functions. Marching through a data frame row by row is now a job for purrr’s pmap().

lapply() vs. purrr::map()

These are the core mapping functions of base and purrr, respectively. They are “list in, list out”. The main (only?) difference is access to purrr’s shortcuts for indexing by name or position and for creating anonymous functions.

base

lapply(got_chars[1:3],
       function(x) x[["name"]])
#> Error in lapply(got_chars[1:3], function(x) x[["name"]]): object 'got_chars' not found

purrr

map(got_chars[1:3], "name")
#> Error in map(got_chars[1:3], "name"): object 'got_chars' not found

This is also plyr::llply(), although it offers some other arguments, such as .progress and .parallel.

plyr::llply(got_chars[1:3], function(x) x[["name"]])

sapply() vs. ¯\_(ツ)_/¯

sapply() is a base function that attempts to apply a reasonable simplification to the output of lapply(). It’s handy for interactive use, but due to the unpredictability of it return value, it’s unwise to use it in programming. There is no equivalent in purrr or plyr.

Problem demonstration: The characters in Game of Thrones can have aliases. Some have several, some have one, some have none. Depending on which characters I work with, the same sapply() call can return an object of entirely different class, i.e. a list or a character vector. This can be a source of mysterious bugs in code meant to run non-interactively.

aliases1 <- sapply(got_chars[20:22], function(x) x[["aliases"]])
#> Error in lapply(X = X, FUN = FUN, ...): object 'got_chars' not found
str(aliases1)
#> Error in str(aliases1): object 'aliases1' not found
aliases2 <- sapply(got_chars[c(3, 22, 27)], function(x) x[["aliases"]])
#> Error in lapply(X = X, FUN = FUN, ...): object 'got_chars' not found
str(aliases2)
#> Error in str(aliases2): object 'aliases2' not found

With purrr, you would use map() to get a list back or map_chr() to get atomic character vector. If you use map_chr() when you should not, you’ll get an informative error right away (shown below) and can adjust your approach accordingly.

map_chr(got_chars[2:4], "aliases")
#> Error in map_chr(got_chars[2:4], "aliases"): object 'got_chars' not found

vapply() vs. map_*()

Base vapply() requires you to specify a template for the return value and is described as a safer alternative to sapply(). The closest purrr functions are the type-specific mapping functions: map_lgl(), map_int(), map_dbl(), and map_chr() that are “list in, atomic vector out”. Here’s comparable use of vapply() and map_chr() to get some of the Game of Thrones characters’ names.

base

vapply(got_chars[1:3],
       function(x) x[["name"]],
       character(1))
#> Error in vapply(got_chars[1:3], function(x) x[["name"]], character(1)): object 'got_chars' not found

purrr

map_chr(got_chars[1:3], "name")
#> Error in map_chr(got_chars[1:3], "name"): object 'got_chars' not found

vapply() always simplifies

What’s not to love with vapply() then? It suffers from the drop = TRUE vs FALSE problem we have when requesting a single row or column from a 2-dimensional object. Except vapply() has no drop argument to control this behavior. It’s an example of the base functions being more difficult to program around. The template allows you to specify the form of each individual result, but there is no way to specify the form – such as the dimension – of the overall result.

I adapt this example from my real life, where I have vapply() inside a function and n is an argument to that function, i.e. it varies. Here I simply define n in the global environment prior to the vapply() call. Note how vapply() returns a 2 dimensional object in the first case and atomic vector in the second. As it says in the docs: “Simplification is always done in vapply.” Believe it.

f <- function(x, n) rep(x, n)
n <- 3
vapply(c("a", "b"), f, character(n), n = n)
#>      a   b  
#> [1,] "a" "b"
#> [2,] "a" "b"
#> [3,] "a" "b"
n <- 1
vapply(c("a", "b"), f, character(n), n = n)
#>   a   b 
#> "a" "b"

¯\_(ツ)_/¯ vs. map_dfr()

The purrr::map_dfr() function is “list in, data frame out” and there is no true base equivalent. Given the centrality of data frames for analysis, it is handy to have a function to produce them, without resorting to do.call().

base

l <- lapply(got_chars[23:25],
            `[`, c("name", "playedBy"))
#> Error in lapply(got_chars[23:25], `[`, c("name", "playedBy")): object 'got_chars' not found
mat <- do.call(rbind, l)
#> Error in do.call(rbind, l): object 'l' not found
(df <- as.data.frame(mat, stringsAsFactors = FALSE))
#> Error in as.data.frame(mat, stringsAsFactors = FALSE): object 'mat' not found

purrr

map_dfr(got_chars[23:25],
        `[`, c("name", "playedBy"))
#> Error in map(.x, .f, ...): object 'got_chars' not found

The base workflow above gets trickier if you’re extracting elements of disparate type. At that point, it may make more sense to use vapply() repeatedly. For comparability, we’ll show similar using purrr’s type-specific mapping, which is also safer than relying on automatic type conversion from map_dfr().

base

data.frame(
  name = vapply(got_chars[23:25], `[[`,
                character(1), "name"),
  id = vapply(got_chars[23:25], `[[`,
              integer(1), "id"),
  stringsAsFactors = FALSE
)
#> Error in vapply(got_chars[23:25], `[[`, character(1), "name"): object 'got_chars' not found

purrr

tibble::tibble(
  name = map_chr(got_chars[23:25], "name"),
  id = map_int(got_chars[23:25], "id")
)
#> Error in map_chr(got_chars[23:25], "name"): object 'got_chars' not found

mapply() vs. map2(), pmap()

When you need to iterate over 2 or more vectors/lists in parallel, the base option is mapply(). Unlike the other apply functions, the first argument is FUN, the function to apply, and the multiple vector inputs are provided “loose” via ....

For exactly two vector inputs, purrr has map2(), with all the usual type-specific variants. For an arbitrary number of vector inputs, use purrr pmap() or type-specific variants, with the inputs packaged in a list. A very handy special case is when the input is a data frame, in which case pmap_*() applies .f to each row.

base

nms <- vapply(got_chars[16:18],
              `[[`, character(1), "name")
#> Error in vapply(got_chars[16:18], `[[`, character(1), "name"): object 'got_chars' not found
birth <- vapply(got_chars[16:18],
                `[[`, character(1), "born")
#> Error in vapply(got_chars[16:18], `[[`, character(1), "born"): object 'got_chars' not found
mapply(function(x, y) paste(x, "was born", y),
       nms, birth)
#> Error in mapply(function(x, y) paste(x, "was born", y), nms, birth): object 'nms' not found

purrr

nms <- got_chars[16:18] %>% 
  map_chr("name")
#> Error in eval(lhs, parent, parent): object 'got_chars' not found
birth <- got_chars[16:18] %>% 
  map_chr("born")
#> Error in eval(lhs, parent, parent): object 'got_chars' not found
map2_chr(nms, birth, ~ paste(.x, "was born", .y))
#> Error in map2_chr(nms, birth, ~paste(.x, "was born", .y)): object 'nms' not found

## and again, but with pmap()
df <- tibble::tibble(
  nms, 
  connector = "was born",
  birth
)
#> Error in overscope_eval_next(overscope, expr): object 'nms' not found
pmap_chr(df, paste)
#> Error: `.x` is not a list (closure)

aggregate() vs. dplyr::summarize()

Consider a data frame, as opposed to a nested list. How do you split it into pieces, according to one or more factors, apply a function to the pieces, and combine the results?

Create a tiny excerpt of the Gapminder dataset that contains a bit of data for Canada and Germany. Load dplyr, now that we are more in the data frame world.

library(dplyr)
library(gapminder)
(mini_gap <- gapminder %>%
    filter(country %in% c("Canada", "Germany"), year > 2000) %>% 
    droplevels())
#> # A tibble: 4 x 6
#>   country continent  year lifeExp      pop gdpPercap
#>   <fctr>  <fctr>    <int>   <dbl>    <int>     <dbl>
#> 1 Canada  Americas   2002    79.8 31902268     33329
#> 2 Canada  Americas   2007    80.7 33390141     36319
#> 3 Germany Europe     2002    78.7 82350671     30036
#> 4 Germany Europe     2007    79.4 82400996     32170

Simple summary of a single variable for each country. Specifically, take the mean of life expectancy. In this case, the formula method of base aggregate() is quite nice, i.e. it returns a convenient data frame.

base

aggregate(lifeExp ~ country, mini_gap, mean)
#>   country lifeExp
#> 1  Canada 80.2115
#> 2 Germany 79.0380

tidyverse

mini_gap %>% 
  group_by(country) %>% 
  summarize(lifeExp = mean(lifeExp))
#> # A tibble: 2 x 2
#>   country lifeExp
#>   <fctr>    <dbl>
#> 1 Canada     80.2
#> 2 Germany    79.0

Simple summaries of two variables for each country. We take the mean of life expectancy and of GDP per capita.

base

## formula method
aggregate(cbind(lifeExp, gdpPercap) ~ country, mini_gap, mean)
#>   country lifeExp gdpPercap
#> 1  Canada 80.2115  34824.10
#> 2 Germany 79.0380  31103.09

## data.frame method
aggregate(mini_gap[c("lifeExp", "gdpPercap")], list(mini_gap$country), mean)
#>   Group.1 lifeExp gdpPercap
#> 1  Canada 80.2115  34824.10
#> 2 Germany 79.0380  31103.09

## tapply() more general but output less useful here (data frame?)

## returns named vector
tapply(mini_gap$lifeExp, mini_gap$country, mean)
#>  Canada Germany 
#> 80.2115 79.0380

## returns list
tapply(mini_gap$lifeExp, mini_gap$country, mean, simplify = FALSE)
#> $Canada
#> [1] 80.2115
#> 
#> $Germany
#> [1] 79.038

tidyverse

mini_gap %>% 
  group_by(country) %>% 
  summarize_at(vars(lifeExp, gdpPercap), mean)
#> # A tibble: 2 x 3
#>   country lifeExp gdpPercap
#>   <fctr>    <dbl>     <dbl>
#> 1 Canada     80.2     34824
#> 2 Germany    79.0     31103

Bivariate summary of two variables for each country. We compute the correlation of life expectancy and year, for the full gapminder dataset now. On the base side, we can no longer use aggregate() or tapply() and need to graduate to by().

base

## by() with simplification (the default)
by_obj <- by(gapminder, gapminder$country, function(df) cor(df$lifeExp, df$year))
head(by_obj)
#> gapminder$country
#> Afghanistan     Albania     Algeria      Angola   Argentina   Australia 
#>   0.9735051   0.9542420   0.9925307   0.9422392   0.9977816   0.9897716

## by() without simplification
by_obj <- by(gapminder, gapminder$country, function(df) cor(df$lifeExp, df$year),
   simplify = FALSE)
head(by_obj)
#> $Afghanistan
#> [1] 0.9735051
#> 
#> $Albania
#> [1] 0.954242
#> 
#> $Algeria
#> [1] 0.9925307
#> 
#> $Angola
#> [1] 0.9422392
#> 
#> $Argentina
#> [1] 0.9977816
#> 
#> $Australia
#> [1] 0.9897716

tidyverse

gapminder %>% 
  group_by(country) %>% 
  summarize(cor = cor(lifeExp, year))
#> # A tibble: 142 x 2
#>    country       cor
#>    <fctr>      <dbl>
#>  1 Afghanistan 0.974
#>  2 Albania     0.954
#>  3 Algeria     0.993
#>  4 Angola      0.942
#>  5 Argentina   0.998
#>  6 Australia   0.990
#>  7 Austria     0.996
#>  8 Bahrain     0.983
#>  9 Bangladesh  0.995
#> 10 Belgium     0.997
#> # ... with 132 more rows

This is a good place to pause and glance back over the base vs tidyverse calls and results in this section. It shows that you can absolutely do these tasks either way, but the consistency of the code and return values is higher in the tidyverse.

by() vs. tidyr::nest()

Fit a linear model of life expectancy against year. On the tidyverse side, we now create a nested data frame, with one meta-row per country. Therefore we load tidyr to get nest(). The data needed for each country’s linear model is stored as a list-column of country-specific data frame.

base

by_obj <- by(gapminder,
             gapminder$country, 
             function(df) lm(lifeExp ~ year, data = df))
str(by_obj[1:2], max.level = 1)
#> List of 2
#>  $ Afghanistan:List of 12
#>   ..- attr(*, "class")= chr "lm"
#>  $ Albania    :List of 12
#>   ..- attr(*, "class")= chr "lm"
#>  - attr(*, "dim")= int 2
#>  - attr(*, "dimnames")=List of 1
by_obj[[1]]
#> 
#> Call:
#> lm(formula = lifeExp ~ year, data = df)
#> 
#> Coefficients:
#> (Intercept)         year  
#>   -507.5343       0.2753

tidyverse

library(tidyr)
nested_df <- gapminder %>% 
  group_by(country, continent) %>% 
  nest() %>% 
  mutate(fit = map(data, ~ lm(lifeExp ~ year, data = .x)))
str(nested_df$fit[1:2], max.level = 1)
#> List of 2
#>  $ :List of 12
#>   ..- attr(*, "class")= chr "lm"
#>  $ :List of 12
#>   ..- attr(*, "class")= chr "lm"
nested_df$fit[[1]]
#> 
#> Call:
#> lm(formula = lifeExp ~ year, data = .x)
#> 
#> Coefficients:
#> (Intercept)         year  
#>   -507.5343       0.2753

What if you want to inspect the fits for Oceania? On the base side, you’ve got some work to do because the country information lives only in the names and the continent information is not directly linked to the model fits at all. On the tidyverse side, where the fits live in a data frame that carries country and continent info, we can use our usual techniques for filtering rows based on the data.

base

o_countries <- as.character(unique(gapminder$country[gapminder$continent == "Oceania"]))
by_obj[names(by_obj) %in% o_countries]
#> $Australia
#> 
#> Call:
#> lm(formula = lifeExp ~ year, data = df)
#> 
#> Coefficients:
#> (Intercept)         year  
#>   -376.1163       0.2277  
#> 
#> 
#> $`New Zealand`
#> 
#> Call:
#> lm(formula = lifeExp ~ year, data = df)
#> 
#> Coefficients:
#> (Intercept)         year  
#>   -307.6996       0.1928

tidyverse

nested_df %>% 
  filter(continent == "Oceania") %>% 
  .$fit
#> [[1]]
#> 
#> Call:
#> lm(formula = lifeExp ~ year, data = .x)
#> 
#> Coefficients:
#> (Intercept)         year  
#>   -376.1163       0.2277  
#> 
#> 
#> [[2]]
#> 
#> Call:
#> lm(formula = lifeExp ~ year, data = .x)
#> 
#> Coefficients:
#> (Intercept)         year  
#>   -307.6996       0.1928

Let’s form a data frame with one row per country and variables for country, continent, estimated intercept, and estimated slope. I also want to guarantee that the country and continent factors have the same levels as they originally had in gapminder.

base

coefs <- lapply(by_obj, coef)
coefs <- do.call(rbind, coefs)
coefs <- data.frame(
  country = I(rownames(coefs)),
  coefs
)
coefs$continent <- gapminder$continent[match(coefs$country, gapminder$country)]
coefs$continent <- factor(coefs$continent, levels = levels(gapminder$continent))
coefs$country <- factor(coefs$country, levels = levels(gapminder$country))
head(coefs)
#>                 country X.Intercept.      year continent
#> Afghanistan Afghanistan    -507.5343 0.2753287      Asia
#> Albania         Albania    -594.0725 0.3346832    Europe
#> Algeria         Algeria   -1067.8590 0.5692797    Africa
#> Angola           Angola    -376.5048 0.2093399    Africa
#> Argentina     Argentina    -389.6063 0.2317084  Americas
#> Australia     Australia    -376.1163 0.2277238   Oceania

tidyverse

nested_df %>% 
  mutate(coefs = map(fit, coef),
         intercept = map_dbl(coefs, 1),
         slope = map_dbl(coefs, 2)) %>% 
  select(country, continent, intercept, slope)
#> # A tibble: 142 x 4
#>    country     continent intercept slope
#>    <fctr>      <fctr>        <dbl> <dbl>
#>  1 Afghanistan Asia          - 508 0.275
#>  2 Albania     Europe        - 594 0.335
#>  3 Algeria     Africa        -1068 0.569
#>  4 Angola      Africa        - 377 0.209
#>  5 Argentina   Americas      - 390 0.232
#>  6 Australia   Oceania       - 376 0.228
#>  7 Austria     Europe        - 406 0.242
#>  8 Bahrain     Asia          - 860 0.468
#>  9 Bangladesh  Asia          - 936 0.498
#> 10 Belgium     Europe        - 340 0.209
#> # ... with 132 more rows

This illustrates the post-processing that is often necessary in a base workflow. We need to restore the country info from the names of by_obj, use them to look up the continents in gapminder, and then restore the original factor levels for both country and continent. It illustrates the payoff for temporarily tolerating the data and fit list-columns in the nested data frame used in the tidyverse workflow. The country and continent factors remain intact and directly associated with the data and fits.

Creative Commons License