Sample from groups, n varies by group

A challenge tweeted by Hilary Parker, paraphrased:

How do you sample from groups, with a different sample size for each group?

Illustrated with the iris data.

Species = groups.
Sample from the 3 Species with 3 different sample sizes.

How fits the template:

DRAW A SAMPLE for each PAIR OF (SPECIES DATA, SPECIES SAMPLE SIZE)

How to prepare the data? I need a data frame with

One row per Species
A variable of Species-specific sample sizes
A variable of “Species data”, whatever that means.
- Actually we know what that is: a variable of Species-specific data frames. A list-column!

We need a nested data frame.

suppressMessages(library(dplyr))
suppressMessages(library(purrr))
library(tidyr)
set.seed(4561)

(nested_iris <- iris %>%
    group_by(Species) %>%   # prep for work by Species
    nest() %>%              # --> one row per Species
    mutate(n = c(2, 5, 3))) # add sample sizes
#> # A tibble: 3 x 3
#>   Species    data                  n
#>   <fct>      <list>            <dbl>
#> 1 setosa     <tibble [50 × 4]>     2
#> 2 versicolor <tibble [50 × 4]>     5
#> 3 virginica  <tibble [50 × 4]>     3

Draw the samples.

purrr::map2() is good since we want to operate on 2 things (data = DATA FOR ONE SPECIES, n = SAMPLE SIZE).
We’ve already got data = DATA FOR ONE SPECIES and n = SAMPLE SIZE as variables in our data frame.
Drop them in as inputs 1 and 2 to dplyr::sample_n(tbl, size).
Accept whatever comes back as a new list-column in the data frame, i.e. use dplyr::mutate(). Be brave and deal with it.

(sampled_iris <- nested_iris %>%
  mutate(samp = map2(data, n, sample_n)))
#> # A tibble: 3 x 4
#>   Species    data                  n samp            
#>   <fct>      <list>            <dbl> <list>          
#> 1 setosa     <tibble [50 × 4]>     2 <tibble [2 × 4]>
#> 2 versicolor <tibble [50 × 4]>     5 <tibble [5 × 4]>
#> 3 virginica  <tibble [50 × 4]>     3 <tibble [3 × 4]>

What came back? More Species-specific data frames.

We are in that uncomfortable intermediate state, with two list-columns: the original data and the sampled data, samp. Let’s get back to a normal data frame!

Keep only Species and samp variables.
Unnest, which essentially rowbinds the data frames in samp and replicates Species as necessary.

sampled_iris %>% 
  select(Species, samp) %>%
  unnest()
#> # A tibble: 10 x 5
#>    Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
#>    <fct>             <dbl>       <dbl>        <dbl>       <dbl>
#>  1 setosa              4.6         3.1          1.5         0.2
#>  2 setosa              4.5         2.3          1.3         0.3
#>  3 versicolor          5.6         2.5          3.9         1.1
#>  4 versicolor          6.1         2.9          4.7         1.4
#>  5 versicolor          6.1         2.8          4.7         1.2
#>  6 versicolor          6.6         2.9          4.6         1.3
#>  7 versicolor          5.7         2.6          3.5         1  
#>  8 virginica           6.3         2.7          4.9         1.8
#>  9 virginica           6.9         3.1          5.1         2.3
#> 10 virginica           6.7         3.1          5.6         2.4

Again, from the top, with no exposition:

iris %>%
  group_by(Species) %>% 
  nest() %>%            
  mutate(n = c(2, 5, 3)) %>% 
  mutate(samp = map2(data, n, sample_n)) %>% 
  select(Species, samp) %>%
  unnest()
#> # A tibble: 10 x 5
#>    Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
#>    <fct>             <dbl>       <dbl>        <dbl>       <dbl>
#>  1 setosa              5.4         3.4          1.7         0.2
#>  2 setosa              5.5         3.5          1.3         0.2
#>  3 versicolor          6.6         2.9          4.6         1.3
#>  4 versicolor          6.9         3.1          4.9         1.5
#>  5 versicolor          5.8         2.7          3.9         1.2
#>  6 versicolor          6           2.7          5.1         1.6
#>  7 versicolor          6.2         2.9          4.3         1.3
#>  8 virginica           6.4         3.2          5.3         2.3
#>  9 virginica           6.5         3            5.5         1.8
#> 10 virginica           6.1         3            4.9         1.8

A base R solution, with some marginal comments:

split_iris <- split(iris, iris$Species) # why can't Species be found in iris?
                                        # where else would it be found?
str(split_iris)                         # split_iris ~= nested_iris[["data"]]
#> List of 3
#>  $ setosa    :'data.frame':  50 obs. of  5 variables:
#>   ..$ Sepal.Length: num [1:50] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>   ..$ Sepal.Width : num [1:50] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>   ..$ Petal.Length: num [1:50] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>   ..$ Petal.Width : num [1:50] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ versicolor:'data.frame':  50 obs. of  5 variables:
#>   ..$ Sepal.Length: num [1:50] 7 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 ...
#>   ..$ Sepal.Width : num [1:50] 3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 ...
#>   ..$ Petal.Length: num [1:50] 4.7 4.5 4.9 4 4.6 4.5 4.7 3.3 4.6 3.9 ...
#>   ..$ Petal.Width : num [1:50] 1.4 1.5 1.5 1.3 1.5 1.3 1.6 1 1.3 1.4 ...
#>   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...
#>  $ virginica :'data.frame':  50 obs. of  5 variables:
#>   ..$ Sepal.Length: num [1:50] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 ...
#>   ..$ Sepal.Width : num [1:50] 3.3 2.7 3 2.9 3 3 2.5 2.9 2.5 3.6 ...
#>   ..$ Petal.Length: num [1:50] 6 5.1 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 ...
#>   ..$ Petal.Width : num [1:50] 2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8 1.8 2.5 ...
#>   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 3 3 3 3 3 3 3 3 3 3 ...
(n <- c(2, 5, 3))                       # Species data and n are only 'in sync'
#> [1] 2 5 3
                                        # due to my discipline / care
                                        # not locked safely into a data frame
(group_sizes <- vapply(split_iris, nrow, integer(1))) # also floating free
#>     setosa versicolor  virginica 
#>         50         50         50
(sampled_obs <- mapply(sample, group_sizes, n)) # I'm floating free too!
#> $setosa
#> [1] 47 32
#> 
#> $versicolor
#> [1] 15 50 13 30 21
#> 
#> $virginica
#> [1] 15 23 33
get_rows <- function(df, rows) df[rows, , drop = FALSE] # custom function
                                        # drop = FALSE required to avoid
                                        # nasty surprise in case of n = 1
(sampled_iris <-                        # god help you if forget SIMPLIFY = FALSE
    mapply(get_rows, split_iris, sampled_obs, SIMPLIFY = FALSE))
#> $setosa
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 47          5.1         3.8          1.6         0.2  setosa
#> 32          5.4         3.4          1.5         0.4  setosa
#> 
#> $versicolor
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 65           5.6         2.9          3.6         1.3 versicolor
#> 100          5.7         2.8          4.1         1.3 versicolor
#> 63           6.0         2.2          4.0         1.0 versicolor
#> 80           5.7         2.6          3.5         1.0 versicolor
#> 71           5.9         3.2          4.8         1.8 versicolor
#> 
#> $virginica
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 115          5.8         2.8          5.1         2.4 virginica
#> 123          7.7         2.8          6.7         2.0 virginica
#> 133          6.4         2.8          5.6         2.2 virginica
do.call(rbind, sampled_iris)            # :( do.call()
#>                Sepal.Length Sepal.Width Petal.Length Petal.Width
#> setosa.47               5.1         3.8          1.6         0.2
#> setosa.32               5.4         3.4          1.5         0.4
#> versicolor.65           5.6         2.9          3.6         1.3
#> versicolor.100          5.7         2.8          4.1         1.3
#> versicolor.63           6.0         2.2          4.0         1.0
#> versicolor.80           5.7         2.6          3.5         1.0
#> versicolor.71           5.9         3.2          4.8         1.8
#> virginica.115           5.8         2.8          5.1         2.4
#> virginica.123           7.7         2.8          6.7         2.0
#> virginica.133           6.4         2.8          5.6         2.2
#>                   Species
#> setosa.47          setosa
#> setosa.32          setosa
#> versicolor.65  versicolor
#> versicolor.100 versicolor
#> versicolor.63  versicolor
#> versicolor.80  versicolor
#> versicolor.71  versicolor
#> virginica.115   virginica
#> virginica.123   virginica
#> virginica.133   virginica

IMO the base R solution requires much greater facility with R programming and data structures to get it right. It feels more like programming than data analysis.