Based on:

Benchmark: https://h2oai.github.io/db-benchmark/

Introduce data.table’s syntax / general form with simple examples

library(data.table)

DT <- data.table(id= c(1,1,1,1,2,2,1,2,1),
                 code = c("c", "b", "c", "c", "a", "a", "b", "a", "c"),
                 valA = c(1,8,3,6,5,4,7,2,9),
                 valB = c(12,11,10,13,19,15,16,14,18))
DT[, .(valA=sum(valA)), by=id]
##    id valA
## 1:  1   34
## 2:  2   11

For some data.table users, “dot-SD” is a bit of a mystery. But data.table creator Matt Dowle told me that it’s actually quite simple: Just think of it as a symbol representing “each group.” Let’s go through a couple of examples.

These queries can be chained together just by adding another one on the end:

  `DT[...][...]`
  

See data.table compared to dplyr on Stack Overflow and Quora.

Basic row subset

DT[2] # 2nd row
##    id code valA valB
## 1:  1    b    8   11
DT[2:3] # 2nd and 3nd row
##    id code valA valB
## 1:  1    b    8   11
## 2:  1    c    3   10
w = 2:3; DT[w] # same
##    id code valA valB
## 1:  1    b    8   11
## 2:  1    c    3   10
DT[order(valA)]
##    id code valA valB
## 1:  1    c    1   12
## 2:  2    a    2   14
## 3:  1    c    3   10
## 4:  2    a    4   15
## 5:  2    a    5   19
## 6:  1    c    6   13
## 7:  1    b    7   16
## 8:  1    b    8   11
## 9:  1    c    9   18
DT[order(valA),] # same; the ',' is optional
##    id code valA valB
## 1:  1    c    1   12
## 2:  2    a    2   14
## 3:  1    c    3   10
## 4:  2    a    4   15
## 5:  2    a    5   19
## 6:  1    c    6   13
## 7:  1    b    7   16
## 8:  1    b    8   11
## 9:  1    c    9   18
DT[valB > 15] # all rows where DT$valB > 15
##    id code valA valB
## 1:  2    a    5   19
## 2:  1    b    7   16
## 3:  1    c    9   18
DT[valA > 7 & valB >15] # compound logical expressions
##    id code valA valB
## 1:  1    c    9   18
DT[!2:4] # all rows other than 2:4
##    id code valA valB
## 1:  1    c    1   12
## 2:  2    a    5   19
## 3:  2    a    4   15
## 4:  1    b    7   16
## 5:  2    a    2   14
## 6:  1    c    9   18
DT[-(2:4)] # same
##    id code valA valB
## 1:  1    c    1   12
## 2:  2    a    5   19
## 3:  2    a    4   15
## 4:  1    b    7   16
## 5:  2    a    2   14
## 6:  1    c    9   18

select | compute columns

DT[, valA] # valA column (as vector)
## [1] 1 8 3 6 5 4 7 2 9
DT[, list(valA)] # valA column (as data.table)
##    valA
## 1:    1
## 2:    8
## 3:    3
## 4:    6
## 5:    5
## 6:    4
## 7:    7
## 8:    2
## 9:    9
DT[, .(valA)] # same; .() is an alias for list()
##    valA
## 1:    1
## 2:    8
## 3:    3
## 4:    6
## 5:    5
## 6:    4
## 7:    7
## 8:    2
## 9:    9
DT[, sum(valA)] # sum of column v, returned as vector
## [1] 45
DT[, .(sum(valA))] # same but return data.table
##    V1
## 1: 45
DT[, .(sv = sum(valA))] # same but name column "sv"
##    sv
## 1: 45
DT[, .(valA, valA*2)] # return 2 column data.table
##    valA V2
## 1:    1  2
## 2:    8 16
## 3:    3  6
## 4:    6 12
## 5:    5 10
## 6:    4  8
## 7:    7 14
## 8:    2  4
## 9:    9 18

subset rows and select | compute

DT[2:3, sum(valA)] # sum(valA) over rows 2 and 3
## [1] 11
DT[2:3, .(sum(valA))] # same, but return data.table
##    V1
## 1: 11
DT[2:3, .(sv = sum(valA))] # same, but name column "sv"
##    sv
## 1: 11
DT[2:5, cat(valA, "\n")] # just for j's side effect
## 8 3 6 5
## NULL

select columns the data.frame way

DT[, 2] # 2nd column, a data.table always
##    code
## 1:    c
## 2:    b
## 3:    c
## 4:    c
## 5:    a
## 6:    a
## 7:    b
## 8:    a
## 9:    c
colNum = 2
DT[, ..colNum] # same as DT[, 2]; ..var => one-up
##    code
## 1:    c
## 2:    b
## 3:    c
## 4:    c
## 5:    a
## 6:    a
## 7:    b
## 8:    a
## 9:    c
DT[["valA"]] # same as DT[, v] but lower overhead
## [1] 1 8 3 6 5 4 7 2 9

grouping operations -j and by

DT[, sum(valA), by=valB] # appearance order of groups preserved
##    valB V1
## 1:   12  1
## 2:   11  8
## 3:   10  3
## 4:   13  6
## 5:   19  5
## 6:   15  4
## 7:   16  7
## 8:   14  2
## 9:   18  9
DT[, sum(valA), keyby=valB] # order the result by group
##    valB V1
## 1:   10  3
## 2:   11  8
## 3:   12  1
## 4:   13  6
## 5:   14  2
## 6:   15  4
## 7:   16  7
## 8:   18  9
## 9:   19  5
DT[, sum(valA), by=valB][order(valB)] # same by chaining expression together
##    valB V1
## 1:   10  3
## 2:   11  8
## 3:   12  1
## 4:   13  6
## 5:   14  2
## 6:   15  4
## 7:   16  7
## 8:   18  9
## 9:   19  5

fast and hoc row subsets (subsets as joins)

DT["a", on="code"] # same as code == "a" but uses key (fast)
##    id code valA valB
## 1:  2    a    5   19
## 2:  2    a    4   15
## 3:  2    a    2   14
DT["a", on=.(code)] # same
##    id code valA valB
## 1:  2    a    5   19
## 2:  2    a    4   15
## 3:  2    a    2   14
DT[.("a"), on="code"] # same
##    id code valA valB
## 1:  2    a    5   19
## 2:  2    a    4   15
## 3:  2    a    2   14
DT[.("a"), on=.(code)] # same
##    id code valA valB
## 1:  2    a    5   19
## 2:  2    a    4   15
## 3:  2    a    2   14
DT[code=="a"] #same, == internally optimized
##    id code valA valB
## 1:  2    a    5   19
## 2:  2    a    4   15
## 3:  2    a    2   14
DT[code != "b" | valA != 3] # not yet optimized
##    id code valA valB
## 1:  1    c    1   12
## 2:  1    b    8   11
## 3:  1    c    3   10
## 4:  1    c    6   13
## 5:  2    a    5   19
## 6:  2    a    4   15
## 7:  1    b    7   16
## 8:  2    a    2   14
## 9:  1    c    9   18
DT[.("b", 8), on=c("code", "valA")] # same as DT[code == "b" & valA == 8]
##    id code valA valB
## 1:  1    b    8   11
DT[.("b", 8), on=.(code, valA)] # same
##    id code valA valB
## 1:  1    b    8   11
DT[.("b", 7:9), on=c("code", "valA")]             # no match returns NA
##    id code valA valB
## 1:  1    b    7   16
## 2:  1    b    8   11
## 3: NA    b    9   NA
DT[.("b", 7:9), on=.(code, valA), nomatch=0]      # no match row is not returned
##    id code valA valB
## 1:  1    b    7   16
## 2:  1    b    8   11
DT[.("b", 7:9), on=c("code", "valA"), roll=Inf]   # locf, previous row rolls forward
##    id code valA valB
## 1:  1    b    7   16
## 2:  1    b    8   11
## 3:  1    b    9   11
DT[.("b", 7:9), on=.(code, valA), roll=-Inf]      # nocb, next row rolls backward
##    id code valA valB
## 1:  1    b    7   16
## 2:  1    b    8   11
## 3: NA    b    9   NA
DT["b", sum(valA*valB), on="code"]   # same as DT[x=="b", sum(v*y)]
## [1] 200

all together

DT[code != "a", sum(valA), by = id] # get sum(valA) by "id" for each i != "a"
##    id V1
## 1:  1 34
DT[!"a", sum(valA), by=.EACHI, on="code"]         # same, but using subsets-as-joins
##    code V1
## 1:    c 19
DT[c("b","c"), sum(valA), by=.EACHI, on="code"]   # same
##    code V1
## 1:    b 15
## 2:    c 19
DT[c("b","c"), sum(valA), by=.EACHI, on=.(code)]  # same, using on=.()
##    code V1
## 1:    b 15
## 2:    c 19

joins as subsets

X = data.table(x=c("c", "b"), v=8:7, foo=c(4,2))
X
##    x v foo
## 1: c 8   4
## 2: b 7   2
DT <- data.table(id= c(1,1,1,1,2,2,1,2,1),
                 x = c("c", "b", "c", "c", "a", "a", "b", "a", "c"),
                 valA = c(1,8,3,6,5,4,7,2,9),
                 valB = c(12,11,10,13,19,15,16,14,18))
DT[X, on="x"] # right join
##    id x valA valB v foo
## 1:  1 c    1   12 8   4
## 2:  1 c    3   10 8   4
## 3:  1 c    6   13 8   4
## 4:  1 c    9   18 8   4
## 5:  1 b    8   11 7   2
## 6:  1 b    7   16 7   2
X[DT, on="x"] # left join
##    x  v foo id valA valB
## 1: c  8   4  1    1   12
## 2: b  7   2  1    8   11
## 3: c  8   4  1    3   10
## 4: c  8   4  1    6   13
## 5: a NA  NA  2    5   19
## 6: a NA  NA  2    4   15
## 7: b  7   2  1    7   16
## 8: a NA  NA  2    2   14
## 9: c  8   4  1    9   18
DT[X, on="x", nomatch=0] # inner join
##    id x valA valB v foo
## 1:  1 c    1   12 8   4
## 2:  1 c    3   10 8   4
## 3:  1 c    6   13 8   4
## 4:  1 c    9   18 8   4
## 5:  1 b    8   11 7   2
## 6:  1 b    7   16 7   2
DT[!X, on="x"] # not join
##    id x valA valB
## 1:  2 a    5   19
## 2:  2 a    4   15
## 3:  2 a    2   14
DT[X, on=c(valA="v")] # join DT$valA to X$v
##    id x valA valB i.x foo
## 1:  1 b    8   11   c   4
## 2:  1 b    7   16   b   2
DT[X, on="valA==v"] # same
##    id x valA valB i.x foo
## 1:  1 b    8   11   c   4
## 2:  1 b    7   16   b   2
# DT[X, on=.(y<=foo)] # non-equi join

# DT[X, on="y<=foo"] # same

# DT[X, on=c("y<=foo")] # same

# DT[X, on=.(y>=foo)] # non-equi join

# DT[X, on=.(x, y<=foo)] # non-equi join

# DT[X, .(x,y,x.y,v), on=.(x, y>=foo)]        # select x's join columns as well

DT[X, on="x", mult="first"] # first row of each group
##    id x valA valB v foo
## 1:  1 c    1   12 8   4
## 2:  1 b    8   11 7   2
DT[X, on="x", mult="last"] # last row of each group
##    id x valA valB v foo
## 1:  1 c    9   18 8   4
## 2:  1 b    7   16 7   2
DT[X, sum(v), by=.EACHI, on="x"]            # join and eval j for each row in i
##    x V1
## 1: c  8
## 2: b  7
DT[X, sum(v)*foo, by=.EACHI, on="x"]        # join inherited scope
##    x V1
## 1: c 32
## 2: b 14
DT[X, sum(v)*i.v, by=.EACHI, on="x"]        # 'i,v' refers to X's v column
##    x V1
## 1: c 64
## 2: b 49
#DT[X, on=.(x, v>=v), sum(y)*foo, by=.EACHI] # non-equi join with by=.EACHI

setting keys

kDT = copy(DT) # copy DT to kDT to work with it

setkey(kDT, x) # set a 1-column key

setkeyv(kDT, "x") # same (v in setkeyv stands for vector)

v="x"

setkeyv(kDT, v) # same 

haskey(kDT) # TRUE
## [1] TRUE
key(kDT)
## [1] "x"

fast keyed subsets

kDT["a"]                                    # subset-as-join on *key* column 'x'
##    id x valA valB
## 1:  2 a    5   19
## 2:  2 a    4   15
## 3:  2 a    2   14
kDT["a", on="x"]                            # same, being explicit using 'on='
##    id x valA valB
## 1:  2 a    5   19
## 2:  2 a    4   15
## 3:  2 a    2   14

all together

kDT[!"a", sum(valA), by=.EACHI]                # get sum(v) for each i != "a"
##    x V1
## 1: b 15
## 2: c 19

multi-column key

setkey(kDT,x,id)                             # 2-column key

setkeyv(kDT,c("x","id"))                     # same

fast keyed subsets on multi-column key

kDT["a"]                                    # join to 1st column of key
##    id x valA valB
## 1:  2 a    5   19
## 2:  2 a    4   15
## 3:  2 a    2   14
kDT["a", on="x"]                            # on= is optional but preferred
##    id x valA valB
## 1:  2 a    5   19
## 2:  2 a    4   15
## 3:  2 a    2   14
kDT[.("a")]                                 # same; .() is an alias for list()
##    id x valA valB
## 1:  2 a    5   19
## 2:  2 a    4   15
## 3:  2 a    2   14
kDT[list("a")]                              # same
##    id x valA valB
## 1:  2 a    5   19
## 2:  2 a    4   15
## 3:  2 a    2   14
kDT[.("a", 3)]                              # join to 2 columns
##    id x valA valB
## 1:  3 a   NA   NA
kDT[.("a", 3:6)]                            # join 4 rows (2 missing)
##    id x valA valB
## 1:  3 a   NA   NA
## 2:  4 a   NA   NA
## 3:  5 a   NA   NA
## 4:  6 a   NA   NA
kDT[.("a", 3:6), nomatch=0]                 # remove missing
## Empty data.table (0 rows and 4 cols): id,x,valA,valB
kDT[.("a", 3:6), roll=TRUE]                 # locf rolling join
##    id x valA valB
## 1:  3 a    2   14
## 2:  4 a    2   14
## 3:  5 a    2   14
## 4:  6 a    2   14
kDT[.("a", 3:6), roll=Inf]                  # same
##    id x valA valB
## 1:  3 a    2   14
## 2:  4 a    2   14
## 3:  5 a    2   14
## 4:  6 a    2   14
kDT[.("a", 3:6), roll=-Inf]                 # nocb rolling join
##    id x valA valB
## 1:  3 a   NA   NA
## 2:  4 a   NA   NA
## 3:  5 a   NA   NA
## 4:  6 a   NA   NA
kDT[!.("a")]                                # not join
##    id x valA valB
## 1:  1 b    8   11
## 2:  1 b    7   16
## 3:  1 c    1   12
## 4:  1 c    3   10
## 5:  1 c    6   13
## 6:  1 c    9   18
kDT[!"a"]                                   # same
##    id x valA valB
## 1:  1 b    8   11
## 2:  1 b    7   16
## 3:  1 c    1   12
## 4:  1 c    3   10
## 5:  1 c    6   13
## 6:  1 c    9   18

more on special symbols, see also ?“special-symbols”

DT[.N]                                      # last row
##    id x valA valB
## 1:  1 c    9   18
DT[, .N]                                    # total number of rows in DT
## [1] 9
DT[, .N, by=x]                              # number of rows in each group
##    x N
## 1: c 4
## 2: b 2
## 3: a 3
DT[, .SD, .SDcols=valA:valB]                # select columns 'x' and 'y'
##    valA valB
## 1:    1   12
## 2:    8   11
## 3:    3   10
## 4:    6   13
## 5:    5   19
## 6:    4   15
## 7:    7   16
## 8:    2   14
## 9:    9   18
DT[, .SD[1]]                                # first row; same as DT[1,]
##    id x valA valB
## 1:  1 c    1   12
DT[, .SD[c(1,3)]]                           # 第一行和第三行
##    id x valA valB
## 1:  1 c    1   12
## 2:  1 c    3   10
DT[, .SD[1:3]]                              # 第一行至第三行
##    id x valA valB
## 1:  1 c    1   12
## 2:  1 b    8   11
## 3:  1 c    3   10
DT[, .SD[1], by=x]                          # first row of each group
##    x id valA valB
## 1: c  1    1   12
## 2: b  1    8   11
## 3: a  2    5   19
DT[, c(.N, lapply(.SD, sum)), by=x]         # group size alongside sum
##    x N id valA valB
## 1: c 4  4   19   53
## 2: b 2  2   15   27
## 3: a 3  6   11   48
DT[, .I[1], by=x]                           # row number of first row of each group
##    x V1
## 1: c  1
## 2: b  2
## 3: a  5
DT[, grp := .GRP, by=x]                     # add a group counter column

#X[, DT[.BY, y, on="x"], by=x]               # join within group to use less ram

add/update/delete by reference (see ?assign)

print(DT[, z:=42L])                         # add new column by reference
##    id x valA valB grp  z
## 1:  1 c    1   12   1 42
## 2:  1 b    8   11   2 42
## 3:  1 c    3   10   1 42
## 4:  1 c    6   13   1 42
## 5:  2 a    5   19   3 42
## 6:  2 a    4   15   3 42
## 7:  1 b    7   16   2 42
## 8:  2 a    2   14   3 42
## 9:  1 c    9   18   1 42
print(DT[, z:=NULL])                        # remove column by reference
##    id x valA valB grp
## 1:  1 c    1   12   1
## 2:  1 b    8   11   2
## 3:  1 c    3   10   1
## 4:  1 c    6   13   1
## 5:  2 a    5   19   3
## 6:  2 a    4   15   3
## 7:  1 b    7   16   2
## 8:  2 a    2   14   3
## 9:  1 c    9   18   1
print(DT["a", v:=42L, on="x"])              # subassign to column
##    id x valA valB grp  v
## 1:  1 c    1   12   1 NA
## 2:  1 b    8   11   2 NA
## 3:  1 c    3   10   1 NA
## 4:  1 c    6   13   1 NA
## 5:  2 a    5   19   3 42
## 6:  2 a    4   15   3 42
## 7:  1 b    7   16   2 NA
## 8:  2 a    2   14   3 42
## 9:  1 c    9   18   1 NA
print(DT["b", v2:=84L, on="x"])             # subassign to new column (NA padded)
##    id x valA valB grp  v v2
## 1:  1 c    1   12   1 NA NA
## 2:  1 b    8   11   2 NA 84
## 3:  1 c    3   10   1 NA NA
## 4:  1 c    6   13   1 NA NA
## 5:  2 a    5   19   3 42 NA
## 6:  2 a    4   15   3 42 NA
## 7:  1 b    7   16   2 NA 84
## 8:  2 a    2   14   3 42 NA
## 9:  1 c    9   18   1 NA NA
DT[, m:=mean(v), by=x][]                    # add new column by reference by group
##    id x valA valB grp  v v2  m
## 1:  1 c    1   12   1 NA NA NA
## 2:  1 b    8   11   2 NA 84 NA
## 3:  1 c    3   10   1 NA NA NA
## 4:  1 c    6   13   1 NA NA NA
## 5:  2 a    5   19   3 42 NA 42
## 6:  2 a    4   15   3 42 NA 42
## 7:  1 b    7   16   2 NA 84 NA
## 8:  2 a    2   14   3 42 NA 42
## 9:  1 c    9   18   1 NA NA NA
                                            # postfix [] is shortcut to print()

advanced usage

#DT[, sum(v), by=.(y%%2)]                    # expressions in by 

#DT[, sum(v), by=.(bool = y%%2)]             # same with name for group expression

DT[, .SD[2], by=x]                          # get 2nd row of each group
##    x id valA valB grp  v v2  m
## 1: c  1    3   10   1 NA NA NA
## 2: b  1    7   16   2 NA 84 NA
## 3: a  2    4   15   3 42 NA 42
DT[, tail(.SD,2), by=x]                     # last 2 rows of each group
##    x id valA valB grp  v v2  m
## 1: c  1    6   13   1 NA NA NA
## 2: c  1    9   18   1 NA NA NA
## 3: b  1    8   11   2 NA 84 NA
## 4: b  1    7   16   2 NA 84 NA
## 5: a  2    4   15   3 42 NA 42
## 6: a  2    2   14   3 42 NA 42
DT[, lapply(.SD, sum), by=x]                # sum of all columns for each group
##    x id valA valB grp   v  v2   m
## 1: c  4   19   53   4  NA  NA  NA
## 2: b  2   15   27   4  NA 168  NA
## 3: a  6   11   48   9 126  NA 126
DT[, .SD[which.min(v)], by=x]               # nested query by group
##    x id valA valB grp  v v2  m
## 1: a  2    5   19   3 42 NA 42
#DT[, list(MySum=sum(v),
#          MyMin=min(v),
#          MyMax=max(v)),
#    by=.(x, y%%2)]                          # by 2 expressions

DT[, .(valA = .(valA), valB = .(valB)), by=x]           # list columns
##    x    valA        valB
## 1: c 1,3,6,9 12,10,13,18
## 2: b     8,7       11,16
## 3: a   5,4,2    19,15,14
DT[, .(seq = min(valA):max(valB)), by=x]          # j is not limited to just aggregations
##     x seq
##  1: c   1
##  2: c   2
##  3: c   3
##  4: c   4
##  5: c   5
##  6: c   6
##  7: c   7
##  8: c   8
##  9: c   9
## 10: c  10
## 11: c  11
## 12: c  12
## 13: c  13
## 14: c  14
## 15: c  15
## 16: c  16
## 17: c  17
## 18: c  18
## 19: b   7
## 20: b   8
## 21: b   9
## 22: b  10
## 23: b  11
## 24: b  12
## 25: b  13
## 26: b  14
## 27: b  15
## 28: b  16
## 29: a   2
## 30: a   3
## 31: a   4
## 32: a   5
## 33: a   6
## 34: a   7
## 35: a   8
## 36: a   9
## 37: a  10
## 38: a  11
## 39: a  12
## 40: a  13
## 41: a  14
## 42: a  15
## 43: a  16
## 44: a  17
## 45: a  18
## 46: a  19
##     x seq
DT[, sum(valA), by=x][V1<20]                   # compound query
##    x V1
## 1: c 19
## 2: b 15
## 3: a 11
DT[, sum(valA), by=x][order(-V1)]              # ordering results
##    x V1
## 1: c 19
## 2: b 15
## 3: a 11
DT[, c(.N, lapply(.SD,sum)), by=x]          # group size and sums by group
##    x N id valA valB grp   v  v2   m
## 1: c 4  4   19   53   4  NA  NA  NA
## 2: b 2  2   15   27   4  NA 168  NA
## 3: a 3  6   11   48   9 126  NA 126
DT[, {tmp <- mean(id);                       # anonymous lambda in 'j'; j any valid
      .(a = valA-tmp, b = valB-tmp)               #   expression where every element
      }, by=x]                              #   becomes a column in result
##    x a  b
## 1: c 0 11
## 2: c 2  9
## 3: c 5 12
## 4: c 8 17
## 5: b 7 10
## 6: b 6 15
## 7: a 3 17
## 8: a 2 13
## 9: a 0 12
pdf("new.pdf")

DT[, plot(valA, valB), by=x]                       # can also plot in 'j'
## Empty data.table (0 rows and 1 cols): x
dev.off()
## png 
##   2
# get max(y) and min of a set of columns for each consecutive run of 'v'
#DT[, c(.(y=max(y)), lapply(.SD, min)), by=rleid(v), .SDcols=v:b]

melt & dcast

dcast : https://cran.r-project.org/package=reshape

Description

dcast.data.table is a much faster version of reshape2::dcast, but for data.tables. More importantly, it is capable of handling very large data quite efficiently in terms of memory usage in comparison to reshape2::dcast.

The cast formula takes the form LHS ~ RHS, ex: var1 + var2 ~ var3. The order of entries in the formula is essential. There are two special variables: . and …. . represents no variable; … represents all variables not otherwise mentioned in formula; see Examples.

dcast also allows value.var columns of type list.

When variable combinations in formula doesn’t identify a unique value in a cell, fun.aggregate will have to be specified, which defaults to length if unspecified. The aggregating function should take a vector as input and return a single value (or a list of length one) as output. In cases where value.var is a list, the function should be able to handle a list input and provide a single value or list of length one as output.

If the formula’s LHS contains the same column more than once, ex: dcast(DT, x+x~ y), then the answer will have duplicate names. In those cases, the duplicate names are renamed using make.unique so that key can be set without issues.

Names for columns that are being cast are generated in the same order (separated by an underscore, _) from the (unique) values in each column mentioned in the formula RHS.

names(ChickWeight) <- tolower(names(ChickWeight))

DT <- melt(as.data.table(ChickWeight), id = 2:4)
dcast(DT, time~variable, fun = mean)
##     time    weight
##  1:    0  41.06000
##  2:    2  49.22000
##  3:    4  59.95918
##  4:    6  74.30612
##  5:    8  91.24490
##  6:   10 107.83673
##  7:   12 129.24490
##  8:   14 143.81250
##  9:   16 168.08511
## 10:   18 190.19149
## 11:   20 209.71739
## 12:   21 218.68889
dcast(DT, diet~variable, fun = mean)
##    diet   weight
## 1:    1 102.6455
## 2:    2 122.6167
## 3:    3 142.9500
## 4:    4 135.2627
dcast(DT, diet+chick ~ time, drop = FALSE)
##      diet chick  0  2  4  6   8  10  12  14  16  18  20  21
##   1:    1    18 39 35 NA NA  NA  NA  NA  NA  NA  NA  NA  NA
##   2:    1    16 41 45 49 51  57  51  54  NA  NA  NA  NA  NA
##   3:    1    15 41 49 56 64  68  68  67  68  NA  NA  NA  NA
##   4:    1    13 41 48 53 60  65  67  71  70  71  81  91  96
##   5:    1     9 42 51 59 68  85  96  90  92  93 100 100  98
##  ---                                                       
## 196:    4    49 40 53 64 85 108 128 152 166 184 203 233 237
## 197:    4    46 40 52 62 82 101 120 144 156 173 210 231 238
## 198:    4    50 41 54 67 84 105 122 155 175 205 234 264 264
## 199:    4    42 42 49 63 84 103 126 160 174 204 234 269 281
## 200:    4    48 39 50 62 80 104 125 154 170 222 261 303 322
dcast(DT, diet + chick ~ time, drop = FALSE, fill = 0)
##      diet chick  0  2  4  6   8  10  12  14  16  18  20  21
##   1:    1    18 39 35  0  0   0   0   0   0   0   0   0   0
##   2:    1    16 41 45 49 51  57  51  54   0   0   0   0   0
##   3:    1    15 41 49 56 64  68  68  67  68   0   0   0   0
##   4:    1    13 41 48 53 60  65  67  71  70  71  81  91  96
##   5:    1     9 42 51 59 68  85  96  90  92  93 100 100  98
##  ---                                                       
## 196:    4    49 40 53 64 85 108 128 152 166 184 203 233 237
## 197:    4    46 40 52 62 82 101 120 144 156 173 210 231 238
## 198:    4    50 41 54 67 84 105 122 155 175 205 234 264 264
## 199:    4    42 42 49 63 84 103 126 160 174 204 234 269 281
## 200:    4    48 39 50 62 80 104 125 154 170 222 261 303 322

Using subset

dcast(DT, chick ~ time, fun = mean, subset = .(time < 10 & chick < 20))
##    chick  0  2   4   6   8
## 1:    18 39 35 NaN NaN NaN
## 2:    16 41 45  49  51  57
## 3:    15 41 49  56  64  68
## 4:    13 41 48  53  60  65
## 5:     9 42 51  59  68  85

Drop argument,

DT <- data.table(v1 = c(1.1, 1.1, 1.1, 2.2, 2.2, 2.2),
                 v2 = factor(c(1L, 1L, 1L, 3L, 3L, 3L), levels=1:3),
                 v3 = factor(c(2L, 3L, 5L, 1L, 2L, 6L), levels=1:6),
                 v4 = c(3L, 2L, 2L, 5L, 4L, 3L))
DT
##     v1 v2 v3 v4
## 1: 1.1  1  2  3
## 2: 1.1  1  3  2
## 3: 1.1  1  5  2
## 4: 2.2  3  1  5
## 5: 2.2  3  2  4
## 6: 2.2  3  6  3

drop=TRUE

dcast(DT, v1 + v2 ~ v3) # default is drop = TRUE
## Using 'v4' as value column. Use 'value.var' to override
##     v1 v2  1 2  3  5  6
## 1: 1.1  1 NA 3  2  2 NA
## 2: 2.2  3  5 4 NA NA  3
dcast(DT, v1 + v2 ~ v3, drop = FALSE) # all missing combination of both LHS and RHS
## Using 'v4' as value column. Use 'value.var' to override
##     v1 v2  1  2  3  4  5  6
## 1: 1.1  1 NA  3  2 NA  2 NA
## 2: 1.1  2 NA NA NA NA NA NA
## 3: 1.1  3 NA NA NA NA NA NA
## 4: 2.2  1 NA NA NA NA NA NA
## 5: 2.2  2 NA NA NA NA NA NA
## 6: 2.2  3  5  4 NA NA NA  3
dcast(DT, v1 + v2 ~ v3, drop = c(FALSE, TRUE)) # all missing combination of only LHS
## Using 'v4' as value column. Use 'value.var' to override
##     v1 v2  1  2  3  5  6
## 1: 1.1  1 NA  3  2  2 NA
## 2: 1.1  2 NA NA NA NA NA
## 3: 1.1  3 NA NA NA NA NA
## 4: 2.2  1 NA NA NA NA NA
## 5: 2.2  2 NA NA NA NA NA
## 6: 2.2  3  5  4 NA NA  3
dcast(DT, v1 + v2 ~ v3, drop = c(TRUE, FALSE)) # all missing combination of only RHS
## Using 'v4' as value column. Use 'value.var' to override
##     v1 v2  1 2  3  4  5  6
## 1: 1.1  1 NA 3  2 NA  2 NA
## 2: 2.2  3  5 4 NA NA NA  3

using . and ...

DT <- data.table(v1 = rep(1:2, each = 6),
                 v2 = rep(rep(1:3, 2), each = 2),
                 v3 = rep(1:2, 6),
                 v4 = rnorm(6))
DT
##     v1 v2 v3        v4
##  1:  1  1  1 -1.165233
##  2:  1  1  2 -1.416931
##  3:  1  2  1 -1.084796
##  4:  1  2  2 -1.474635
##  5:  1  3  1 -1.388353
##  6:  1  3  2  1.320183
##  7:  2  1  1 -1.165233
##  8:  2  1  2 -1.416931
##  9:  2  2  1 -1.084796
## 10:  2  2  2 -1.474635
## 11:  2  3  1 -1.388353
## 12:  2  3  2  1.320183
dcast(DT, ... ~ v3, value.var = "v4") #same as v1 + v2 ~ v3, value.var = "v4"
##    v1 v2         1         2
## 1:  1  1 -1.165233 -1.416931
## 2:  1  2 -1.084796 -1.474635
## 3:  1  3 -1.388353  1.320183
## 4:  2  1 -1.165233 -1.416931
## 5:  2  2 -1.084796 -1.474635
## 6:  2  3 -1.388353  1.320183
dcast(DT, v1 + v2 + v3 ~ ., value.var = "v4")
##     v1 v2 v3         .
##  1:  1  1  1 -1.165233
##  2:  1  1  2 -1.416931
##  3:  1  2  1 -1.084796
##  4:  1  2  2 -1.474635
##  5:  1  3  1 -1.388353
##  6:  1  3  2  1.320183
##  7:  2  1  1 -1.165233
##  8:  2  1  2 -1.416931
##  9:  2  2  1 -1.084796
## 10:  2  2  2 -1.474635
## 11:  2  3  1 -1.388353
## 12:  2  3  2  1.320183

benchmark against reshape2’s dcast, minimum of 3 runs

set.seed(45)
DT <- data.table(aa=sample(1e4, 1e6, TRUE),
      bb=sample(1e3, 1e6, TRUE),
      cc = sample(letters, 1e6, TRUE), dd=runif(1e6))
DT
##            aa  bb cc        dd
##       1: 8741 429  y 0.1985822
##       2: 4427 983  c 0.9916615
##       3: 8416 336  a 0.1117716
##       4: 6695 844  s 0.9301698
##       5: 3130 674  f 0.2739307
##      ---                      
##  999996: 1287 294  n 0.2819233
##  999997: 1913  22  x 0.3740439
##  999998: 6481 676  n 0.1717177
##  999999: 3669 576  k 0.5089813
## 1000000: 1795 703  v 0.5600036
system.time(dcast(DT, aa ~ cc, fun=sum)) # 0.12 seconds
## Using 'dd' as value column. Use 'value.var' to override
##    user  system elapsed 
##   0.567   0.003   0.314
system.time(dcast(DT, bb ~ cc, fun=mean)) # 0.04 seconds
## Using 'dd' as value column. Use 'value.var' to override
##    user  system elapsed 
##   0.243   0.000   0.148
# reshape2::dcast takes 31 seconds
system.time(dcast(DT, aa + bb ~ cc, fun=sum)) # 1.2 seconds
## Using 'dd' as value column. Use 'value.var' to override
##    user  system elapsed 
##   2.558   0.119   1.948
# NEW FEATURE - multiple value.var and multiple fun.aggregate
DT = data.table(x=sample(5,20,TRUE), y=sample(2,20,TRUE),
                z=sample(letters[1:2], 20,TRUE), d1 = runif(20), d2=1L)
DT
##     x y z        d1 d2
##  1: 1 1 b 0.1089563  1
##  2: 4 2 a 0.3218471  1
##  3: 5 1 a 0.3171631  1
##  4: 5 1 b 0.7866574  1
##  5: 1 2 b 0.4110979  1
##  6: 2 1 a 0.1820242  1
##  7: 5 2 b 0.9101196  1
##  8: 1 2 a 0.4628660  1
##  9: 4 1 b 0.5173982  1
## 10: 3 2 b 0.5939504  1
## 11: 1 2 b 0.9338177  1
## 12: 4 2 b 0.7506386  1
## 13: 1 1 b 0.3951595  1
## 14: 4 2 b 0.9816864  1
## 15: 5 2 a 0.8124613  1
## 16: 2 1 b 0.5121996  1
## 17: 3 1 a 0.1758779  1
## 18: 3 2 a 0.8867973  1
## 19: 4 1 a 0.6169361  1
## 20: 5 2 a 0.5777725  1
# multiple value.var
dcast(DT, x + y ~ z, fun=sum, value.var=c("d1","d2"))
##    x y      d1_a      d1_b d2_a d2_b
## 1: 1 1 0.0000000 0.5041157    0    2
## 2: 1 2 0.4628660 1.3449156    1    2
## 3: 2 1 0.1820242 0.5121996    1    1
## 4: 3 1 0.1758779 0.0000000    1    0
## 5: 3 2 0.8867973 0.5939504    1    1
## 6: 4 1 0.6169361 0.5173982    1    1
## 7: 4 2 0.3218471 1.7323249    1    2
## 8: 5 1 0.3171631 0.7866574    1    1
## 9: 5 2 1.3902338 0.9101196    2    1
# multiple fun.aggregate
dcast(DT, x + y ~ z, fun=list(sum, mean), value.var="d1")
##    x y  d1_sum_a  d1_sum_b d1_mean_a d1_mean_b
## 1: 1 1 0.0000000 0.5041157       NaN 0.2520579
## 2: 1 2 0.4628660 1.3449156 0.4628660 0.6724578
## 3: 2 1 0.1820242 0.5121996 0.1820242 0.5121996
## 4: 3 1 0.1758779 0.0000000 0.1758779       NaN
## 5: 3 2 0.8867973 0.5939504 0.8867973 0.5939504
## 6: 4 1 0.6169361 0.5173982 0.6169361 0.5173982
## 7: 4 2 0.3218471 1.7323249 0.3218471 0.8661625
## 8: 5 1 0.3171631 0.7866574 0.3171631 0.7866574
## 9: 5 2 1.3902338 0.9101196 0.6951169 0.9101196
# multiple fun.agg and value.var (all combinations)
dcast(DT, x + y ~ z, fun=list(sum, mean), value.var=c("d1", "d2"))
##    x y  d1_sum_a  d1_sum_b d2_sum_a d2_sum_b d1_mean_a d1_mean_b d2_mean_a
## 1: 1 1 0.0000000 0.5041157        0        2       NaN 0.2520579       NaN
## 2: 1 2 0.4628660 1.3449156        1        2 0.4628660 0.6724578         1
## 3: 2 1 0.1820242 0.5121996        1        1 0.1820242 0.5121996         1
## 4: 3 1 0.1758779 0.0000000        1        0 0.1758779       NaN         1
## 5: 3 2 0.8867973 0.5939504        1        1 0.8867973 0.5939504         1
## 6: 4 1 0.6169361 0.5173982        1        1 0.6169361 0.5173982         1
## 7: 4 2 0.3218471 1.7323249        1        2 0.3218471 0.8661625         1
## 8: 5 1 0.3171631 0.7866574        1        1 0.3171631 0.7866574         1
## 9: 5 2 1.3902338 0.9101196        2        1 0.6951169 0.9101196         1
##    d2_mean_b
## 1:         1
## 2:         1
## 3:         1
## 4:       NaN
## 5:         1
## 6:         1
## 7:         1
## 8:         1
## 9:         1
# multiple fun.agg and value.var (one-to-one)
dcast(DT, x + y ~ z, fun=list(sum, mean), value.var=list("d1", "d2"))
##    x y  d1_sum_a  d1_sum_b d2_mean_a d2_mean_b
## 1: 1 1 0.0000000 0.5041157       NaN         1
## 2: 1 2 0.4628660 1.3449156         1         1
## 3: 2 1 0.1820242 0.5121996         1         1
## 4: 3 1 0.1758779 0.0000000         1       NaN
## 5: 3 2 0.8867973 0.5939504         1         1
## 6: 4 1 0.6169361 0.5173982         1         1
## 7: 4 2 0.3218471 1.7323249         1         1
## 8: 5 1 0.3171631 0.7866574         1         1
## 9: 5 2 1.3902338 0.9101196         1         1

melt

s1 <- "family_id age_mother dob_child1 dob_child2 dob_child3
1         30 1998-11-26 2000-01-29         NA
2         27 1996-06-22         NA         NA
3         26 2002-07-11 2004-04-05 2007-09-02
4         32 2004-10-10 2009-08-27 2012-07-21
5         29 2000-12-05 2005-02-28         NA"
DT <- fread(s1)
DT
##    family_id age_mother dob_child1 dob_child2 dob_child3
## 1:         1         30 1998-11-26 2000-01-29       <NA>
## 2:         2         27 1996-06-22       <NA>       <NA>
## 3:         3         26 2002-07-11 2004-04-05 2007-09-02
## 4:         4         32 2004-10-10 2009-08-27 2012-07-21
## 5:         5         29 2000-12-05 2005-02-28       <NA>

Convert DT to long form where each dob is a separate observation

we could accomplish this using melt() by specifying id.vars and measure.vars arguments as follows:

DT.m1 = melt(DT, id.vars = c("family_id", "age_mother"), measure.vars = c("dob_child1", "dob_child2", "dob_child3"))
DT.m1
##     family_id age_mother   variable      value
##  1:         1         30 dob_child1 1998-11-26
##  2:         2         27 dob_child1 1996-06-22
##  3:         3         26 dob_child1 2002-07-11
##  4:         4         32 dob_child1 2004-10-10
##  5:         5         29 dob_child1 2000-12-05
##  6:         1         30 dob_child2 2000-01-29
##  7:         2         27 dob_child2       <NA>
##  8:         3         26 dob_child2 2004-04-05
##  9:         4         32 dob_child2 2009-08-27
## 10:         5         29 dob_child2 2005-02-28
## 11:         1         30 dob_child3       <NA>
## 12:         2         27 dob_child3       <NA>
## 13:         3         26 dob_child3 2007-09-02
## 14:         4         32 dob_child3 2012-07-21
## 15:         5         29 dob_child3       <NA>
  • measure.vars specify the set of columns we would like to collapse (or combine) together.

  • We can also specify column indices instead of names

  • By default, variable column is of type factor. Set variable.factor argument to FALSE if you’d like to return a character vector instead. variable.factor argument is only available in melt from data.tabel and not in the reshape2 package.

  • By default, the molten columns are automatically named variable and value

  • melt preserves column attributes in result

Name the variable and value columns to child and dob respectively

DT.m1 = melt(DT, measure.vars = c("dob_child1", "dob_child2", "dob_child3"),
             variable.name = "child", value.name = "dob")
DT.m1
##     family_id age_mother      child        dob
##  1:         1         30 dob_child1 1998-11-26
##  2:         2         27 dob_child1 1996-06-22
##  3:         3         26 dob_child1 2002-07-11
##  4:         4         32 dob_child1 2004-10-10
##  5:         5         29 dob_child1 2000-12-05
##  6:         1         30 dob_child2 2000-01-29
##  7:         2         27 dob_child2       <NA>
##  8:         3         26 dob_child2 2004-04-05
##  9:         4         32 dob_child2 2009-08-27
## 10:         5         29 dob_child2 2005-02-28
## 11:         1         30 dob_child3       <NA>
## 12:         2         27 dob_child3       <NA>
## 13:         3         26 dob_child3 2007-09-02
## 14:         4         32 dob_child3 2012-07-21
## 15:         5         29 dob_child3       <NA>
  • By default, when one of id.vars or measure.vars is missing, the rest of the columns are automatically assigned to the missing argument.

  • When neigher id.vars nor measure.vars are specified, are mentioned under ?melt, all non-numeric, integer, logical columns will be assigned to id.vars

In addition, a warning message is issued highlighting the columns that are automatically considered to be id.vars.

dcast(DT.m1, family_id + age_mother ~ child, value.var = "dob")
##    family_id age_mother dob_child1 dob_child2 dob_child3
## 1:         1         30 1998-11-26 2000-01-29       <NA>
## 2:         2         27 1996-06-22       <NA>       <NA>
## 3:         3         26 2002-07-11 2004-04-05 2007-09-02
## 4:         4         32 2004-10-10 2009-08-27 2012-07-21
## 5:         5         29 2000-12-05 2005-02-28       <NA>
  • dcast uses formula interface. The variables on the LHS of formula represents the id vars and RHS the measure vars.

  • value.var denotes the column to be filled in with while casting to wide format

  • dcast also tries to preserve attributes in result whereever possible

Starting from DT.m, how can we get the number of children in each family?

You can also pass a function to aggregate by in dcast with the argument fun.aggregate. This is particularly essential when the formula provided does not identify single observation for each cell.

dcast(DT.m1, family_id ~ . ,fun.aggregate = function(x) sum(!is.na(x)), value.var = "dob")
##    family_id .
## 1:         1 2
## 2:         2 1
## 3:         3 3
## 4:         4 3
## 5:         5 2

Limitations in current melt/dcast approaches

However, there are situations we might run into where the desired operation is not expressed in a straightward manner. For example, consider the data.table shown below:

s2 <- "family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1         30 1998-11-26 2000-01-29         NA             1             2            NA
2         27 1996-06-22         NA         NA             2            NA            NA
3         26 2002-07-11 2004-04-05 2007-09-02             2             2             1
4         32 2004-10-10 2009-08-27 2012-07-21             1             1             1
5         29 2000-12-05 2005-02-28         NA             2             1            NA"
DT <- fread(s2)
DT
##    family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1
## 1:         1         30 1998-11-26 2000-01-29       <NA>             1
## 2:         2         27 1996-06-22       <NA>       <NA>             2
## 3:         3         26 2002-07-11 2004-04-05 2007-09-02             2
## 4:         4         32 2004-10-10 2009-08-27 2012-07-21             1
## 5:         5         29 2000-12-05 2005-02-28       <NA>             2
##    gender_child2 gender_child3
## 1:             2            NA
## 2:            NA            NA
## 3:             2             1
## 4:             1             1
## 5:             1            NA
## 1 = female, 2 = male

And you’d like to combine (melt) all the dob columns together, and gender columns together. Using the current functionality, we can do something like this:

DT.m1 = melt(DT, id = c("family_id", "age_mother"))
## Warning in melt.data.table(DT, id = c("family_id", "age_mother")):
## 'measure.vars' [dob_child1, dob_child2, dob_child3, gender_child1, ...]
## are not all of the same type. By order of hierarchy, the molten data value
## column will be of type 'character'. All measure variables not of type
## 'character' will be coerced too. Check DETAILS in ?melt.data.table for more
## on coercion.
DT.m1
##     family_id age_mother      variable      value
##  1:         1         30    dob_child1 1998-11-26
##  2:         2         27    dob_child1 1996-06-22
##  3:         3         26    dob_child1 2002-07-11
##  4:         4         32    dob_child1 2004-10-10
##  5:         5         29    dob_child1 2000-12-05
##  6:         1         30    dob_child2 2000-01-29
##  7:         2         27    dob_child2       <NA>
##  8:         3         26    dob_child2 2004-04-05
##  9:         4         32    dob_child2 2009-08-27
## 10:         5         29    dob_child2 2005-02-28
## 11:         1         30    dob_child3       <NA>
## 12:         2         27    dob_child3       <NA>
## 13:         3         26    dob_child3 2007-09-02
## 14:         4         32    dob_child3 2012-07-21
## 15:         5         29    dob_child3       <NA>
## 16:         1         30 gender_child1          1
## 17:         2         27 gender_child1          2
## 18:         3         26 gender_child1          2
## 19:         4         32 gender_child1          1
## 20:         5         29 gender_child1          2
## 21:         1         30 gender_child2          2
## 22:         2         27 gender_child2       <NA>
## 23:         3         26 gender_child2          2
## 24:         4         32 gender_child2          1
## 25:         5         29 gender_child2          1
## 26:         1         30 gender_child3       <NA>
## 27:         2         27 gender_child3       <NA>
## 28:         3         26 gender_child3          1
## 29:         4         32 gender_child3          1
## 30:         5         29 gender_child3       <NA>
##     family_id age_mother      variable      value
DT.m1[, c("variable", "child") := tstrsplit(variable, "_", fixed = TRUE)]
DT.m1
##     family_id age_mother variable      value  child
##  1:         1         30      dob 1998-11-26 child1
##  2:         2         27      dob 1996-06-22 child1
##  3:         3         26      dob 2002-07-11 child1
##  4:         4         32      dob 2004-10-10 child1
##  5:         5         29      dob 2000-12-05 child1
##  6:         1         30      dob 2000-01-29 child2
##  7:         2         27      dob       <NA> child2
##  8:         3         26      dob 2004-04-05 child2
##  9:         4         32      dob 2009-08-27 child2
## 10:         5         29      dob 2005-02-28 child2
## 11:         1         30      dob       <NA> child3
## 12:         2         27      dob       <NA> child3
## 13:         3         26      dob 2007-09-02 child3
## 14:         4         32      dob 2012-07-21 child3
## 15:         5         29      dob       <NA> child3
## 16:         1         30   gender          1 child1
## 17:         2         27   gender          2 child1
## 18:         3         26   gender          2 child1
## 19:         4         32   gender          1 child1
## 20:         5         29   gender          2 child1
## 21:         1         30   gender          2 child2
## 22:         2         27   gender       <NA> child2
## 23:         3         26   gender          2 child2
## 24:         4         32   gender          1 child2
## 25:         5         29   gender          1 child2
## 26:         1         30   gender       <NA> child3
## 27:         2         27   gender       <NA> child3
## 28:         3         26   gender          1 child3
## 29:         4         32   gender          1 child3
## 30:         5         29   gender       <NA> child3
##     family_id age_mother variable      value  child
DT.c1 = dcast(DT.m1, family_id + age_mother + child ~ variable, value.var = "value")
DT.c1
##     family_id age_mother  child        dob gender
##  1:         1         30 child1 1998-11-26      1
##  2:         1         30 child2 2000-01-29      2
##  3:         1         30 child3       <NA>   <NA>
##  4:         2         27 child1 1996-06-22      2
##  5:         2         27 child2       <NA>   <NA>
##  6:         2         27 child3       <NA>   <NA>
##  7:         3         26 child1 2002-07-11      2
##  8:         3         26 child2 2004-04-05      2
##  9:         3         26 child3 2007-09-02      1
## 10:         4         32 child1 2004-10-10      1
## 11:         4         32 child2 2009-08-27      1
## 12:         4         32 child3 2012-07-21      1
## 13:         5         29 child1 2000-12-05      2
## 14:         5         29 child2 2005-02-28      1
## 15:         5         29 child3       <NA>   <NA>
  • What we wanted to do was to combine all the dob and gender type columns together respectively. Instead we are combining everything together, and then splitting them again. I think it’s easy to see that it’s quite roundabout (and inefficient).

As an analogy, imagine you’ve a closet with four shelves of clothes and you’d like to put together the clothes from shelves 1 and 2 together(in 1), and 3 and 4 together(in 3). What we are doing is more or less to combine all the clothes together, and then split them back on to shelves 1 and 3!

  • The columns to melt may be of different types, as in this case (character and integer types). By melting them all together, the columns will be coerced in result, as explained by the warning message above and shown from output of str(DT.c1), where gender has been converted to character type.

  • We are generating an additional column by splitting the variable column into two columns, whose purpose is quite cryptic. We do it because we need it for casting in the next step.

  • Finally, we cast the data set. But the issue is it’s a much more compuatationally involved operation than melt. Specifically, it requires computing the order of the variables in formula, and that’s costly.

In fact, stats::reshape is capable of performing this operation in a very staightforward manner. It is an extremely useful and often underrated function. You should definitely give it a try!

Enhanced (new) functionality

a) Enhanced melt

Since we’d like for data.table to perform this operation straightforward and efficient using the same interface, we went ahead and implemented an additional functionality, where we can melt to multiple columns simutaneously.

  • melt multiple columns simultaneously

The idea is quite simple. We pass a list of columns to measure.vars, where each element of the list contains the columns that should be combined together.

DT
##    family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1
## 1:         1         30 1998-11-26 2000-01-29       <NA>             1
## 2:         2         27 1996-06-22       <NA>       <NA>             2
## 3:         3         26 2002-07-11 2004-04-05 2007-09-02             2
## 4:         4         32 2004-10-10 2009-08-27 2012-07-21             1
## 5:         5         29 2000-12-05 2005-02-28       <NA>             2
##    gender_child2 gender_child3
## 1:             2            NA
## 2:            NA            NA
## 3:             2             1
## 4:             1             1
## 5:             1            NA
colA = paste("dob_child", 1:3, sep = "")
colB = paste("gender_child", 1:3, sep = "")
DT.m2 = melt(DT, measure.vars = list(colA, colB), value.name = c("dob", "gender"))
DT.m2
##     family_id age_mother variable        dob gender
##  1:         1         30        1 1998-11-26      1
##  2:         2         27        1 1996-06-22      2
##  3:         3         26        1 2002-07-11      2
##  4:         4         32        1 2004-10-10      1
##  5:         5         29        1 2000-12-05      2
##  6:         1         30        2 2000-01-29      2
##  7:         2         27        2       <NA>     NA
##  8:         3         26        2 2004-04-05      2
##  9:         4         32        2 2009-08-27      1
## 10:         5         29        2 2005-02-28      1
## 11:         1         30        3       <NA>     NA
## 12:         2         27        3       <NA>     NA
## 13:         3         26        3 2007-09-02      1
## 14:         4         32        3 2012-07-21      1
## 15:         5         29        3       <NA>     NA
  • Using pattern()

Usually in these problems, the columns we’d like to melt can be distinguished by a common pattern. We can use the function patterns(), implemented for convenience, to provide regular expressions for the columns to be combined together. The above operation can be rewritten as:

DT.m2 = melt(DT, measure.vars = patterns("^dob", "^gender"), value.name = c("dob", "gender"))
DT.m2 
##     family_id age_mother variable        dob gender
##  1:         1         30        1 1998-11-26      1
##  2:         2         27        1 1996-06-22      2
##  3:         3         26        1 2002-07-11      2
##  4:         4         32        1 2004-10-10      1
##  5:         5         29        1 2000-12-05      2
##  6:         1         30        2 2000-01-29      2
##  7:         2         27        2       <NA>     NA
##  8:         3         26        2 2004-04-05      2
##  9:         4         32        2 2009-08-27      1
## 10:         5         29        2 2005-02-28      1
## 11:         1         30        3       <NA>     NA
## 12:         2         27        3       <NA>     NA
## 13:         3         26        3 2007-09-02      1
## 14:         4         32        3 2012-07-21      1
## 15:         5         29        3       <NA>     NA
  • We can remove the variable column if necessary

  • The functionality is implemented entirely in C, and is therefore both fast and memory efficient in addition to being straightforward

b) Enhanced dcast

We can now melt into multiple columns simultaneously. Now given the data set DT.m2 as shown above, how can we get back to the same format as the original data we started with?

If we use the current functionality of dcast, then we’d have to cast twice and bind the results together. But that’s once again verbose, not straightforward and is also inefficient.

  • Casting multiple value.vars simultaneously

We can now provide multiple value.var columns to dcast for data.tables directly so that the operations are taken care of internally and efficiently.

DT.c2 = dcast(DT.m2, family_id + age_mother ~ variable, value.var = c("dob", "gender"))
DT.c2
##    family_id age_mother      dob_1      dob_2      dob_3 gender_1 gender_2
## 1:         1         30 1998-11-26 2000-01-29       <NA>        1        2
## 2:         2         27 1996-06-22       <NA>       <NA>        2       NA
## 3:         3         26 2002-07-11 2004-04-05 2007-09-02        2        2
## 4:         4         32 2004-10-10 2009-08-27 2012-07-21        1        1
## 5:         5         29 2000-12-05 2005-02-28       <NA>        2        1
##    gender_3
## 1:       NA
## 2:       NA
## 3:        1
## 4:        1
## 5:       NA
  • Attributes are preserved in result whereever possible

  • Everything is taken care of internally, and efficiently. In addition to being fast, it is also very memory efficient.