Based on:
“THE #RDATATABLEPACKAGE” by Arun Srinivasan
https://www.youtube.com/watch?v=Z0gZuh5xsuo&feature=youtu.be&t=12s
Benchmark: https://h2oai.github.io/db-benchmark/
library(data.table)
DT <- data.table(id= c(1,1,1,1,2,2,1,2,1),
code = c("c", "b", "c", "c", "a", "a", "b", "a", "c"),
valA = c(1,8,3,6,5,4,7,2,9),
valB = c(12,11,10,13,19,15,16,14,18))
DT[, .(valA=sum(valA)), by=id]
## id valA
## 1: 1 34
## 2: 2 11
For some data.table users, “dot-SD” is a bit of a mystery. But data.table creator Matt Dowle told me that it’s actually quite simple: Just think of it as a symbol representing “each group.” Let’s go through a couple of examples.
These queries can be chained together just by adding another one on the end:
`DT[...][...]`
See data.table compared to dplyr on Stack Overflow and Quora.
DT[2] # 2nd row
## id code valA valB
## 1: 1 b 8 11
DT[2:3] # 2nd and 3nd row
## id code valA valB
## 1: 1 b 8 11
## 2: 1 c 3 10
w = 2:3; DT[w] # same
## id code valA valB
## 1: 1 b 8 11
## 2: 1 c 3 10
DT[order(valA)]
## id code valA valB
## 1: 1 c 1 12
## 2: 2 a 2 14
## 3: 1 c 3 10
## 4: 2 a 4 15
## 5: 2 a 5 19
## 6: 1 c 6 13
## 7: 1 b 7 16
## 8: 1 b 8 11
## 9: 1 c 9 18
DT[order(valA),] # same; the ',' is optional
## id code valA valB
## 1: 1 c 1 12
## 2: 2 a 2 14
## 3: 1 c 3 10
## 4: 2 a 4 15
## 5: 2 a 5 19
## 6: 1 c 6 13
## 7: 1 b 7 16
## 8: 1 b 8 11
## 9: 1 c 9 18
DT[valB > 15] # all rows where DT$valB > 15
## id code valA valB
## 1: 2 a 5 19
## 2: 1 b 7 16
## 3: 1 c 9 18
DT[valA > 7 & valB >15] # compound logical expressions
## id code valA valB
## 1: 1 c 9 18
DT[!2:4] # all rows other than 2:4
## id code valA valB
## 1: 1 c 1 12
## 2: 2 a 5 19
## 3: 2 a 4 15
## 4: 1 b 7 16
## 5: 2 a 2 14
## 6: 1 c 9 18
DT[-(2:4)] # same
## id code valA valB
## 1: 1 c 1 12
## 2: 2 a 5 19
## 3: 2 a 4 15
## 4: 1 b 7 16
## 5: 2 a 2 14
## 6: 1 c 9 18
DT[, valA] # valA column (as vector)
## [1] 1 8 3 6 5 4 7 2 9
DT[, list(valA)] # valA column (as data.table)
## valA
## 1: 1
## 2: 8
## 3: 3
## 4: 6
## 5: 5
## 6: 4
## 7: 7
## 8: 2
## 9: 9
DT[, .(valA)] # same; .() is an alias for list()
## valA
## 1: 1
## 2: 8
## 3: 3
## 4: 6
## 5: 5
## 6: 4
## 7: 7
## 8: 2
## 9: 9
DT[, sum(valA)] # sum of column v, returned as vector
## [1] 45
DT[, .(sum(valA))] # same but return data.table
## V1
## 1: 45
DT[, .(sv = sum(valA))] # same but name column "sv"
## sv
## 1: 45
DT[, .(valA, valA*2)] # return 2 column data.table
## valA V2
## 1: 1 2
## 2: 8 16
## 3: 3 6
## 4: 6 12
## 5: 5 10
## 6: 4 8
## 7: 7 14
## 8: 2 4
## 9: 9 18
DT[2:3, sum(valA)] # sum(valA) over rows 2 and 3
## [1] 11
DT[2:3, .(sum(valA))] # same, but return data.table
## V1
## 1: 11
DT[2:3, .(sv = sum(valA))] # same, but name column "sv"
## sv
## 1: 11
DT[2:5, cat(valA, "\n")] # just for j's side effect
## 8 3 6 5
## NULL
DT[, 2] # 2nd column, a data.table always
## code
## 1: c
## 2: b
## 3: c
## 4: c
## 5: a
## 6: a
## 7: b
## 8: a
## 9: c
colNum = 2
DT[, ..colNum] # same as DT[, 2]; ..var => one-up
## code
## 1: c
## 2: b
## 3: c
## 4: c
## 5: a
## 6: a
## 7: b
## 8: a
## 9: c
DT[["valA"]] # same as DT[, v] but lower overhead
## [1] 1 8 3 6 5 4 7 2 9
DT[, sum(valA), by=valB] # appearance order of groups preserved
## valB V1
## 1: 12 1
## 2: 11 8
## 3: 10 3
## 4: 13 6
## 5: 19 5
## 6: 15 4
## 7: 16 7
## 8: 14 2
## 9: 18 9
DT[, sum(valA), keyby=valB] # order the result by group
## valB V1
## 1: 10 3
## 2: 11 8
## 3: 12 1
## 4: 13 6
## 5: 14 2
## 6: 15 4
## 7: 16 7
## 8: 18 9
## 9: 19 5
DT[, sum(valA), by=valB][order(valB)] # same by chaining expression together
## valB V1
## 1: 10 3
## 2: 11 8
## 3: 12 1
## 4: 13 6
## 5: 14 2
## 6: 15 4
## 7: 16 7
## 8: 18 9
## 9: 19 5
DT["a", on="code"] # same as code == "a" but uses key (fast)
## id code valA valB
## 1: 2 a 5 19
## 2: 2 a 4 15
## 3: 2 a 2 14
DT["a", on=.(code)] # same
## id code valA valB
## 1: 2 a 5 19
## 2: 2 a 4 15
## 3: 2 a 2 14
DT[.("a"), on="code"] # same
## id code valA valB
## 1: 2 a 5 19
## 2: 2 a 4 15
## 3: 2 a 2 14
DT[.("a"), on=.(code)] # same
## id code valA valB
## 1: 2 a 5 19
## 2: 2 a 4 15
## 3: 2 a 2 14
DT[code=="a"] #same, == internally optimized
## id code valA valB
## 1: 2 a 5 19
## 2: 2 a 4 15
## 3: 2 a 2 14
DT[code != "b" | valA != 3] # not yet optimized
## id code valA valB
## 1: 1 c 1 12
## 2: 1 b 8 11
## 3: 1 c 3 10
## 4: 1 c 6 13
## 5: 2 a 5 19
## 6: 2 a 4 15
## 7: 1 b 7 16
## 8: 2 a 2 14
## 9: 1 c 9 18
DT[.("b", 8), on=c("code", "valA")] # same as DT[code == "b" & valA == 8]
## id code valA valB
## 1: 1 b 8 11
DT[.("b", 8), on=.(code, valA)] # same
## id code valA valB
## 1: 1 b 8 11
DT[.("b", 7:9), on=c("code", "valA")] # no match returns NA
## id code valA valB
## 1: 1 b 7 16
## 2: 1 b 8 11
## 3: NA b 9 NA
DT[.("b", 7:9), on=.(code, valA), nomatch=0] # no match row is not returned
## id code valA valB
## 1: 1 b 7 16
## 2: 1 b 8 11
DT[.("b", 7:9), on=c("code", "valA"), roll=Inf] # locf, previous row rolls forward
## id code valA valB
## 1: 1 b 7 16
## 2: 1 b 8 11
## 3: 1 b 9 11
DT[.("b", 7:9), on=.(code, valA), roll=-Inf] # nocb, next row rolls backward
## id code valA valB
## 1: 1 b 7 16
## 2: 1 b 8 11
## 3: NA b 9 NA
DT["b", sum(valA*valB), on="code"] # same as DT[x=="b", sum(v*y)]
## [1] 200
DT[code != "a", sum(valA), by = id] # get sum(valA) by "id" for each i != "a"
## id V1
## 1: 1 34
DT[!"a", sum(valA), by=.EACHI, on="code"] # same, but using subsets-as-joins
## code V1
## 1: c 19
DT[c("b","c"), sum(valA), by=.EACHI, on="code"] # same
## code V1
## 1: b 15
## 2: c 19
DT[c("b","c"), sum(valA), by=.EACHI, on=.(code)] # same, using on=.()
## code V1
## 1: b 15
## 2: c 19
X = data.table(x=c("c", "b"), v=8:7, foo=c(4,2))
X
## x v foo
## 1: c 8 4
## 2: b 7 2
DT <- data.table(id= c(1,1,1,1,2,2,1,2,1),
x = c("c", "b", "c", "c", "a", "a", "b", "a", "c"),
valA = c(1,8,3,6,5,4,7,2,9),
valB = c(12,11,10,13,19,15,16,14,18))
DT[X, on="x"] # right join
## id x valA valB v foo
## 1: 1 c 1 12 8 4
## 2: 1 c 3 10 8 4
## 3: 1 c 6 13 8 4
## 4: 1 c 9 18 8 4
## 5: 1 b 8 11 7 2
## 6: 1 b 7 16 7 2
X[DT, on="x"] # left join
## x v foo id valA valB
## 1: c 8 4 1 1 12
## 2: b 7 2 1 8 11
## 3: c 8 4 1 3 10
## 4: c 8 4 1 6 13
## 5: a NA NA 2 5 19
## 6: a NA NA 2 4 15
## 7: b 7 2 1 7 16
## 8: a NA NA 2 2 14
## 9: c 8 4 1 9 18
DT[X, on="x", nomatch=0] # inner join
## id x valA valB v foo
## 1: 1 c 1 12 8 4
## 2: 1 c 3 10 8 4
## 3: 1 c 6 13 8 4
## 4: 1 c 9 18 8 4
## 5: 1 b 8 11 7 2
## 6: 1 b 7 16 7 2
DT[!X, on="x"] # not join
## id x valA valB
## 1: 2 a 5 19
## 2: 2 a 4 15
## 3: 2 a 2 14
DT[X, on=c(valA="v")] # join DT$valA to X$v
## id x valA valB i.x foo
## 1: 1 b 8 11 c 4
## 2: 1 b 7 16 b 2
DT[X, on="valA==v"] # same
## id x valA valB i.x foo
## 1: 1 b 8 11 c 4
## 2: 1 b 7 16 b 2
# DT[X, on=.(y<=foo)] # non-equi join
# DT[X, on="y<=foo"] # same
# DT[X, on=c("y<=foo")] # same
# DT[X, on=.(y>=foo)] # non-equi join
# DT[X, on=.(x, y<=foo)] # non-equi join
# DT[X, .(x,y,x.y,v), on=.(x, y>=foo)] # select x's join columns as well
DT[X, on="x", mult="first"] # first row of each group
## id x valA valB v foo
## 1: 1 c 1 12 8 4
## 2: 1 b 8 11 7 2
DT[X, on="x", mult="last"] # last row of each group
## id x valA valB v foo
## 1: 1 c 9 18 8 4
## 2: 1 b 7 16 7 2
DT[X, sum(v), by=.EACHI, on="x"] # join and eval j for each row in i
## x V1
## 1: c 8
## 2: b 7
DT[X, sum(v)*foo, by=.EACHI, on="x"] # join inherited scope
## x V1
## 1: c 32
## 2: b 14
DT[X, sum(v)*i.v, by=.EACHI, on="x"] # 'i,v' refers to X's v column
## x V1
## 1: c 64
## 2: b 49
#DT[X, on=.(x, v>=v), sum(y)*foo, by=.EACHI] # non-equi join with by=.EACHI
kDT = copy(DT) # copy DT to kDT to work with it
setkey(kDT, x) # set a 1-column key
setkeyv(kDT, "x") # same (v in setkeyv stands for vector)
v="x"
setkeyv(kDT, v) # same
haskey(kDT) # TRUE
## [1] TRUE
key(kDT)
## [1] "x"
kDT["a"] # subset-as-join on *key* column 'x'
## id x valA valB
## 1: 2 a 5 19
## 2: 2 a 4 15
## 3: 2 a 2 14
kDT["a", on="x"] # same, being explicit using 'on='
## id x valA valB
## 1: 2 a 5 19
## 2: 2 a 4 15
## 3: 2 a 2 14
kDT[!"a", sum(valA), by=.EACHI] # get sum(v) for each i != "a"
## x V1
## 1: b 15
## 2: c 19
setkey(kDT,x,id) # 2-column key
setkeyv(kDT,c("x","id")) # same
kDT["a"] # join to 1st column of key
## id x valA valB
## 1: 2 a 5 19
## 2: 2 a 4 15
## 3: 2 a 2 14
kDT["a", on="x"] # on= is optional but preferred
## id x valA valB
## 1: 2 a 5 19
## 2: 2 a 4 15
## 3: 2 a 2 14
kDT[.("a")] # same; .() is an alias for list()
## id x valA valB
## 1: 2 a 5 19
## 2: 2 a 4 15
## 3: 2 a 2 14
kDT[list("a")] # same
## id x valA valB
## 1: 2 a 5 19
## 2: 2 a 4 15
## 3: 2 a 2 14
kDT[.("a", 3)] # join to 2 columns
## id x valA valB
## 1: 3 a NA NA
kDT[.("a", 3:6)] # join 4 rows (2 missing)
## id x valA valB
## 1: 3 a NA NA
## 2: 4 a NA NA
## 3: 5 a NA NA
## 4: 6 a NA NA
kDT[.("a", 3:6), nomatch=0] # remove missing
## Empty data.table (0 rows and 4 cols): id,x,valA,valB
kDT[.("a", 3:6), roll=TRUE] # locf rolling join
## id x valA valB
## 1: 3 a 2 14
## 2: 4 a 2 14
## 3: 5 a 2 14
## 4: 6 a 2 14
kDT[.("a", 3:6), roll=Inf] # same
## id x valA valB
## 1: 3 a 2 14
## 2: 4 a 2 14
## 3: 5 a 2 14
## 4: 6 a 2 14
kDT[.("a", 3:6), roll=-Inf] # nocb rolling join
## id x valA valB
## 1: 3 a NA NA
## 2: 4 a NA NA
## 3: 5 a NA NA
## 4: 6 a NA NA
kDT[!.("a")] # not join
## id x valA valB
## 1: 1 b 8 11
## 2: 1 b 7 16
## 3: 1 c 1 12
## 4: 1 c 3 10
## 5: 1 c 6 13
## 6: 1 c 9 18
kDT[!"a"] # same
## id x valA valB
## 1: 1 b 8 11
## 2: 1 b 7 16
## 3: 1 c 1 12
## 4: 1 c 3 10
## 5: 1 c 6 13
## 6: 1 c 9 18
DT[.N] # last row
## id x valA valB
## 1: 1 c 9 18
DT[, .N] # total number of rows in DT
## [1] 9
DT[, .N, by=x] # number of rows in each group
## x N
## 1: c 4
## 2: b 2
## 3: a 3
DT[, .SD, .SDcols=valA:valB] # select columns 'x' and 'y'
## valA valB
## 1: 1 12
## 2: 8 11
## 3: 3 10
## 4: 6 13
## 5: 5 19
## 6: 4 15
## 7: 7 16
## 8: 2 14
## 9: 9 18
DT[, .SD[1]] # first row; same as DT[1,]
## id x valA valB
## 1: 1 c 1 12
DT[, .SD[c(1,3)]] # 第一行和第三行
## id x valA valB
## 1: 1 c 1 12
## 2: 1 c 3 10
DT[, .SD[1:3]] # 第一行至第三行
## id x valA valB
## 1: 1 c 1 12
## 2: 1 b 8 11
## 3: 1 c 3 10
DT[, .SD[1], by=x] # first row of each group
## x id valA valB
## 1: c 1 1 12
## 2: b 1 8 11
## 3: a 2 5 19
DT[, c(.N, lapply(.SD, sum)), by=x] # group size alongside sum
## x N id valA valB
## 1: c 4 4 19 53
## 2: b 2 2 15 27
## 3: a 3 6 11 48
DT[, .I[1], by=x] # row number of first row of each group
## x V1
## 1: c 1
## 2: b 2
## 3: a 5
DT[, grp := .GRP, by=x] # add a group counter column
#X[, DT[.BY, y, on="x"], by=x] # join within group to use less ram
print(DT[, z:=42L]) # add new column by reference
## id x valA valB grp z
## 1: 1 c 1 12 1 42
## 2: 1 b 8 11 2 42
## 3: 1 c 3 10 1 42
## 4: 1 c 6 13 1 42
## 5: 2 a 5 19 3 42
## 6: 2 a 4 15 3 42
## 7: 1 b 7 16 2 42
## 8: 2 a 2 14 3 42
## 9: 1 c 9 18 1 42
print(DT[, z:=NULL]) # remove column by reference
## id x valA valB grp
## 1: 1 c 1 12 1
## 2: 1 b 8 11 2
## 3: 1 c 3 10 1
## 4: 1 c 6 13 1
## 5: 2 a 5 19 3
## 6: 2 a 4 15 3
## 7: 1 b 7 16 2
## 8: 2 a 2 14 3
## 9: 1 c 9 18 1
print(DT["a", v:=42L, on="x"]) # subassign to column
## id x valA valB grp v
## 1: 1 c 1 12 1 NA
## 2: 1 b 8 11 2 NA
## 3: 1 c 3 10 1 NA
## 4: 1 c 6 13 1 NA
## 5: 2 a 5 19 3 42
## 6: 2 a 4 15 3 42
## 7: 1 b 7 16 2 NA
## 8: 2 a 2 14 3 42
## 9: 1 c 9 18 1 NA
print(DT["b", v2:=84L, on="x"]) # subassign to new column (NA padded)
## id x valA valB grp v v2
## 1: 1 c 1 12 1 NA NA
## 2: 1 b 8 11 2 NA 84
## 3: 1 c 3 10 1 NA NA
## 4: 1 c 6 13 1 NA NA
## 5: 2 a 5 19 3 42 NA
## 6: 2 a 4 15 3 42 NA
## 7: 1 b 7 16 2 NA 84
## 8: 2 a 2 14 3 42 NA
## 9: 1 c 9 18 1 NA NA
DT[, m:=mean(v), by=x][] # add new column by reference by group
## id x valA valB grp v v2 m
## 1: 1 c 1 12 1 NA NA NA
## 2: 1 b 8 11 2 NA 84 NA
## 3: 1 c 3 10 1 NA NA NA
## 4: 1 c 6 13 1 NA NA NA
## 5: 2 a 5 19 3 42 NA 42
## 6: 2 a 4 15 3 42 NA 42
## 7: 1 b 7 16 2 NA 84 NA
## 8: 2 a 2 14 3 42 NA 42
## 9: 1 c 9 18 1 NA NA NA
# postfix [] is shortcut to print()
#DT[, sum(v), by=.(y%%2)] # expressions in by
#DT[, sum(v), by=.(bool = y%%2)] # same with name for group expression
DT[, .SD[2], by=x] # get 2nd row of each group
## x id valA valB grp v v2 m
## 1: c 1 3 10 1 NA NA NA
## 2: b 1 7 16 2 NA 84 NA
## 3: a 2 4 15 3 42 NA 42
DT[, tail(.SD,2), by=x] # last 2 rows of each group
## x id valA valB grp v v2 m
## 1: c 1 6 13 1 NA NA NA
## 2: c 1 9 18 1 NA NA NA
## 3: b 1 8 11 2 NA 84 NA
## 4: b 1 7 16 2 NA 84 NA
## 5: a 2 4 15 3 42 NA 42
## 6: a 2 2 14 3 42 NA 42
DT[, lapply(.SD, sum), by=x] # sum of all columns for each group
## x id valA valB grp v v2 m
## 1: c 4 19 53 4 NA NA NA
## 2: b 2 15 27 4 NA 168 NA
## 3: a 6 11 48 9 126 NA 126
DT[, .SD[which.min(v)], by=x] # nested query by group
## x id valA valB grp v v2 m
## 1: a 2 5 19 3 42 NA 42
#DT[, list(MySum=sum(v),
# MyMin=min(v),
# MyMax=max(v)),
# by=.(x, y%%2)] # by 2 expressions
DT[, .(valA = .(valA), valB = .(valB)), by=x] # list columns
## x valA valB
## 1: c 1,3,6,9 12,10,13,18
## 2: b 8,7 11,16
## 3: a 5,4,2 19,15,14
DT[, .(seq = min(valA):max(valB)), by=x] # j is not limited to just aggregations
## x seq
## 1: c 1
## 2: c 2
## 3: c 3
## 4: c 4
## 5: c 5
## 6: c 6
## 7: c 7
## 8: c 8
## 9: c 9
## 10: c 10
## 11: c 11
## 12: c 12
## 13: c 13
## 14: c 14
## 15: c 15
## 16: c 16
## 17: c 17
## 18: c 18
## 19: b 7
## 20: b 8
## 21: b 9
## 22: b 10
## 23: b 11
## 24: b 12
## 25: b 13
## 26: b 14
## 27: b 15
## 28: b 16
## 29: a 2
## 30: a 3
## 31: a 4
## 32: a 5
## 33: a 6
## 34: a 7
## 35: a 8
## 36: a 9
## 37: a 10
## 38: a 11
## 39: a 12
## 40: a 13
## 41: a 14
## 42: a 15
## 43: a 16
## 44: a 17
## 45: a 18
## 46: a 19
## x seq
DT[, sum(valA), by=x][V1<20] # compound query
## x V1
## 1: c 19
## 2: b 15
## 3: a 11
DT[, sum(valA), by=x][order(-V1)] # ordering results
## x V1
## 1: c 19
## 2: b 15
## 3: a 11
DT[, c(.N, lapply(.SD,sum)), by=x] # group size and sums by group
## x N id valA valB grp v v2 m
## 1: c 4 4 19 53 4 NA NA NA
## 2: b 2 2 15 27 4 NA 168 NA
## 3: a 3 6 11 48 9 126 NA 126
DT[, {tmp <- mean(id); # anonymous lambda in 'j'; j any valid
.(a = valA-tmp, b = valB-tmp) # expression where every element
}, by=x] # becomes a column in result
## x a b
## 1: c 0 11
## 2: c 2 9
## 3: c 5 12
## 4: c 8 17
## 5: b 7 10
## 6: b 6 15
## 7: a 3 17
## 8: a 2 13
## 9: a 0 12
pdf("new.pdf")
DT[, plot(valA, valB), by=x] # can also plot in 'j'
## Empty data.table (0 rows and 1 cols): x
dev.off()
## png
## 2
# get max(y) and min of a set of columns for each consecutive run of 'v'
#DT[, c(.(y=max(y)), lapply(.SD, min)), by=rleid(v), .SDcols=v:b]
Description
dcast.data.table is a much faster version of reshape2::dcast, but for data.tables. More importantly, it is capable of handling very large data quite efficiently in terms of memory usage in comparison to reshape2::dcast.
The cast formula takes the form LHS ~ RHS, ex: var1 + var2 ~ var3. The order of entries in the formula is essential. There are two special variables: . and …. . represents no variable; … represents all variables not otherwise mentioned in formula; see Examples.
dcast also allows value.var columns of type list.
When variable combinations in formula doesn’t identify a unique value in a cell, fun.aggregate will have to be specified, which defaults to length if unspecified. The aggregating function should take a vector as input and return a single value (or a list of length one) as output. In cases where value.var is a list, the function should be able to handle a list input and provide a single value or list of length one as output.
If the formula’s LHS contains the same column more than once, ex: dcast(DT, x+x~ y), then the answer will have duplicate names. In those cases, the duplicate names are renamed using make.unique so that key can be set without issues.
Names for columns that are being cast are generated in the same order (separated by an underscore, _) from the (unique) values in each column mentioned in the formula RHS.
names(ChickWeight) <- tolower(names(ChickWeight))
DT <- melt(as.data.table(ChickWeight), id = 2:4)
dcast(DT, time~variable, fun = mean)
## time weight
## 1: 0 41.06000
## 2: 2 49.22000
## 3: 4 59.95918
## 4: 6 74.30612
## 5: 8 91.24490
## 6: 10 107.83673
## 7: 12 129.24490
## 8: 14 143.81250
## 9: 16 168.08511
## 10: 18 190.19149
## 11: 20 209.71739
## 12: 21 218.68889
dcast(DT, diet~variable, fun = mean)
## diet weight
## 1: 1 102.6455
## 2: 2 122.6167
## 3: 3 142.9500
## 4: 4 135.2627
dcast(DT, diet+chick ~ time, drop = FALSE)
## diet chick 0 2 4 6 8 10 12 14 16 18 20 21
## 1: 1 18 39 35 NA NA NA NA NA NA NA NA NA NA
## 2: 1 16 41 45 49 51 57 51 54 NA NA NA NA NA
## 3: 1 15 41 49 56 64 68 68 67 68 NA NA NA NA
## 4: 1 13 41 48 53 60 65 67 71 70 71 81 91 96
## 5: 1 9 42 51 59 68 85 96 90 92 93 100 100 98
## ---
## 196: 4 49 40 53 64 85 108 128 152 166 184 203 233 237
## 197: 4 46 40 52 62 82 101 120 144 156 173 210 231 238
## 198: 4 50 41 54 67 84 105 122 155 175 205 234 264 264
## 199: 4 42 42 49 63 84 103 126 160 174 204 234 269 281
## 200: 4 48 39 50 62 80 104 125 154 170 222 261 303 322
dcast(DT, diet + chick ~ time, drop = FALSE, fill = 0)
## diet chick 0 2 4 6 8 10 12 14 16 18 20 21
## 1: 1 18 39 35 0 0 0 0 0 0 0 0 0 0
## 2: 1 16 41 45 49 51 57 51 54 0 0 0 0 0
## 3: 1 15 41 49 56 64 68 68 67 68 0 0 0 0
## 4: 1 13 41 48 53 60 65 67 71 70 71 81 91 96
## 5: 1 9 42 51 59 68 85 96 90 92 93 100 100 98
## ---
## 196: 4 49 40 53 64 85 108 128 152 166 184 203 233 237
## 197: 4 46 40 52 62 82 101 120 144 156 173 210 231 238
## 198: 4 50 41 54 67 84 105 122 155 175 205 234 264 264
## 199: 4 42 42 49 63 84 103 126 160 174 204 234 269 281
## 200: 4 48 39 50 62 80 104 125 154 170 222 261 303 322
Using subset
dcast(DT, chick ~ time, fun = mean, subset = .(time < 10 & chick < 20))
## chick 0 2 4 6 8
## 1: 18 39 35 NaN NaN NaN
## 2: 16 41 45 49 51 57
## 3: 15 41 49 56 64 68
## 4: 13 41 48 53 60 65
## 5: 9 42 51 59 68 85
Drop argument,
DT <- data.table(v1 = c(1.1, 1.1, 1.1, 2.2, 2.2, 2.2),
v2 = factor(c(1L, 1L, 1L, 3L, 3L, 3L), levels=1:3),
v3 = factor(c(2L, 3L, 5L, 1L, 2L, 6L), levels=1:6),
v4 = c(3L, 2L, 2L, 5L, 4L, 3L))
DT
## v1 v2 v3 v4
## 1: 1.1 1 2 3
## 2: 1.1 1 3 2
## 3: 1.1 1 5 2
## 4: 2.2 3 1 5
## 5: 2.2 3 2 4
## 6: 2.2 3 6 3
drop=TRUE
dcast(DT, v1 + v2 ~ v3) # default is drop = TRUE
## Using 'v4' as value column. Use 'value.var' to override
## v1 v2 1 2 3 5 6
## 1: 1.1 1 NA 3 2 2 NA
## 2: 2.2 3 5 4 NA NA 3
dcast(DT, v1 + v2 ~ v3, drop = FALSE) # all missing combination of both LHS and RHS
## Using 'v4' as value column. Use 'value.var' to override
## v1 v2 1 2 3 4 5 6
## 1: 1.1 1 NA 3 2 NA 2 NA
## 2: 1.1 2 NA NA NA NA NA NA
## 3: 1.1 3 NA NA NA NA NA NA
## 4: 2.2 1 NA NA NA NA NA NA
## 5: 2.2 2 NA NA NA NA NA NA
## 6: 2.2 3 5 4 NA NA NA 3
dcast(DT, v1 + v2 ~ v3, drop = c(FALSE, TRUE)) # all missing combination of only LHS
## Using 'v4' as value column. Use 'value.var' to override
## v1 v2 1 2 3 5 6
## 1: 1.1 1 NA 3 2 2 NA
## 2: 1.1 2 NA NA NA NA NA
## 3: 1.1 3 NA NA NA NA NA
## 4: 2.2 1 NA NA NA NA NA
## 5: 2.2 2 NA NA NA NA NA
## 6: 2.2 3 5 4 NA NA 3
dcast(DT, v1 + v2 ~ v3, drop = c(TRUE, FALSE)) # all missing combination of only RHS
## Using 'v4' as value column. Use 'value.var' to override
## v1 v2 1 2 3 4 5 6
## 1: 1.1 1 NA 3 2 NA 2 NA
## 2: 2.2 3 5 4 NA NA NA 3
using .
and ...
DT <- data.table(v1 = rep(1:2, each = 6),
v2 = rep(rep(1:3, 2), each = 2),
v3 = rep(1:2, 6),
v4 = rnorm(6))
DT
## v1 v2 v3 v4
## 1: 1 1 1 -1.165233
## 2: 1 1 2 -1.416931
## 3: 1 2 1 -1.084796
## 4: 1 2 2 -1.474635
## 5: 1 3 1 -1.388353
## 6: 1 3 2 1.320183
## 7: 2 1 1 -1.165233
## 8: 2 1 2 -1.416931
## 9: 2 2 1 -1.084796
## 10: 2 2 2 -1.474635
## 11: 2 3 1 -1.388353
## 12: 2 3 2 1.320183
dcast(DT, ... ~ v3, value.var = "v4") #same as v1 + v2 ~ v3, value.var = "v4"
## v1 v2 1 2
## 1: 1 1 -1.165233 -1.416931
## 2: 1 2 -1.084796 -1.474635
## 3: 1 3 -1.388353 1.320183
## 4: 2 1 -1.165233 -1.416931
## 5: 2 2 -1.084796 -1.474635
## 6: 2 3 -1.388353 1.320183
dcast(DT, v1 + v2 + v3 ~ ., value.var = "v4")
## v1 v2 v3 .
## 1: 1 1 1 -1.165233
## 2: 1 1 2 -1.416931
## 3: 1 2 1 -1.084796
## 4: 1 2 2 -1.474635
## 5: 1 3 1 -1.388353
## 6: 1 3 2 1.320183
## 7: 2 1 1 -1.165233
## 8: 2 1 2 -1.416931
## 9: 2 2 1 -1.084796
## 10: 2 2 2 -1.474635
## 11: 2 3 1 -1.388353
## 12: 2 3 2 1.320183
benchmark against reshape2’s dcast, minimum of 3 runs
set.seed(45)
DT <- data.table(aa=sample(1e4, 1e6, TRUE),
bb=sample(1e3, 1e6, TRUE),
cc = sample(letters, 1e6, TRUE), dd=runif(1e6))
DT
## aa bb cc dd
## 1: 8741 429 y 0.1985822
## 2: 4427 983 c 0.9916615
## 3: 8416 336 a 0.1117716
## 4: 6695 844 s 0.9301698
## 5: 3130 674 f 0.2739307
## ---
## 999996: 1287 294 n 0.2819233
## 999997: 1913 22 x 0.3740439
## 999998: 6481 676 n 0.1717177
## 999999: 3669 576 k 0.5089813
## 1000000: 1795 703 v 0.5600036
system.time(dcast(DT, aa ~ cc, fun=sum)) # 0.12 seconds
## Using 'dd' as value column. Use 'value.var' to override
## user system elapsed
## 0.567 0.003 0.314
system.time(dcast(DT, bb ~ cc, fun=mean)) # 0.04 seconds
## Using 'dd' as value column. Use 'value.var' to override
## user system elapsed
## 0.243 0.000 0.148
# reshape2::dcast takes 31 seconds
system.time(dcast(DT, aa + bb ~ cc, fun=sum)) # 1.2 seconds
## Using 'dd' as value column. Use 'value.var' to override
## user system elapsed
## 2.558 0.119 1.948
# NEW FEATURE - multiple value.var and multiple fun.aggregate
DT = data.table(x=sample(5,20,TRUE), y=sample(2,20,TRUE),
z=sample(letters[1:2], 20,TRUE), d1 = runif(20), d2=1L)
DT
## x y z d1 d2
## 1: 1 1 b 0.1089563 1
## 2: 4 2 a 0.3218471 1
## 3: 5 1 a 0.3171631 1
## 4: 5 1 b 0.7866574 1
## 5: 1 2 b 0.4110979 1
## 6: 2 1 a 0.1820242 1
## 7: 5 2 b 0.9101196 1
## 8: 1 2 a 0.4628660 1
## 9: 4 1 b 0.5173982 1
## 10: 3 2 b 0.5939504 1
## 11: 1 2 b 0.9338177 1
## 12: 4 2 b 0.7506386 1
## 13: 1 1 b 0.3951595 1
## 14: 4 2 b 0.9816864 1
## 15: 5 2 a 0.8124613 1
## 16: 2 1 b 0.5121996 1
## 17: 3 1 a 0.1758779 1
## 18: 3 2 a 0.8867973 1
## 19: 4 1 a 0.6169361 1
## 20: 5 2 a 0.5777725 1
# multiple value.var
dcast(DT, x + y ~ z, fun=sum, value.var=c("d1","d2"))
## x y d1_a d1_b d2_a d2_b
## 1: 1 1 0.0000000 0.5041157 0 2
## 2: 1 2 0.4628660 1.3449156 1 2
## 3: 2 1 0.1820242 0.5121996 1 1
## 4: 3 1 0.1758779 0.0000000 1 0
## 5: 3 2 0.8867973 0.5939504 1 1
## 6: 4 1 0.6169361 0.5173982 1 1
## 7: 4 2 0.3218471 1.7323249 1 2
## 8: 5 1 0.3171631 0.7866574 1 1
## 9: 5 2 1.3902338 0.9101196 2 1
# multiple fun.aggregate
dcast(DT, x + y ~ z, fun=list(sum, mean), value.var="d1")
## x y d1_sum_a d1_sum_b d1_mean_a d1_mean_b
## 1: 1 1 0.0000000 0.5041157 NaN 0.2520579
## 2: 1 2 0.4628660 1.3449156 0.4628660 0.6724578
## 3: 2 1 0.1820242 0.5121996 0.1820242 0.5121996
## 4: 3 1 0.1758779 0.0000000 0.1758779 NaN
## 5: 3 2 0.8867973 0.5939504 0.8867973 0.5939504
## 6: 4 1 0.6169361 0.5173982 0.6169361 0.5173982
## 7: 4 2 0.3218471 1.7323249 0.3218471 0.8661625
## 8: 5 1 0.3171631 0.7866574 0.3171631 0.7866574
## 9: 5 2 1.3902338 0.9101196 0.6951169 0.9101196
# multiple fun.agg and value.var (all combinations)
dcast(DT, x + y ~ z, fun=list(sum, mean), value.var=c("d1", "d2"))
## x y d1_sum_a d1_sum_b d2_sum_a d2_sum_b d1_mean_a d1_mean_b d2_mean_a
## 1: 1 1 0.0000000 0.5041157 0 2 NaN 0.2520579 NaN
## 2: 1 2 0.4628660 1.3449156 1 2 0.4628660 0.6724578 1
## 3: 2 1 0.1820242 0.5121996 1 1 0.1820242 0.5121996 1
## 4: 3 1 0.1758779 0.0000000 1 0 0.1758779 NaN 1
## 5: 3 2 0.8867973 0.5939504 1 1 0.8867973 0.5939504 1
## 6: 4 1 0.6169361 0.5173982 1 1 0.6169361 0.5173982 1
## 7: 4 2 0.3218471 1.7323249 1 2 0.3218471 0.8661625 1
## 8: 5 1 0.3171631 0.7866574 1 1 0.3171631 0.7866574 1
## 9: 5 2 1.3902338 0.9101196 2 1 0.6951169 0.9101196 1
## d2_mean_b
## 1: 1
## 2: 1
## 3: 1
## 4: NaN
## 5: 1
## 6: 1
## 7: 1
## 8: 1
## 9: 1
# multiple fun.agg and value.var (one-to-one)
dcast(DT, x + y ~ z, fun=list(sum, mean), value.var=list("d1", "d2"))
## x y d1_sum_a d1_sum_b d2_mean_a d2_mean_b
## 1: 1 1 0.0000000 0.5041157 NaN 1
## 2: 1 2 0.4628660 1.3449156 1 1
## 3: 2 1 0.1820242 0.5121996 1 1
## 4: 3 1 0.1758779 0.0000000 1 NaN
## 5: 3 2 0.8867973 0.5939504 1 1
## 6: 4 1 0.6169361 0.5173982 1 1
## 7: 4 2 0.3218471 1.7323249 1 1
## 8: 5 1 0.3171631 0.7866574 1 1
## 9: 5 2 1.3902338 0.9101196 1 1
s1 <- "family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA"
DT <- fread(s1)
DT
## family_id age_mother dob_child1 dob_child2 dob_child3
## 1: 1 30 1998-11-26 2000-01-29 <NA>
## 2: 2 27 1996-06-22 <NA> <NA>
## 3: 3 26 2002-07-11 2004-04-05 2007-09-02
## 4: 4 32 2004-10-10 2009-08-27 2012-07-21
## 5: 5 29 2000-12-05 2005-02-28 <NA>
DT
to long form where each dob
is a separate observationwe could accomplish this using melt()
by specifying id.vars
and measure.vars
arguments as follows:
DT.m1 = melt(DT, id.vars = c("family_id", "age_mother"), measure.vars = c("dob_child1", "dob_child2", "dob_child3"))
DT.m1
## family_id age_mother variable value
## 1: 1 30 dob_child1 1998-11-26
## 2: 2 27 dob_child1 1996-06-22
## 3: 3 26 dob_child1 2002-07-11
## 4: 4 32 dob_child1 2004-10-10
## 5: 5 29 dob_child1 2000-12-05
## 6: 1 30 dob_child2 2000-01-29
## 7: 2 27 dob_child2 <NA>
## 8: 3 26 dob_child2 2004-04-05
## 9: 4 32 dob_child2 2009-08-27
## 10: 5 29 dob_child2 2005-02-28
## 11: 1 30 dob_child3 <NA>
## 12: 2 27 dob_child3 <NA>
## 13: 3 26 dob_child3 2007-09-02
## 14: 4 32 dob_child3 2012-07-21
## 15: 5 29 dob_child3 <NA>
measure.vars
specify the set of columns we would like to collapse (or combine) together.
We can also specify column indices instead of names
By default, variable
column is of type factor
. Set variable.factor
argument to FALSE
if you’d like to return a character vector instead. variable.factor
argument is only available in melt
from data.tabel
and not in the reshape2
package.
By default, the molten columns are automatically named variable
and value
melt
preserves column attributes in result
variable
and value
columns to child
and dob
respectivelyDT.m1 = melt(DT, measure.vars = c("dob_child1", "dob_child2", "dob_child3"),
variable.name = "child", value.name = "dob")
DT.m1
## family_id age_mother child dob
## 1: 1 30 dob_child1 1998-11-26
## 2: 2 27 dob_child1 1996-06-22
## 3: 3 26 dob_child1 2002-07-11
## 4: 4 32 dob_child1 2004-10-10
## 5: 5 29 dob_child1 2000-12-05
## 6: 1 30 dob_child2 2000-01-29
## 7: 2 27 dob_child2 <NA>
## 8: 3 26 dob_child2 2004-04-05
## 9: 4 32 dob_child2 2009-08-27
## 10: 5 29 dob_child2 2005-02-28
## 11: 1 30 dob_child3 <NA>
## 12: 2 27 dob_child3 <NA>
## 13: 3 26 dob_child3 2007-09-02
## 14: 4 32 dob_child3 2012-07-21
## 15: 5 29 dob_child3 <NA>
By default, when one of id.vars
or measure.vars
is missing, the rest of the columns are automatically assigned to the missing argument.
When neigher id.vars
nor measure.vars
are specified, are mentioned under ?melt
, all non-numeric
, integer
, logical
columns will be assigned to id.vars
In addition, a warning message is issued highlighting the columns that are automatically considered to be id.vars
.
dcast(DT.m1, family_id + age_mother ~ child, value.var = "dob")
## family_id age_mother dob_child1 dob_child2 dob_child3
## 1: 1 30 1998-11-26 2000-01-29 <NA>
## 2: 2 27 1996-06-22 <NA> <NA>
## 3: 3 26 2002-07-11 2004-04-05 2007-09-02
## 4: 4 32 2004-10-10 2009-08-27 2012-07-21
## 5: 5 29 2000-12-05 2005-02-28 <NA>
dcast
uses formula interface. The variables on the LHS of formula represents the id vars and RHS the measure vars.
value.var
denotes the column to be filled in with while casting to wide format
dcast
also tries to preserve attributes in result whereever possible
DT.m
, how can we get the number of children in each family?You can also pass a function to aggregate by in dcast
with the argument fun.aggregate
. This is particularly essential when the formula provided does not identify single observation for each cell.
dcast(DT.m1, family_id ~ . ,fun.aggregate = function(x) sum(!is.na(x)), value.var = "dob")
## family_id .
## 1: 1 2
## 2: 2 1
## 3: 3 3
## 4: 4 3
## 5: 5 2
melt/dcast
approachesHowever, there are situations we might run into where the desired operation is not expressed in a straightward manner. For example, consider the data.table
shown below:
s2 <- "family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1 30 1998-11-26 2000-01-29 NA 1 2 NA
2 27 1996-06-22 NA NA 2 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1
4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1
5 29 2000-12-05 2005-02-28 NA 2 1 NA"
DT <- fread(s2)
DT
## family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1
## 1: 1 30 1998-11-26 2000-01-29 <NA> 1
## 2: 2 27 1996-06-22 <NA> <NA> 2
## 3: 3 26 2002-07-11 2004-04-05 2007-09-02 2
## 4: 4 32 2004-10-10 2009-08-27 2012-07-21 1
## 5: 5 29 2000-12-05 2005-02-28 <NA> 2
## gender_child2 gender_child3
## 1: 2 NA
## 2: NA NA
## 3: 2 1
## 4: 1 1
## 5: 1 NA
## 1 = female, 2 = male
And you’d like to combine (melt) all the dob
columns together, and gender
columns together. Using the current functionality, we can do something like this:
DT.m1 = melt(DT, id = c("family_id", "age_mother"))
## Warning in melt.data.table(DT, id = c("family_id", "age_mother")):
## 'measure.vars' [dob_child1, dob_child2, dob_child3, gender_child1, ...]
## are not all of the same type. By order of hierarchy, the molten data value
## column will be of type 'character'. All measure variables not of type
## 'character' will be coerced too. Check DETAILS in ?melt.data.table for more
## on coercion.
DT.m1
## family_id age_mother variable value
## 1: 1 30 dob_child1 1998-11-26
## 2: 2 27 dob_child1 1996-06-22
## 3: 3 26 dob_child1 2002-07-11
## 4: 4 32 dob_child1 2004-10-10
## 5: 5 29 dob_child1 2000-12-05
## 6: 1 30 dob_child2 2000-01-29
## 7: 2 27 dob_child2 <NA>
## 8: 3 26 dob_child2 2004-04-05
## 9: 4 32 dob_child2 2009-08-27
## 10: 5 29 dob_child2 2005-02-28
## 11: 1 30 dob_child3 <NA>
## 12: 2 27 dob_child3 <NA>
## 13: 3 26 dob_child3 2007-09-02
## 14: 4 32 dob_child3 2012-07-21
## 15: 5 29 dob_child3 <NA>
## 16: 1 30 gender_child1 1
## 17: 2 27 gender_child1 2
## 18: 3 26 gender_child1 2
## 19: 4 32 gender_child1 1
## 20: 5 29 gender_child1 2
## 21: 1 30 gender_child2 2
## 22: 2 27 gender_child2 <NA>
## 23: 3 26 gender_child2 2
## 24: 4 32 gender_child2 1
## 25: 5 29 gender_child2 1
## 26: 1 30 gender_child3 <NA>
## 27: 2 27 gender_child3 <NA>
## 28: 3 26 gender_child3 1
## 29: 4 32 gender_child3 1
## 30: 5 29 gender_child3 <NA>
## family_id age_mother variable value
DT.m1[, c("variable", "child") := tstrsplit(variable, "_", fixed = TRUE)]
DT.m1
## family_id age_mother variable value child
## 1: 1 30 dob 1998-11-26 child1
## 2: 2 27 dob 1996-06-22 child1
## 3: 3 26 dob 2002-07-11 child1
## 4: 4 32 dob 2004-10-10 child1
## 5: 5 29 dob 2000-12-05 child1
## 6: 1 30 dob 2000-01-29 child2
## 7: 2 27 dob <NA> child2
## 8: 3 26 dob 2004-04-05 child2
## 9: 4 32 dob 2009-08-27 child2
## 10: 5 29 dob 2005-02-28 child2
## 11: 1 30 dob <NA> child3
## 12: 2 27 dob <NA> child3
## 13: 3 26 dob 2007-09-02 child3
## 14: 4 32 dob 2012-07-21 child3
## 15: 5 29 dob <NA> child3
## 16: 1 30 gender 1 child1
## 17: 2 27 gender 2 child1
## 18: 3 26 gender 2 child1
## 19: 4 32 gender 1 child1
## 20: 5 29 gender 2 child1
## 21: 1 30 gender 2 child2
## 22: 2 27 gender <NA> child2
## 23: 3 26 gender 2 child2
## 24: 4 32 gender 1 child2
## 25: 5 29 gender 1 child2
## 26: 1 30 gender <NA> child3
## 27: 2 27 gender <NA> child3
## 28: 3 26 gender 1 child3
## 29: 4 32 gender 1 child3
## 30: 5 29 gender <NA> child3
## family_id age_mother variable value child
DT.c1 = dcast(DT.m1, family_id + age_mother + child ~ variable, value.var = "value")
DT.c1
## family_id age_mother child dob gender
## 1: 1 30 child1 1998-11-26 1
## 2: 1 30 child2 2000-01-29 2
## 3: 1 30 child3 <NA> <NA>
## 4: 2 27 child1 1996-06-22 2
## 5: 2 27 child2 <NA> <NA>
## 6: 2 27 child3 <NA> <NA>
## 7: 3 26 child1 2002-07-11 2
## 8: 3 26 child2 2004-04-05 2
## 9: 3 26 child3 2007-09-02 1
## 10: 4 32 child1 2004-10-10 1
## 11: 4 32 child2 2009-08-27 1
## 12: 4 32 child3 2012-07-21 1
## 13: 5 29 child1 2000-12-05 2
## 14: 5 29 child2 2005-02-28 1
## 15: 5 29 child3 <NA> <NA>
dob
and gender
type columns together respectively. Instead we are combining everything together, and then splitting them again. I think it’s easy to see that it’s quite roundabout (and inefficient).As an analogy, imagine you’ve a closet with four shelves of clothes and you’d like to put together the clothes from shelves 1 and 2 together(in 1), and 3 and 4 together(in 3). What we are doing is more or less to combine all the clothes together, and then split them back on to shelves 1 and 3!
The columns to melt may be of different types, as in this case (character and integer types). By melting them all together, the columns will be coerced in result, as explained by the warning message above and shown from output of str(DT.c1)
, where gender
has been converted to character type.
We are generating an additional column by splitting the variable
column into two columns, whose purpose is quite cryptic. We do it because we need it for casting in the next step.
Finally, we cast the data set. But the issue is it’s a much more compuatationally involved operation than melt. Specifically, it requires computing the order of the variables in formula, and that’s costly.
In fact, stats::reshape
is capable of performing this operation in a very staightforward manner. It is an extremely useful and often underrated function. You should definitely give it a try!
melt
Since we’d like for data.table
to perform this operation straightforward and efficient using the same interface, we went ahead and implemented an additional functionality, where we can melt
to multiple columns simutaneously.
melt
multiple columns simultaneouslyThe idea is quite simple. We pass a list of columns to measure.vars
, where each element of the list contains the columns that should be combined together.
DT
## family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1
## 1: 1 30 1998-11-26 2000-01-29 <NA> 1
## 2: 2 27 1996-06-22 <NA> <NA> 2
## 3: 3 26 2002-07-11 2004-04-05 2007-09-02 2
## 4: 4 32 2004-10-10 2009-08-27 2012-07-21 1
## 5: 5 29 2000-12-05 2005-02-28 <NA> 2
## gender_child2 gender_child3
## 1: 2 NA
## 2: NA NA
## 3: 2 1
## 4: 1 1
## 5: 1 NA
colA = paste("dob_child", 1:3, sep = "")
colB = paste("gender_child", 1:3, sep = "")
DT.m2 = melt(DT, measure.vars = list(colA, colB), value.name = c("dob", "gender"))
DT.m2
## family_id age_mother variable dob gender
## 1: 1 30 1 1998-11-26 1
## 2: 2 27 1 1996-06-22 2
## 3: 3 26 1 2002-07-11 2
## 4: 4 32 1 2004-10-10 1
## 5: 5 29 1 2000-12-05 2
## 6: 1 30 2 2000-01-29 2
## 7: 2 27 2 <NA> NA
## 8: 3 26 2 2004-04-05 2
## 9: 4 32 2 2009-08-27 1
## 10: 5 29 2 2005-02-28 1
## 11: 1 30 3 <NA> NA
## 12: 2 27 3 <NA> NA
## 13: 3 26 3 2007-09-02 1
## 14: 4 32 3 2012-07-21 1
## 15: 5 29 3 <NA> NA
pattern()
Usually in these problems, the columns we’d like to melt can be distinguished by a common pattern. We can use the function patterns()
, implemented for convenience, to provide regular expressions for the columns to be combined together. The above operation can be rewritten as:
DT.m2 = melt(DT, measure.vars = patterns("^dob", "^gender"), value.name = c("dob", "gender"))
DT.m2
## family_id age_mother variable dob gender
## 1: 1 30 1 1998-11-26 1
## 2: 2 27 1 1996-06-22 2
## 3: 3 26 1 2002-07-11 2
## 4: 4 32 1 2004-10-10 1
## 5: 5 29 1 2000-12-05 2
## 6: 1 30 2 2000-01-29 2
## 7: 2 27 2 <NA> NA
## 8: 3 26 2 2004-04-05 2
## 9: 4 32 2 2009-08-27 1
## 10: 5 29 2 2005-02-28 1
## 11: 1 30 3 <NA> NA
## 12: 2 27 3 <NA> NA
## 13: 3 26 3 2007-09-02 1
## 14: 4 32 3 2012-07-21 1
## 15: 5 29 3 <NA> NA
We can remove the variable
column if necessary
The functionality is implemented entirely in C, and is therefore both fast and memory efficient in addition to being straightforward
dcast
We can now melt into multiple columns simultaneously. Now given the data set DT.m2
as shown above, how can we get back to the same format as the original data we started with?
If we use the current functionality of dcast
, then we’d have to cast twice and bind the results together. But that’s once again verbose, not straightforward and is also inefficient.
value.var
s simultaneouslyWe can now provide multiple value.var
columns to dcast
for data.tables directly so that the operations are taken care of internally and efficiently.
DT.c2 = dcast(DT.m2, family_id + age_mother ~ variable, value.var = c("dob", "gender"))
DT.c2
## family_id age_mother dob_1 dob_2 dob_3 gender_1 gender_2
## 1: 1 30 1998-11-26 2000-01-29 <NA> 1 2
## 2: 2 27 1996-06-22 <NA> <NA> 2 NA
## 3: 3 26 2002-07-11 2004-04-05 2007-09-02 2 2
## 4: 4 32 2004-10-10 2009-08-27 2012-07-21 1 1
## 5: 5 29 2000-12-05 2005-02-28 <NA> 2 1
## gender_3
## 1: NA
## 2: NA
## 3: 1
## 4: 1
## 5: NA
Attributes are preserved in result whereever possible
Everything is taken care of internally, and efficiently. In addition to being fast, it is also very memory efficient.