Transformation to DataFrames#

Split-apply-combine

using DataFrames

Grouping a data frame#

groupby

x = DataFrame(id=[1, 2, 3, 4, 1, 2, 3, 4], id2=[1, 2, 1, 2, 1, 2, 1, 2], v=rand(8))
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.660288
2220.00519032
3310.0942121
4420.458129
5110.26202
6220.849367
7310.522893
8420.24494
groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = 1
Rowidid2v
Int64Int64Float64
1110.660288
2110.26202

Last Group (2 rows): id = 4
Rowidid2v
Int64Int64Float64
1420.458129
2420.24494
groupby(x, [])

GroupedDataFrame with 1 group based on key:

First Group (8 rows):
Rowidid2v
Int64Int64Float64
1110.660288
2220.00519032
3310.0942121
4420.458129
5110.26202
6220.849367
7310.522893
8420.24494
gx2 = groupby(x, [:id, :id2])

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.660288
2110.26202

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.458129
2420.24494

get the parent DataFrame

parent(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.660288
2220.00519032
3310.0942121
4420.458129
5110.26202
6220.849367
7310.522893
8420.24494

back to the DataFrame, but in a different order of rows than the original

vcat(gx2...)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.660288
2110.26202
3220.00519032
4220.849367
5310.0942121
6310.522893
7420.458129
8420.24494

the same as above

DataFrame(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.660288
2110.26202
3220.00519032
4220.849367
5310.0942121
6310.522893
7420.458129
8420.24494

drop grouping columns when creating a data frame

DataFrame(gx2, keepkeys=false)
8×1 DataFrame
Rowv
Float64
10.660288
20.26202
30.00519032
40.849367
50.0942121
60.522893
70.458129
80.24494

vector of names of grouping variables

groupcols(gx2)
2-element Vector{Symbol}:
 :id
 :id2

and non-grouping variables

valuecols(gx2)
1-element Vector{Symbol}:
 :v

group indices in parent(gx2)

groupindices(gx2)
8-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
 1
 2
 3
 4
kgx2 = keys(gx2)
4-element DataFrames.GroupKeys{DataFrames.GroupedDataFrame{DataFrames.DataFrame}}:
 GroupKey: (id = 1, id2 = 1)
 GroupKey: (id = 2, id2 = 2)
 GroupKey: (id = 3, id2 = 1)
 GroupKey: (id = 4, id2 = 2)

You can index into a GroupedDataFrame like to a vector or to a dictionary. The second form accepts GroupKey, NamedTuple or a Tuple.

gx2

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.660288
2110.26202

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.458129
2420.24494
k = keys(gx2)[1]
GroupKey: (id = 1, id2 = 1)
ntk = NamedTuple(k)
(id = 1, id2 = 1)
tk = Tuple(k)
(1, 1)

the operations below produce the same result and are proformant

gx2[1], gx2[k], gx2[ntk], gx2[tk]
(2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.660288
   2 │     1      1  0.26202, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.660288
   2 │     1      1  0.26202, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.660288
   2 │     1      1  0.26202, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.660288
   2 │     1      1  0.26202)

handling missing values

x = DataFrame(id=[missing, 5, 1, 3, missing], x=1:5)
5×2 DataFrame
Rowidx
Int64?Int64
1missing1
252
313
434
5missing5

by default groups include missing values and their order is not guaranteed

groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

but we can change it; now they are sorted

groupby(x, :id, sort=true, skipmissing=true)

GroupedDataFrame with 3 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (1 row): id = 5
Rowidx
Int64?Int64
152

and now they are in the order they appear in the source data frame

groupby(x, :id, sort=false)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

Last Group (1 row): id = 3
Rowidx
Int64?Int64
134

Performing transformations#

by group using combine, select, select!, transform, and transform!

using Statistics
using Chain

x = DataFrame(id=rand('a':'d', 100), v=rand(100))
100×2 DataFrame
75 rows omitted
Rowidv
CharFloat64
1d0.248966
2b0.677136
3d0.0390287
4d0.338539
5a0.914206
6c0.119065
7b0.572222
8d0.403218
9c0.907131
10d0.419591
11a0.438969
12b0.0641178
13b0.983015
89a0.0495181
90d0.106207
91a0.00476801
92c0.985394
93d0.122091
94d0.867029
95a0.616144
96d0.99123
97b0.13803
98b0.390887
99b0.372689
100b0.927418

apply a function to each group of a data frame combine keeps as many rows as are returned from the function

@chain x begin
    groupby(:id)
    combine(:v => mean)
end
4×2 DataFrame
Rowidv_mean
CharFloat64
1d0.461606
2b0.498758
3a0.529388
4c0.546276
x.id2 = axes(x, 1)
Base.OneTo(100)

Select and transform keep as many rows as are in the source data frame and in correct order. Additionally, transform keeps all columns from the source.

@chain x begin
    groupby(:id)
    transform(:v => mean)
end
100×4 DataFrame
75 rows omitted
Rowidvid2v_mean
CharFloat64Int64Float64
1d0.24896610.461606
2b0.67713620.498758
3d0.039028730.461606
4d0.33853940.461606
5a0.91420650.529388
6c0.11906560.546276
7b0.57222270.498758
8d0.40321880.461606
9c0.90713190.546276
10d0.419591100.461606
11a0.438969110.529388
12b0.0641178120.498758
13b0.983015130.498758
89a0.0495181890.529388
90d0.106207900.461606
91a0.00476801910.529388
92c0.985394920.546276
93d0.122091930.461606
94d0.867029940.461606
95a0.616144950.529388
96d0.99123960.461606
97b0.13803970.498758
98b0.390887980.498758
99b0.372689990.498758
100b0.9274181000.498758

note that combine reorders rows by group of GroupedDataFrame

@chain x begin
    groupby(:id)
    combine(:id2, :v => mean)
end
100×3 DataFrame
75 rows omitted
Rowidid2v_mean
CharInt64Float64
1d10.461606
2d30.461606
3d40.461606
4d80.461606
5d100.461606
6d160.461606
7d220.461606
8d240.461606
9d320.461606
10d340.461606
11d370.461606
12d380.461606
13d400.461606
89c470.546276
90c480.546276
91c530.546276
92c540.546276
93c650.546276
94c710.546276
95c730.546276
96c740.546276
97c780.546276
98c860.546276
99c870.546276
100c920.546276

we give a custom name for the result column

@chain x begin
    groupby(:id)
    combine(:v => mean => :res)
end
4×2 DataFrame
Rowidres
CharFloat64
1d0.461606
2b0.498758
3a0.529388
4c0.546276

you can have multiple operations

@chain x begin
    groupby(:id)
    combine(:v => mean => :res1, :v => sum => :res2, nrow => :n)
end
4×4 DataFrame
Rowidres1res2n
CharFloat64Float64Int64
1d0.46160613.386629
2b0.49875812.468925
3a0.52938810.587820
4c0.54627614.203226

Additional notes:

  • select! and transform! perform operations in-place

  • The general syntax for transformation is source_columns => function => target_column

  • if you pass multiple columns to a function they are treated as positional arguments

  • ByRow and AsTable work exactly like discussed for operations on data frames in 05_columns.ipynb

  • you can automatically groupby again the result of combine, select etc. by passing ungroup=false keyword argument to them

  • similarly keepkeys keyword argument allows you to drop grouping columns from the resulting data frame

It is also allowed to pass a function to all these functions (also - as a special case, as a first argument). In this case the return value can be a table. In particular it allows for an easy dropping of groups if you return an empty table from the function.

If you pass a function you can use a do block syntax. In case of passing a function it gets a SubDataFrame as its argument.

Here is an example:

combine(groupby(x, :id)) do sdf
    n = nrow(sdf)
    n < 25 ? DataFrame() : DataFrame(n=n) ## drop groups with low number of rows
end
3×2 DataFrame
Rowidn
CharInt64
1d29
2b25
3c26

You can also produce multiple columns in a single operation:

df = DataFrame(id=[1, 1, 2, 2], val=[1, 2, 3, 4])
4×2 DataFrame
Rowidval
Int64Int64
111
212
323
424
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => AsTable)
end
2×3 DataFrame
Rowidx1x2
Int64Int64Int64
1112
2234
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => [:c1, :c2])
end
2×3 DataFrame
Rowidc1c2
Int64Int64Int64
1112
2234

It is easy to unnest the column into multiple columns,

df = DataFrame(a=[(p=1, q=2), (p=3, q=4)])
select(df, :a => AsTable)
2×2 DataFrame
Rowpq
Int64Int64
112
234

automatic column names generated

df = DataFrame(a=[[1, 2], [3, 4]])
select(df, :a => AsTable)
2×2 DataFrame
Rowx1x2
Int64Int64
112
234

custom column names generated

select(df, :a => [:C1, :C2])
2×2 DataFrame
RowC1C2
Int64Int64
112
234

Finally, observe that one can conveniently apply multiple transformations using broadcasting:

df = DataFrame(id=repeat(1:10, 10), x1=1:100, x2=101:200)
100×3 DataFrame
75 rows omitted
Rowidx1x2
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
11111111
12212112
13313113
89989189
901090190
91191191
92292192
93393193
94494194
95595195
96696196
97797197
98898198
99999199
10010100200
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> minimum)
end
10×3 DataFrame
Rowidx1_minimumx2_minimum
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> [minimum maximum])
end
10×5 DataFrame
Rowidx1_minimumx2_minimumx1_maximumx2_maximum
Int64Int64Int64Int64Int64
11110191191
22210292192
33310393193
44410494194
55510595195
66610696196
77710797197
88810898198
99910999199
101010110100200

Aggregation of a data frame using mapcols#

x = DataFrame(rand(10, 10), :auto)
10×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.8573720.06043980.3255890.3386940.7665810.7948920.320410.06972530.9448040.845565
20.669090.5888140.01833790.800830.4966730.06241740.486740.4980840.1738670.604959
30.61920.2696990.07140280.08421040.06464430.8753360.1418550.2369860.3133790.0745637
40.8515080.08584740.8404330.194620.8781920.003134760.2038180.710060.9636750.384315
50.8655080.1869520.2602360.7308580.929060.8642170.9970370.9474470.7396230.413455
60.007444640.8037860.1943780.7945460.02736590.3103780.6223360.6497830.5321840.873814
70.125960.8140050.09305380.1859450.1700850.2147340.1338850.5978840.1868610.893142
80.7801130.7665130.3975150.4322410.4316760.5191520.6524280.04324910.2376440.796689
90.3385590.8009770.7904610.944810.1633920.1606520.4828060.7981110.5835620.523887
100.1884920.7297080.1099740.5662460.2603430.3574190.2828430.1685030.9054040.748549
mapcols(mean, x)
1×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.5303250.5106740.3101380.50730.4188010.4162330.4324160.4719830.55810.615894

Mapping rows and columns using eachcol and eachrow#

map a function over each column and return a vector

map(mean, eachcol(x))
10-element Vector{Float64}:
 0.5303245428009665
 0.5106740318449894
 0.31013809099463696
 0.5073001344079734
 0.4188011876439576
 0.4162331753066006
 0.43241578776594264
 0.47198321172017604
 0.5581001809827705
 0.6158937739356688

an iteration returns a Pair with column name and values

foreach(c -> println(c[1], ": ", mean(c[2])), pairs(eachcol(x)))
x1: 0.5303245428009665
x2: 0.5106740318449894
x3: 0.31013809099463696
x4: 0.5073001344079734
x5: 0.4188011876439576
x6: 0.4162331753066006
x7: 0.43241578776594264
x8: 0.47198321172017604
x9: 0.5581001809827705
x10: 0.6158937739356688

now the returned value is DataFrameRow which works as a NamedTuple but is a view to a parent DataFrame

map(r -> r.x1 / r.x2, eachrow(x))
10-element Vector{Float64}:
 14.185545224261652
  1.1363353510787026
  2.295892834255423
  9.918850731427545
  4.629583189916868
  0.009261975035659207
  0.1547412524365301
  1.0177421879049602
  0.4226822659171988
  0.2583112239652292

it prints like a data frame, only the caption is different so that you know the type of the object

er = eachrow(x)
er.x1 ## you can access columns of a parent data frame directly
10-element Vector{Float64}:
 0.857371938863679
 0.6690899418911187
 0.6191999149726781
 0.8515080259405692
 0.8655075819602075
 0.007444641674264618
 0.12596020343974845
 0.7801126668947531
 0.33855880913447767
 0.18849170323816855

it prints like a data frame, only the caption is different so that you know the type of the object

ec = eachcol(x)
10×10 DataFrameColumns
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.8573720.06043980.3255890.3386940.7665810.7948920.320410.06972530.9448040.845565
20.669090.5888140.01833790.800830.4966730.06241740.486740.4980840.1738670.604959
30.61920.2696990.07140280.08421040.06464430.8753360.1418550.2369860.3133790.0745637
40.8515080.08584740.8404330.194620.8781920.003134760.2038180.710060.9636750.384315
50.8655080.1869520.2602360.7308580.929060.8642170.9970370.9474470.7396230.413455
60.007444640.8037860.1943780.7945460.02736590.3103780.6223360.6497830.5321840.873814
70.125960.8140050.09305380.1859450.1700850.2147340.1338850.5978840.1868610.893142
80.7801130.7665130.3975150.4322410.4316760.5191520.6524280.04324910.2376440.796689
90.3385590.8009770.7904610.944810.1633920.1606520.4828060.7981110.5835620.523887
100.1884920.7297080.1099740.5662460.2603430.3574190.2828430.1685030.9054040.748549

you can access columns of a parent data frame directly

ec.x1
10-element Vector{Float64}:
 0.857371938863679
 0.6690899418911187
 0.6191999149726781
 0.8515080259405692
 0.8655075819602075
 0.007444641674264618
 0.12596020343974845
 0.7801126668947531
 0.33855880913447767
 0.18849170323816855

Transposing#

you can transpose a data frame using permutedims:

df = DataFrame(reshape(1:12, 3, 4), :auto)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
114710
225811
336912
df.names = ["a", "b", "c"]
3-element Vector{String}:
 "a"
 "b"
 "c"
permutedims(df, :names)
4×4 DataFrame
Rownamesabc
StringInt64Int64Int64
1x1123
2x2456
3x3789
4x4101112

This notebook was generated using Literate.jl.