Transformation to DataFrames#

Split-apply-combine

using DataFrames

Grouping a data frame#

groupby

x = DataFrame(id=[1, 2, 3, 4, 1, 2, 3, 4], id2=[1, 2, 1, 2, 1, 2, 1, 2], v=rand(8))
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.213093
2220.37718
3310.654804
4420.393161
5110.531108
6220.202536
7310.33627
8420.374237
groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = 1
Rowidid2v
Int64Int64Float64
1110.213093
2110.531108

Last Group (2 rows): id = 4
Rowidid2v
Int64Int64Float64
1420.393161
2420.374237
groupby(x, [])

GroupedDataFrame with 1 group based on key:

First Group (8 rows):
Rowidid2v
Int64Int64Float64
1110.213093
2220.37718
3310.654804
4420.393161
5110.531108
6220.202536
7310.33627
8420.374237
gx2 = groupby(x, [:id, :id2])

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.213093
2110.531108

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.393161
2420.374237

get the parent DataFrame

parent(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.213093
2220.37718
3310.654804
4420.393161
5110.531108
6220.202536
7310.33627
8420.374237

back to the DataFrame, but in a different order of rows than the original

vcat(gx2...)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.213093
2110.531108
3220.37718
4220.202536
5310.654804
6310.33627
7420.393161
8420.374237

the same as above

DataFrame(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.213093
2110.531108
3220.37718
4220.202536
5310.654804
6310.33627
7420.393161
8420.374237

drop grouping columns when creating a data frame

DataFrame(gx2, keepkeys=false)
8×1 DataFrame
Rowv
Float64
10.213093
20.531108
30.37718
40.202536
50.654804
60.33627
70.393161
80.374237

vector of names of grouping variables

groupcols(gx2)
2-element Vector{Symbol}:
 :id
 :id2

and non-grouping variables

valuecols(gx2)
1-element Vector{Symbol}:
 :v

group indices in parent(gx2)

groupindices(gx2)
8-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
 1
 2
 3
 4
kgx2 = keys(gx2)
4-element DataFrames.GroupKeys{DataFrames.GroupedDataFrame{DataFrames.DataFrame}}:
 GroupKey: (id = 1, id2 = 1)
 GroupKey: (id = 2, id2 = 2)
 GroupKey: (id = 3, id2 = 1)
 GroupKey: (id = 4, id2 = 2)

You can index into a GroupedDataFrame like to a vector or to a dictionary. The second form accepts GroupKey, NamedTuple or a Tuple.

gx2

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.213093
2110.531108

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.393161
2420.374237
k = keys(gx2)[1]
GroupKey: (id = 1, id2 = 1)
ntk = NamedTuple(k)
(id = 1, id2 = 1)
tk = Tuple(k)
(1, 1)

the operations below produce the same result and are proformant

gx2[1], gx2[k], gx2[ntk], gx2[tk]
(2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.213093
   2 │     1      1  0.531108, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.213093
   2 │     1      1  0.531108, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.213093
   2 │     1      1  0.531108, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.213093
   2 │     1      1  0.531108)

handling missing values

x = DataFrame(id=[missing, 5, 1, 3, missing], x=1:5)
5×2 DataFrame
Rowidx
Int64?Int64
1missing1
252
313
434
5missing5

by default groups include missing values and their order is not guaranteed

groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

but we can change it; now they are sorted

groupby(x, :id, sort=true, skipmissing=true)

GroupedDataFrame with 3 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (1 row): id = 5
Rowidx
Int64?Int64
152

and now they are in the order they appear in the source data frame

groupby(x, :id, sort=false)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

Last Group (1 row): id = 3
Rowidx
Int64?Int64
134

Performing transformations#

by group using combine, select, select!, transform, and transform!

using Statistics
using Chain

x = DataFrame(id=rand('a':'d', 100), v=rand(100))
100×2 DataFrame
75 rows omitted
Rowidv
CharFloat64
1a0.801883
2d0.952816
3d0.0163985
4d0.628305
5d0.200988
6c0.679717
7b0.406815
8d0.0890824
9d0.647735
10b0.743218
11a0.979649
12d0.307343
13c0.29788
89a0.127252
90c0.123745
91c0.460812
92b0.818189
93b0.138276
94b0.676854
95a0.587076
96d0.613215
97d0.609037
98b0.823303
99a0.0850082
100a0.0878467

apply a function to each group of a data frame combine keeps as many rows as are returned from the function

@chain x begin
    groupby(:id)
    combine(:v => mean)
end
4×2 DataFrame
Rowidv_mean
CharFloat64
1a0.497656
2d0.457387
3c0.425741
4b0.568237
x.id2 = axes(x, 1)
Base.OneTo(100)

Select and transform keep as many rows as are in the source data frame and in correct order. Additionally, transform keeps all columns from the source.

@chain x begin
    groupby(:id)
    transform(:v => mean)
end
100×4 DataFrame
75 rows omitted
Rowidvid2v_mean
CharFloat64Int64Float64
1a0.80188310.497656
2d0.95281620.457387
3d0.016398530.457387
4d0.62830540.457387
5d0.20098850.457387
6c0.67971760.425741
7b0.40681570.568237
8d0.089082480.457387
9d0.64773590.457387
10b0.743218100.568237
11a0.979649110.497656
12d0.307343120.457387
13c0.29788130.425741
89a0.127252890.497656
90c0.123745900.425741
91c0.460812910.425741
92b0.818189920.568237
93b0.138276930.568237
94b0.676854940.568237
95a0.587076950.497656
96d0.613215960.457387
97d0.609037970.457387
98b0.823303980.568237
99a0.0850082990.497656
100a0.08784671000.497656

note that combine reorders rows by group of GroupedDataFrame

@chain x begin
    groupby(:id)
    combine(:id2, :v => mean)
end
100×3 DataFrame
75 rows omitted
Rowidid2v_mean
CharInt64Float64
1a10.497656
2a110.497656
3a150.497656
4a190.497656
5a250.497656
6a340.497656
7a400.497656
8a490.497656
9a530.497656
10a550.497656
11a590.497656
12a760.497656
13a770.497656
89b440.568237
90b460.568237
91b580.568237
92b610.568237
93b620.568237
94b660.568237
95b700.568237
96b730.568237
97b920.568237
98b930.568237
99b940.568237
100b980.568237

we give a custom name for the result column

@chain x begin
    groupby(:id)
    combine(:v => mean => :res)
end
4×2 DataFrame
Rowidres
CharFloat64
1a0.497656
2d0.457387
3c0.425741
4b0.568237

you can have multiple operations

@chain x begin
    groupby(:id)
    combine(:v => mean => :res1, :v => sum => :res2, nrow => :n)
end
4×4 DataFrame
Rowidres1res2n
CharFloat64Float64Int64
1a0.49765610.450821
2d0.45738714.636432
3c0.42574111.069326
4b0.56823711.93321

Additional notes:

  • select! and transform! perform operations in-place

  • The general syntax for transformation is source_columns => function => target_column

  • if you pass multiple columns to a function they are treated as positional arguments

  • ByRow and AsTable work exactly like discussed for operations on data frames in 05_columns.ipynb

  • you can automatically groupby again the result of combine, select etc. by passing ungroup=false keyword argument to them

  • similarly keepkeys keyword argument allows you to drop grouping columns from the resulting data frame

It is also allowed to pass a function to all these functions (also - as a special case, as a first argument). In this case the return value can be a table. In particular it allows for an easy dropping of groups if you return an empty table from the function.

If you pass a function you can use a do block syntax. In case of passing a function it gets a SubDataFrame as its argument.

Here is an example:

combine(groupby(x, :id)) do sdf
    n = nrow(sdf)
    n < 25 ? DataFrame() : DataFrame(n=n) ## drop groups with low number of rows
end
2×2 DataFrame
Rowidn
CharInt64
1d32
2c26

You can also produce multiple columns in a single operation:

df = DataFrame(id=[1, 1, 2, 2], val=[1, 2, 3, 4])
4×2 DataFrame
Rowidval
Int64Int64
111
212
323
424
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => AsTable)
end
2×3 DataFrame
Rowidx1x2
Int64Int64Int64
1112
2234
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => [:c1, :c2])
end
2×3 DataFrame
Rowidc1c2
Int64Int64Int64
1112
2234

It is easy to unnest the column into multiple columns,

df = DataFrame(a=[(p=1, q=2), (p=3, q=4)])
select(df, :a => AsTable)
2×2 DataFrame
Rowpq
Int64Int64
112
234

automatic column names generated

df = DataFrame(a=[[1, 2], [3, 4]])
select(df, :a => AsTable)
2×2 DataFrame
Rowx1x2
Int64Int64
112
234

custom column names generated

select(df, :a => [:C1, :C2])
2×2 DataFrame
RowC1C2
Int64Int64
112
234

Finally, observe that one can conveniently apply multiple transformations using broadcasting:

df = DataFrame(id=repeat(1:10, 10), x1=1:100, x2=101:200)
100×3 DataFrame
75 rows omitted
Rowidx1x2
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
11111111
12212112
13313113
89989189
901090190
91191191
92292192
93393193
94494194
95595195
96696196
97797197
98898198
99999199
10010100200
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> minimum)
end
10×3 DataFrame
Rowidx1_minimumx2_minimum
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> [minimum maximum])
end
10×5 DataFrame
Rowidx1_minimumx2_minimumx1_maximumx2_maximum
Int64Int64Int64Int64Int64
11110191191
22210292192
33310393193
44410494194
55510595195
66610696196
77710797197
88810898198
99910999199
101010110100200

Aggregation of a data frame using mapcols#

x = DataFrame(rand(10, 10), :auto)
10×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.5431870.6549710.5649670.01931840.06556210.3334220.1466350.6967580.1337010.441637
20.02163680.2218180.8124180.9199890.4990020.2731370.2699780.2941270.1258620.83274
30.08582330.7294420.9662610.6826630.4764210.9031260.3797120.05583430.894870.280855
40.443830.6986130.8794490.4159380.4841920.4191590.2754730.04658550.423370.87747
50.0703360.9354930.304440.8013970.800480.4670630.7782720.4808270.2581810.310151
60.6754480.6636050.1942070.9249490.4396750.03353850.7425940.2114520.5283050.995182
70.4964980.9458060.1688430.1774350.02144210.4111330.9084370.4235620.4442850.0575748
80.391820.7467540.5831230.5260520.4061520.347290.6632820.5350990.06837990.502595
90.6899190.4805970.7806220.1537470.1605710.9410220.4729050.4765830.01536240.671748
100.3175210.318490.9509290.1537740.7335820.2941430.8286360.8255280.6309970.797777
mapcols(mean, x)
1×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.3736020.6395590.6205260.4775260.4087080.4423030.5465920.4046360.3523310.576773

Mapping rows and columns using eachcol and eachrow#

map a function over each column and return a vector

map(mean, eachcol(x))
10-element Vector{Float64}:
 0.37360182782733153
 0.6395588106842118
 0.6205259472123151
 0.4775262465128011
 0.4087079638457814
 0.4423033580345047
 0.5465923243787321
 0.40463569092326396
 0.3523313871012412
 0.5767728830531882

an iteration returns a Pair with column name and values

foreach(c -> println(c[1], ": ", mean(c[2])), pairs(eachcol(x)))
x1: 0.37360182782733153
x2: 0.6395588106842118
x3: 0.6205259472123151
x4: 0.4775262465128011
x5: 0.4087079638457814
x6: 0.4423033580345047
x7: 0.5465923243787321
x8: 0.40463569092326396
x9: 0.3523313871012412
x10: 0.5767728830531882

now the returned value is DataFrameRow which works as a NamedTuple but is a view to a parent DataFrame

map(r -> r.x1 / r.x2, eachrow(x))
10-element Vector{Float64}:
 0.8293307869115067
 0.0975428614803122
 0.11765595862937013
 0.6353016147079975
 0.07518609472932634
 1.0178462703245712
 0.5249474255772854
 0.5246970294853852
 1.4355460069910808
 0.9969559193720787

it prints like a data frame, only the caption is different so that you know the type of the object

er = eachrow(x)
er.x1 ## you can access columns of a parent data frame directly
10-element Vector{Float64}:
 0.5431872051020343
 0.02163675847737223
 0.08582325596595097
 0.4438299117560812
 0.07033603762854568
 0.6754479893590396
 0.496498351529858
 0.39181953559070604
 0.6899186955395462
 0.3175205373241805

it prints like a data frame, only the caption is different so that you know the type of the object

ec = eachcol(x)
10×10 DataFrameColumns
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.5431870.6549710.5649670.01931840.06556210.3334220.1466350.6967580.1337010.441637
20.02163680.2218180.8124180.9199890.4990020.2731370.2699780.2941270.1258620.83274
30.08582330.7294420.9662610.6826630.4764210.9031260.3797120.05583430.894870.280855
40.443830.6986130.8794490.4159380.4841920.4191590.2754730.04658550.423370.87747
50.0703360.9354930.304440.8013970.800480.4670630.7782720.4808270.2581810.310151
60.6754480.6636050.1942070.9249490.4396750.03353850.7425940.2114520.5283050.995182
70.4964980.9458060.1688430.1774350.02144210.4111330.9084370.4235620.4442850.0575748
80.391820.7467540.5831230.5260520.4061520.347290.6632820.5350990.06837990.502595
90.6899190.4805970.7806220.1537470.1605710.9410220.4729050.4765830.01536240.671748
100.3175210.318490.9509290.1537740.7335820.2941430.8286360.8255280.6309970.797777

you can access columns of a parent data frame directly

ec.x1
10-element Vector{Float64}:
 0.5431872051020343
 0.02163675847737223
 0.08582325596595097
 0.4438299117560812
 0.07033603762854568
 0.6754479893590396
 0.496498351529858
 0.39181953559070604
 0.6899186955395462
 0.3175205373241805

Transposing#

you can transpose a data frame using permutedims:

df = DataFrame(reshape(1:12, 3, 4), :auto)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
114710
225811
336912
df.names = ["a", "b", "c"]
3-element Vector{String}:
 "a"
 "b"
 "c"
permutedims(df, :names)
4×4 DataFrame
Rownamesabc
StringInt64Int64Int64
1x1123
2x2456
3x3789
4x4101112

This notebook was generated using Literate.jl.