Transformation to DataFrames#

Split-apply-combine

using DataFrames

Grouping a data frame#

groupby

x = DataFrame(id=[1, 2, 3, 4, 1, 2, 3, 4], id2=[1, 2, 1, 2, 1, 2, 1, 2], v=rand(8))
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.756392
2220.405107
3310.855122
4420.989225
5110.117164
6220.3996
7310.32268
8420.317763
groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = 1
Rowidid2v
Int64Int64Float64
1110.756392
2110.117164

Last Group (2 rows): id = 4
Rowidid2v
Int64Int64Float64
1420.989225
2420.317763
groupby(x, [])

GroupedDataFrame with 1 group based on key:

First Group (8 rows):
Rowidid2v
Int64Int64Float64
1110.756392
2220.405107
3310.855122
4420.989225
5110.117164
6220.3996
7310.32268
8420.317763
gx2 = groupby(x, [:id, :id2])

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.756392
2110.117164

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.989225
2420.317763

get the parent DataFrame

parent(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.756392
2220.405107
3310.855122
4420.989225
5110.117164
6220.3996
7310.32268
8420.317763

back to the DataFrame, but in a different order of rows than the original

vcat(gx2...)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.756392
2110.117164
3220.405107
4220.3996
5310.855122
6310.32268
7420.989225
8420.317763

the same as above

DataFrame(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.756392
2110.117164
3220.405107
4220.3996
5310.855122
6310.32268
7420.989225
8420.317763

drop grouping columns when creating a data frame

DataFrame(gx2, keepkeys=false)
8×1 DataFrame
Rowv
Float64
10.756392
20.117164
30.405107
40.3996
50.855122
60.32268
70.989225
80.317763

vector of names of grouping variables

groupcols(gx2)
2-element Vector{Symbol}:
 :id
 :id2

and non-grouping variables

valuecols(gx2)
1-element Vector{Symbol}:
 :v

group indices in parent(gx2)

groupindices(gx2)
8-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
 1
 2
 3
 4
kgx2 = keys(gx2)
4-element DataFrames.GroupKeys{DataFrames.GroupedDataFrame{DataFrames.DataFrame}}:
 GroupKey: (id = 1, id2 = 1)
 GroupKey: (id = 2, id2 = 2)
 GroupKey: (id = 3, id2 = 1)
 GroupKey: (id = 4, id2 = 2)

You can index into a GroupedDataFrame like to a vector or to a dictionary. The second form accepts GroupKey, NamedTuple or a Tuple.

gx2

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.756392
2110.117164

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.989225
2420.317763
k = keys(gx2)[1]
GroupKey: (id = 1, id2 = 1)
ntk = NamedTuple(k)
(id = 1, id2 = 1)
tk = Tuple(k)
(1, 1)

the operations below produce the same result and are proformant

gx2[1], gx2[k], gx2[ntk], gx2[tk]
(2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.756392
   2 │     1      1  0.117164, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.756392
   2 │     1      1  0.117164, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.756392
   2 │     1      1  0.117164, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.756392
   2 │     1      1  0.117164)

handling missing values

x = DataFrame(id=[missing, 5, 1, 3, missing], x=1:5)
5×2 DataFrame
Rowidx
Int64?Int64
1missing1
252
313
434
5missing5

by default groups include missing values and their order is not guaranteed

groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

but we can change it; now they are sorted

groupby(x, :id, sort=true, skipmissing=true)

GroupedDataFrame with 3 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (1 row): id = 5
Rowidx
Int64?Int64
152

and now they are in the order they appear in the source data frame

groupby(x, :id, sort=false)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

Last Group (1 row): id = 3
Rowidx
Int64?Int64
134

Performing transformations#

by group using combine, select, select!, transform, and transform!

using Statistics
using Chain

x = DataFrame(id=rand('a':'d', 100), v=rand(100))
100×2 DataFrame
75 rows omitted
Rowidv
CharFloat64
1c0.164321
2a0.928849
3c0.177093
4c0.465703
5d0.270039
6c0.0326164
7a0.914675
8c0.420337
9d0.746542
10c0.609532
11c0.389072
12b0.80542
13a0.588681
89d0.99002
90c0.855962
91c0.425813
92a0.412515
93b0.158071
94c0.183782
95c0.347485
96c0.812393
97d0.574843
98a0.1684
99b0.118532
100d0.165367

apply a function to each group of a data frame combine keeps as many rows as are returned from the function

@chain x begin
    groupby(:id)
    combine(:v => mean)
end
4×2 DataFrame
Rowidv_mean
CharFloat64
1c0.443746
2a0.568155
3d0.515793
4b0.558574
x.id2 = axes(x, 1)
Base.OneTo(100)

Select and transform keep as many rows as are in the source data frame and in correct order. Additionally, transform keeps all columns from the source.

@chain x begin
    groupby(:id)
    transform(:v => mean)
end
100×4 DataFrame
75 rows omitted
Rowidvid2v_mean
CharFloat64Int64Float64
1c0.16432110.443746
2a0.92884920.568155
3c0.17709330.443746
4c0.46570340.443746
5d0.27003950.515793
6c0.032616460.443746
7a0.91467570.568155
8c0.42033780.443746
9d0.74654290.515793
10c0.609532100.443746
11c0.389072110.443746
12b0.80542120.558574
13a0.588681130.568155
89d0.99002890.515793
90c0.855962900.443746
91c0.425813910.443746
92a0.412515920.568155
93b0.158071930.558574
94c0.183782940.443746
95c0.347485950.443746
96c0.812393960.443746
97d0.574843970.515793
98a0.1684980.568155
99b0.118532990.558574
100d0.1653671000.515793

note that combine reorders rows by group of GroupedDataFrame

@chain x begin
    groupby(:id)
    combine(:id2, :v => mean)
end
100×3 DataFrame
75 rows omitted
Rowidid2v_mean
CharInt64Float64
1c10.443746
2c30.443746
3c40.443746
4c60.443746
5c80.443746
6c100.443746
7c110.443746
8c190.443746
9c260.443746
10c300.443746
11c310.443746
12c340.443746
13c370.443746
89b140.558574
90b210.558574
91b270.558574
92b330.558574
93b360.558574
94b400.558574
95b660.558574
96b680.558574
97b720.558574
98b760.558574
99b930.558574
100b990.558574

we give a custom name for the result column

@chain x begin
    groupby(:id)
    combine(:v => mean => :res)
end
4×2 DataFrame
Rowidres
CharFloat64
1c0.443746
2a0.568155
3d0.515793
4b0.558574

you can have multiple operations

@chain x begin
    groupby(:id)
    combine(:v => mean => :res1, :v => sum => :res2, nrow => :n)
end
4×4 DataFrame
Rowidres1res2n
CharFloat64Float64Int64
1c0.44374616.418637
2a0.56815516.476529
3d0.51579310.831621
4b0.5585747.2614613

Additional notes:

  • select! and transform! perform operations in-place

  • The general syntax for transformation is source_columns => function => target_column

  • if you pass multiple columns to a function they are treated as positional arguments

  • ByRow and AsTable work exactly like discussed for operations on data frames in 05_columns.ipynb

  • you can automatically groupby again the result of combine, select etc. by passing ungroup=false keyword argument to them

  • similarly keepkeys keyword argument allows you to drop grouping columns from the resulting data frame

It is also allowed to pass a function to all these functions (also - as a special case, as a first argument). In this case the return value can be a table. In particular it allows for an easy dropping of groups if you return an empty table from the function.

If you pass a function you can use a do block syntax. In case of passing a function it gets a SubDataFrame as its argument.

Here is an example:

combine(groupby(x, :id)) do sdf
    n = nrow(sdf)
    n < 25 ? DataFrame() : DataFrame(n=n) ## drop groups with low number of rows
end
2×2 DataFrame
Rowidn
CharInt64
1c37
2a29

You can also produce multiple columns in a single operation:

df = DataFrame(id=[1, 1, 2, 2], val=[1, 2, 3, 4])
4×2 DataFrame
Rowidval
Int64Int64
111
212
323
424
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => AsTable)
end
2×3 DataFrame
Rowidx1x2
Int64Int64Int64
1112
2234
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => [:c1, :c2])
end
2×3 DataFrame
Rowidc1c2
Int64Int64Int64
1112
2234

It is easy to unnest the column into multiple columns,

df = DataFrame(a=[(p=1, q=2), (p=3, q=4)])
select(df, :a => AsTable)
2×2 DataFrame
Rowpq
Int64Int64
112
234

automatic column names generated

df = DataFrame(a=[[1, 2], [3, 4]])
select(df, :a => AsTable)
2×2 DataFrame
Rowx1x2
Int64Int64
112
234

custom column names generated

select(df, :a => [:C1, :C2])
2×2 DataFrame
RowC1C2
Int64Int64
112
234

Finally, observe that one can conveniently apply multiple transformations using broadcasting:

df = DataFrame(id=repeat(1:10, 10), x1=1:100, x2=101:200)
100×3 DataFrame
75 rows omitted
Rowidx1x2
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
11111111
12212112
13313113
89989189
901090190
91191191
92292192
93393193
94494194
95595195
96696196
97797197
98898198
99999199
10010100200
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> minimum)
end
10×3 DataFrame
Rowidx1_minimumx2_minimum
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> [minimum maximum])
end
10×5 DataFrame
Rowidx1_minimumx2_minimumx1_maximumx2_maximum
Int64Int64Int64Int64Int64
11110191191
22210292192
33310393193
44410494194
55510595195
66610696196
77710797197
88810898198
99910999199
101010110100200

Aggregation of a data frame using mapcols#

x = DataFrame(rand(10, 10), :auto)
10×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.4284130.2882730.3802490.09625240.875730.8922090.3418930.9678070.04251220.398241
20.6410050.8288170.2921580.6486560.5616290.1764640.6800280.4728140.06236930.355808
30.2461230.5354850.5016760.7678780.1586130.2188190.5174650.554680.8488980.508709
40.4817730.8835890.4841770.9768910.9531310.1310420.346620.429910.9891930.0319736
50.727870.7344140.2773080.655520.2970510.1848050.814330.3574840.2402090.132601
60.2469950.4984330.1583710.5166960.7350910.6290040.8121740.5147850.04299620.0967131
70.7957720.8899870.9138670.4025210.6256660.2799970.3379370.798280.08403320.227062
80.5333170.4372080.3617380.6100430.8854540.8424650.224490.5708670.6741150.695531
90.7610870.046450.7174710.8491070.9993130.2015690.3442650.8833350.9940710.364646
100.694550.03222910.08588670.1611960.4754940.6086150.8945330.8953580.5791550.0914591
mapcols(mean, x)
1×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.555690.5174890.417290.5684760.6567170.4164990.5313730.6445320.4557550.290274

Mapping rows and columns using eachcol and eachrow#

map a function over each column and return a vector

map(mean, eachcol(x))
10-element Vector{Float64}:
 0.5556904112031291
 0.5174885931860757
 0.4172901102513499
 0.5684760119865997
 0.6567171643899633
 0.41649895612771
 0.5313734330242246
 0.6445319618589554
 0.4557552139097991
 0.29027441029936235

an iteration returns a Pair with column name and values

foreach(c -> println(c[1], ": ", mean(c[2])), pairs(eachcol(x)))
x1: 0.5556904112031291
x2: 0.5174885931860757
x3: 0.4172901102513499
x4: 0.5684760119865997
x5: 0.6567171643899633
x6: 0.41649895612771
x7: 0.5313734330242246
x8: 0.6445319618589554
x9: 0.4557552139097991
x10: 0.29027441029936235

now the returned value is DataFrameRow which works as a NamedTuple but is a view to a parent DataFrame

map(r -> r.x1 / r.x2, eachrow(x))
10-element Vector{Float64}:
  1.4861381928287993
  0.7733965185752516
  0.45962608488578244
  0.5452456518417225
  0.9910894748966226
  0.49554198049022985
  0.8941381695084383
  1.2198258072343713
 16.385072165899587
 21.550374040782003

it prints like a data frame, only the caption is different so that you know the type of the object

er = eachrow(x)
er.x1 ## you can access columns of a parent data frame directly
10-element Vector{Float64}:
 0.42841287488654944
 0.6410045042984526
 0.24612297675874162
 0.48177314942789873
 0.7278698016799588
 0.24699461528955446
 0.7957717820207632
 0.533317366688105
 0.7610873308098037
 0.6945497101714644

it prints like a data frame, only the caption is different so that you know the type of the object

ec = eachcol(x)
10×10 DataFrameColumns
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.4284130.2882730.3802490.09625240.875730.8922090.3418930.9678070.04251220.398241
20.6410050.8288170.2921580.6486560.5616290.1764640.6800280.4728140.06236930.355808
30.2461230.5354850.5016760.7678780.1586130.2188190.5174650.554680.8488980.508709
40.4817730.8835890.4841770.9768910.9531310.1310420.346620.429910.9891930.0319736
50.727870.7344140.2773080.655520.2970510.1848050.814330.3574840.2402090.132601
60.2469950.4984330.1583710.5166960.7350910.6290040.8121740.5147850.04299620.0967131
70.7957720.8899870.9138670.4025210.6256660.2799970.3379370.798280.08403320.227062
80.5333170.4372080.3617380.6100430.8854540.8424650.224490.5708670.6741150.695531
90.7610870.046450.7174710.8491070.9993130.2015690.3442650.8833350.9940710.364646
100.694550.03222910.08588670.1611960.4754940.6086150.8945330.8953580.5791550.0914591

you can access columns of a parent data frame directly

ec.x1
10-element Vector{Float64}:
 0.42841287488654944
 0.6410045042984526
 0.24612297675874162
 0.48177314942789873
 0.7278698016799588
 0.24699461528955446
 0.7957717820207632
 0.533317366688105
 0.7610873308098037
 0.6945497101714644

Transposing#

you can transpose a data frame using permutedims:

df = DataFrame(reshape(1:12, 3, 4), :auto)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
114710
225811
336912
df.names = ["a", "b", "c"]
3-element Vector{String}:
 "a"
 "b"
 "c"
permutedims(df, :names)
4×4 DataFrame
Rownamesabc
StringInt64Int64Int64
1x1123
2x2456
3x3789
4x4101112

This notebook was generated using Literate.jl.