Transformation to DataFrames#

Split-apply-combine

using DataFrames

Grouping a data frame#

groupby

x = DataFrame(id=[1, 2, 3, 4, 1, 2, 3, 4], id2=[1, 2, 1, 2, 1, 2, 1, 2], v=rand(8))
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.973202
2220.046608
3310.252007
4420.628115
5110.110051
6220.544258
7310.891864
8420.24464
groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = 1
Rowidid2v
Int64Int64Float64
1110.973202
2110.110051

Last Group (2 rows): id = 4
Rowidid2v
Int64Int64Float64
1420.628115
2420.24464
groupby(x, [])

GroupedDataFrame with 1 group based on key:

First Group (8 rows):
Rowidid2v
Int64Int64Float64
1110.973202
2220.046608
3310.252007
4420.628115
5110.110051
6220.544258
7310.891864
8420.24464
gx2 = groupby(x, [:id, :id2])

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.973202
2110.110051

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.628115
2420.24464

get the parent DataFrame

parent(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.973202
2220.046608
3310.252007
4420.628115
5110.110051
6220.544258
7310.891864
8420.24464

back to the DataFrame, but in a different order of rows than the original

vcat(gx2...)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.973202
2110.110051
3220.046608
4220.544258
5310.252007
6310.891864
7420.628115
8420.24464

the same as above

DataFrame(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.973202
2110.110051
3220.046608
4220.544258
5310.252007
6310.891864
7420.628115
8420.24464

drop grouping columns when creating a data frame

DataFrame(gx2, keepkeys=false)
8×1 DataFrame
Rowv
Float64
10.973202
20.110051
30.046608
40.544258
50.252007
60.891864
70.628115
80.24464

vector of names of grouping variables

groupcols(gx2)
2-element Vector{Symbol}:
 :id
 :id2

and non-grouping variables

valuecols(gx2)
1-element Vector{Symbol}:
 :v

group indices in parent(gx2)

groupindices(gx2)
8-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
 1
 2
 3
 4
kgx2 = keys(gx2)
4-element DataFrames.GroupKeys{DataFrames.GroupedDataFrame{DataFrames.DataFrame}}:
 GroupKey: (id = 1, id2 = 1)
 GroupKey: (id = 2, id2 = 2)
 GroupKey: (id = 3, id2 = 1)
 GroupKey: (id = 4, id2 = 2)

You can index into a GroupedDataFrame like to a vector or to a dictionary. The second form accepts GroupKey, NamedTuple or a Tuple.

gx2

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.973202
2110.110051

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.628115
2420.24464
k = keys(gx2)[1]
GroupKey: (id = 1, id2 = 1)
ntk = NamedTuple(k)
(id = 1, id2 = 1)
tk = Tuple(k)
(1, 1)

the operations below produce the same result and are proformant

gx2[1], gx2[k], gx2[ntk], gx2[tk]
(2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.973202
   2 │     1      1  0.110051, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.973202
   2 │     1      1  0.110051, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.973202
   2 │     1      1  0.110051, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.973202
   2 │     1      1  0.110051)

handling missing values

x = DataFrame(id=[missing, 5, 1, 3, missing], x=1:5)
5×2 DataFrame
Rowidx
Int64?Int64
1missing1
252
313
434
5missing5

by default groups include missing values and their order is not guaranteed

groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

but we can change it; now they are sorted

groupby(x, :id, sort=true, skipmissing=true)

GroupedDataFrame with 3 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (1 row): id = 5
Rowidx
Int64?Int64
152

and now they are in the order they appear in the source data frame

groupby(x, :id, sort=false)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

Last Group (1 row): id = 3
Rowidx
Int64?Int64
134

Performing transformations#

by group using combine, select, select!, transform, and transform!

using Statistics
using Chain

x = DataFrame(id=rand('a':'d', 100), v=rand(100))
100×2 DataFrame
75 rows omitted
Rowidv
CharFloat64
1c0.847674
2a0.894818
3d0.133443
4d0.744233
5d0.963738
6b0.74572
7c0.861151
8c0.699599
9a0.776682
10a0.383058
11c0.368769
12d0.712692
13b0.0873433
89d0.326667
90a0.92526
91b0.5512
92d0.538515
93a0.543957
94c0.404102
95c0.787083
96d0.931296
97a0.298847
98b0.445243
99c0.506366
100d0.264476

apply a function to each group of a data frame combine keeps as many rows as are returned from the function

@chain x begin
    groupby(:id)
    combine(:v => mean)
end
4×2 DataFrame
Rowidv_mean
CharFloat64
1c0.493781
2a0.507957
3d0.507968
4b0.443487
x.id2 = axes(x, 1)
Base.OneTo(100)

Select and transform keep as many rows as are in the source data frame and in correct order. Additionally, transform keeps all columns from the source.

@chain x begin
    groupby(:id)
    transform(:v => mean)
end
100×4 DataFrame
75 rows omitted
Rowidvid2v_mean
CharFloat64Int64Float64
1c0.84767410.493781
2a0.89481820.507957
3d0.13344330.507968
4d0.74423340.507968
5d0.96373850.507968
6b0.7457260.443487
7c0.86115170.493781
8c0.69959980.493781
9a0.77668290.507957
10a0.383058100.507957
11c0.368769110.493781
12d0.712692120.507968
13b0.0873433130.443487
89d0.326667890.507968
90a0.92526900.507957
91b0.5512910.443487
92d0.538515920.507968
93a0.543957930.507957
94c0.404102940.493781
95c0.787083950.493781
96d0.931296960.507968
97a0.298847970.507957
98b0.445243980.443487
99c0.506366990.493781
100d0.2644761000.507968

note that combine reorders rows by group of GroupedDataFrame

@chain x begin
    groupby(:id)
    combine(:id2, :v => mean)
end
100×3 DataFrame
75 rows omitted
Rowidid2v_mean
CharInt64Float64
1c10.493781
2c70.493781
3c80.493781
4c110.493781
5c140.493781
6c190.493781
7c240.493781
8c260.493781
9c270.493781
10c300.493781
11c370.493781
12c390.493781
13c470.493781
89b460.443487
90b650.443487
91b660.443487
92b690.443487
93b720.443487
94b750.443487
95b770.443487
96b780.443487
97b830.443487
98b850.443487
99b910.443487
100b980.443487

we give a custom name for the result column

@chain x begin
    groupby(:id)
    combine(:v => mean => :res)
end
4×2 DataFrame
Rowidres
CharFloat64
1c0.493781
2a0.507957
3d0.507968
4b0.443487

you can have multiple operations

@chain x begin
    groupby(:id)
    combine(:v => mean => :res1, :v => sum => :res2, nrow => :n)
end
4×4 DataFrame
Rowidres1res2n
CharFloat64Float64Int64
1c0.49378114.813430
2a0.50795712.698925
3d0.50796812.699225
4b0.4434878.8697420

Additional notes:

  • select! and transform! perform operations in-place

  • The general syntax for transformation is source_columns => function => target_column

  • if you pass multiple columns to a function they are treated as positional arguments

  • ByRow and AsTable work exactly like discussed for operations on data frames in 05_columns.ipynb

  • you can automatically groupby again the result of combine, select etc. by passing ungroup=false keyword argument to them

  • similarly keepkeys keyword argument allows you to drop grouping columns from the resulting data frame

It is also allowed to pass a function to all these functions (also - as a special case, as a first argument). In this case the return value can be a table. In particular it allows for an easy dropping of groups if you return an empty table from the function.

If you pass a function you can use a do block syntax. In case of passing a function it gets a SubDataFrame as its argument.

Here is an example:

combine(groupby(x, :id)) do sdf
    n = nrow(sdf)
    n < 25 ? DataFrame() : DataFrame(n=n) ## drop groups with low number of rows
end
3×2 DataFrame
Rowidn
CharInt64
1c30
2a25
3d25

You can also produce multiple columns in a single operation:

df = DataFrame(id=[1, 1, 2, 2], val=[1, 2, 3, 4])
4×2 DataFrame
Rowidval
Int64Int64
111
212
323
424
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => AsTable)
end
2×3 DataFrame
Rowidx1x2
Int64Int64Int64
1112
2234
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => [:c1, :c2])
end
2×3 DataFrame
Rowidc1c2
Int64Int64Int64
1112
2234

It is easy to unnest the column into multiple columns,

df = DataFrame(a=[(p=1, q=2), (p=3, q=4)])
select(df, :a => AsTable)
2×2 DataFrame
Rowpq
Int64Int64
112
234

automatic column names generated

df = DataFrame(a=[[1, 2], [3, 4]])
select(df, :a => AsTable)
2×2 DataFrame
Rowx1x2
Int64Int64
112
234

custom column names generated

select(df, :a => [:C1, :C2])
2×2 DataFrame
RowC1C2
Int64Int64
112
234

Finally, observe that one can conveniently apply multiple transformations using broadcasting:

df = DataFrame(id=repeat(1:10, 10), x1=1:100, x2=101:200)
100×3 DataFrame
75 rows omitted
Rowidx1x2
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
11111111
12212112
13313113
89989189
901090190
91191191
92292192
93393193
94494194
95595195
96696196
97797197
98898198
99999199
10010100200
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> minimum)
end
10×3 DataFrame
Rowidx1_minimumx2_minimum
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> [minimum maximum])
end
10×5 DataFrame
Rowidx1_minimumx2_minimumx1_maximumx2_maximum
Int64Int64Int64Int64Int64
11110191191
22210292192
33310393193
44410494194
55510595195
66610696196
77710797197
88810898198
99910999199
101010110100200

Aggregation of a data frame using mapcols#

x = DataFrame(rand(10, 10), :auto)
10×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.2631270.8771980.5845350.8995920.7540930.2885980.230890.2765770.1055920.976852
20.5934380.1132250.9204280.8236550.185510.2648290.8117130.1502740.2829630.288698
30.5206680.2773640.5046780.3899450.3006030.5686720.655540.8187690.1775980.666094
40.2489620.6670650.1912870.3946650.9259360.05439780.2878190.7657720.6800410.241801
50.4614360.5331720.3645650.3704680.3662760.2074010.04678880.4690720.6031090.983973
60.9219790.08273530.9170150.4083040.157750.8110740.5098420.05976660.3823580.307684
70.275640.9442870.7904530.9427990.04924210.9580790.264330.09495980.2321860.87194
80.09620070.541460.3700440.543880.8518270.009075290.5921960.06336860.1636620.71178
90.833280.6458290.6255910.04732150.8370150.3279450.9635450.3444260.007574660.210299
100.2327410.5191510.1809210.9910150.8966220.07614360.9306610.9044190.974580.930629
mapcols(mean, x)
1×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.4447470.5201480.5449520.5811650.5324870.3566220.5293330.394740.3609660.618975

Mapping rows and columns using eachcol and eachrow#

map a function over each column and return a vector

map(mean, eachcol(x))
10-element Vector{Float64}:
 0.44474714682317557
 0.5201484905953546
 0.5449515714761632
 0.5811645813409286
 0.5324872656441735
 0.3566215707533047
 0.5293326134324855
 0.39474042721143554
 0.3609663947412353
 0.6189750547489952

an iteration returns a Pair with column name and values

foreach(c -> println(c[1], ": ", mean(c[2])), pairs(eachcol(x)))
x1: 0.44474714682317557
x2: 0.5201484905953546
x3: 0.5449515714761632
x4: 0.5811645813409286
x5: 0.5324872656441735
x6: 0.3566215707533047
x7: 0.5293326134324855
x8: 0.39474042721143554
x9: 0.3609663947412353
x10: 0.6189750547489952

now the returned value is DataFrameRow which works as a NamedTuple but is a view to a parent DataFrame

map(r -> r.x1 / r.x2, eachrow(x))
10-element Vector{Float64}:
  0.29996299302778767
  5.24124800610297
  1.877198550350215
  0.37321980323752285
  0.865454240024211
 11.143722651309448
  0.29190334065555323
  0.17766891870799123
  1.2902488983315856
  0.44831027267186824

it prints like a data frame, only the caption is different so that you know the type of the object

er = eachrow(x)
er.x1 ## you can access columns of a parent data frame directly
10-element Vector{Float64}:
 0.26312679794452243
 0.5934383993898801
 0.5206675282484069
 0.2489616895928357
 0.4614360974663768
 0.9219794376761827
 0.2756404170403999
 0.09620066883911571
 0.8332796762631839
 0.23274075577085218

it prints like a data frame, only the caption is different so that you know the type of the object

ec = eachcol(x)
10×10 DataFrameColumns
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.2631270.8771980.5845350.8995920.7540930.2885980.230890.2765770.1055920.976852
20.5934380.1132250.9204280.8236550.185510.2648290.8117130.1502740.2829630.288698
30.5206680.2773640.5046780.3899450.3006030.5686720.655540.8187690.1775980.666094
40.2489620.6670650.1912870.3946650.9259360.05439780.2878190.7657720.6800410.241801
50.4614360.5331720.3645650.3704680.3662760.2074010.04678880.4690720.6031090.983973
60.9219790.08273530.9170150.4083040.157750.8110740.5098420.05976660.3823580.307684
70.275640.9442870.7904530.9427990.04924210.9580790.264330.09495980.2321860.87194
80.09620070.541460.3700440.543880.8518270.009075290.5921960.06336860.1636620.71178
90.833280.6458290.6255910.04732150.8370150.3279450.9635450.3444260.007574660.210299
100.2327410.5191510.1809210.9910150.8966220.07614360.9306610.9044190.974580.930629

you can access columns of a parent data frame directly

ec.x1
10-element Vector{Float64}:
 0.26312679794452243
 0.5934383993898801
 0.5206675282484069
 0.2489616895928357
 0.4614360974663768
 0.9219794376761827
 0.2756404170403999
 0.09620066883911571
 0.8332796762631839
 0.23274075577085218

Transposing#

you can transpose a data frame using permutedims:

df = DataFrame(reshape(1:12, 3, 4), :auto)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
114710
225811
336912
df.names = ["a", "b", "c"]
3-element Vector{String}:
 "a"
 "b"
 "c"
permutedims(df, :names)
4×4 DataFrame
Rownamesabc
StringInt64Int64Int64
1x1123
2x2456
3x3789
4x4101112

This notebook was generated using Literate.jl.