Transformation to DataFrames#

Split-apply-combine

using DataFrames

Grouping a data frame#

groupby

x = DataFrame(id=[1, 2, 3, 4, 1, 2, 3, 4], id2=[1, 2, 1, 2, 1, 2, 1, 2], v=rand(8))
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.274977
2220.0764892
3310.620511
4420.0601085
5110.162512
6220.318162
7310.407531
8420.851718
groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = 1
Rowidid2v
Int64Int64Float64
1110.274977
2110.162512

Last Group (2 rows): id = 4
Rowidid2v
Int64Int64Float64
1420.0601085
2420.851718
groupby(x, [])

GroupedDataFrame with 1 group based on key:

First Group (8 rows):
Rowidid2v
Int64Int64Float64
1110.274977
2220.0764892
3310.620511
4420.0601085
5110.162512
6220.318162
7310.407531
8420.851718
gx2 = groupby(x, [:id, :id2])

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.274977
2110.162512

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.0601085
2420.851718

get the parent DataFrame

parent(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.274977
2220.0764892
3310.620511
4420.0601085
5110.162512
6220.318162
7310.407531
8420.851718

back to the DataFrame, but in a different order of rows than the original

vcat(gx2...)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.274977
2110.162512
3220.0764892
4220.318162
5310.620511
6310.407531
7420.0601085
8420.851718

the same as above

DataFrame(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.274977
2110.162512
3220.0764892
4220.318162
5310.620511
6310.407531
7420.0601085
8420.851718

drop grouping columns when creating a data frame

DataFrame(gx2, keepkeys=false)
8×1 DataFrame
Rowv
Float64
10.274977
20.162512
30.0764892
40.318162
50.620511
60.407531
70.0601085
80.851718

vector of names of grouping variables

groupcols(gx2)
2-element Vector{Symbol}:
 :id
 :id2

and non-grouping variables

valuecols(gx2)
1-element Vector{Symbol}:
 :v

group indices in parent(gx2)

groupindices(gx2)
8-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
 1
 2
 3
 4
kgx2 = keys(gx2)
4-element DataFrames.GroupKeys{DataFrames.GroupedDataFrame{DataFrames.DataFrame}}:
 GroupKey: (id = 1, id2 = 1)
 GroupKey: (id = 2, id2 = 2)
 GroupKey: (id = 3, id2 = 1)
 GroupKey: (id = 4, id2 = 2)

You can index into a GroupedDataFrame like to a vector or to a dictionary. The second form accepts GroupKey, NamedTuple or a Tuple.

gx2

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.274977
2110.162512

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.0601085
2420.851718
k = keys(gx2)[1]
GroupKey: (id = 1, id2 = 1)
ntk = NamedTuple(k)
(id = 1, id2 = 1)
tk = Tuple(k)
(1, 1)

the operations below produce the same result and are proformant

gx2[1], gx2[k], gx2[ntk], gx2[tk]
(2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.274977
   2 │     1      1  0.162512, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.274977
   2 │     1      1  0.162512, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.274977
   2 │     1      1  0.162512, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.274977
   2 │     1      1  0.162512)

handling missing values

x = DataFrame(id=[missing, 5, 1, 3, missing], x=1:5)
5×2 DataFrame
Rowidx
Int64?Int64
1missing1
252
313
434
5missing5

by default groups include missing values and their order is not guaranteed

groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

but we can change it; now they are sorted

groupby(x, :id, sort=true, skipmissing=true)

GroupedDataFrame with 3 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (1 row): id = 5
Rowidx
Int64?Int64
152

and now they are in the order they appear in the source data frame

groupby(x, :id, sort=false)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

Last Group (1 row): id = 3
Rowidx
Int64?Int64
134

Performing transformations#

by group using combine, select, select!, transform, and transform!

using Statistics
using Chain

x = DataFrame(id=rand('a':'d', 100), v=rand(100))
100×2 DataFrame
75 rows omitted
Rowidv
CharFloat64
1b0.453848
2c0.909895
3d0.703891
4b0.027346
5d0.248954
6d0.223325
7a0.632575
8c0.0887604
9a0.56193
10a0.874694
11c0.192252
12d0.611955
13a0.63133
89c0.582991
90d0.456848
91b0.892767
92c0.622906
93a0.492163
94d0.459594
95b0.923333
96d0.956767
97a0.344608
98d0.374078
99a0.597576
100b0.880164

apply a function to each group of a data frame combine keeps as many rows as are returned from the function

@chain x begin
    groupby(:id)
    combine(:v => mean)
end
4×2 DataFrame
Rowidv_mean
CharFloat64
1b0.549499
2c0.551201
3d0.501363
4a0.480635
x.id2 = axes(x, 1)
Base.OneTo(100)

Select and transform keep as many rows as are in the source data frame and in correct order. Additionally, transform keeps all columns from the source.

@chain x begin
    groupby(:id)
    transform(:v => mean)
end
100×4 DataFrame
75 rows omitted
Rowidvid2v_mean
CharFloat64Int64Float64
1b0.45384810.549499
2c0.90989520.551201
3d0.70389130.501363
4b0.02734640.549499
5d0.24895450.501363
6d0.22332560.501363
7a0.63257570.480635
8c0.088760480.551201
9a0.5619390.480635
10a0.874694100.480635
11c0.192252110.551201
12d0.611955120.501363
13a0.63133130.480635
89c0.582991890.551201
90d0.456848900.501363
91b0.892767910.549499
92c0.622906920.551201
93a0.492163930.480635
94d0.459594940.501363
95b0.923333950.549499
96d0.956767960.501363
97a0.344608970.480635
98d0.374078980.501363
99a0.597576990.480635
100b0.8801641000.549499

note that combine reorders rows by group of GroupedDataFrame

@chain x begin
    groupby(:id)
    combine(:id2, :v => mean)
end
100×3 DataFrame
75 rows omitted
Rowidid2v_mean
CharInt64Float64
1b10.549499
2b40.549499
3b140.549499
4b170.549499
5b200.549499
6b240.549499
7b270.549499
8b290.549499
9b300.549499
10b330.549499
11b340.549499
12b390.549499
13b470.549499
89a410.480635
90a500.480635
91a510.480635
92a540.480635
93a610.480635
94a630.480635
95a760.480635
96a770.480635
97a790.480635
98a930.480635
99a970.480635
100a990.480635

we give a custom name for the result column

@chain x begin
    groupby(:id)
    combine(:v => mean => :res)
end
4×2 DataFrame
Rowidres
CharFloat64
1b0.549499
2c0.551201
3d0.501363
4a0.480635

you can have multiple operations

@chain x begin
    groupby(:id)
    combine(:v => mean => :res1, :v => sum => :res2, nrow => :n)
end
4×4 DataFrame
Rowidres1res2n
CharFloat64Float64Int64
1b0.54949913.18824
2c0.55120114.331226
3d0.50136315.040930
4a0.4806359.612720

Additional notes:

  • select! and transform! perform operations in-place

  • The general syntax for transformation is source_columns => function => target_column

  • if you pass multiple columns to a function they are treated as positional arguments

  • ByRow and AsTable work exactly like discussed for operations on data frames in 05_columns.ipynb

  • you can automatically groupby again the result of combine, select etc. by passing ungroup=false keyword argument to them

  • similarly keepkeys keyword argument allows you to drop grouping columns from the resulting data frame

It is also allowed to pass a function to all these functions (also - as a special case, as a first argument). In this case the return value can be a table. In particular it allows for an easy dropping of groups if you return an empty table from the function.

If you pass a function you can use a do block syntax. In case of passing a function it gets a SubDataFrame as its argument.

Here is an example:

combine(groupby(x, :id)) do sdf
    n = nrow(sdf)
    n < 25 ? DataFrame() : DataFrame(n=n) ## drop groups with low number of rows
end
2×2 DataFrame
Rowidn
CharInt64
1c26
2d30

You can also produce multiple columns in a single operation:

df = DataFrame(id=[1, 1, 2, 2], val=[1, 2, 3, 4])
4×2 DataFrame
Rowidval
Int64Int64
111
212
323
424
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => AsTable)
end
2×3 DataFrame
Rowidx1x2
Int64Int64Int64
1112
2234
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => [:c1, :c2])
end
2×3 DataFrame
Rowidc1c2
Int64Int64Int64
1112
2234

It is easy to unnest the column into multiple columns,

df = DataFrame(a=[(p=1, q=2), (p=3, q=4)])
select(df, :a => AsTable)
2×2 DataFrame
Rowpq
Int64Int64
112
234

automatic column names generated

df = DataFrame(a=[[1, 2], [3, 4]])
select(df, :a => AsTable)
2×2 DataFrame
Rowx1x2
Int64Int64
112
234

custom column names generated

select(df, :a => [:C1, :C2])
2×2 DataFrame
RowC1C2
Int64Int64
112
234

Finally, observe that one can conveniently apply multiple transformations using broadcasting:

df = DataFrame(id=repeat(1:10, 10), x1=1:100, x2=101:200)
100×3 DataFrame
75 rows omitted
Rowidx1x2
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
11111111
12212112
13313113
89989189
901090190
91191191
92292192
93393193
94494194
95595195
96696196
97797197
98898198
99999199
10010100200
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> minimum)
end
10×3 DataFrame
Rowidx1_minimumx2_minimum
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> [minimum maximum])
end
10×5 DataFrame
Rowidx1_minimumx2_minimumx1_maximumx2_maximum
Int64Int64Int64Int64Int64
11110191191
22210292192
33310393193
44410494194
55510595195
66610696196
77710797197
88810898198
99910999199
101010110100200

Aggregation of a data frame using mapcols#

x = DataFrame(rand(10, 10), :auto)
10×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.1429330.6980270.8008330.2391320.1788740.02045030.8310440.3943130.1556430.821355
20.8716970.4864610.7333910.989540.7884070.9391660.2647160.9400420.4413650.546164
30.7291920.6293160.4503160.07344780.1736710.9009680.71310.3663680.4319750.0348618
40.8632210.2475430.7335630.009216370.922690.2003390.3472160.2265520.4696030.723935
50.2129530.5513120.8871630.2434850.4110230.7408940.9379030.3598570.8704350.763588
60.4501410.4437760.06893580.6201690.3100590.3529240.3440050.9544290.274180.506943
70.9780550.2135760.943040.3810250.4728790.9748180.7038780.9534610.9210670.676587
80.001934630.763280.8314390.6083850.1030240.9016160.681050.7489140.2678940.268814
90.4585910.6934980.1747740.6895810.09343550.6712970.0589330.09003850.1153050.779629
100.1721190.7361470.5719740.1218390.7540630.8404380.5807820.8357180.4947140.0780636
mapcols(mean, x)
1×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.4880840.5462940.6195430.3975820.4208120.6542910.5462630.5869690.4442180.519994

Mapping rows and columns using eachcol and eachrow#

map a function over each column and return a vector

map(mean, eachcol(x))
10-element Vector{Float64}:
 0.4880837305347603
 0.5462936532258832
 0.6195428722469729
 0.3975818076667079
 0.42081249608994337
 0.6542910592139737
 0.546262570576392
 0.5869692073307355
 0.44421820059897693
 0.5199939474914792

an iteration returns a Pair with column name and values

foreach(c -> println(c[1], ": ", mean(c[2])), pairs(eachcol(x)))
x1: 0.4880837305347603
x2: 0.5462936532258832
x3: 0.6195428722469729
x4: 0.3975818076667079
x5: 0.42081249608994337
x6: 0.6542910592139737
x7: 0.546262570576392
x8: 0.5869692073307355
x9: 0.44421820059897693
x10: 0.5199939474914792

now the returned value is DataFrameRow which works as a NamedTuple but is a view to a parent DataFrame

map(r -> r.x1 / r.x2, eachrow(x))
10-element Vector{Float64}:
 0.20476684814042187
 1.7919161167443853
 1.1587060545644823
 3.4871599001661853
 0.38626626380992646
 1.0143419056246494
 4.579421293717041
 0.0025346325040194943
 0.6612724362340907
 0.2338102187801741

it prints like a data frame, only the caption is different so that you know the type of the object

er = eachrow(x)
er.x1 ## you can access columns of a parent data frame directly
10-element Vector{Float64}:
 0.14293288724939057
 0.8716972684308065
 0.7291924954982303
 0.8632214579604041
 0.21295310607107132
 0.4501409310895359
 0.9780545986593611
 0.0019346332871597038
 0.4585911654757868
 0.1721187616258557

it prints like a data frame, only the caption is different so that you know the type of the object

ec = eachcol(x)
10×10 DataFrameColumns
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.1429330.6980270.8008330.2391320.1788740.02045030.8310440.3943130.1556430.821355
20.8716970.4864610.7333910.989540.7884070.9391660.2647160.9400420.4413650.546164
30.7291920.6293160.4503160.07344780.1736710.9009680.71310.3663680.4319750.0348618
40.8632210.2475430.7335630.009216370.922690.2003390.3472160.2265520.4696030.723935
50.2129530.5513120.8871630.2434850.4110230.7408940.9379030.3598570.8704350.763588
60.4501410.4437760.06893580.6201690.3100590.3529240.3440050.9544290.274180.506943
70.9780550.2135760.943040.3810250.4728790.9748180.7038780.9534610.9210670.676587
80.001934630.763280.8314390.6083850.1030240.9016160.681050.7489140.2678940.268814
90.4585910.6934980.1747740.6895810.09343550.6712970.0589330.09003850.1153050.779629
100.1721190.7361470.5719740.1218390.7540630.8404380.5807820.8357180.4947140.0780636

you can access columns of a parent data frame directly

ec.x1
10-element Vector{Float64}:
 0.14293288724939057
 0.8716972684308065
 0.7291924954982303
 0.8632214579604041
 0.21295310607107132
 0.4501409310895359
 0.9780545986593611
 0.0019346332871597038
 0.4585911654757868
 0.1721187616258557

Transposing#

you can transpose a data frame using permutedims:

df = DataFrame(reshape(1:12, 3, 4), :auto)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
114710
225811
336912
df.names = ["a", "b", "c"]
3-element Vector{String}:
 "a"
 "b"
 "c"
permutedims(df, :names)
4×4 DataFrame
Rownamesabc
StringInt64Int64Int64
1x1123
2x2456
3x3789
4x4101112

This notebook was generated using Literate.jl.