Transformation to DataFrames#

Split-apply-combine

using DataFrames

Grouping a data frame#

groupby

x = DataFrame(id=[1, 2, 3, 4, 1, 2, 3, 4], id2=[1, 2, 1, 2, 1, 2, 1, 2], v=rand(8))
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.0425907
2220.0509911
3310.572115
4420.701202
5110.892608
6220.0499722
7310.611035
8420.503039
groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = 1
Rowidid2v
Int64Int64Float64
1110.0425907
2110.892608

Last Group (2 rows): id = 4
Rowidid2v
Int64Int64Float64
1420.701202
2420.503039
groupby(x, [])

GroupedDataFrame with 1 group based on key:

First Group (8 rows):
Rowidid2v
Int64Int64Float64
1110.0425907
2220.0509911
3310.572115
4420.701202
5110.892608
6220.0499722
7310.611035
8420.503039
gx2 = groupby(x, [:id, :id2])

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.0425907
2110.892608

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.701202
2420.503039

get the parent DataFrame

parent(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.0425907
2220.0509911
3310.572115
4420.701202
5110.892608
6220.0499722
7310.611035
8420.503039

back to the DataFrame, but in a different order of rows than the original

vcat(gx2...)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.0425907
2110.892608
3220.0509911
4220.0499722
5310.572115
6310.611035
7420.701202
8420.503039

the same as above

DataFrame(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.0425907
2110.892608
3220.0509911
4220.0499722
5310.572115
6310.611035
7420.701202
8420.503039

drop grouping columns when creating a data frame

DataFrame(gx2, keepkeys=false)
8×1 DataFrame
Rowv
Float64
10.0425907
20.892608
30.0509911
40.0499722
50.572115
60.611035
70.701202
80.503039

vector of names of grouping variables

groupcols(gx2)
2-element Vector{Symbol}:
 :id
 :id2

and non-grouping variables

valuecols(gx2)
1-element Vector{Symbol}:
 :v

group indices in parent(gx2)

groupindices(gx2)
8-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
 1
 2
 3
 4
kgx2 = keys(gx2)
4-element DataFrames.GroupKeys{DataFrames.GroupedDataFrame{DataFrames.DataFrame}}:
 GroupKey: (id = 1, id2 = 1)
 GroupKey: (id = 2, id2 = 2)
 GroupKey: (id = 3, id2 = 1)
 GroupKey: (id = 4, id2 = 2)

You can index into a GroupedDataFrame like to a vector or to a dictionary. The second form accepts GroupKey, NamedTuple or a Tuple.

gx2

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.0425907
2110.892608

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.701202
2420.503039
k = keys(gx2)[1]
GroupKey: (id = 1, id2 = 1)
ntk = NamedTuple(k)
(id = 1, id2 = 1)
tk = Tuple(k)
(1, 1)

the operations below produce the same result and are proformant

gx2[1], gx2[k], gx2[ntk], gx2[tk]
(2×3 SubDataFrame
 Row  id     id2    v         
     │ Int64  Int64  Float64   
─────┼─────────────────────────
   1 │     1      1  0.0425907
   2 │     1      1  0.892608, 2×3 SubDataFrame
 Row  id     id2    v         
     │ Int64  Int64  Float64   
─────┼─────────────────────────
   1 │     1      1  0.0425907
   2 │     1      1  0.892608, 2×3 SubDataFrame
 Row  id     id2    v         
     │ Int64  Int64  Float64   
─────┼─────────────────────────
   1 │     1      1  0.0425907
   2 │     1      1  0.892608, 2×3 SubDataFrame
 Row  id     id2    v         
     │ Int64  Int64  Float64   
─────┼─────────────────────────
   1 │     1      1  0.0425907
   2 │     1      1  0.892608)

handling missing values

x = DataFrame(id=[missing, 5, 1, 3, missing], x=1:5)
5×2 DataFrame
Rowidx
Int64?Int64
1missing1
252
313
434
5missing5

by default groups include missing values and their order is not guaranteed

groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

but we can change it; now they are sorted

groupby(x, :id, sort=true, skipmissing=true)

GroupedDataFrame with 3 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (1 row): id = 5
Rowidx
Int64?Int64
152

and now they are in the order they appear in the source data frame

groupby(x, :id, sort=false)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

Last Group (1 row): id = 3
Rowidx
Int64?Int64
134

Performing transformations#

by group using combine, select, select!, transform, and transform!

using Statistics
using Chain

x = DataFrame(id=rand('a':'d', 100), v=rand(100))
100×2 DataFrame
75 rows omitted
Rowidv
CharFloat64
1c0.976458
2c0.843507
3b0.0820769
4c0.854991
5a0.899427
6b0.396204
7a0.144418
8b0.605172
9b0.377043
10c0.756218
11b0.730404
12d0.968495
13a0.500889
89c0.807466
90a0.419191
91c0.661874
92d0.507822
93a0.6274
94c0.847145
95a0.237098
96b0.448951
97b0.0310993
98b0.137833
99d0.288475
100b0.0969202

apply a function to each group of a data frame combine keeps as many rows as are returned from the function

@chain x begin
    groupby(:id)
    combine(:v => mean)
end
4×2 DataFrame
Rowidv_mean
CharFloat64
1c0.687625
2b0.464012
3a0.494417
4d0.55287
x.id2 = axes(x, 1)
Base.OneTo(100)

Select and transform keep as many rows as are in the source data frame and in correct order. Additionally, transform keeps all columns from the source.

@chain x begin
    groupby(:id)
    transform(:v => mean)
end
100×4 DataFrame
75 rows omitted
Rowidvid2v_mean
CharFloat64Int64Float64
1c0.97645810.687625
2c0.84350720.687625
3b0.082076930.464012
4c0.85499140.687625
5a0.89942750.494417
6b0.39620460.464012
7a0.14441870.494417
8b0.60517280.464012
9b0.37704390.464012
10c0.756218100.687625
11b0.730404110.464012
12d0.968495120.55287
13a0.500889130.494417
89c0.807466890.687625
90a0.419191900.494417
91c0.661874910.687625
92d0.507822920.55287
93a0.6274930.494417
94c0.847145940.687625
95a0.237098950.494417
96b0.448951960.464012
97b0.0310993970.464012
98b0.137833980.464012
99d0.288475990.55287
100b0.09692021000.464012

note that combine reorders rows by group of GroupedDataFrame

@chain x begin
    groupby(:id)
    combine(:id2, :v => mean)
end
100×3 DataFrame
75 rows omitted
Rowidid2v_mean
CharInt64Float64
1c10.687625
2c20.687625
3c40.687625
4c100.687625
5c160.687625
6c210.687625
7c220.687625
8c260.687625
9c280.687625
10c310.687625
11c360.687625
12c380.687625
13c440.687625
89d370.55287
90d390.55287
91d450.55287
92d500.55287
93d630.55287
94d650.55287
95d710.55287
96d740.55287
97d780.55287
98d880.55287
99d920.55287
100d990.55287

we give a custom name for the result column

@chain x begin
    groupby(:id)
    combine(:v => mean => :res)
end
4×2 DataFrame
Rowidres
CharFloat64
1c0.687625
2b0.464012
3a0.494417
4d0.55287

you can have multiple operations

@chain x begin
    groupby(:id)
    combine(:v => mean => :res1, :v => sum => :res2, nrow => :n)
end
4×4 DataFrame
Rowidres1res2n
CharFloat64Float64Int64
1c0.68762520.628730
2b0.46401212.064326
3a0.49441714.338129
4d0.552878.2930515

Additional notes:

  • select! and transform! perform operations in-place

  • The general syntax for transformation is source_columns => function => target_column

  • if you pass multiple columns to a function they are treated as positional arguments

  • ByRow and AsTable work exactly like discussed for operations on data frames in 05_columns.ipynb

  • you can automatically groupby again the result of combine, select etc. by passing ungroup=false keyword argument to them

  • similarly keepkeys keyword argument allows you to drop grouping columns from the resulting data frame

It is also allowed to pass a function to all these functions (also - as a special case, as a first argument). In this case the return value can be a table. In particular it allows for an easy dropping of groups if you return an empty table from the function.

If you pass a function you can use a do block syntax. In case of passing a function it gets a SubDataFrame as its argument.

Here is an example:

combine(groupby(x, :id)) do sdf
    n = nrow(sdf)
    n < 25 ? DataFrame() : DataFrame(n=n) ## drop groups with low number of rows
end
3×2 DataFrame
Rowidn
CharInt64
1c30
2b26
3a29

You can also produce multiple columns in a single operation:

df = DataFrame(id=[1, 1, 2, 2], val=[1, 2, 3, 4])
4×2 DataFrame
Rowidval
Int64Int64
111
212
323
424
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => AsTable)
end
2×3 DataFrame
Rowidx1x2
Int64Int64Int64
1112
2234
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => [:c1, :c2])
end
2×3 DataFrame
Rowidc1c2
Int64Int64Int64
1112
2234

It is easy to unnest the column into multiple columns,

df = DataFrame(a=[(p=1, q=2), (p=3, q=4)])
select(df, :a => AsTable)
2×2 DataFrame
Rowpq
Int64Int64
112
234

automatic column names generated

df = DataFrame(a=[[1, 2], [3, 4]])
select(df, :a => AsTable)
2×2 DataFrame
Rowx1x2
Int64Int64
112
234

custom column names generated

select(df, :a => [:C1, :C2])
2×2 DataFrame
RowC1C2
Int64Int64
112
234

Finally, observe that one can conveniently apply multiple transformations using broadcasting:

df = DataFrame(id=repeat(1:10, 10), x1=1:100, x2=101:200)
100×3 DataFrame
75 rows omitted
Rowidx1x2
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
11111111
12212112
13313113
89989189
901090190
91191191
92292192
93393193
94494194
95595195
96696196
97797197
98898198
99999199
10010100200
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> minimum)
end
10×3 DataFrame
Rowidx1_minimumx2_minimum
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> [minimum maximum])
end
10×5 DataFrame
Rowidx1_minimumx2_minimumx1_maximumx2_maximum
Int64Int64Int64Int64Int64
11110191191
22210292192
33310393193
44410494194
55510595195
66610696196
77710797197
88810898198
99910999199
101010110100200

Aggregation of a data frame using mapcols#

x = DataFrame(rand(10, 10), :auto)
10×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.2681080.9641370.9216430.3905050.6682130.5647710.2891420.3464560.8361030.791965
20.4900960.3483360.8296210.9798890.3418510.9114950.315990.47770.07379970.786805
30.4416230.3377020.04799440.611830.1813130.560650.8989560.3782130.8449970.48045
40.983130.6098420.2688690.7712990.8175490.2452410.4398610.4556850.3709280.749379
50.2336960.2353160.2093660.1510250.4869470.02694580.04091150.273740.01198220.763845
60.725240.09935230.3780520.7146940.894910.1014370.07678530.8864870.8058250.58954
70.05130310.7805590.6251970.3096080.08605590.09542760.1274510.865440.4525620.0974757
80.3401240.775760.01943760.9854340.03751540.8010260.5285430.3337350.07165950.272976
90.6730340.8304610.9026820.970380.4328740.7826950.4001020.5310820.6704790.859868
100.9321440.4514640.4602090.1644510.5398350.7640050.8193390.6481050.7459790.879255
mapcols(mean, x)
1×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.513850.5432930.4663070.6049110.4487060.4853690.3937080.5196640.4884310.627156

Mapping rows and columns using eachcol and eachrow#

map a function over each column and return a vector

map(mean, eachcol(x))
10-element Vector{Float64}:
 0.5138499053530677
 0.5432929873490393
 0.4663071840543761
 0.6049114902111529
 0.4487063307522809
 0.48536940553131674
 0.39370812175433223
 0.519664191175476
 0.4884312090863836
 0.6271559063716318

an iteration returns a Pair with column name and values

foreach(c -> println(c[1], ": ", mean(c[2])), pairs(eachcol(x)))
x1: 0.5138499053530677
x2: 0.5432929873490393
x3: 0.4663071840543761
x4: 0.6049114902111529
x5: 0.4487063307522809
x6: 0.48536940553131674
x7: 0.39370812175433223
x8: 0.519664191175476
x9: 0.4884312090863836
x10: 0.6271559063716318

now the returned value is DataFrameRow which works as a NamedTuple but is a view to a parent DataFrame

map(r -> r.x1 / r.x2, eachrow(x))
10-element Vector{Float64}:
 0.2780809863660134
 1.4069622297968167
 1.3077289482952075
 1.6121065509237753
 0.9931156388639917
 7.299683165942576
 0.06572613008547387
 0.4384391019088276
 0.8104348130976737
 2.0647140078056307

it prints like a data frame, only the caption is different so that you know the type of the object

er = eachrow(x)
er.x1 ## you can access columns of a parent data frame directly
10-element Vector{Float64}:
 0.2681081593548079
 0.4900958196613917
 0.44162319542270223
 0.9831302949527768
 0.23369612446354215
 0.7252401254413869
 0.05130311782432795
 0.3401235342198399
 0.6730343878634002
 0.9321442943265011

it prints like a data frame, only the caption is different so that you know the type of the object

ec = eachcol(x)
10×10 DataFrameColumns
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.2681080.9641370.9216430.3905050.6682130.5647710.2891420.3464560.8361030.791965
20.4900960.3483360.8296210.9798890.3418510.9114950.315990.47770.07379970.786805
30.4416230.3377020.04799440.611830.1813130.560650.8989560.3782130.8449970.48045
40.983130.6098420.2688690.7712990.8175490.2452410.4398610.4556850.3709280.749379
50.2336960.2353160.2093660.1510250.4869470.02694580.04091150.273740.01198220.763845
60.725240.09935230.3780520.7146940.894910.1014370.07678530.8864870.8058250.58954
70.05130310.7805590.6251970.3096080.08605590.09542760.1274510.865440.4525620.0974757
80.3401240.775760.01943760.9854340.03751540.8010260.5285430.3337350.07165950.272976
90.6730340.8304610.9026820.970380.4328740.7826950.4001020.5310820.6704790.859868
100.9321440.4514640.4602090.1644510.5398350.7640050.8193390.6481050.7459790.879255

you can access columns of a parent data frame directly

ec.x1
10-element Vector{Float64}:
 0.2681081593548079
 0.4900958196613917
 0.44162319542270223
 0.9831302949527768
 0.23369612446354215
 0.7252401254413869
 0.05130311782432795
 0.3401235342198399
 0.6730343878634002
 0.9321442943265011

Transposing#

you can transpose a data frame using permutedims:

df = DataFrame(reshape(1:12, 3, 4), :auto)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
114710
225811
336912
df.names = ["a", "b", "c"]
3-element Vector{String}:
 "a"
 "b"
 "c"
permutedims(df, :names)
4×4 DataFrame
Rownamesabc
StringInt64Int64Int64
1x1123
2x2456
3x3789
4x4101112

This notebook was generated using Literate.jl.