Transformation to DataFrames#

Split-apply-combine

using DataFrames

Grouping a data frame#

groupby

x = DataFrame(id=[1, 2, 3, 4, 1, 2, 3, 4], id2=[1, 2, 1, 2, 1, 2, 1, 2], v=rand(8))
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.847192
2220.863094
3310.945034
4420.850163
5110.401421
6220.544358
7310.971199
8420.954327
groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = 1
Rowidid2v
Int64Int64Float64
1110.847192
2110.401421

Last Group (2 rows): id = 4
Rowidid2v
Int64Int64Float64
1420.850163
2420.954327
groupby(x, [])

GroupedDataFrame with 1 group based on key:

First Group (8 rows):
Rowidid2v
Int64Int64Float64
1110.847192
2220.863094
3310.945034
4420.850163
5110.401421
6220.544358
7310.971199
8420.954327
gx2 = groupby(x, [:id, :id2])

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.847192
2110.401421

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.850163
2420.954327

get the parent DataFrame

parent(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.847192
2220.863094
3310.945034
4420.850163
5110.401421
6220.544358
7310.971199
8420.954327

back to the DataFrame, but in a different order of rows than the original

vcat(gx2...)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.847192
2110.401421
3220.863094
4220.544358
5310.945034
6310.971199
7420.850163
8420.954327

the same as above

DataFrame(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.847192
2110.401421
3220.863094
4220.544358
5310.945034
6310.971199
7420.850163
8420.954327

drop grouping columns when creating a data frame

DataFrame(gx2, keepkeys=false)
8×1 DataFrame
Rowv
Float64
10.847192
20.401421
30.863094
40.544358
50.945034
60.971199
70.850163
80.954327

vector of names of grouping variables

groupcols(gx2)
2-element Vector{Symbol}:
 :id
 :id2

and non-grouping variables

valuecols(gx2)
1-element Vector{Symbol}:
 :v

group indices in parent(gx2)

groupindices(gx2)
8-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
 1
 2
 3
 4
kgx2 = keys(gx2)
4-element DataFrames.GroupKeys{DataFrames.GroupedDataFrame{DataFrames.DataFrame}}:
 GroupKey: (id = 1, id2 = 1)
 GroupKey: (id = 2, id2 = 2)
 GroupKey: (id = 3, id2 = 1)
 GroupKey: (id = 4, id2 = 2)

You can index into a GroupedDataFrame like to a vector or to a dictionary. The second form accepts GroupKey, NamedTuple or a Tuple.

gx2

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.847192
2110.401421

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.850163
2420.954327
k = keys(gx2)[1]
GroupKey: (id = 1, id2 = 1)
ntk = NamedTuple(k)
(id = 1, id2 = 1)
tk = Tuple(k)
(1, 1)

the operations below produce the same result and are proformant

gx2[1], gx2[k], gx2[ntk], gx2[tk]
(2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.847192
   2 │     1      1  0.401421, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.847192
   2 │     1      1  0.401421, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.847192
   2 │     1      1  0.401421, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.847192
   2 │     1      1  0.401421)

handling missing values

x = DataFrame(id=[missing, 5, 1, 3, missing], x=1:5)
5×2 DataFrame
Rowidx
Int64?Int64
1missing1
252
313
434
5missing5

by default groups include missing values and their order is not guaranteed

groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

but we can change it; now they are sorted

groupby(x, :id, sort=true, skipmissing=true)

GroupedDataFrame with 3 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (1 row): id = 5
Rowidx
Int64?Int64
152

and now they are in the order they appear in the source data frame

groupby(x, :id, sort=false)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

Last Group (1 row): id = 3
Rowidx
Int64?Int64
134

Performing transformations#

by group using combine, select, select!, transform, and transform!

using Statistics
using Chain

x = DataFrame(id=rand('a':'d', 100), v=rand(100))
100×2 DataFrame
75 rows omitted
Rowidv
CharFloat64
1a0.0701664
2a0.0182633
3c0.358291
4a0.529623
5a0.402455
6a0.333796
7a0.97962
8b0.637092
9a0.473495
10a0.77666
11d0.106041
12d0.381774
13c0.0512169
89a0.559751
90a0.591965
91a0.616339
92c0.836743
93c0.300096
94d0.751398
95a0.746588
96a0.379123
97c0.476596
98d0.390383
99d0.447578
100a0.0325232

apply a function to each group of a data frame combine keeps as many rows as are returned from the function

@chain x begin
    groupby(:id)
    combine(:v => mean)
end
4×2 DataFrame
Rowidv_mean
CharFloat64
1a0.441277
2c0.448135
3b0.530122
4d0.439243
x.id2 = axes(x, 1)
Base.OneTo(100)

Select and transform keep as many rows as are in the source data frame and in correct order. Additionally, transform keeps all columns from the source.

@chain x begin
    groupby(:id)
    transform(:v => mean)
end
100×4 DataFrame
75 rows omitted
Rowidvid2v_mean
CharFloat64Int64Float64
1a0.070166410.441277
2a0.018263320.441277
3c0.35829130.448135
4a0.52962340.441277
5a0.40245550.441277
6a0.33379660.441277
7a0.9796270.441277
8b0.63709280.530122
9a0.47349590.441277
10a0.77666100.441277
11d0.106041110.439243
12d0.381774120.439243
13c0.0512169130.448135
89a0.559751890.441277
90a0.591965900.441277
91a0.616339910.441277
92c0.836743920.448135
93c0.300096930.448135
94d0.751398940.439243
95a0.746588950.441277
96a0.379123960.441277
97c0.476596970.448135
98d0.390383980.439243
99d0.447578990.439243
100a0.03252321000.441277

note that combine reorders rows by group of GroupedDataFrame

@chain x begin
    groupby(:id)
    combine(:id2, :v => mean)
end
100×3 DataFrame
75 rows omitted
Rowidid2v_mean
CharInt64Float64
1a10.441277
2a20.441277
3a40.441277
4a50.441277
5a60.441277
6a70.441277
7a90.441277
8a100.441277
9a160.441277
10a200.441277
11a250.441277
12a300.441277
13a320.441277
89d600.439243
90d680.439243
91d700.439243
92d730.439243
93d760.439243
94d780.439243
95d800.439243
96d820.439243
97d860.439243
98d940.439243
99d980.439243
100d990.439243

we give a custom name for the result column

@chain x begin
    groupby(:id)
    combine(:v => mean => :res)
end
4×2 DataFrame
Rowidres
CharFloat64
1a0.441277
2c0.448135
3b0.530122
4d0.439243

you can have multiple operations

@chain x begin
    groupby(:id)
    combine(:v => mean => :res1, :v => sum => :res2, nrow => :n)
end
4×4 DataFrame
Rowidres1res2n
CharFloat64Float64Int64
1a0.44127712.79729
2c0.4481358.9626920
3b0.53012212.722924
4d0.43924311.859627

Additional notes:

  • select! and transform! perform operations in-place

  • The general syntax for transformation is source_columns => function => target_column

  • if you pass multiple columns to a function they are treated as positional arguments

  • ByRow and AsTable work exactly like discussed for operations on data frames in 05_columns.ipynb

  • you can automatically groupby again the result of combine, select etc. by passing ungroup=false keyword argument to them

  • similarly keepkeys keyword argument allows you to drop grouping columns from the resulting data frame

It is also allowed to pass a function to all these functions (also - as a special case, as a first argument). In this case the return value can be a table. In particular it allows for an easy dropping of groups if you return an empty table from the function.

If you pass a function you can use a do block syntax. In case of passing a function it gets a SubDataFrame as its argument.

Here is an example:

combine(groupby(x, :id)) do sdf
    n = nrow(sdf)
    n < 25 ? DataFrame() : DataFrame(n=n) ## drop groups with low number of rows
end
2×2 DataFrame
Rowidn
CharInt64
1a29
2d27

You can also produce multiple columns in a single operation:

df = DataFrame(id=[1, 1, 2, 2], val=[1, 2, 3, 4])
4×2 DataFrame
Rowidval
Int64Int64
111
212
323
424
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => AsTable)
end
2×3 DataFrame
Rowidx1x2
Int64Int64Int64
1112
2234
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => [:c1, :c2])
end
2×3 DataFrame
Rowidc1c2
Int64Int64Int64
1112
2234

It is easy to unnest the column into multiple columns,

df = DataFrame(a=[(p=1, q=2), (p=3, q=4)])
select(df, :a => AsTable)
2×2 DataFrame
Rowpq
Int64Int64
112
234

automatic column names generated

df = DataFrame(a=[[1, 2], [3, 4]])
select(df, :a => AsTable)
2×2 DataFrame
Rowx1x2
Int64Int64
112
234

custom column names generated

select(df, :a => [:C1, :C2])
2×2 DataFrame
RowC1C2
Int64Int64
112
234

Finally, observe that one can conveniently apply multiple transformations using broadcasting:

df = DataFrame(id=repeat(1:10, 10), x1=1:100, x2=101:200)
100×3 DataFrame
75 rows omitted
Rowidx1x2
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
11111111
12212112
13313113
89989189
901090190
91191191
92292192
93393193
94494194
95595195
96696196
97797197
98898198
99999199
10010100200
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> minimum)
end
10×3 DataFrame
Rowidx1_minimumx2_minimum
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> [minimum maximum])
end
10×5 DataFrame
Rowidx1_minimumx2_minimumx1_maximumx2_maximum
Int64Int64Int64Int64Int64
11110191191
22210292192
33310393193
44410494194
55510595195
66610696196
77710797197
88810898198
99910999199
101010110100200

Aggregation of a data frame using mapcols#

x = DataFrame(rand(10, 10), :auto)
10×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.5812860.5001310.3734930.5470110.2159260.6927870.191660.3516910.1986530.215892
20.8981610.2291910.02468170.6229540.6674770.9393990.8742420.3288040.8894760.00458233
30.2404050.5918950.5803450.9728040.9301640.5595820.3247420.8500610.3668940.236819
40.1056530.8244390.7920410.6238640.7357570.9026210.5634830.4327710.114220.839069
50.1099150.2510320.9441310.8943290.1646850.7508560.5695740.04654950.7510140.522555
60.3799320.6619070.8138770.6742670.6044130.03217410.8281090.971110.03927870.402899
70.6252080.2981350.8331250.8628580.02689480.6643660.4735130.1631170.3306850.569004
80.6761980.1974210.1471040.6297170.2422180.6608850.6357910.2588080.01442460.764473
90.4244080.2487680.737090.3153250.443840.4294590.5786950.8443530.7297730.00955439
100.2261490.5210390.1722760.6879270.5927180.4212270.2303970.769070.2445680.298045
mapcols(mean, x)
1×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.4267320.4323960.5418160.6831060.4624090.6053360.5270210.5016330.3678990.386289

Mapping rows and columns using eachcol and eachrow#

map a function over each column and return a vector

map(mean, eachcol(x))
10-element Vector{Float64}:
 0.4267316147456429
 0.4323957582948384
 0.5418162034323375
 0.6831056523551486
 0.46240935500358804
 0.6053356095651719
 0.5270207063246461
 0.5016333397790869
 0.3678985308320364
 0.3862893148013652

an iteration returns a Pair with column name and values

foreach(c -> println(c[1], ": ", mean(c[2])), pairs(eachcol(x)))
x1: 0.4267316147456429
x2: 0.4323957582948384
x3: 0.5418162034323375
x4: 0.6831056523551486
x5: 0.46240935500358804
x6: 0.6053356095651719
x7: 0.5270207063246461
x8: 0.5016333397790869
x9: 0.3678985308320364
x10: 0.3862893148013652

now the returned value is DataFrameRow which works as a NamedTuple but is a view to a parent DataFrame

map(r -> r.x1 / r.x2, eachrow(x))
10-element Vector{Float64}:
 1.1622681786852576
 3.9188323070256126
 0.40616089739099004
 0.12815152393556184
 0.43785232107549216
 0.5739959426386916
 2.0970673785796814
 3.4251564265695804
 1.7060360415449167
 0.43403506944925896

it prints like a data frame, only the caption is different so that you know the type of the object

er = eachrow(x)
er.x1 ## you can access columns of a parent data frame directly
10-element Vector{Float64}:
 0.5812864438977917
 0.8981614676806798
 0.24040469126869135
 0.10565305018063365
 0.1099149189485783
 0.37993167234429415
 0.625208287087893
 0.6761984301946938
 0.4244080056653826
 0.2261491801877904

it prints like a data frame, only the caption is different so that you know the type of the object

ec = eachcol(x)
10×10 DataFrameColumns
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.5812860.5001310.3734930.5470110.2159260.6927870.191660.3516910.1986530.215892
20.8981610.2291910.02468170.6229540.6674770.9393990.8742420.3288040.8894760.00458233
30.2404050.5918950.5803450.9728040.9301640.5595820.3247420.8500610.3668940.236819
40.1056530.8244390.7920410.6238640.7357570.9026210.5634830.4327710.114220.839069
50.1099150.2510320.9441310.8943290.1646850.7508560.5695740.04654950.7510140.522555
60.3799320.6619070.8138770.6742670.6044130.03217410.8281090.971110.03927870.402899
70.6252080.2981350.8331250.8628580.02689480.6643660.4735130.1631170.3306850.569004
80.6761980.1974210.1471040.6297170.2422180.6608850.6357910.2588080.01442460.764473
90.4244080.2487680.737090.3153250.443840.4294590.5786950.8443530.7297730.00955439
100.2261490.5210390.1722760.6879270.5927180.4212270.2303970.769070.2445680.298045

you can access columns of a parent data frame directly

ec.x1
10-element Vector{Float64}:
 0.5812864438977917
 0.8981614676806798
 0.24040469126869135
 0.10565305018063365
 0.1099149189485783
 0.37993167234429415
 0.625208287087893
 0.6761984301946938
 0.4244080056653826
 0.2261491801877904

Transposing#

you can transpose a data frame using permutedims:

df = DataFrame(reshape(1:12, 3, 4), :auto)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
114710
225811
336912
df.names = ["a", "b", "c"]
3-element Vector{String}:
 "a"
 "b"
 "c"
permutedims(df, :names)
4×4 DataFrame
Rownamesabc
StringInt64Int64Int64
1x1123
2x2456
3x3789
4x4101112

This notebook was generated using Literate.jl.