Transformation to DataFrames#

Split-apply-combine

using DataFrames

Grouping a data frame#

groupby

x = DataFrame(id=[1, 2, 3, 4, 1, 2, 3, 4], id2=[1, 2, 1, 2, 1, 2, 1, 2], v=rand(8))
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.274951
2220.00502915
3310.27654
4420.497871
5110.913103
6220.994002
7310.124519
8420.123377
groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = 1
Rowidid2v
Int64Int64Float64
1110.274951
2110.913103

Last Group (2 rows): id = 4
Rowidid2v
Int64Int64Float64
1420.497871
2420.123377
groupby(x, [])

GroupedDataFrame with 1 group based on key:

First Group (8 rows):
Rowidid2v
Int64Int64Float64
1110.274951
2220.00502915
3310.27654
4420.497871
5110.913103
6220.994002
7310.124519
8420.123377
gx2 = groupby(x, [:id, :id2])

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.274951
2110.913103

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.497871
2420.123377

get the parent DataFrame

parent(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.274951
2220.00502915
3310.27654
4420.497871
5110.913103
6220.994002
7310.124519
8420.123377

back to the DataFrame, but in a different order of rows than the original

vcat(gx2...)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.274951
2110.913103
3220.00502915
4220.994002
5310.27654
6310.124519
7420.497871
8420.123377

the same as above

DataFrame(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.274951
2110.913103
3220.00502915
4220.994002
5310.27654
6310.124519
7420.497871
8420.123377

drop grouping columns when creating a data frame

DataFrame(gx2, keepkeys=false)
8×1 DataFrame
Rowv
Float64
10.274951
20.913103
30.00502915
40.994002
50.27654
60.124519
70.497871
80.123377

vector of names of grouping variables

groupcols(gx2)
2-element Vector{Symbol}:
 :id
 :id2

and non-grouping variables

valuecols(gx2)
1-element Vector{Symbol}:
 :v

group indices in parent(gx2)

groupindices(gx2)
8-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
 1
 2
 3
 4
kgx2 = keys(gx2)
4-element DataFrames.GroupKeys{DataFrames.GroupedDataFrame{DataFrames.DataFrame}}:
 GroupKey: (id = 1, id2 = 1)
 GroupKey: (id = 2, id2 = 2)
 GroupKey: (id = 3, id2 = 1)
 GroupKey: (id = 4, id2 = 2)

You can index into a GroupedDataFrame like to a vector or to a dictionary. The second form accepts GroupKey, NamedTuple or a Tuple.

gx2

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.274951
2110.913103

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.497871
2420.123377
k = keys(gx2)[1]
GroupKey: (id = 1, id2 = 1)
ntk = NamedTuple(k)
(id = 1, id2 = 1)
tk = Tuple(k)
(1, 1)

the operations below produce the same result and are proformant

gx2[1], gx2[k], gx2[ntk], gx2[tk]
(2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.274951
   2 │     1      1  0.913103, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.274951
   2 │     1      1  0.913103, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.274951
   2 │     1      1  0.913103, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.274951
   2 │     1      1  0.913103)

handling missing values

x = DataFrame(id=[missing, 5, 1, 3, missing], x=1:5)
5×2 DataFrame
Rowidx
Int64?Int64
1missing1
252
313
434
5missing5

by default groups include missing values and their order is not guaranteed

groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

but we can change it; now they are sorted

groupby(x, :id, sort=true, skipmissing=true)

GroupedDataFrame with 3 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (1 row): id = 5
Rowidx
Int64?Int64
152

and now they are in the order they appear in the source data frame

groupby(x, :id, sort=false)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

Last Group (1 row): id = 3
Rowidx
Int64?Int64
134

Performing transformations#

by group using combine, select, select!, transform, and transform!

using Statistics
using Chain

x = DataFrame(id=rand('a':'d', 100), v=rand(100))
100×2 DataFrame
75 rows omitted
Rowidv
CharFloat64
1c0.86718
2d0.00553355
3a0.792594
4d0.145016
5a0.723366
6c0.373478
7a0.156386
8a0.774747
9b0.319325
10b0.577269
11a0.179759
12b0.399449
13c0.0531153
89d0.314694
90b0.694085
91a0.148528
92d0.90592
93d0.364386
94d0.307045
95c0.803253
96b0.941827
97d0.4352
98c0.874392
99c0.0448102
100b0.814797

apply a function to each group of a data frame combine keeps as many rows as are returned from the function

@chain x begin
    groupby(:id)
    combine(:v => mean)
end
4×2 DataFrame
Rowidv_mean
CharFloat64
1c0.541363
2d0.449724
3a0.520584
4b0.552102
x.id2 = axes(x, 1)
Base.OneTo(100)

Select and transform keep as many rows as are in the source data frame and in correct order. Additionally, transform keeps all columns from the source.

@chain x begin
    groupby(:id)
    transform(:v => mean)
end
100×4 DataFrame
75 rows omitted
Rowidvid2v_mean
CharFloat64Int64Float64
1c0.8671810.541363
2d0.0055335520.449724
3a0.79259430.520584
4d0.14501640.449724
5a0.72336650.520584
6c0.37347860.541363
7a0.15638670.520584
8a0.77474780.520584
9b0.31932590.552102
10b0.577269100.552102
11a0.179759110.520584
12b0.399449120.552102
13c0.0531153130.541363
89d0.314694890.449724
90b0.694085900.552102
91a0.148528910.520584
92d0.90592920.449724
93d0.364386930.449724
94d0.307045940.449724
95c0.803253950.541363
96b0.941827960.552102
97d0.4352970.449724
98c0.874392980.541363
99c0.0448102990.541363
100b0.8147971000.552102

note that combine reorders rows by group of GroupedDataFrame

@chain x begin
    groupby(:id)
    combine(:id2, :v => mean)
end
100×3 DataFrame
75 rows omitted
Rowidid2v_mean
CharInt64Float64
1c10.541363
2c60.541363
3c130.541363
4c210.541363
5c220.541363
6c250.541363
7c310.541363
8c330.541363
9c360.541363
10c500.541363
11c510.541363
12c540.541363
13c550.541363
89b680.552102
90b690.552102
91b700.552102
92b720.552102
93b760.552102
94b770.552102
95b800.552102
96b850.552102
97b870.552102
98b900.552102
99b960.552102
100b1000.552102

we give a custom name for the result column

@chain x begin
    groupby(:id)
    combine(:v => mean => :res)
end
4×2 DataFrame
Rowidres
CharFloat64
1c0.541363
2d0.449724
3a0.520584
4b0.552102

you can have multiple operations

@chain x begin
    groupby(:id)
    combine(:v => mean => :res1, :v => sum => :res2, nrow => :n)
end
4×4 DataFrame
Rowidres1res2n
CharFloat64Float64Int64
1c0.54136313.534125
2d0.44972410.793424
3a0.52058413.535226
4b0.55210213.802625

Additional notes:

  • select! and transform! perform operations in-place

  • The general syntax for transformation is source_columns => function => target_column

  • if you pass multiple columns to a function they are treated as positional arguments

  • ByRow and AsTable work exactly like discussed for operations on data frames in 05_columns.ipynb

  • you can automatically groupby again the result of combine, select etc. by passing ungroup=false keyword argument to them

  • similarly keepkeys keyword argument allows you to drop grouping columns from the resulting data frame

It is also allowed to pass a function to all these functions (also - as a special case, as a first argument). In this case the return value can be a table. In particular it allows for an easy dropping of groups if you return an empty table from the function.

If you pass a function you can use a do block syntax. In case of passing a function it gets a SubDataFrame as its argument.

Here is an example:

combine(groupby(x, :id)) do sdf
    n = nrow(sdf)
    n < 25 ? DataFrame() : DataFrame(n=n) ## drop groups with low number of rows
end
3×2 DataFrame
Rowidn
CharInt64
1c25
2a26
3b25

You can also produce multiple columns in a single operation:

df = DataFrame(id=[1, 1, 2, 2], val=[1, 2, 3, 4])
4×2 DataFrame
Rowidval
Int64Int64
111
212
323
424
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => AsTable)
end
2×3 DataFrame
Rowidx1x2
Int64Int64Int64
1112
2234
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => [:c1, :c2])
end
2×3 DataFrame
Rowidc1c2
Int64Int64Int64
1112
2234

It is easy to unnest the column into multiple columns,

df = DataFrame(a=[(p=1, q=2), (p=3, q=4)])
select(df, :a => AsTable)
2×2 DataFrame
Rowpq
Int64Int64
112
234

automatic column names generated

df = DataFrame(a=[[1, 2], [3, 4]])
select(df, :a => AsTable)
2×2 DataFrame
Rowx1x2
Int64Int64
112
234

custom column names generated

select(df, :a => [:C1, :C2])
2×2 DataFrame
RowC1C2
Int64Int64
112
234

Finally, observe that one can conveniently apply multiple transformations using broadcasting:

df = DataFrame(id=repeat(1:10, 10), x1=1:100, x2=101:200)
100×3 DataFrame
75 rows omitted
Rowidx1x2
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
11111111
12212112
13313113
89989189
901090190
91191191
92292192
93393193
94494194
95595195
96696196
97797197
98898198
99999199
10010100200
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> minimum)
end
10×3 DataFrame
Rowidx1_minimumx2_minimum
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> [minimum maximum])
end
10×5 DataFrame
Rowidx1_minimumx2_minimumx1_maximumx2_maximum
Int64Int64Int64Int64Int64
11110191191
22210292192
33310393193
44410494194
55510595195
66610696196
77710797197
88810898198
99910999199
101010110100200

Aggregation of a data frame using mapcols#

x = DataFrame(rand(10, 10), :auto)
10×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.9693560.8851770.7610780.4960730.5815440.07939560.5314030.8640590.6072080.249705
20.8442990.6911450.7808360.6281890.3203690.6425880.9787630.8141360.4655930.220491
30.01027940.4448810.9736320.5746620.6646090.01827010.1731610.6514980.8614540.363842
40.8964740.6928010.08006440.2550910.3996230.3123420.2237070.636870.7541040.747419
50.0813520.4279610.06482440.873970.2300910.8390310.1066340.6488130.4787490.609716
60.2864770.3555910.4020540.07505660.4350710.4254020.6078390.4586550.3237060.0805402
70.2822610.6377020.6395530.9799430.3999840.8388330.5802710.8398420.05355650.0232844
80.8170890.09497480.05098110.1153480.8775740.06813370.1480460.297520.9581210.601349
90.4143550.7314440.7256870.1061480.3462540.7172150.8350490.9708820.2614050.913889
100.9781320.7731390.4691980.1602930.1760570.5817850.2893740.9879720.9030490.788043
mapcols(mean, x)
1×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.5580070.5734820.4947910.4264780.4431180.4522990.4474250.7170250.5666950.459828

Mapping rows and columns using eachcol and eachrow#

map a function over each column and return a vector

map(mean, eachcol(x))
10-element Vector{Float64}:
 0.5580074671136023
 0.5734816749457339
 0.49479074789167593
 0.42647752642362546
 0.44311770773949916
 0.45229947965039824
 0.447424705841956
 0.7170247908651912
 0.5666947144849279
 0.4598278711412883

an iteration returns a Pair with column name and values

foreach(c -> println(c[1], ": ", mean(c[2])), pairs(eachcol(x)))
x1: 0.5580074671136023
x2: 0.5734816749457339
x3: 0.49479074789167593
x4: 0.42647752642362546
x5: 0.44311770773949916
x6: 0.45229947965039824
x7: 0.447424705841956
x8: 0.7170247908651912
x9: 0.5666947144849279
x10: 0.4598278711412883

now the returned value is DataFrameRow which works as a NamedTuple but is a view to a parent DataFrame

map(r -> r.x1 / r.x2, eachrow(x))
10-element Vector{Float64}:
 1.0950983038719277
 1.2215946823759614
 0.023105980733285687
 1.2939839832212443
 0.19009186137341755
 0.8056369667857306
 0.44262260834169254
 8.603218574055887
 0.5664889266820277
 1.2651444081680663

it prints like a data frame, only the caption is different so that you know the type of the object

er = eachrow(x)
er.x1 ## you can access columns of a parent data frame directly
10-element Vector{Float64}:
 0.9693560197606523
 0.8442993309808335
 0.010279419744060547
 0.8964735498352752
 0.08135197669469263
 0.28647738846072346
 0.28226121332184273
 0.8170886247220703
 0.4143549589180511
 0.9781321886978206

it prints like a data frame, only the caption is different so that you know the type of the object

ec = eachcol(x)
10×10 DataFrameColumns
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.9693560.8851770.7610780.4960730.5815440.07939560.5314030.8640590.6072080.249705
20.8442990.6911450.7808360.6281890.3203690.6425880.9787630.8141360.4655930.220491
30.01027940.4448810.9736320.5746620.6646090.01827010.1731610.6514980.8614540.363842
40.8964740.6928010.08006440.2550910.3996230.3123420.2237070.636870.7541040.747419
50.0813520.4279610.06482440.873970.2300910.8390310.1066340.6488130.4787490.609716
60.2864770.3555910.4020540.07505660.4350710.4254020.6078390.4586550.3237060.0805402
70.2822610.6377020.6395530.9799430.3999840.8388330.5802710.8398420.05355650.0232844
80.8170890.09497480.05098110.1153480.8775740.06813370.1480460.297520.9581210.601349
90.4143550.7314440.7256870.1061480.3462540.7172150.8350490.9708820.2614050.913889
100.9781320.7731390.4691980.1602930.1760570.5817850.2893740.9879720.9030490.788043

you can access columns of a parent data frame directly

ec.x1
10-element Vector{Float64}:
 0.9693560197606523
 0.8442993309808335
 0.010279419744060547
 0.8964735498352752
 0.08135197669469263
 0.28647738846072346
 0.28226121332184273
 0.8170886247220703
 0.4143549589180511
 0.9781321886978206

Transposing#

you can transpose a data frame using permutedims:

df = DataFrame(reshape(1:12, 3, 4), :auto)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
114710
225811
336912
df.names = ["a", "b", "c"]
3-element Vector{String}:
 "a"
 "b"
 "c"
permutedims(df, :names)
4×4 DataFrame
Rownamesabc
StringInt64Int64Int64
1x1123
2x2456
3x3789
4x4101112

This notebook was generated using Literate.jl.