Transformation to DataFrames#

Split-apply-combine

using DataFrames

Grouping a data frame#

groupby

x = DataFrame(id=[1, 2, 3, 4, 1, 2, 3, 4], id2=[1, 2, 1, 2, 1, 2, 1, 2], v=rand(8))
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.173134
2220.0333481
3310.180963
4420.113274
5110.836035
6220.0155187
7310.340925
8420.468375
groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = 1
Rowidid2v
Int64Int64Float64
1110.173134
2110.836035

Last Group (2 rows): id = 4
Rowidid2v
Int64Int64Float64
1420.113274
2420.468375
groupby(x, [])

GroupedDataFrame with 1 group based on key:

First Group (8 rows):
Rowidid2v
Int64Int64Float64
1110.173134
2220.0333481
3310.180963
4420.113274
5110.836035
6220.0155187
7310.340925
8420.468375
gx2 = groupby(x, [:id, :id2])

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.173134
2110.836035

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.113274
2420.468375

get the parent DataFrame

parent(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.173134
2220.0333481
3310.180963
4420.113274
5110.836035
6220.0155187
7310.340925
8420.468375

back to the DataFrame, but in a different order of rows than the original

vcat(gx2...)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.173134
2110.836035
3220.0333481
4220.0155187
5310.180963
6310.340925
7420.113274
8420.468375

the same as above

DataFrame(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.173134
2110.836035
3220.0333481
4220.0155187
5310.180963
6310.340925
7420.113274
8420.468375

drop grouping columns when creating a data frame

DataFrame(gx2, keepkeys=false)
8×1 DataFrame
Rowv
Float64
10.173134
20.836035
30.0333481
40.0155187
50.180963
60.340925
70.113274
80.468375

vector of names of grouping variables

groupcols(gx2)
2-element Vector{Symbol}:
 :id
 :id2

and non-grouping variables

valuecols(gx2)
1-element Vector{Symbol}:
 :v

group indices in parent(gx2)

groupindices(gx2)
8-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
 1
 2
 3
 4
kgx2 = keys(gx2)
4-element DataFrames.GroupKeys{DataFrames.GroupedDataFrame{DataFrames.DataFrame}}:
 GroupKey: (id = 1, id2 = 1)
 GroupKey: (id = 2, id2 = 2)
 GroupKey: (id = 3, id2 = 1)
 GroupKey: (id = 4, id2 = 2)

You can index into a GroupedDataFrame like to a vector or to a dictionary. The second form accepts GroupKey, NamedTuple or a Tuple.

gx2

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.173134
2110.836035

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.113274
2420.468375
k = keys(gx2)[1]
GroupKey: (id = 1, id2 = 1)
ntk = NamedTuple(k)
(id = 1, id2 = 1)
tk = Tuple(k)
(1, 1)

the operations below produce the same result and are proformant

gx2[1], gx2[k], gx2[ntk], gx2[tk]
(2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.173134
   2 │     1      1  0.836035, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.173134
   2 │     1      1  0.836035, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.173134
   2 │     1      1  0.836035, 2×3 SubDataFrame
 Row  id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.173134
   2 │     1      1  0.836035)

handling missing values

x = DataFrame(id=[missing, 5, 1, 3, missing], x=1:5)
5×2 DataFrame
Rowidx
Int64?Int64
1missing1
252
313
434
5missing5

by default groups include missing values and their order is not guaranteed

groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

but we can change it; now they are sorted

groupby(x, :id, sort=true, skipmissing=true)

GroupedDataFrame with 3 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (1 row): id = 5
Rowidx
Int64?Int64
152

and now they are in the order they appear in the source data frame

groupby(x, :id, sort=false)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

Last Group (1 row): id = 3
Rowidx
Int64?Int64
134

Performing transformations#

by group using combine, select, select!, transform, and transform!

using Statistics
using Chain

x = DataFrame(id=rand('a':'d', 100), v=rand(100))
100×2 DataFrame
75 rows omitted
Rowidv
CharFloat64
1d0.59953
2d0.963186
3b0.912677
4c0.46845
5b0.640382
6b0.340461
7c0.716181
8b0.748398
9c0.0640978
10b0.706182
11b0.161233
12c0.261802
13c0.15591
89b0.467717
90a0.128613
91d0.401218
92a0.0066471
93a0.609879
94b0.200713
95c0.666582
96a0.770652
97a0.297559
98c0.797879
99b0.350133
100b0.229405

apply a function to each group of a data frame combine keeps as many rows as are returned from the function

@chain x begin
    groupby(:id)
    combine(:v => mean)
end
4×2 DataFrame
Rowidv_mean
CharFloat64
1d0.479639
2b0.502398
3c0.548179
4a0.426175
x.id2 = axes(x, 1)
Base.OneTo(100)

Select and transform keep as many rows as are in the source data frame and in correct order. Additionally, transform keeps all columns from the source.

@chain x begin
    groupby(:id)
    transform(:v => mean)
end
100×4 DataFrame
75 rows omitted
Rowidvid2v_mean
CharFloat64Int64Float64
1d0.5995310.479639
2d0.96318620.479639
3b0.91267730.502398
4c0.4684540.548179
5b0.64038250.502398
6b0.34046160.502398
7c0.71618170.548179
8b0.74839880.502398
9c0.064097890.548179
10b0.706182100.502398
11b0.161233110.502398
12c0.261802120.548179
13c0.15591130.548179
89b0.467717890.502398
90a0.128613900.426175
91d0.401218910.479639
92a0.0066471920.426175
93a0.609879930.426175
94b0.200713940.502398
95c0.666582950.548179
96a0.770652960.426175
97a0.297559970.426175
98c0.797879980.548179
99b0.350133990.502398
100b0.2294051000.502398

note that combine reorders rows by group of GroupedDataFrame

@chain x begin
    groupby(:id)
    combine(:id2, :v => mean)
end
100×3 DataFrame
75 rows omitted
Rowidid2v_mean
CharInt64Float64
1d10.479639
2d20.479639
3d150.479639
4d200.479639
5d300.479639
6d310.479639
7d320.479639
8d350.479639
9d430.479639
10d480.479639
11d490.479639
12d560.479639
13d580.479639
89a620.426175
90a650.426175
91a670.426175
92a710.426175
93a720.426175
94a750.426175
95a790.426175
96a900.426175
97a920.426175
98a930.426175
99a960.426175
100a970.426175

we give a custom name for the result column

@chain x begin
    groupby(:id)
    combine(:v => mean => :res)
end
4×2 DataFrame
Rowidres
CharFloat64
1d0.479639
2b0.502398
3c0.548179
4a0.426175

you can have multiple operations

@chain x begin
    groupby(:id)
    combine(:v => mean => :res1, :v => sum => :res2, nrow => :n)
end
4×4 DataFrame
Rowidres1res2n
CharFloat64Float64Int64
1d0.47963911.031723
2b0.50239816.579133
3c0.54817913.156324
4a0.4261758.523520

Additional notes:

  • select! and transform! perform operations in-place

  • The general syntax for transformation is source_columns => function => target_column

  • if you pass multiple columns to a function they are treated as positional arguments

  • ByRow and AsTable work exactly like discussed for operations on data frames in 05_columns.ipynb

  • you can automatically groupby again the result of combine, select etc. by passing ungroup=false keyword argument to them

  • similarly keepkeys keyword argument allows you to drop grouping columns from the resulting data frame

It is also allowed to pass a function to all these functions (also - as a special case, as a first argument). In this case the return value can be a table. In particular it allows for an easy dropping of groups if you return an empty table from the function.

If you pass a function you can use a do block syntax. In case of passing a function it gets a SubDataFrame as its argument.

Here is an example:

combine(groupby(x, :id)) do sdf
    n = nrow(sdf)
    n < 25 ? DataFrame() : DataFrame(n=n) ## drop groups with low number of rows
end
1×2 DataFrame
Rowidn
CharInt64
1b33

You can also produce multiple columns in a single operation:

df = DataFrame(id=[1, 1, 2, 2], val=[1, 2, 3, 4])
4×2 DataFrame
Rowidval
Int64Int64
111
212
323
424
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => AsTable)
end
2×3 DataFrame
Rowidx1x2
Int64Int64Int64
1112
2234
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => [:c1, :c2])
end
2×3 DataFrame
Rowidc1c2
Int64Int64Int64
1112
2234

It is easy to unnest the column into multiple columns,

df = DataFrame(a=[(p=1, q=2), (p=3, q=4)])
select(df, :a => AsTable)
2×2 DataFrame
Rowpq
Int64Int64
112
234

automatic column names generated

df = DataFrame(a=[[1, 2], [3, 4]])
select(df, :a => AsTable)
2×2 DataFrame
Rowx1x2
Int64Int64
112
234

custom column names generated

select(df, :a => [:C1, :C2])
2×2 DataFrame
RowC1C2
Int64Int64
112
234

Finally, observe that one can conveniently apply multiple transformations using broadcasting:

df = DataFrame(id=repeat(1:10, 10), x1=1:100, x2=101:200)
100×3 DataFrame
75 rows omitted
Rowidx1x2
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
11111111
12212112
13313113
89989189
901090190
91191191
92292192
93393193
94494194
95595195
96696196
97797197
98898198
99999199
10010100200
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> minimum)
end
10×3 DataFrame
Rowidx1_minimumx2_minimum
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> [minimum maximum])
end
10×5 DataFrame
Rowidx1_minimumx2_minimumx1_maximumx2_maximum
Int64Int64Int64Int64Int64
11110191191
22210292192
33310393193
44410494194
55510595195
66610696196
77710797197
88810898198
99910999199
101010110100200

Aggregation of a data frame using mapcols#

x = DataFrame(rand(10, 10), :auto)
10×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.4063160.9831920.3847710.925230.9397310.9556980.3668890.6230420.190060.846325
20.4613180.6309190.1373740.8312080.2501980.2643450.5183620.2622070.1226960.307633
30.3113360.2611780.9911510.9721520.5092530.660560.9073030.514720.4122270.256053
40.4197240.9035280.2977720.5158090.01057330.0002375850.9467710.4630930.9766760.978568
50.6438060.1523960.6364820.6506650.401730.5154770.3014230.3169340.08737560.139309
60.6055380.2535320.2534670.358480.2148930.2681260.2997080.830040.699160.711818
70.5032390.5005040.9141210.6971060.8017140.8766750.1905610.451160.3740460.0171316
80.6998780.2460490.9126620.2193990.8628770.5399720.5546070.8211120.2537410.277274
90.605240.8285920.6233950.7404170.2960860.07194410.1813720.3288450.6977480.983598
100.4041520.1186630.9184180.6597530.2421870.8508870.7375580.01180750.1369590.544532
mapcols(mean, x)
1×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.5060550.4878550.6069610.6570220.4529240.5003920.5004550.4622960.3950690.506224

Mapping rows and columns using eachcol and eachrow#

map a function over each column and return a vector

map(mean, eachcol(x))
10-element Vector{Float64}:
 0.5060547920796004
 0.48785535814481973
 0.6069613476270621
 0.6570220327155232
 0.45292425664895275
 0.5003921008238743
 0.5004553926468702
 0.4622961689025299
 0.39506884836796574
 0.5062243237787427

an iteration returns a Pair with column name and values

foreach(c -> println(c[1], ": ", mean(c[2])), pairs(eachcol(x)))
x1: 0.5060547920796004
x2: 0.48785535814481973
x3: 0.6069613476270621
x4: 0.6570220327155232
x5: 0.45292425664895275
x6: 0.5003921008238743
x7: 0.5004553926468702
x8: 0.4622961689025299
x9: 0.39506884836796574
x10: 0.5062243237787427

now the returned value is DataFrameRow which works as a NamedTuple but is a view to a parent DataFrame

map(r -> r.x1 / r.x2, eachrow(x))
10-element Vector{Float64}:
 0.4132624389343559
 0.7311832529054909
 1.1920438782095928
 0.46453955353950066
 4.224551393630487
 2.388414434655677
 1.0054654848736022
 2.8444695221189074
 0.730443081489957
 3.4058671583882494

it prints like a data frame, only the caption is different so that you know the type of the object

er = eachrow(x)
er.x1 ## you can access columns of a parent data frame directly
10-element Vector{Float64}:
 0.4063163827820543
 0.4613176613354725
 0.31133551371983115
 0.4197243987932313
 0.6438063265650948
 0.6055384860738079
 0.5032393113313084
 0.6998783593017472
 0.6052395617941637
 0.40415191909929205

it prints like a data frame, only the caption is different so that you know the type of the object

ec = eachcol(x)
10×10 DataFrameColumns
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.4063160.9831920.3847710.925230.9397310.9556980.3668890.6230420.190060.846325
20.4613180.6309190.1373740.8312080.2501980.2643450.5183620.2622070.1226960.307633
30.3113360.2611780.9911510.9721520.5092530.660560.9073030.514720.4122270.256053
40.4197240.9035280.2977720.5158090.01057330.0002375850.9467710.4630930.9766760.978568
50.6438060.1523960.6364820.6506650.401730.5154770.3014230.3169340.08737560.139309
60.6055380.2535320.2534670.358480.2148930.2681260.2997080.830040.699160.711818
70.5032390.5005040.9141210.6971060.8017140.8766750.1905610.451160.3740460.0171316
80.6998780.2460490.9126620.2193990.8628770.5399720.5546070.8211120.2537410.277274
90.605240.8285920.6233950.7404170.2960860.07194410.1813720.3288450.6977480.983598
100.4041520.1186630.9184180.6597530.2421870.8508870.7375580.01180750.1369590.544532

you can access columns of a parent data frame directly

ec.x1
10-element Vector{Float64}:
 0.4063163827820543
 0.4613176613354725
 0.31133551371983115
 0.4197243987932313
 0.6438063265650948
 0.6055384860738079
 0.5032393113313084
 0.6998783593017472
 0.6052395617941637
 0.40415191909929205

Transposing#

you can transpose a data frame using permutedims:

df = DataFrame(reshape(1:12, 3, 4), :auto)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
114710
225811
336912
df.names = ["a", "b", "c"]
3-element Vector{String}:
 "a"
 "b"
 "c"
permutedims(df, :names)
4×4 DataFrame
Rownamesabc
StringInt64Int64Int64
1x1123
2x2456
3x3789
4x4101112

This notebook was generated using Literate.jl.