Split-apply-combine
using DataFramesGrouping a data frame¶
groupby
x = DataFrame(id=[1, 2, 3, 4, 1, 2, 3, 4], id2=[1, 2, 1, 2, 1, 2, 1, 2], v=rand(8))groupby(x, :id)groupby(x, [])gx2 = groupby(x, [:id, :id2])get the parent DataFrame
parent(gx2)back to the DataFrame, but in a different order of rows than the original
vcat(gx2...)the same as above
DataFrame(gx2)drop grouping columns when creating a data frame
DataFrame(gx2, keepkeys=false)vector of names of grouping variables
groupcols(gx2)2-element Vector{Symbol}:
:id
:id2and non-grouping variables
valuecols(gx2)1-element Vector{Symbol}:
:vgroup indices in parent(gx2)
groupindices(gx2)8-element Vector{Union{Missing, Int64}}:
1
2
3
4
1
2
3
4kgx2 = keys(gx2)4-element DataFrames.GroupKeys{DataFrames.GroupedDataFrame{DataFrames.DataFrame}}:
GroupKey: (id = 1, id2 = 1)
GroupKey: (id = 2, id2 = 2)
GroupKey: (id = 3, id2 = 1)
GroupKey: (id = 4, id2 = 2)You can index into a GroupedDataFrame like to a vector or to a dictionary. The second form accepts GroupKey, NamedTuple or a Tuple.
gx2k = keys(gx2)[1]GroupKey: (id = 1, id2 = 1)ntk = NamedTuple(k)(id = 1, id2 = 1)tk = Tuple(k)(1, 1)the operations below produce the same result and are proformant
gx2[1], gx2[k], gx2[ntk], gx2[tk](2×3 SubDataFrame
Row │ id id2 v
│ Int64 Int64 Float64
─────┼─────────────────────────
1 │ 1 1 0.0985022
2 │ 1 1 0.807281, 2×3 SubDataFrame
Row │ id id2 v
│ Int64 Int64 Float64
─────┼─────────────────────────
1 │ 1 1 0.0985022
2 │ 1 1 0.807281, 2×3 SubDataFrame
Row │ id id2 v
│ Int64 Int64 Float64
─────┼─────────────────────────
1 │ 1 1 0.0985022
2 │ 1 1 0.807281, 2×3 SubDataFrame
Row │ id id2 v
│ Int64 Int64 Float64
─────┼─────────────────────────
1 │ 1 1 0.0985022
2 │ 1 1 0.807281)handling missing values
x = DataFrame(id=[missing, 5, 1, 3, missing], x=1:5)by default groups include missing values and their order is not guaranteed
groupby(x, :id)but we can change it; now they are sorted
groupby(x, :id, sort=true, skipmissing=true)and now they are in the order they appear in the source data frame
groupby(x, :id, sort=false)Performing transformations¶
by group using combine, select, select!, transform, and transform!
using Statistics
using Chain
x = DataFrame(id=rand('a':'d', 100), v=rand(100))apply a function to each group of a data frame combine keeps as many rows as are returned from the function
@chain x begin
groupby(:id)
combine(:v => mean)
endx.id2 = axes(x, 1)Base.OneTo(100)Select and transform keep as many rows as are in the source data frame and in correct order. Additionally, transform keeps all columns from the source.
@chain x begin
groupby(:id)
transform(:v => mean)
endnote that combine reorders rows by group of GroupedDataFrame
@chain x begin
groupby(:id)
combine(:id2, :v => mean)
endwe give a custom name for the result column
@chain x begin
groupby(:id)
combine(:v => mean => :res)
endyou can have multiple operations
@chain x begin
groupby(:id)
combine(:v => mean => :res1, :v => sum => :res2, nrow => :n)
endAdditional notes:
select!andtransform!perform operations in-placeThe general syntax for transformation is
source_columns => function => target_columnif you pass multiple columns to a function they are treated as positional arguments
ByRowandAsTablework exactly like discussed for operations on data frames in 05_columns.ipynbyou can automatically groupby again the result of
combine,selectetc. by passingungroup=falsekeyword argument to themsimilarly
keepkeyskeyword argument allows you to drop grouping columns from the resulting data frame
It is also allowed to pass a function to all these functions (also - as a special case, as a first argument). In this case the return value can be a table. In particular it allows for an easy dropping of groups if you return an empty table from the function.
If you pass a function you can use a do block syntax. In case of passing a function it gets a SubDataFrame as its argument.
Here is an example:
combine(groupby(x, :id)) do sdf
n = nrow(sdf)
n < 25 ? DataFrame() : DataFrame(n=n) ## drop groups with low number of rows
endYou can also produce multiple columns in a single operation:
df = DataFrame(id=[1, 1, 2, 2], val=[1, 2, 3, 4])@chain df begin
groupby(:id)
combine(:val => (x -> [x]) => AsTable)
end@chain df begin
groupby(:id)
combine(:val => (x -> [x]) => [:c1, :c2])
endIt is easy to unnest the column into multiple columns,
df = DataFrame(a=[(p=1, q=2), (p=3, q=4)])
select(df, :a => AsTable)automatic column names generated
df = DataFrame(a=[[1, 2], [3, 4]])
select(df, :a => AsTable)custom column names generated
select(df, :a => [:C1, :C2])Finally, observe that one can conveniently apply multiple transformations using broadcasting:
df = DataFrame(id=repeat(1:10, 10), x1=1:100, x2=101:200)@chain df begin
groupby(:id)
combine([:x1, :x2] .=> minimum)
end@chain df begin
groupby(:id)
combine([:x1, :x2] .=> [minimum maximum])
endAggregation of a data frame using mapcols¶
x = DataFrame(rand(10, 10), :auto)mapcols(mean, x)Mapping rows and columns using eachcol and eachrow¶
map a function over each column and return a vector
map(mean, eachcol(x))10-element Vector{Float64}:
0.6370950733736179
0.5371450258748118
0.6148211343589259
0.6728720548889984
0.6795271988422863
0.5993090446173606
0.473456379743358
0.4968696652568423
0.6176732417197414
0.4346372588074159an iteration returns a Pair with column name and values
foreach(c -> println(c[1], ": ", mean(c[2])), pairs(eachcol(x)))x1: 0.6370950733736179
x2: 0.5371450258748118
x3: 0.6148211343589259
x4: 0.6728720548889984
x5: 0.6795271988422863
x6: 0.5993090446173606
x7: 0.473456379743358
x8: 0.4968696652568423
x9: 0.6176732417197414
x10: 0.4346372588074159
now the returned value is DataFrameRow which works as a NamedTuple but is a view to a parent DataFrame
map(r -> r.x1 / r.x2, eachrow(x))10-element Vector{Float64}:
0.8539026855869895
0.2654061535888251
1.0719528846523578
1.1879279852400852
1.3291632401407916
1.1148404958013007
5.426896192700847
2.016517028243219
1.1863027057204478
1.4532757622030301it prints like a data frame, only the caption is different so that you know the type of the object
er = eachrow(x)
er.x1 ## you can access columns of a parent data frame directly10-element Vector{Float64}:
0.711450140216487
0.24328795450984142
0.3446363034383769
0.9103082680210326
0.7968947359218227
0.4576005594843121
0.8709220353521566
0.7848443047708391
0.7316063734530001
0.5194000585683105it prints like a data frame, only the caption is different so that you know the type of the object
ec = eachcol(x)you can access columns of a parent data frame directly
ec.x110-element Vector{Float64}:
0.711450140216487
0.24328795450984142
0.3446363034383769
0.9103082680210326
0.7968947359218227
0.4576005594843121
0.8709220353521566
0.7848443047708391
0.7316063734530001
0.5194000585683105Transposing¶
you can transpose a data frame using permutedims:
df = DataFrame(reshape(1:12, 3, 4), :auto)df.names = ["a", "b", "c"]3-element Vector{String}:
"a"
"b"
"c"permutedims(df, :names)This notebook was generated using Literate.jl.