Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Split-apply-combine

using DataFrames

Grouping a data frame

groupby

x = DataFrame(id=[1, 2, 3, 4, 1, 2, 3, 4], id2=[1, 2, 1, 2, 1, 2, 1, 2], v=rand(8))
Loading...
groupby(x, :id)
Loading...
groupby(x, [])
Loading...
gx2 = groupby(x, [:id, :id2])
Loading...

get the parent DataFrame

parent(gx2)
Loading...

back to the DataFrame, but in a different order of rows than the original

vcat(gx2...)
Loading...

the same as above

DataFrame(gx2)
Loading...

drop grouping columns when creating a data frame

DataFrame(gx2, keepkeys=false)
Loading...

vector of names of grouping variables

groupcols(gx2)
2-element Vector{Symbol}: :id :id2

and non-grouping variables

valuecols(gx2)
1-element Vector{Symbol}: :v

group indices in parent(gx2)

groupindices(gx2)
8-element Vector{Union{Missing, Int64}}: 1 2 3 4 1 2 3 4
kgx2 = keys(gx2)
4-element DataFrames.GroupKeys{DataFrames.GroupedDataFrame{DataFrames.DataFrame}}: GroupKey: (id = 1, id2 = 1) GroupKey: (id = 2, id2 = 2) GroupKey: (id = 3, id2 = 1) GroupKey: (id = 4, id2 = 2)

You can index into a GroupedDataFrame like to a vector or to a dictionary. The second form accepts GroupKey, NamedTuple or a Tuple.

gx2
Loading...
k = keys(gx2)[1]
GroupKey: (id = 1, id2 = 1)
ntk = NamedTuple(k)
(id = 1, id2 = 1)
tk = Tuple(k)
(1, 1)

the operations below produce the same result and are proformant

gx2[1], gx2[k], gx2[ntk], gx2[tk]
(2×3 SubDataFrame Row id id2 v Int64 Int64 Float64 ─────┼───────────────────────── 1 │ 1 1 0.0985022 2 │ 1 1 0.807281, 2×3 SubDataFrame Row id id2 v Int64 Int64 Float64 ─────┼───────────────────────── 1 │ 1 1 0.0985022 2 │ 1 1 0.807281, 2×3 SubDataFrame Row id id2 v Int64 Int64 Float64 ─────┼───────────────────────── 1 │ 1 1 0.0985022 2 │ 1 1 0.807281, 2×3 SubDataFrame Row id id2 v Int64 Int64 Float64 ─────┼───────────────────────── 1 │ 1 1 0.0985022 2 │ 1 1 0.807281)

handling missing values

x = DataFrame(id=[missing, 5, 1, 3, missing], x=1:5)
Loading...

by default groups include missing values and their order is not guaranteed

groupby(x, :id)
Loading...

but we can change it; now they are sorted

groupby(x, :id, sort=true, skipmissing=true)
Loading...

and now they are in the order they appear in the source data frame

groupby(x, :id, sort=false)
Loading...

Performing transformations

by group using combine, select, select!, transform, and transform!

using Statistics
using Chain

x = DataFrame(id=rand('a':'d', 100), v=rand(100))
Loading...

apply a function to each group of a data frame combine keeps as many rows as are returned from the function

@chain x begin
    groupby(:id)
    combine(:v => mean)
end
Loading...
x.id2 = axes(x, 1)
Base.OneTo(100)

Select and transform keep as many rows as are in the source data frame and in correct order. Additionally, transform keeps all columns from the source.

@chain x begin
    groupby(:id)
    transform(:v => mean)
end
Loading...

note that combine reorders rows by group of GroupedDataFrame

@chain x begin
    groupby(:id)
    combine(:id2, :v => mean)
end
Loading...

we give a custom name for the result column

@chain x begin
    groupby(:id)
    combine(:v => mean => :res)
end
Loading...

you can have multiple operations

@chain x begin
    groupby(:id)
    combine(:v => mean => :res1, :v => sum => :res2, nrow => :n)
end
Loading...

Additional notes:

  • select! and transform! perform operations in-place

  • The general syntax for transformation is source_columns => function => target_column

  • if you pass multiple columns to a function they are treated as positional arguments

  • ByRow and AsTable work exactly like discussed for operations on data frames in 05_columns.ipynb

  • you can automatically groupby again the result of combine, select etc. by passing ungroup=false keyword argument to them

  • similarly keepkeys keyword argument allows you to drop grouping columns from the resulting data frame

It is also allowed to pass a function to all these functions (also - as a special case, as a first argument). In this case the return value can be a table. In particular it allows for an easy dropping of groups if you return an empty table from the function.

If you pass a function you can use a do block syntax. In case of passing a function it gets a SubDataFrame as its argument.

Here is an example:

combine(groupby(x, :id)) do sdf
    n = nrow(sdf)
    n < 25 ? DataFrame() : DataFrame(n=n) ## drop groups with low number of rows
end
Loading...

You can also produce multiple columns in a single operation:

df = DataFrame(id=[1, 1, 2, 2], val=[1, 2, 3, 4])
Loading...
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => AsTable)
end
Loading...
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => [:c1, :c2])
end
Loading...

It is easy to unnest the column into multiple columns,

df = DataFrame(a=[(p=1, q=2), (p=3, q=4)])
select(df, :a => AsTable)
Loading...

automatic column names generated

df = DataFrame(a=[[1, 2], [3, 4]])
select(df, :a => AsTable)
Loading...

custom column names generated

select(df, :a => [:C1, :C2])
Loading...

Finally, observe that one can conveniently apply multiple transformations using broadcasting:

df = DataFrame(id=repeat(1:10, 10), x1=1:100, x2=101:200)
Loading...
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> minimum)
end
Loading...
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> [minimum maximum])
end
Loading...

Aggregation of a data frame using mapcols

x = DataFrame(rand(10, 10), :auto)
Loading...
mapcols(mean, x)
Loading...

Mapping rows and columns using eachcol and eachrow

map a function over each column and return a vector

map(mean, eachcol(x))
10-element Vector{Float64}: 0.6370950733736179 0.5371450258748118 0.6148211343589259 0.6728720548889984 0.6795271988422863 0.5993090446173606 0.473456379743358 0.4968696652568423 0.6176732417197414 0.4346372588074159

an iteration returns a Pair with column name and values

foreach(c -> println(c[1], ": ", mean(c[2])), pairs(eachcol(x)))
x1: 0.6370950733736179
x2: 0.5371450258748118
x3: 0.6148211343589259
x4: 0.6728720548889984
x5: 0.6795271988422863
x6: 0.5993090446173606
x7: 0.473456379743358
x8: 0.4968696652568423
x9: 0.6176732417197414
x10: 0.4346372588074159

now the returned value is DataFrameRow which works as a NamedTuple but is a view to a parent DataFrame

map(r -> r.x1 / r.x2, eachrow(x))
10-element Vector{Float64}: 0.8539026855869895 0.2654061535888251 1.0719528846523578 1.1879279852400852 1.3291632401407916 1.1148404958013007 5.426896192700847 2.016517028243219 1.1863027057204478 1.4532757622030301

it prints like a data frame, only the caption is different so that you know the type of the object

er = eachrow(x)
er.x1 ## you can access columns of a parent data frame directly
10-element Vector{Float64}: 0.711450140216487 0.24328795450984142 0.3446363034383769 0.9103082680210326 0.7968947359218227 0.4576005594843121 0.8709220353521566 0.7848443047708391 0.7316063734530001 0.5194000585683105

it prints like a data frame, only the caption is different so that you know the type of the object

ec = eachcol(x)
Loading...

you can access columns of a parent data frame directly

ec.x1
10-element Vector{Float64}: 0.711450140216487 0.24328795450984142 0.3446363034383769 0.9103082680210326 0.7968947359218227 0.4576005594843121 0.8709220353521566 0.7848443047708391 0.7316063734530001 0.5194000585683105

Transposing

you can transpose a data frame using permutedims:

df = DataFrame(reshape(1:12, 3, 4), :auto)
Loading...
df.names = ["a", "b", "c"]
3-element Vector{String}: "a" "b" "c"
permutedims(df, :names)
Loading...

This notebook was generated using Literate.jl.