Transformation to DataFrames#

Split-apply-combine

using DataFrames

Grouping a dat=a frame#

x = DataFrame(id=[1, 2, 3, 4, 1, 2, 3, 4], id2=[1, 2, 1, 2, 1, 2, 1, 2], v=rand(8))
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.0298167
2220.890129
3310.608735
4420.648561
5110.187759
6220.510651
7310.451765
8420.847968
groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = 1
Rowidid2v
Int64Int64Float64
1110.0298167
2110.187759

Last Group (2 rows): id = 4
Rowidid2v
Int64Int64Float64
1420.648561
2420.847968
groupby(x, [])

GroupedDataFrame with 1 group based on key:

First Group (8 rows):
Rowidid2v
Int64Int64Float64
1110.0298167
2220.890129
3310.608735
4420.648561
5110.187759
6220.510651
7310.451765
8420.847968
gx2 = groupby(x, [:id, :id2])

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.0298167
2110.187759

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.648561
2420.847968

get the parent DataFrame

parent(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.0298167
2220.890129
3310.608735
4420.648561
5110.187759
6220.510651
7310.451765
8420.847968

back to the DataFrame, but in a different order of rows than the original

vcat(gx2...)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.0298167
2110.187759
3220.890129
4220.510651
5310.608735
6310.451765
7420.648561
8420.847968

the same

DataFrame(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.0298167
2110.187759
3220.890129
4220.510651
5310.608735
6310.451765
7420.648561
8420.847968

drop grouping columns when creating a data frame

DataFrame(gx2, keepkeys=false)
8×1 DataFrame
Rowv
Float64
10.0298167
20.187759
30.890129
40.510651
50.608735
60.451765
70.648561
80.847968

vector of names of grouping variables

groupcols(gx2)
2-element Vector{Symbol}:
 :id
 :id2

and non-grouping variables

valuecols(gx2)
1-element Vector{Symbol}:
 :v

group indices in parent(gx2)

groupindices(gx2)
8-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
 1
 2
 3
 4
kgx2 = keys(gx2)
4-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 1, id2 = 1)
 GroupKey: (id = 2, id2 = 2)
 GroupKey: (id = 3, id2 = 1)
 GroupKey: (id = 4, id2 = 2)

You can index into a GroupedDataFrame like to a vector or to a dictionary. The second form acceps GroupKey, NamedTuple or a Tuple

gx2

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.0298167
2110.187759

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.648561
2420.847968
k = keys(gx2)[1]
GroupKey: (id = 1, id2 = 1)
ntk = NamedTuple(k)
(id = 1, id2 = 1)
tk = Tuple(k)
(1, 1)

the operations below produce the same result and are fast

gx2[1]
2×3 SubDataFrame
Rowidid2v
Int64Int64Float64
1110.0298167
2110.187759
gx2[k]
2×3 SubDataFrame
Rowidid2v
Int64Int64Float64
1110.0298167
2110.187759
gx2[ntk]
2×3 SubDataFrame
Rowidid2v
Int64Int64Float64
1110.0298167
2110.187759
gx2[tk]
2×3 SubDataFrame
Rowidid2v
Int64Int64Float64
1110.0298167
2110.187759

handling missing values

x = DataFrame(id=[missing, 5, 1, 3, missing], x=1:5)
5×2 DataFrame
Rowidx
Int64?Int64
1missing1
252
313
434
5missing5

by default groups include mising values and their order is not guaranteed

groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

but we can change it; now they are sorted

groupby(x, :id, sort=true, skipmissing=true)

GroupedDataFrame with 3 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (1 row): id = 5
Rowidx
Int64?Int64
152

and now they are in the order they appear in the source data frame

groupby(x, :id, sort=false)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

Last Group (1 row): id = 3
Rowidx
Int64?Int64
134

Performing transformations#

by group using combine, select, select!, transform, and transform!

using Statistics
using Chain

reduce the number of rows in the output

ENV["LINES"] = 15
15
x = DataFrame(id=rand('a':'d', 100), v=rand(100))
100×2 DataFrame
75 rows omitted
Rowidv
CharFloat64
1b0.0947633
2b0.444125
3c0.587247
4b0.330791
5b0.31695
6b7.83911e-5
7d0.728483
8b0.301966
9d0.577585
10b0.360327
11c0.803869
12c0.836187
13c0.633544
89c0.234815
90a0.0572505
91a0.421841
92a0.135045
93a0.702993
94d0.788392
95a0.759564
96c0.445145
97c0.756522
98b0.413141
99b0.617292
100d0.0817082

apply a function to each group of a data frame combine keeps as many rows as are returned from the function

@chain x begin
    groupby(:id)
    combine(:v => mean)
end
4×2 DataFrame
Rowidv_mean
CharFloat64
1b0.375926
2c0.538606
3d0.60607
4a0.484241
x.id2 = axes(x, 1)
Base.OneTo(100)

select and transform keep as many rows as are in the source data frame and in correct order additionally transform keeps all columns from the source

@chain x begin
    groupby(:id)
    transform(:v => mean)
end
100×4 DataFrame
75 rows omitted
Rowidvid2v_mean
CharFloat64Int64Float64
1b0.094763310.375926
2b0.44412520.375926
3c0.58724730.538606
4b0.33079140.375926
5b0.3169550.375926
6b7.83911e-560.375926
7d0.72848370.60607
8b0.30196680.375926
9d0.57758590.60607
10b0.360327100.375926
11c0.803869110.538606
12c0.836187120.538606
13c0.633544130.538606
89c0.234815890.538606
90a0.0572505900.484241
91a0.421841910.484241
92a0.135045920.484241
93a0.702993930.484241
94d0.788392940.60607
95a0.759564950.484241
96c0.445145960.538606
97c0.756522970.538606
98b0.413141980.375926
99b0.617292990.375926
100d0.08170821000.60607

note that combine reorders rows by group of GroupedDataFrame

@chain x begin
    groupby(:id)
    combine(:id2, :v => mean)
end
100×3 DataFrame
75 rows omitted
Rowidid2v_mean
CharInt64Float64
1b10.375926
2b20.375926
3b40.375926
4b50.375926
5b60.375926
6b80.375926
7b100.375926
8b170.375926
9b220.375926
10b230.375926
11b250.375926
12b260.375926
13b270.375926
89a740.484241
90a750.484241
91a770.484241
92a780.484241
93a790.484241
94a800.484241
95a870.484241
96a900.484241
97a910.484241
98a920.484241
99a930.484241
100a950.484241

we give a custom name for the result column

@chain x begin
    groupby(:id)
    combine(:v => mean => :res)
end
4×2 DataFrame
Rowidres
CharFloat64
1b0.375926
2c0.538606
3d0.60607
4a0.484241

you can have multiple operations

@chain x begin
    groupby(:id)
    combine(:v => mean => :res1, :v => sum => :res2, nrow => :n)
end
4×4 DataFrame
Rowidres1res2n
CharFloat64Float64Int64
1b0.37592610.525928
2c0.53860614.003726
3d0.6060711.515319
4a0.48424113.074527

Additional notes:

  • select! and transform! perform operations in-place

  • The general syntax for transformation is source_columns => function => target_column

  • if you pass multiple columns to a function they are treated as positional arguments

  • ByRow and AsTable work exactly like discussed for operations on data frames in 05_columns.ipynb

  • you can automatically groupby again the result of combine, select etc. by passing ungroup=false keyword argument to them

  • similarly keepkeys keyword argument allows you to drop grouping columns from the resulting data frame

It is also allowed to pass a function to all these functions (also - as a special case, as a first argument). In this case the return value can be a table. In particular it allows for an easy dropping of groups if you return an empty table from the function.

If you pass a function you can use a do block syntax. In case of passing a function it gets a SubDataFrame as its argument.

Here is an example:

combine(groupby(x, :id)) do sdf
    n = nrow(sdf)
    n < 25 ? DataFrame() : DataFrame(n=n) ## drop groups with low number of rows
end
3×2 DataFrame
Rowidn
CharInt64
1b28
2c26
3a27

You can also produce multiple columns in a single operation, e.g.:

df = DataFrame(id=[1, 1, 2, 2], val=[1, 2, 3, 4])
4×2 DataFrame
Rowidval
Int64Int64
111
212
323
424
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => AsTable)
end
2×3 DataFrame
Rowidx1x2
Int64Int64Int64
1112
2234
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => [:c1, :c2])
end
2×3 DataFrame
Rowidc1c2
Int64Int64Int64
1112
2234

t is easy to unnest the column into multiple columns, e.g.

df = DataFrame(a=[(p=1, q=2), (p=3, q=4)])
2×1 DataFrame
Rowa
NamedTup…
1(p = 1, q = 2)
2(p = 3, q = 4)
select(df, :a => AsTable)
2×2 DataFrame
Rowpq
Int64Int64
112
234
df = DataFrame(a=[[1, 2], [3, 4]])
2×1 DataFrame
Rowa
Array…
1[1, 2]
2[3, 4]

automatic column names generated

select(df, :a => AsTable)
2×2 DataFrame
Rowx1x2
Int64Int64
112
234

custom column names generated

select(df, :a => [:C1, :C2])
2×2 DataFrame
RowC1C2
Int64Int64
112
234

Finally, observe that one can conveniently apply multiple transformations using broadcasting:

df = DataFrame(id=repeat(1:10, 10), x1=1:100, x2=101:200)
100×3 DataFrame
75 rows omitted
Rowidx1x2
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
11111111
12212112
13313113
89989189
901090190
91191191
92292192
93393193
94494194
95595195
96696196
97797197
98898198
99999199
10010100200
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> minimum)
end
10×3 DataFrame
Rowidx1_minimumx2_minimum
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> [minimum maximum])
end
10×5 DataFrame
Rowidx1_minimumx2_minimumx1_maximumx2_maximum
Int64Int64Int64Int64Int64
11110191191
22210292192
33310393193
44410494194
55510595195
66610696196
77710797197
88810898198
99910999199
101010110100200

Aggregation of a data frame using mapcols#

x = DataFrame(rand(10, 10), :auto)
10×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.2394130.3726320.3918080.8817660.9264260.1729870.8860670.94450.991340.567602
20.1273320.5236840.2807440.6314590.1000090.4142130.4225490.4637210.02107590.569317
30.9102730.7497510.6655980.6614690.6567250.2943940.4850970.4116870.2883020.633039
40.7533250.6425350.2213390.3330830.7834310.2744990.4023390.6981890.7197780.198621
50.03870460.369160.7171330.389620.6580460.575610.03611610.184530.65390.490985
60.9396080.2328130.6325980.6250160.9885540.3642420.3841420.8377590.7753180.136857
70.986040.6463580.6768850.6619430.7982120.606850.7831940.727250.4455320.311484
80.8770280.4524340.6541840.5927950.1606480.8825150.3630150.8522140.7625080.863806
90.03067930.2613510.2927750.3398210.1435830.4738890.2291760.05220240.7636650.277183
100.80660.8183270.6008030.6342670.53110.6104160.8476060.7034340.6139010.974852
mapcols(mean, x)
1×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.57090.5069040.5133870.5751240.5746730.4669620.483930.5875490.6035320.502374

Mapping rows and columns using eachcol and eachrow#

map a function over each column and return a vector

map(mean, eachcol(x))
10-element Vector{Float64}:
 0.5709004496231695
 0.5069044704575527
 0.5133865376373459
 0.5751238253561982
 0.5746734925011687
 0.46696151605594977
 0.4839300017510227
 0.5875487958217412
 0.6035319143917144
 0.5023744992952286

an iteration returns a Pair with column name and values

foreach(c -> println(c[1], ": ", mean(c[2])), pairs(eachcol(x)))
x1: 0.5709004496231695
x2: 0.5069044704575527
x3: 0.5133865376373459
x4: 0.5751238253561982
x5: 0.5746734925011687
x6: 0.46696151605594977
x7: 0.4839300017510227
x8: 0.5875487958217412
x9: 0.6035319143917144
x10: 0.5023744992952286

now the returned value is DataFrameRow which works as a NamedTuple but is a view to a parent DataFrame

map(r -> r.x1 / r.x2, eachrow(x))
10-element Vector{Float64}:
 0.6424933971881048
 0.24314771856095865
 1.2141011230985737
 1.1724263056867696
 0.10484501684899052
 4.035893016397789
 1.5255319699420227
 1.9384648268062645
 0.11738731668127854
 0.985669840177133

it prints like a data frame, only the caption is different so that you know the type of the object

er = eachrow(x)

er.x1 # you can access columns of a parent data frame directly
10-element Vector{Float64}:
 0.23941342855307912
 0.1273324638742388
 0.9102731889046806
 0.7533254463886194
 0.038704570132973126
 0.9396078861391909
 0.9860398107614828
 0.8770280375662959
 0.03067929975280259
 0.8066003641583329

it prints like a data frame, only the caption is different so that you know the type of the object

ec = eachcol(x)
10×10 DataFrameColumns
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.2394130.3726320.3918080.8817660.9264260.1729870.8860670.94450.991340.567602
20.1273320.5236840.2807440.6314590.1000090.4142130.4225490.4637210.02107590.569317
30.9102730.7497510.6655980.6614690.6567250.2943940.4850970.4116870.2883020.633039
40.7533250.6425350.2213390.3330830.7834310.2744990.4023390.6981890.7197780.198621
50.03870460.369160.7171330.389620.6580460.575610.03611610.184530.65390.490985
60.9396080.2328130.6325980.6250160.9885540.3642420.3841420.8377590.7753180.136857
70.986040.6463580.6768850.6619430.7982120.606850.7831940.727250.4455320.311484
80.8770280.4524340.6541840.5927950.1606480.8825150.3630150.8522140.7625080.863806
90.03067930.2613510.2927750.3398210.1435830.4738890.2291760.05220240.7636650.277183
100.80660.8183270.6008030.6342670.53110.6104160.8476060.7034340.6139010.974852

you can access columns of a parent data frame directly

ec.x1
10-element Vector{Float64}:
 0.23941342855307912
 0.1273324638742388
 0.9102731889046806
 0.7533254463886194
 0.038704570132973126
 0.9396078861391909
 0.9860398107614828
 0.8770280375662959
 0.03067929975280259
 0.8066003641583329

Transposing#

you can transpose a data frame using permutedims:

df = DataFrame(reshape(1:12, 3, 4), :auto)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
114710
225811
336912
df.names = ["a", "b", "c"]
3-element Vector{String}:
 "a"
 "b"
 "c"
permutedims(df, :names)
4×4 DataFrame
Rownamesabc
StringInt64Int64Int64
1x1123
2x2456
3x3789
4x4101112

revert the changes for line width

delete!(ENV, "LINES")
Base.EnvDict with 18 entries:
  "PATH"                         => "/usr/local/julia//bin:/usr/local/bin:/usr/…
  "HOSTNAME"                     => "c0e5116d4768"
  "LANG"                         => "C.UTF-8"
  "GPG_KEY"                      => "7169605F62C751356D054A26A821E680E5FA6305"
  "PYTHON_VERSION"               => "3.12.2"
  "PYTHON_PIP_VERSION"           => "24.0"
  "PYTHON_GET_PIP_URL"           => "https://github.com/pypa/get-pip/raw/dbf0c8…
  "PYTHON_GET_PIP_SHA256"        => "dfe9fd5c28dc98b5ac17979a953ea550cec37ae1b4…
  "JULIA_CI"                     => "true"
  "JULIA_NUM_THREADS"            => "auto"
  "JULIA_CONDAPKG_BACKEND"       => "Null"
  "JULIA_PATH"                   => "/usr/local/julia/"
  "JULIA_DEPOT_PATH"             => "/srv/juliapkg/"
  "HOME"                         => "/root"
  "JPY_PARENT_PID"               => "1"
  "OPENBLAS_MAIN_FREE"           => "1"
  "OPENBLAS_DEFAULT_NUM_THREADS" => "1"
  "COLUMNS"                      => "80"