Transformation to DataFrames#

Split-apply-combine

using DataFrames

Grouping a data frame#

groupby

x = DataFrame(id=[1, 2, 3, 4, 1, 2, 3, 4], id2=[1, 2, 1, 2, 1, 2, 1, 2], v=rand(8))
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.339102
2220.76979
3310.743467
4420.179302
5110.0578305
6220.043014
7310.126171
8420.00631068
groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = 1
Rowidid2v
Int64Int64Float64
1110.339102
2110.0578305

Last Group (2 rows): id = 4
Rowidid2v
Int64Int64Float64
1420.179302
2420.00631068
groupby(x, [])

GroupedDataFrame with 1 group based on key:

First Group (8 rows):
Rowidid2v
Int64Int64Float64
1110.339102
2220.76979
3310.743467
4420.179302
5110.0578305
6220.043014
7310.126171
8420.00631068
gx2 = groupby(x, [:id, :id2])

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.339102
2110.0578305

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.179302
2420.00631068

get the parent DataFrame

parent(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.339102
2220.76979
3310.743467
4420.179302
5110.0578305
6220.043014
7310.126171
8420.00631068

back to the DataFrame, but in a different order of rows than the original

vcat(gx2...)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.339102
2110.0578305
3220.76979
4220.043014
5310.743467
6310.126171
7420.179302
8420.00631068

the same as above

DataFrame(gx2)
8×3 DataFrame
Rowidid2v
Int64Int64Float64
1110.339102
2110.0578305
3220.76979
4220.043014
5310.743467
6310.126171
7420.179302
8420.00631068

drop grouping columns when creating a data frame

DataFrame(gx2, keepkeys=false)
8×1 DataFrame
Rowv
Float64
10.339102
20.0578305
30.76979
40.043014
50.743467
60.126171
70.179302
80.00631068

vector of names of grouping variables

groupcols(gx2)
2-element Vector{Symbol}:
 :id
 :id2

and non-grouping variables

valuecols(gx2)
1-element Vector{Symbol}:
 :v

group indices in parent(gx2)

groupindices(gx2)
8-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
 1
 2
 3
 4
kgx2 = keys(gx2)
4-element DataFrames.GroupKeys{DataFrames.GroupedDataFrame{DataFrames.DataFrame}}:
 GroupKey: (id = 1, id2 = 1)
 GroupKey: (id = 2, id2 = 2)
 GroupKey: (id = 3, id2 = 1)
 GroupKey: (id = 4, id2 = 2)

You can index into a GroupedDataFrame like to a vector or to a dictionary. The second form accepts GroupKey, NamedTuple or a Tuple.

gx2

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1
Rowidid2v
Int64Int64Float64
1110.339102
2110.0578305

Last Group (2 rows): id = 4, id2 = 2
Rowidid2v
Int64Int64Float64
1420.179302
2420.00631068
k = keys(gx2)[1]
GroupKey: (id = 1, id2 = 1)
ntk = NamedTuple(k)
(id = 1, id2 = 1)
tk = Tuple(k)
(1, 1)

the operations below produce the same result and are proformant

gx2[1], gx2[k], gx2[ntk], gx2[tk]
(2×3 SubDataFrame
 Row  id     id2    v         
     │ Int64  Int64  Float64   
─────┼─────────────────────────
   1 │     1      1  0.339102
   2 │     1      1  0.0578305, 2×3 SubDataFrame
 Row  id     id2    v         
     │ Int64  Int64  Float64   
─────┼─────────────────────────
   1 │     1      1  0.339102
   2 │     1      1  0.0578305, 2×3 SubDataFrame
 Row  id     id2    v         
     │ Int64  Int64  Float64   
─────┼─────────────────────────
   1 │     1      1  0.339102
   2 │     1      1  0.0578305, 2×3 SubDataFrame
 Row  id     id2    v         
     │ Int64  Int64  Float64   
─────┼─────────────────────────
   1 │     1      1  0.339102
   2 │     1      1  0.0578305)

handling missing values

x = DataFrame(id=[missing, 5, 1, 3, missing], x=1:5)
5×2 DataFrame
Rowidx
Int64?Int64
1missing1
252
313
434
5missing5

by default groups include missing values and their order is not guaranteed

groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

but we can change it; now they are sorted

groupby(x, :id, sort=true, skipmissing=true)

GroupedDataFrame with 3 groups based on key: id

First Group (1 row): id = 1
Rowidx
Int64?Int64
113

Last Group (1 row): id = 5
Rowidx
Int64?Int64
152

and now they are in the order they appear in the source data frame

groupby(x, :id, sort=false)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = missing
Rowidx
Int64?Int64
1missing1
2missing5

Last Group (1 row): id = 3
Rowidx
Int64?Int64
134

Performing transformations#

by group using combine, select, select!, transform, and transform!

using Statistics
using Chain

x = DataFrame(id=rand('a':'d', 100), v=rand(100))
100×2 DataFrame
75 rows omitted
Rowidv
CharFloat64
1a0.370334
2b0.993104
3d0.635
4c0.616631
5b0.35559
6c0.124394
7b0.511263
8d0.874141
9b0.933005
10d0.588118
11a0.12122
12b0.822107
13c0.58134
89d0.54355
90d0.113034
91b0.817364
92c0.0342757
93b0.657528
94a0.877121
95b0.203659
96b0.241867
97c0.945331
98a0.0922752
99a0.928773
100b0.258577

apply a function to each group of a data frame combine keeps as many rows as are returned from the function

@chain x begin
    groupby(:id)
    combine(:v => mean)
end
4×2 DataFrame
Rowidv_mean
CharFloat64
1a0.424412
2b0.492894
3d0.47484
4c0.517263
x.id2 = axes(x, 1)
Base.OneTo(100)

Select and transform keep as many rows as are in the source data frame and in correct order. Additionally, transform keeps all columns from the source.

@chain x begin
    groupby(:id)
    transform(:v => mean)
end
100×4 DataFrame
75 rows omitted
Rowidvid2v_mean
CharFloat64Int64Float64
1a0.37033410.424412
2b0.99310420.492894
3d0.63530.47484
4c0.61663140.517263
5b0.3555950.492894
6c0.12439460.517263
7b0.51126370.492894
8d0.87414180.47484
9b0.93300590.492894
10d0.588118100.47484
11a0.12122110.424412
12b0.822107120.492894
13c0.58134130.517263
89d0.54355890.47484
90d0.113034900.47484
91b0.817364910.492894
92c0.0342757920.517263
93b0.657528930.492894
94a0.877121940.424412
95b0.203659950.492894
96b0.241867960.492894
97c0.945331970.517263
98a0.0922752980.424412
99a0.928773990.424412
100b0.2585771000.492894

note that combine reorders rows by group of GroupedDataFrame

@chain x begin
    groupby(:id)
    combine(:id2, :v => mean)
end
100×3 DataFrame
75 rows omitted
Rowidid2v_mean
CharInt64Float64
1a10.424412
2a110.424412
3a170.424412
4a200.424412
5a220.424412
6a250.424412
7a300.424412
8a320.424412
9a410.424412
10a550.424412
11a610.424412
12a660.424412
13a670.424412
89c560.517263
90c570.517263
91c590.517263
92c620.517263
93c720.517263
94c740.517263
95c770.517263
96c790.517263
97c830.517263
98c840.517263
99c920.517263
100c970.517263

we give a custom name for the result column

@chain x begin
    groupby(:id)
    combine(:v => mean => :res)
end
4×2 DataFrame
Rowidres
CharFloat64
1a0.424412
2b0.492894
3d0.47484
4c0.517263

you can have multiple operations

@chain x begin
    groupby(:id)
    combine(:v => mean => :res1, :v => sum => :res2, nrow => :n)
end
4×4 DataFrame
Rowidres1res2n
CharFloat64Float64Int64
1a0.4244128.9126421
2b0.49289414.786830
3d0.4748411.396224
4c0.51726312.931625

Additional notes:

  • select! and transform! perform operations in-place

  • The general syntax for transformation is source_columns => function => target_column

  • if you pass multiple columns to a function they are treated as positional arguments

  • ByRow and AsTable work exactly like discussed for operations on data frames in 05_columns.ipynb

  • you can automatically groupby again the result of combine, select etc. by passing ungroup=false keyword argument to them

  • similarly keepkeys keyword argument allows you to drop grouping columns from the resulting data frame

It is also allowed to pass a function to all these functions (also - as a special case, as a first argument). In this case the return value can be a table. In particular it allows for an easy dropping of groups if you return an empty table from the function.

If you pass a function you can use a do block syntax. In case of passing a function it gets a SubDataFrame as its argument.

Here is an example:

combine(groupby(x, :id)) do sdf
    n = nrow(sdf)
    n < 25 ? DataFrame() : DataFrame(n=n) ## drop groups with low number of rows
end
2×2 DataFrame
Rowidn
CharInt64
1b30
2c25

You can also produce multiple columns in a single operation:

df = DataFrame(id=[1, 1, 2, 2], val=[1, 2, 3, 4])
4×2 DataFrame
Rowidval
Int64Int64
111
212
323
424
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => AsTable)
end
2×3 DataFrame
Rowidx1x2
Int64Int64Int64
1112
2234
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => [:c1, :c2])
end
2×3 DataFrame
Rowidc1c2
Int64Int64Int64
1112
2234

It is easy to unnest the column into multiple columns,

df = DataFrame(a=[(p=1, q=2), (p=3, q=4)])
select(df, :a => AsTable)
2×2 DataFrame
Rowpq
Int64Int64
112
234

automatic column names generated

df = DataFrame(a=[[1, 2], [3, 4]])
select(df, :a => AsTable)
2×2 DataFrame
Rowx1x2
Int64Int64
112
234

custom column names generated

select(df, :a => [:C1, :C2])
2×2 DataFrame
RowC1C2
Int64Int64
112
234

Finally, observe that one can conveniently apply multiple transformations using broadcasting:

df = DataFrame(id=repeat(1:10, 10), x1=1:100, x2=101:200)
100×3 DataFrame
75 rows omitted
Rowidx1x2
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
11111111
12212112
13313113
89989189
901090190
91191191
92292192
93393193
94494194
95595195
96696196
97797197
98898198
99999199
10010100200
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> minimum)
end
10×3 DataFrame
Rowidx1_minimumx2_minimum
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> [minimum maximum])
end
10×5 DataFrame
Rowidx1_minimumx2_minimumx1_maximumx2_maximum
Int64Int64Int64Int64Int64
11110191191
22210292192
33310393193
44410494194
55510595195
66610696196
77710797197
88810898198
99910999199
101010110100200

Aggregation of a data frame using mapcols#

x = DataFrame(rand(10, 10), :auto)
10×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.4641990.1975730.4716820.5985970.2033530.398530.7071390.6337370.9060730.948284
20.4126740.2514070.1145380.1467080.462440.07587740.5530530.08861150.9147860.998204
30.1734660.7247950.1292640.01000640.799540.07538510.9494510.265990.8667780.264724
40.4535790.0714890.04251210.643350.2648810.07227150.6167450.7368370.8047390.948438
50.118750.1062270.96220.5048960.570590.6440280.3720350.8721280.731280.510869
60.3255530.130650.8934740.6515750.673490.5855310.1919170.5025650.3592870.922355
70.9056370.4993350.3031880.2687080.6386030.1185150.3105340.2918680.01228520.637567
80.1020740.6220270.08340840.5009280.3425730.4921620.6743330.8138990.06320160.396585
90.2484940.01752270.06895990.1458550.4297780.2556790.1627850.9619140.4383820.110069
100.753590.3402630.8266920.7438530.305190.4003590.2086860.4006740.9962910.7375
mapcols(mean, x)
1×10 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.3958020.2961290.3895920.4214480.4690440.3118340.4746680.5568220.609310.647459

Mapping rows and columns using eachcol and eachrow#

map a function over each column and return a vector

map(mean, eachcol(x))
10-element Vector{Float64}:
 0.3958017176190256
 0.2961289269833874
 0.38959191218230754
 0.4214476688437806
 0.4690437545883105
 0.31183387085376063
 0.4746679779896309
 0.55682222654873
 0.6093101882415054
 0.6474594562251919

an iteration returns a Pair with column name and values

foreach(c -> println(c[1], ": ", mean(c[2])), pairs(eachcol(x)))
x1: 0.3958017176190256
x2: 0.2961289269833874
x3: 0.38959191218230754
x4: 0.4214476688437806
x5: 0.4690437545883105
x6: 0.31183387085376063
x7: 0.4746679779896309
x8: 0.55682222654873
x9: 0.6093101882415054
x10: 0.6474594562251919

now the returned value is DataFrameRow which works as a NamedTuple but is a view to a parent DataFrame

map(r -> r.x1 / r.x2, eachrow(x))
10-element Vector{Float64}:
  2.3495068850231715
  1.6414563044865957
  0.23933124317182303
  6.3447452579481505
  1.1178820382382946
  2.4917889463094918
  1.8136875370352246
  0.16409978501458122
 14.181299738280472
  2.214727543588411

it prints like a data frame, only the caption is different so that you know the type of the object

er = eachrow(x)
er.x1 ## you can access columns of a parent data frame directly
10-element Vector{Float64}:
 0.4641990970776172
 0.4126742463842531
 0.1734660417075179
 0.45357920179274114
 0.11874965606061971
 0.3255530246477467
 0.9056369528299589
 0.10207449347088948
 0.24849417821930797
 0.7535902839996041

it prints like a data frame, only the caption is different so that you know the type of the object

ec = eachcol(x)
10×10 DataFrameColumns
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.4641990.1975730.4716820.5985970.2033530.398530.7071390.6337370.9060730.948284
20.4126740.2514070.1145380.1467080.462440.07587740.5530530.08861150.9147860.998204
30.1734660.7247950.1292640.01000640.799540.07538510.9494510.265990.8667780.264724
40.4535790.0714890.04251210.643350.2648810.07227150.6167450.7368370.8047390.948438
50.118750.1062270.96220.5048960.570590.6440280.3720350.8721280.731280.510869
60.3255530.130650.8934740.6515750.673490.5855310.1919170.5025650.3592870.922355
70.9056370.4993350.3031880.2687080.6386030.1185150.3105340.2918680.01228520.637567
80.1020740.6220270.08340840.5009280.3425730.4921620.6743330.8138990.06320160.396585
90.2484940.01752270.06895990.1458550.4297780.2556790.1627850.9619140.4383820.110069
100.753590.3402630.8266920.7438530.305190.4003590.2086860.4006740.9962910.7375

you can access columns of a parent data frame directly

ec.x1
10-element Vector{Float64}:
 0.4641990970776172
 0.4126742463842531
 0.1734660417075179
 0.45357920179274114
 0.11874965606061971
 0.3255530246477467
 0.9056369528299589
 0.10207449347088948
 0.24849417821930797
 0.7535902839996041

Transposing#

you can transpose a data frame using permutedims:

df = DataFrame(reshape(1:12, 3, 4), :auto)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
114710
225811
336912
df.names = ["a", "b", "c"]
3-element Vector{String}:
 "a"
 "b"
 "c"
permutedims(df, :names)
4×4 DataFrame
Rownamesabc
StringInt64Int64Int64
1x1123
2x2456
3x3789
4x4101112

This notebook was generated using Literate.jl.