Transformation to DataFrames

Transformation to DataFrames#

Split-apply-combine

using DataFrames

Grouping a data frame#

groupby

x = DataFrame(id=[1, 2, 3, 4, 1, 2, 3, 4], id2=[1, 2, 1, 2, 1, 2, 1, 2], v=rand(8))

8×3 DataFrame

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.847192
2	2	2	0.863094
3	3	1	0.945034
4	4	2	0.850163
5	1	1	0.401421
6	2	2	0.544358
7	3	1	0.971199
8	4	2	0.954327

groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = 1

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.847192
2	1	1	0.401421

⋮

Last Group (2 rows): id = 4

Row	id	id2	v
	Int64	Int64	Float64
1	4	2	0.850163
2	4	2	0.954327

groupby(x, [])

GroupedDataFrame with 1 group based on key:

First Group (8 rows):

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.847192
2	2	2	0.863094
3	3	1	0.945034
4	4	2	0.850163
5	1	1	0.401421
6	2	2	0.544358
7	3	1	0.971199
8	4	2	0.954327

gx2 = groupby(x, [:id, :id2])

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.847192
2	1	1	0.401421

⋮

Last Group (2 rows): id = 4, id2 = 2

Row	id	id2	v
	Int64	Int64	Float64
1	4	2	0.850163
2	4	2	0.954327

get the parent DataFrame

parent(gx2)

8×3 DataFrame

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.847192
2	2	2	0.863094
3	3	1	0.945034
4	4	2	0.850163
5	1	1	0.401421
6	2	2	0.544358
7	3	1	0.971199
8	4	2	0.954327

back to the DataFrame, but in a different order of rows than the original

vcat(gx2...)

8×3 DataFrame

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.847192
2	1	1	0.401421
3	2	2	0.863094
4	2	2	0.544358
5	3	1	0.945034
6	3	1	0.971199
7	4	2	0.850163
8	4	2	0.954327

the same as above

DataFrame(gx2)

8×3 DataFrame

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.847192
2	1	1	0.401421
3	2	2	0.863094
4	2	2	0.544358
5	3	1	0.945034
6	3	1	0.971199
7	4	2	0.850163
8	4	2	0.954327

drop grouping columns when creating a data frame

DataFrame(gx2, keepkeys=false)

8×1 DataFrame

Row	v
	Float64
1	0.847192
2	0.401421
3	0.863094
4	0.544358
5	0.945034
6	0.971199
7	0.850163
8	0.954327

vector of names of grouping variables

groupcols(gx2)

2-element Vector{Symbol}:
 :id
 :id2

and non-grouping variables

valuecols(gx2)

1-element Vector{Symbol}:
 :v

group indices in parent(gx2)

groupindices(gx2)

8-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
 1
 2
 3
 4

kgx2 = keys(gx2)

4-element DataFrames.GroupKeys{DataFrames.GroupedDataFrame{DataFrames.DataFrame}}:
 GroupKey: (id = 1, id2 = 1)
 GroupKey: (id = 2, id2 = 2)
 GroupKey: (id = 3, id2 = 1)
 GroupKey: (id = 4, id2 = 2)

You can index into a GroupedDataFrame like to a vector or to a dictionary. The second form accepts GroupKey, NamedTuple or a Tuple.

gx2

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.847192
2	1	1	0.401421

⋮

Last Group (2 rows): id = 4, id2 = 2

Row	id	id2	v
	Int64	Int64	Float64
1	4	2	0.850163
2	4	2	0.954327

k = keys(gx2)[1]

GroupKey: (id = 1, id2 = 1)

ntk = NamedTuple(k)

(id = 1, id2 = 1)

tk = Tuple(k)

(1, 1)

the operations below produce the same result and are proformant

gx2[1], gx2[k], gx2[ntk], gx2[tk]

(2×3 SubDataFrame
 Row │ id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.847192
   2 │     1      1  0.401421, 2×3 SubDataFrame
 Row │ id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.847192
   2 │     1      1  0.401421, 2×3 SubDataFrame
 Row │ id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.847192
   2 │     1      1  0.401421, 2×3 SubDataFrame
 Row │ id     id2    v        
     │ Int64  Int64  Float64  
─────┼────────────────────────
   1 │     1      1  0.847192
   2 │     1      1  0.401421)

handling missing values

x = DataFrame(id=[missing, 5, 1, 3, missing], x=1:5)

5×2 DataFrame

Row	id	x
	Int64?	Int64
1	missing	1
2	5	2
3	1	3
4	3	4
5	missing	5

by default groups include missing values and their order is not guaranteed

groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (1 row): id = 1

Row	id	x
	Int64?	Int64
1	1	3

⋮

Last Group (2 rows): id = missing

Row	id	x
	Int64?	Int64
1	missing	1
2	missing	5

but we can change it; now they are sorted

groupby(x, :id, sort=true, skipmissing=true)

GroupedDataFrame with 3 groups based on key: id

First Group (1 row): id = 1

Row	id	x
	Int64?	Int64
1	1	3

⋮

Last Group (1 row): id = 5

Row	id	x
	Int64?	Int64
1	5	2

and now they are in the order they appear in the source data frame

groupby(x, :id, sort=false)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = missing

Row	id	x
	Int64?	Int64
1	missing	1
2	missing	5

⋮

Last Group (1 row): id = 3

Row	id	x
	Int64?	Int64
1	3	4

Performing transformations#

by group using combine, select, select!, transform, and transform!

using Statistics
using Chain

x = DataFrame(id=rand('a':'d', 100), v=rand(100))

100×2 DataFrame

75 rows omitted

Row	id	v
	Char	Float64
1	a	0.0701664
2	a	0.0182633
3	c	0.358291
4	a	0.529623
5	a	0.402455
6	a	0.333796
7	a	0.97962
8	b	0.637092
9	a	0.473495
10	a	0.77666
11	d	0.106041
12	d	0.381774
13	c	0.0512169
⋮	⋮	⋮
89	a	0.559751
90	a	0.591965
91	a	0.616339
92	c	0.836743
93	c	0.300096
94	d	0.751398
95	a	0.746588
96	a	0.379123
97	c	0.476596
98	d	0.390383
99	d	0.447578
100	a	0.0325232

apply a function to each group of a data frame combine keeps as many rows as are returned from the function

@chain x begin
    groupby(:id)
    combine(:v => mean)
end

4×2 DataFrame

Row	id	v_mean
	Char	Float64
1	a	0.441277
2	c	0.448135
3	b	0.530122
4	d	0.439243

x.id2 = axes(x, 1)

Base.OneTo(100)

Select and transform keep as many rows as are in the source data frame and in correct order. Additionally, transform keeps all columns from the source.

@chain x begin
    groupby(:id)
    transform(:v => mean)
end

100×4 DataFrame

75 rows omitted

Row	id	v	id2	v_mean
	Char	Float64	Int64	Float64
1	a	0.0701664	1	0.441277
2	a	0.0182633	2	0.441277
3	c	0.358291	3	0.448135
4	a	0.529623	4	0.441277
5	a	0.402455	5	0.441277
6	a	0.333796	6	0.441277
7	a	0.97962	7	0.441277
8	b	0.637092	8	0.530122
9	a	0.473495	9	0.441277
10	a	0.77666	10	0.441277
11	d	0.106041	11	0.439243
12	d	0.381774	12	0.439243
13	c	0.0512169	13	0.448135
⋮	⋮	⋮	⋮	⋮
89	a	0.559751	89	0.441277
90	a	0.591965	90	0.441277
91	a	0.616339	91	0.441277
92	c	0.836743	92	0.448135
93	c	0.300096	93	0.448135
94	d	0.751398	94	0.439243
95	a	0.746588	95	0.441277
96	a	0.379123	96	0.441277
97	c	0.476596	97	0.448135
98	d	0.390383	98	0.439243
99	d	0.447578	99	0.439243
100	a	0.0325232	100	0.441277

note that combine reorders rows by group of GroupedDataFrame

@chain x begin
    groupby(:id)
    combine(:id2, :v => mean)
end

100×3 DataFrame

75 rows omitted

Row	id	id2	v_mean
	Char	Int64	Float64
1	a	1	0.441277
2	a	2	0.441277
3	a	4	0.441277
4	a	5	0.441277
5	a	6	0.441277
6	a	7	0.441277
7	a	9	0.441277
8	a	10	0.441277
9	a	16	0.441277
10	a	20	0.441277
11	a	25	0.441277
12	a	30	0.441277
13	a	32	0.441277
⋮	⋮	⋮	⋮
89	d	60	0.439243
90	d	68	0.439243
91	d	70	0.439243
92	d	73	0.439243
93	d	76	0.439243
94	d	78	0.439243
95	d	80	0.439243
96	d	82	0.439243
97	d	86	0.439243
98	d	94	0.439243
99	d	98	0.439243
100	d	99	0.439243

we give a custom name for the result column

@chain x begin
    groupby(:id)
    combine(:v => mean => :res)
end

4×2 DataFrame

Row	id	res
	Char	Float64
1	a	0.441277
2	c	0.448135
3	b	0.530122
4	d	0.439243

you can have multiple operations

@chain x begin
    groupby(:id)
    combine(:v => mean => :res1, :v => sum => :res2, nrow => :n)
end

4×4 DataFrame

Row	id	res1	res2	n
	Char	Float64	Float64	Int64
1	a	0.441277	12.797	29
2	c	0.448135	8.96269	20
3	b	0.530122	12.7229	24
4	d	0.439243	11.8596	27

Additional notes:

select! and transform! perform operations in-place
The general syntax for transformation is source_columns => function => target_column
if you pass multiple columns to a function they are treated as positional arguments
ByRow and AsTable work exactly like discussed for operations on data frames in 05_columns.ipynb
you can automatically groupby again the result of combine, select etc. by passing ungroup=false keyword argument to them
similarly keepkeys keyword argument allows you to drop grouping columns from the resulting data frame

It is also allowed to pass a function to all these functions (also - as a special case, as a first argument). In this case the return value can be a table. In particular it allows for an easy dropping of groups if you return an empty table from the function.

If you pass a function you can use a do block syntax. In case of passing a function it gets a SubDataFrame as its argument.

Here is an example:

combine(groupby(x, :id)) do sdf
    n = nrow(sdf)
    n < 25 ? DataFrame() : DataFrame(n=n) ## drop groups with low number of rows
end

2×2 DataFrame

Row	id	n
	Char	Int64
1	a	29
2	d	27

You can also produce multiple columns in a single operation:

df = DataFrame(id=[1, 1, 2, 2], val=[1, 2, 3, 4])

4×2 DataFrame

Row	id	val
	Int64	Int64
1	1	1
2	1	2
3	2	3
4	2	4

@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => AsTable)
end

2×3 DataFrame

Row	id	x1	x2
	Int64	Int64	Int64
1	1	1	2
2	2	3	4

@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => [:c1, :c2])
end

2×3 DataFrame

Row	id	c1	c2
	Int64	Int64	Int64
1	1	1	2
2	2	3	4

It is easy to unnest the column into multiple columns,

df = DataFrame(a=[(p=1, q=2), (p=3, q=4)])
select(df, :a => AsTable)

2×2 DataFrame

Row	p	q
	Int64	Int64
1	1	2
2	3	4

automatic column names generated

df = DataFrame(a=[[1, 2], [3, 4]])
select(df, :a => AsTable)

2×2 DataFrame

Row	x1	x2
	Int64	Int64
1	1	2
2	3	4

custom column names generated

select(df, :a => [:C1, :C2])

2×2 DataFrame

Row	C1	C2
	Int64	Int64
1	1	2
2	3	4

Finally, observe that one can conveniently apply multiple transformations using broadcasting:

df = DataFrame(id=repeat(1:10, 10), x1=1:100, x2=101:200)

100×3 DataFrame

75 rows omitted

Row	id	x1	x2
	Int64	Int64	Int64
1	1	1	101
2	2	2	102
3	3	3	103
4	4	4	104
5	5	5	105
6	6	6	106
7	7	7	107
8	8	8	108
9	9	9	109
10	10	10	110
11	1	11	111
12	2	12	112
13	3	13	113
⋮	⋮	⋮	⋮
89	9	89	189
90	10	90	190
91	1	91	191
92	2	92	192
93	3	93	193
94	4	94	194
95	5	95	195
96	6	96	196
97	7	97	197
98	8	98	198
99	9	99	199
100	10	100	200

@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> minimum)
end

10×3 DataFrame

Row	id	x1_minimum	x2_minimum
	Int64	Int64	Int64
1	1	1	101
2	2	2	102
3	3	3	103
4	4	4	104
5	5	5	105
6	6	6	106
7	7	7	107
8	8	8	108
9	9	9	109
10	10	10	110

@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> [minimum maximum])
end

10×5 DataFrame

Row	id	x1_minimum	x2_minimum	x1_maximum	x2_maximum
	Int64	Int64	Int64	Int64	Int64
1	1	1	101	91	191
2	2	2	102	92	192
3	3	3	103	93	193
4	4	4	104	94	194
5	5	5	105	95	195
6	6	6	106	96	196
7	7	7	107	97	197
8	8	8	108	98	198
9	9	9	109	99	199
10	10	10	110	100	200

Aggregation of a data frame using mapcols#

x = DataFrame(rand(10, 10), :auto)

10×10 DataFrame

Row	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	0.581286	0.500131	0.373493	0.547011	0.215926	0.692787	0.19166	0.351691	0.198653	0.215892
2	0.898161	0.229191	0.0246817	0.622954	0.667477	0.939399	0.874242	0.328804	0.889476	0.00458233
3	0.240405	0.591895	0.580345	0.972804	0.930164	0.559582	0.324742	0.850061	0.366894	0.236819
4	0.105653	0.824439	0.792041	0.623864	0.735757	0.902621	0.563483	0.432771	0.11422	0.839069
5	0.109915	0.251032	0.944131	0.894329	0.164685	0.750856	0.569574	0.0465495	0.751014	0.522555
6	0.379932	0.661907	0.813877	0.674267	0.604413	0.0321741	0.828109	0.97111	0.0392787	0.402899
7	0.625208	0.298135	0.833125	0.862858	0.0268948	0.664366	0.473513	0.163117	0.330685	0.569004
8	0.676198	0.197421	0.147104	0.629717	0.242218	0.660885	0.635791	0.258808	0.0144246	0.764473
9	0.424408	0.248768	0.73709	0.315325	0.44384	0.429459	0.578695	0.844353	0.729773	0.00955439
10	0.226149	0.521039	0.172276	0.687927	0.592718	0.421227	0.230397	0.76907	0.244568	0.298045

mapcols(mean, x)

1×10 DataFrame

Row	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	0.426732	0.432396	0.541816	0.683106	0.462409	0.605336	0.527021	0.501633	0.367899	0.386289

Mapping rows and columns using eachcol and eachrow#

map a function over each column and return a vector

map(mean, eachcol(x))

10-element Vector{Float64}:
4267316147456429
4323957582948384
5418162034323375
6831056523551486
46240935500358804
6053356095651719
5270207063246461
5016333397790869
3678985308320364
3862893148013652

an iteration returns a Pair with column name and values

foreach(c -> println(c[1], ": ", mean(c[2])), pairs(eachcol(x)))

x1: 0.4267316147456429
x2: 0.4323957582948384
x3: 0.5418162034323375
x4: 0.6831056523551486
x5: 0.46240935500358804
x6: 0.6053356095651719
x7: 0.5270207063246461
x8: 0.5016333397790869
x9: 0.3678985308320364
x10: 0.3862893148013652

now the returned value is DataFrameRow which works as a NamedTuple but is a view to a parent DataFrame

map(r -> r.x1 / r.x2, eachrow(x))

10-element Vector{Float64}:
1622681786852576
9188323070256126
40616089739099004
12815152393556184
43785232107549216
5739959426386916
0970673785796814
4251564265695804
7060360415449167
43403506944925896

it prints like a data frame, only the caption is different so that you know the type of the object

er = eachrow(x)
er.x1 ## you can access columns of a parent data frame directly

10-element Vector{Float64}:
5812864438977917
8981614676806798
24040469126869135
10565305018063365
1099149189485783
37993167234429415
625208287087893
6761984301946938
4244080056653826
2261491801877904

it prints like a data frame, only the caption is different so that you know the type of the object

ec = eachcol(x)

10×10 DataFrameColumns

Row	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	0.581286	0.500131	0.373493	0.547011	0.215926	0.692787	0.19166	0.351691	0.198653	0.215892
2	0.898161	0.229191	0.0246817	0.622954	0.667477	0.939399	0.874242	0.328804	0.889476	0.00458233
3	0.240405	0.591895	0.580345	0.972804	0.930164	0.559582	0.324742	0.850061	0.366894	0.236819
4	0.105653	0.824439	0.792041	0.623864	0.735757	0.902621	0.563483	0.432771	0.11422	0.839069
5	0.109915	0.251032	0.944131	0.894329	0.164685	0.750856	0.569574	0.0465495	0.751014	0.522555
6	0.379932	0.661907	0.813877	0.674267	0.604413	0.0321741	0.828109	0.97111	0.0392787	0.402899
7	0.625208	0.298135	0.833125	0.862858	0.0268948	0.664366	0.473513	0.163117	0.330685	0.569004
8	0.676198	0.197421	0.147104	0.629717	0.242218	0.660885	0.635791	0.258808	0.0144246	0.764473
9	0.424408	0.248768	0.73709	0.315325	0.44384	0.429459	0.578695	0.844353	0.729773	0.00955439
10	0.226149	0.521039	0.172276	0.687927	0.592718	0.421227	0.230397	0.76907	0.244568	0.298045

you can access columns of a parent data frame directly

ec.x1

10-element Vector{Float64}:
5812864438977917
8981614676806798
24040469126869135
10565305018063365
1099149189485783
37993167234429415
625208287087893
6761984301946938
4244080056653826
2261491801877904

Transposing#

you can transpose a data frame using permutedims:

df = DataFrame(reshape(1:12, 3, 4), :auto)

3×4 DataFrame

Row	x1	x2	x3	x4
	Int64	Int64	Int64	Int64
1	1	4	7	10
2	2	5	8	11
3	3	6	9	12

df.names = ["a", "b", "c"]

3-element Vector{String}:
 "a"
 "b"
 "c"

permutedims(df, :names)

4×4 DataFrame

Row	names	a	b	c
	String	Int64	Int64	Int64
1	x1	1	2	3
2	x2	4	5	6
3	x3	7	8	9
4	x4	10	11	12

This notebook was generated using Literate.jl.

Transformation to DataFrames

Contents

Transformation to DataFrames#

Grouping a data frame#

Performing transformations#

Aggregation of a data frame using mapcols#

Mapping rows and columns using eachcol and eachrow#

Transposing#