Transformation to DataFrames

Transformation to DataFrames#

Split-apply-combine

using DataFrames

Grouping a dat=a frame#

x = DataFrame(id=[1, 2, 3, 4, 1, 2, 3, 4], id2=[1, 2, 1, 2, 1, 2, 1, 2], v=rand(8))

8×3 DataFrame

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.0298167
2	2	2	0.890129
3	3	1	0.608735
4	4	2	0.648561
5	1	1	0.187759
6	2	2	0.510651
7	3	1	0.451765
8	4	2	0.847968

groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = 1

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.0298167
2	1	1	0.187759

⋮

Last Group (2 rows): id = 4

Row	id	id2	v
	Int64	Int64	Float64
1	4	2	0.648561
2	4	2	0.847968

groupby(x, [])

GroupedDataFrame with 1 group based on key:

First Group (8 rows):

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.0298167
2	2	2	0.890129
3	3	1	0.608735
4	4	2	0.648561
5	1	1	0.187759
6	2	2	0.510651
7	3	1	0.451765
8	4	2	0.847968

gx2 = groupby(x, [:id, :id2])

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.0298167
2	1	1	0.187759

⋮

Last Group (2 rows): id = 4, id2 = 2

Row	id	id2	v
	Int64	Int64	Float64
1	4	2	0.648561
2	4	2	0.847968

get the parent DataFrame

parent(gx2)

8×3 DataFrame

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.0298167
2	2	2	0.890129
3	3	1	0.608735
4	4	2	0.648561
5	1	1	0.187759
6	2	2	0.510651
7	3	1	0.451765
8	4	2	0.847968

back to the DataFrame, but in a different order of rows than the original

vcat(gx2...)

8×3 DataFrame

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.0298167
2	1	1	0.187759
3	2	2	0.890129
4	2	2	0.510651
5	3	1	0.608735
6	3	1	0.451765
7	4	2	0.648561
8	4	2	0.847968

the same

DataFrame(gx2)

8×3 DataFrame

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.0298167
2	1	1	0.187759
3	2	2	0.890129
4	2	2	0.510651
5	3	1	0.608735
6	3	1	0.451765
7	4	2	0.648561
8	4	2	0.847968

drop grouping columns when creating a data frame

DataFrame(gx2, keepkeys=false)

8×1 DataFrame

Row	v
	Float64
1	0.0298167
2	0.187759
3	0.890129
4	0.510651
5	0.608735
6	0.451765
7	0.648561
8	0.847968

vector of names of grouping variables

groupcols(gx2)

2-element Vector{Symbol}:
 :id
 :id2

and non-grouping variables

valuecols(gx2)

1-element Vector{Symbol}:
 :v

group indices in parent(gx2)

groupindices(gx2)

8-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
 1
 2
 3
 4

kgx2 = keys(gx2)

4-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 1, id2 = 1)
 GroupKey: (id = 2, id2 = 2)
 GroupKey: (id = 3, id2 = 1)
 GroupKey: (id = 4, id2 = 2)

You can index into a GroupedDataFrame like to a vector or to a dictionary. The second form acceps GroupKey, NamedTuple or a Tuple

gx2

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.0298167
2	1	1	0.187759

⋮

Last Group (2 rows): id = 4, id2 = 2

Row	id	id2	v
	Int64	Int64	Float64
1	4	2	0.648561
2	4	2	0.847968

k = keys(gx2)[1]

GroupKey: (id = 1, id2 = 1)

ntk = NamedTuple(k)

(id = 1, id2 = 1)

tk = Tuple(k)

(1, 1)

the operations below produce the same result and are fast

gx2[1]

2×3 SubDataFrame

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.0298167
2	1	1	0.187759

gx2[k]

2×3 SubDataFrame

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.0298167
2	1	1	0.187759

gx2[ntk]

2×3 SubDataFrame

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.0298167
2	1	1	0.187759

gx2[tk]

2×3 SubDataFrame

Row	id	id2	v
	Int64	Int64	Float64
1	1	1	0.0298167
2	1	1	0.187759

handling missing values

x = DataFrame(id=[missing, 5, 1, 3, missing], x=1:5)

5×2 DataFrame

Row	id	x
	Int64?	Int64
1	missing	1
2	5	2
3	1	3
4	3	4
5	missing	5

by default groups include mising values and their order is not guaranteed

groupby(x, :id)

GroupedDataFrame with 4 groups based on key: id

First Group (1 row): id = 1

Row	id	x
	Int64?	Int64
1	1	3

⋮

Last Group (2 rows): id = missing

Row	id	x
	Int64?	Int64
1	missing	1
2	missing	5

but we can change it; now they are sorted

groupby(x, :id, sort=true, skipmissing=true)

GroupedDataFrame with 3 groups based on key: id

First Group (1 row): id = 1

Row	id	x
	Int64?	Int64
1	1	3

⋮

Last Group (1 row): id = 5

Row	id	x
	Int64?	Int64
1	5	2

and now they are in the order they appear in the source data frame

groupby(x, :id, sort=false)

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = missing

Row	id	x
	Int64?	Int64
1	missing	1
2	missing	5

⋮

Last Group (1 row): id = 3

Row	id	x
	Int64?	Int64
1	3	4

Performing transformations#

by group using combine, select, select!, transform, and transform!

using Statistics
using Chain

reduce the number of rows in the output

ENV["LINES"] = 15

x = DataFrame(id=rand('a':'d', 100), v=rand(100))

100×2 DataFrame

75 rows omitted

Row	id	v
	Char	Float64
1	b	0.0947633
2	b	0.444125
3	c	0.587247
4	b	0.330791
5	b	0.31695
6	b	7.83911e-5
7	d	0.728483
8	b	0.301966
9	d	0.577585
10	b	0.360327
11	c	0.803869
12	c	0.836187
13	c	0.633544
⋮	⋮	⋮
89	c	0.234815
90	a	0.0572505
91	a	0.421841
92	a	0.135045
93	a	0.702993
94	d	0.788392
95	a	0.759564
96	c	0.445145
97	c	0.756522
98	b	0.413141
99	b	0.617292
100	d	0.0817082

apply a function to each group of a data frame combine keeps as many rows as are returned from the function

@chain x begin
    groupby(:id)
    combine(:v => mean)
end

4×2 DataFrame

Row	id	v_mean
	Char	Float64
1	b	0.375926
2	c	0.538606
3	d	0.60607
4	a	0.484241

x.id2 = axes(x, 1)

Base.OneTo(100)

select and transform keep as many rows as are in the source data frame and in correct order additionally transform keeps all columns from the source

@chain x begin
    groupby(:id)
    transform(:v => mean)
end

100×4 DataFrame

75 rows omitted

Row	id	v	id2	v_mean
	Char	Float64	Int64	Float64
1	b	0.0947633	1	0.375926
2	b	0.444125	2	0.375926
3	c	0.587247	3	0.538606
4	b	0.330791	4	0.375926
5	b	0.31695	5	0.375926
6	b	7.83911e-5	6	0.375926
7	d	0.728483	7	0.60607
8	b	0.301966	8	0.375926
9	d	0.577585	9	0.60607
10	b	0.360327	10	0.375926
11	c	0.803869	11	0.538606
12	c	0.836187	12	0.538606
13	c	0.633544	13	0.538606
⋮	⋮	⋮	⋮	⋮
89	c	0.234815	89	0.538606
90	a	0.0572505	90	0.484241
91	a	0.421841	91	0.484241
92	a	0.135045	92	0.484241
93	a	0.702993	93	0.484241
94	d	0.788392	94	0.60607
95	a	0.759564	95	0.484241
96	c	0.445145	96	0.538606
97	c	0.756522	97	0.538606
98	b	0.413141	98	0.375926
99	b	0.617292	99	0.375926
100	d	0.0817082	100	0.60607

note that combine reorders rows by group of GroupedDataFrame

@chain x begin
    groupby(:id)
    combine(:id2, :v => mean)
end

100×3 DataFrame

75 rows omitted

Row	id	id2	v_mean
	Char	Int64	Float64
1	b	1	0.375926
2	b	2	0.375926
3	b	4	0.375926
4	b	5	0.375926
5	b	6	0.375926
6	b	8	0.375926
7	b	10	0.375926
8	b	17	0.375926
9	b	22	0.375926
10	b	23	0.375926
11	b	25	0.375926
12	b	26	0.375926
13	b	27	0.375926
⋮	⋮	⋮	⋮
89	a	74	0.484241
90	a	75	0.484241
91	a	77	0.484241
92	a	78	0.484241
93	a	79	0.484241
94	a	80	0.484241
95	a	87	0.484241
96	a	90	0.484241
97	a	91	0.484241
98	a	92	0.484241
99	a	93	0.484241
100	a	95	0.484241

we give a custom name for the result column

@chain x begin
    groupby(:id)
    combine(:v => mean => :res)
end

4×2 DataFrame

Row	id	res
	Char	Float64
1	b	0.375926
2	c	0.538606
3	d	0.60607
4	a	0.484241

you can have multiple operations

@chain x begin
    groupby(:id)
    combine(:v => mean => :res1, :v => sum => :res2, nrow => :n)
end

4×4 DataFrame

Row	id	res1	res2	n
	Char	Float64	Float64	Int64
1	b	0.375926	10.5259	28
2	c	0.538606	14.0037	26
3	d	0.60607	11.5153	19
4	a	0.484241	13.0745	27

Additional notes:

select! and transform! perform operations in-place
The general syntax for transformation is source_columns => function => target_column
if you pass multiple columns to a function they are treated as positional arguments
ByRow and AsTable work exactly like discussed for operations on data frames in 05_columns.ipynb
you can automatically groupby again the result of combine, select etc. by passing ungroup=false keyword argument to them
similarly keepkeys keyword argument allows you to drop grouping columns from the resulting data frame

It is also allowed to pass a function to all these functions (also - as a special case, as a first argument). In this case the return value can be a table. In particular it allows for an easy dropping of groups if you return an empty table from the function.

If you pass a function you can use a do block syntax. In case of passing a function it gets a SubDataFrame as its argument.

Here is an example:

combine(groupby(x, :id)) do sdf
    n = nrow(sdf)
    n < 25 ? DataFrame() : DataFrame(n=n) ## drop groups with low number of rows
end

3×2 DataFrame

Row	id	n
	Char	Int64
1	b	28
2	c	26
3	a	27

You can also produce multiple columns in a single operation, e.g.:

df = DataFrame(id=[1, 1, 2, 2], val=[1, 2, 3, 4])

4×2 DataFrame

Row	id	val
	Int64	Int64
1	1	1
2	1	2
3	2	3
4	2	4

@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => AsTable)
end

2×3 DataFrame

Row	id	x1	x2
	Int64	Int64	Int64
1	1	1	2
2	2	3	4

@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => [:c1, :c2])
end

2×3 DataFrame

Row	id	c1	c2
	Int64	Int64	Int64
1	1	1	2
2	2	3	4

t is easy to unnest the column into multiple columns, e.g.

df = DataFrame(a=[(p=1, q=2), (p=3, q=4)])

2×1 DataFrame

Row	a
	NamedTup…
1	(p = 1, q = 2)
2	(p = 3, q = 4)

select(df, :a => AsTable)

2×2 DataFrame

Row	p	q
	Int64	Int64
1	1	2
2	3	4

df = DataFrame(a=[[1, 2], [3, 4]])

2×1 DataFrame

Row	a
	Array…
1	[1, 2]
2	[3, 4]

automatic column names generated

select(df, :a => AsTable)

2×2 DataFrame

Row	x1	x2
	Int64	Int64
1	1	2
2	3	4

custom column names generated

select(df, :a => [:C1, :C2])

2×2 DataFrame

Row	C1	C2
	Int64	Int64
1	1	2
2	3	4

Finally, observe that one can conveniently apply multiple transformations using broadcasting:

df = DataFrame(id=repeat(1:10, 10), x1=1:100, x2=101:200)

100×3 DataFrame

75 rows omitted

Row	id	x1	x2
	Int64	Int64	Int64
1	1	1	101
2	2	2	102
3	3	3	103
4	4	4	104
5	5	5	105
6	6	6	106
7	7	7	107
8	8	8	108
9	9	9	109
10	10	10	110
11	1	11	111
12	2	12	112
13	3	13	113
⋮	⋮	⋮	⋮
89	9	89	189
90	10	90	190
91	1	91	191
92	2	92	192
93	3	93	193
94	4	94	194
95	5	95	195
96	6	96	196
97	7	97	197
98	8	98	198
99	9	99	199
100	10	100	200

@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> minimum)
end

10×3 DataFrame

Row	id	x1_minimum	x2_minimum
	Int64	Int64	Int64
1	1	1	101
2	2	2	102
3	3	3	103
4	4	4	104
5	5	5	105
6	6	6	106
7	7	7	107
8	8	8	108
9	9	9	109
10	10	10	110

@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> [minimum maximum])
end

10×5 DataFrame

Row	id	x1_minimum	x2_minimum	x1_maximum	x2_maximum
	Int64	Int64	Int64	Int64	Int64
1	1	1	101	91	191
2	2	2	102	92	192
3	3	3	103	93	193
4	4	4	104	94	194
5	5	5	105	95	195
6	6	6	106	96	196
7	7	7	107	97	197
8	8	8	108	98	198
9	9	9	109	99	199
10	10	10	110	100	200

Aggregation of a data frame using mapcols#

x = DataFrame(rand(10, 10), :auto)

10×10 DataFrame

Row	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	0.239413	0.372632	0.391808	0.881766	0.926426	0.172987	0.886067	0.9445	0.99134	0.567602
2	0.127332	0.523684	0.280744	0.631459	0.100009	0.414213	0.422549	0.463721	0.0210759	0.569317
3	0.910273	0.749751	0.665598	0.661469	0.656725	0.294394	0.485097	0.411687	0.288302	0.633039
4	0.753325	0.642535	0.221339	0.333083	0.783431	0.274499	0.402339	0.698189	0.719778	0.198621
5	0.0387046	0.36916	0.717133	0.38962	0.658046	0.57561	0.0361161	0.18453	0.6539	0.490985
6	0.939608	0.232813	0.632598	0.625016	0.988554	0.364242	0.384142	0.837759	0.775318	0.136857
7	0.98604	0.646358	0.676885	0.661943	0.798212	0.60685	0.783194	0.72725	0.445532	0.311484
8	0.877028	0.452434	0.654184	0.592795	0.160648	0.882515	0.363015	0.852214	0.762508	0.863806
9	0.0306793	0.261351	0.292775	0.339821	0.143583	0.473889	0.229176	0.0522024	0.763665	0.277183
10	0.8066	0.818327	0.600803	0.634267	0.5311	0.610416	0.847606	0.703434	0.613901	0.974852

mapcols(mean, x)

1×10 DataFrame

Row	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	0.5709	0.506904	0.513387	0.575124	0.574673	0.466962	0.48393	0.587549	0.603532	0.502374

Mapping rows and columns using eachcol and eachrow#

map a function over each column and return a vector

map(mean, eachcol(x))

10-element Vector{Float64}:
5709004496231695
5069044704575527
5133865376373459
5751238253561982
5746734925011687
46696151605594977
4839300017510227
5875487958217412
6035319143917144
5023744992952286

an iteration returns a Pair with column name and values

foreach(c -> println(c[1], ": ", mean(c[2])), pairs(eachcol(x)))

x1: 0.5709004496231695
x2: 0.5069044704575527
x3: 0.5133865376373459
x4: 0.5751238253561982
x5: 0.5746734925011687
x6: 0.46696151605594977
x7: 0.4839300017510227
x8: 0.5875487958217412
x9: 0.6035319143917144
x10: 0.5023744992952286

now the returned value is DataFrameRow which works as a NamedTuple but is a view to a parent DataFrame

map(r -> r.x1 / r.x2, eachrow(x))

10-element Vector{Float64}:
6424933971881048
24314771856095865
2141011230985737
1724263056867696
10484501684899052
035893016397789
5255319699420227
9384648268062645
11738731668127854
985669840177133

it prints like a data frame, only the caption is different so that you know the type of the object

er = eachrow(x)

er.x1 # you can access columns of a parent data frame directly

10-element Vector{Float64}:
23941342855307912
1273324638742388
9102731889046806
7533254463886194
038704570132973126
9396078861391909
9860398107614828
8770280375662959
03067929975280259
8066003641583329

it prints like a data frame, only the caption is different so that you know the type of the object

ec = eachcol(x)

10×10 DataFrameColumns

Row	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	0.239413	0.372632	0.391808	0.881766	0.926426	0.172987	0.886067	0.9445	0.99134	0.567602
2	0.127332	0.523684	0.280744	0.631459	0.100009	0.414213	0.422549	0.463721	0.0210759	0.569317
3	0.910273	0.749751	0.665598	0.661469	0.656725	0.294394	0.485097	0.411687	0.288302	0.633039
4	0.753325	0.642535	0.221339	0.333083	0.783431	0.274499	0.402339	0.698189	0.719778	0.198621
5	0.0387046	0.36916	0.717133	0.38962	0.658046	0.57561	0.0361161	0.18453	0.6539	0.490985
6	0.939608	0.232813	0.632598	0.625016	0.988554	0.364242	0.384142	0.837759	0.775318	0.136857
7	0.98604	0.646358	0.676885	0.661943	0.798212	0.60685	0.783194	0.72725	0.445532	0.311484
8	0.877028	0.452434	0.654184	0.592795	0.160648	0.882515	0.363015	0.852214	0.762508	0.863806
9	0.0306793	0.261351	0.292775	0.339821	0.143583	0.473889	0.229176	0.0522024	0.763665	0.277183
10	0.8066	0.818327	0.600803	0.634267	0.5311	0.610416	0.847606	0.703434	0.613901	0.974852

you can access columns of a parent data frame directly

ec.x1

10-element Vector{Float64}:
23941342855307912
1273324638742388
9102731889046806
7533254463886194
038704570132973126
9396078861391909
9860398107614828
8770280375662959
03067929975280259
8066003641583329

Transposing#

you can transpose a data frame using permutedims:

df = DataFrame(reshape(1:12, 3, 4), :auto)

3×4 DataFrame

Row	x1	x2	x3	x4
	Int64	Int64	Int64	Int64
1	1	4	7	10
2	2	5	8	11
3	3	6	9	12

df.names = ["a", "b", "c"]

3-element Vector{String}:
 "a"
 "b"
 "c"

permutedims(df, :names)

4×4 DataFrame

Row	names	a	b	c
	String	Int64	Int64	Int64
1	x1	1	2	3
2	x2	4	5	6
3	x3	7	8	9
4	x4	10	11	12

revert the changes for line width

delete!(ENV, "LINES")

Base.EnvDict with 18 entries:
  "PATH"                         => "/usr/local/julia//bin:/usr/local/bin:/usr/…
  "HOSTNAME"                     => "c0e5116d4768"
  "LANG"                         => "C.UTF-8"
  "GPG_KEY"                      => "7169605F62C751356D054A26A821E680E5FA6305"
  "PYTHON_VERSION"               => "3.12.2"
  "PYTHON_PIP_VERSION"           => "24.0"
  "PYTHON_GET_PIP_URL"           => "https://github.com/pypa/get-pip/raw/dbf0c8…
  "PYTHON_GET_PIP_SHA256"        => "dfe9fd5c28dc98b5ac17979a953ea550cec37ae1b4…
  "JULIA_CI"                     => "true"
  "JULIA_NUM_THREADS"            => "auto"
  "JULIA_CONDAPKG_BACKEND"       => "Null"
  "JULIA_PATH"                   => "/usr/local/julia/"
  "JULIA_DEPOT_PATH"             => "/srv/juliapkg/"
  "HOME"                         => "/root"
  "JPY_PARENT_PID"               => "1"
  "OPENBLAS_MAIN_FREE"           => "1"
  "OPENBLAS_DEFAULT_NUM_THREADS" => "1"
  "COLUMNS"                      => "80"

Transformation to DataFrames

Contents

Transformation to DataFrames#

Grouping a dat=a frame#

Performing transformations#

Aggregation of a data frame using mapcols#

Mapping rows and columns using eachcol and eachrow#

Transposing#