Reshaping DataFrames

Reshaping DataFrames#

using DataFrames

Wide to long#

x = DataFrame(id=[1, 2, 3, 4], id2=[1, 1, 2, 2], M1=[11, 12, 13, 14], M2=[111, 112, 113, 114])

4×4 DataFrame

Row	id	id2	M1	M2
	Int64	Int64	Int64	Int64
1	1	1	11	111
2	2	1	12	112
3	3	2	13	113
4	4	2	14	114

first pass measure variables and then id-variable

stack(x, [:M1, :M2], :id)

8×3 DataFrame

Row	id	variable	value
	Int64	String	Int64
1	1	M1	11
2	2	M1	12
3	3	M1	13
4	4	M1	14
5	1	M2	111
6	2	M2	112
7	3	M2	113
8	4	M2	114

add view=true keyword argument to make a view; in that case columns of the resulting data frame share memory with columns of the source data frame, so the operation is potentially unsafe. Optionally, you can rename columns.

stack(x, ["M1", "M2"], "id", variable_name="key", value_name="observed")

8×3 DataFrame

Row	id	key	observed
	Int64	String	Int64
1	1	M1	11
2	2	M1	12
3	3	M1	13
4	4	M1	14
5	1	M2	111
6	2	M2	112
7	3	M2	113
8	4	M2	114

if second argument is omitted in stack , all other columns are assumed to be the id-variables

stack(x, Not([:id, :id2]))

8×4 DataFrame

Row	id	id2	variable	value
	Int64	Int64	String	Int64
1	1	1	M1	11
2	2	1	M1	12
3	3	2	M1	13
4	4	2	M1	14
5	1	1	M2	111
6	2	1	M2	112
7	3	2	M2	113
8	4	2	M2	114

you can use index instead of symbol

stack(x, Not([1, 2]))

8×4 DataFrame

Row	id	id2	variable	value
	Int64	Int64	String	Int64
1	1	1	M1	11
2	2	1	M1	12
3	3	2	M1	13
4	4	2	M1	14
5	1	1	M2	111
6	2	1	M2	112
7	3	2	M2	113
8	4	2	M2	114

if stack is not passed any measure variables by default numeric variables are selected as measures

x = DataFrame(id=[1, 1, 1], id2=['a', 'b', 'c'], a1=rand(3), a2=rand(3))
stack(x)

6×4 DataFrame

Row	id	id2	variable	value
	Int64	Char	String	Float64
1	1	a	a1	0.370929
2	1	b	a1	0.335389
3	1	c	a1	0.41355
4	1	a	a2	0.805724
5	1	b	a2	0.805891
6	1	c	a2	0.30079

here all columns are treated as measures:

stack(DataFrame(rand(3, 2), :auto))

6×2 DataFrame

Row	variable	value
	String	Float64
1	x1	0.820408
2	x1	0.328456
3	x1	0.518557
4	x2	0.422249
5	x2	0.526879
6	x2	0.208435

duplicates in key are silently accepted

df = DataFrame(rand(3, 2), :auto)
df.key = [1, 1, 1]
mdf = stack(df)

6×3 DataFrame

Row	key	variable	value
	Int64	String	Float64
1	1	x1	0.230161
2	1	x1	0.896468
3	1	x1	0.744152
4	1	x2	0.495718
5	1	x2	0.917839
6	1	x2	0.874574

Long to wide#

x = DataFrame(id=[1, 1, 1], id2=['a', 'b', 'c'], a1=rand(3), a2=rand(3))

3×4 DataFrame

Row	id	id2	a1	a2
	Int64	Char	Float64	Float64
1	1	a	0.502672	0.800395
2	1	b	0.815703	0.831141
3	1	c	0.776919	0.473717

y = stack(x)

6×4 DataFrame

Row	id	id2	variable	value
	Int64	Char	String	Float64
1	1	a	a1	0.502672
2	1	b	a1	0.815703
3	1	c	a1	0.776919
4	1	a	a2	0.800395
5	1	b	a2	0.831141
6	1	c	a2	0.473717

standard unstack with a specified key

unstack(y, :id2, :variable, :value)

3×3 DataFrame

Row	id2	a1	a2
	Char	Float64?	Float64?
1	a	0.502672	0.800395
2	b	0.815703	0.831141
3	c	0.776919	0.473717

all other columns are treated as keys

unstack(y, :variable, :value)

3×4 DataFrame

Row	id	id2	a1	a2
	Int64	Char	Float64?	Float64?
1	1	a	0.502672	0.800395
2	1	b	0.815703	0.831141
3	1	c	0.776919	0.473717

all columns other than named :variable and :value are treated as keys

unstack(y)

3×4 DataFrame

Row	id	id2	a1	a2
	Int64	Char	Float64?	Float64?
1	1	a	0.502672	0.800395
2	1	b	0.815703	0.831141
3	1	c	0.776919	0.473717

you can rename the unstacked columns

unstack(y, renamecols=n -> string("unstacked_", n))

3×4 DataFrame

Row	id	id2	unstacked_a1	unstacked_a2
	Int64	Char	Float64?	Float64?
1	1	a	0.502672	0.800395
2	1	b	0.815703	0.831141
3	1	c	0.776919	0.473717

df = stack(DataFrame(rand(3, 2), :auto))

6×2 DataFrame

Row	variable	value
	String	Float64
1	x1	0.607313
2	x1	0.85144
3	x1	0.0322449
4	x2	0.386384
5	x2	0.365752
6	x2	0.0791589

unable to unstack when no key column is present

try
    unstack(df, :variable, :value)
catch e
    show(e)
end

ArgumentError("Duplicate entries in unstack at row 2 for key () and variable x1. Pass `combine` keyword argument to specify how they should be handled.")

unstack fills missing combinations with missing, but you can change this default with fill keyword argument.

df = DataFrame(key=[1, 1, 2], variable=["a", "b", "a"], value=1:3)

3×3 DataFrame

Row	key	variable	value
	Int64	String	Int64
1	1	a	1
2	1	b	2
3	2	a	3

unstack(df, :variable, :value)

2×3 DataFrame

Row	key	a	b
	Int64	Int64?	Int64?
1	1	1	2
2	2	3	missing

unstack(df, :variable, :value, fill=0)

2×3 DataFrame

Row	key	a	b
	Int64	Int64	Int64
1	1	1	2
2	2	3	0

This notebook was generated using Literate.jl.

Reshaping DataFrames

Contents

Reshaping DataFrames#

Wide to long#

Long to wide#