Reshaping DataFrames

Reshaping DataFrames#

using DataFrames

Wide to long#

x = DataFrame(id=[1, 2, 3, 4], id2=[1, 1, 2, 2], M1=[11, 12, 13, 14], M2=[111, 112, 113, 114])

4×4 DataFrame

Row	id	id2	M1	M2
	Int64	Int64	Int64	Int64
1	1	1	11	111
2	2	1	12	112
3	3	2	13	113
4	4	2	14	114

first pass measure variables and then id-variable

stack(x, [:M1, :M2], :id)

8×3 DataFrame

Row	id	variable	value
	Int64	String	Int64
1	1	M1	11
2	2	M1	12
3	3	M1	13
4	4	M1	14
5	1	M2	111
6	2	M2	112
7	3	M2	113
8	4	M2	114

add view=true keyword argument to make a view; in that case columns of the resulting data frame share memory with columns of the source data frame, so the operation is potentially unsafe optionally you can rename columns

stack(x, ["M1", "M2"], "id", variable_name="key", value_name="observed")

8×3 DataFrame

Row	id	key	observed
	Int64	String	Int64
1	1	M1	11
2	2	M1	12
3	3	M1	13
4	4	M1	14
5	1	M2	111
6	2	M2	112
7	3	M2	113
8	4	M2	114

if second argument is omitted in stack , all other columns are assumed to be the id-variables

stack(x, Not([:id, :id2]))

8×4 DataFrame

Row	id	id2	variable	value
	Int64	Int64	String	Int64
1	1	1	M1	11
2	2	1	M1	12
3	3	2	M1	13
4	4	2	M1	14
5	1	1	M2	111
6	2	1	M2	112
7	3	2	M2	113
8	4	2	M2	114

you can use index instead of symbol

stack(x, Not([1, 2]))

8×4 DataFrame

Row	id	id2	variable	value
	Int64	Int64	String	Int64
1	1	1	M1	11
2	2	1	M1	12
3	3	2	M1	13
4	4	2	M1	14
5	1	1	M2	111
6	2	1	M2	112
7	3	2	M2	113
8	4	2	M2	114

x = DataFrame(id=[1, 1, 1], id2=['a', 'b', 'c'], a1=rand(3), a2=rand(3))

3×4 DataFrame

Row	id	id2	a1	a2
	Int64	Char	Float64	Float64
1	1	a	0.288109	0.790831
2	1	b	0.60829	0.0670148
3	1	c	0.157403	0.538962

if stack is not passed any measure variables by default numeric variables are selected as measures

stack(x)

6×4 DataFrame

Row	id	id2	variable	value
	Int64	Char	String	Float64
1	1	a	a1	0.288109
2	1	b	a1	0.60829
3	1	c	a1	0.157403
4	1	a	a2	0.790831
5	1	b	a2	0.0670148
6	1	c	a2	0.538962

here all columns are treated as measures:

stack(DataFrame(rand(3, 2), :auto))

6×2 DataFrame

Row	variable	value
	String	Float64
1	x1	0.157915
2	x1	0.358505
3	x1	0.045899
4	x2	0.865111
5	x2	0.127051
6	x2	0.952189

df = DataFrame(rand(3, 2), :auto)
df.key = [1, 1, 1]
mdf = stack(df) ## duplicates in key are silently accepted

6×3 DataFrame

Row	key	variable	value
	Int64	String	Float64
1	1	x1	0.712677
2	1	x1	0.143423
3	1	x1	0.873908
4	1	x2	0.92694
5	1	x2	0.141397
6	1	x2	0.421874

Long to wide#

x = DataFrame(id=[1, 1, 1], id2=['a', 'b', 'c'], a1=rand(3), a2=rand(3))

3×4 DataFrame

Row	id	id2	a1	a2
	Int64	Char	Float64	Float64
1	1	a	0.145994	0.63789
2	1	b	0.703018	0.134889
3	1	c	0.27519	0.619725

y = stack(x)

6×4 DataFrame

Row	id	id2	variable	value
	Int64	Char	String	Float64
1	1	a	a1	0.145994
2	1	b	a1	0.703018
3	1	c	a1	0.27519
4	1	a	a2	0.63789
5	1	b	a2	0.134889
6	1	c	a2	0.619725

stndard unstack with a specified key

unstack(y, :id2, :variable, :value)

3×3 DataFrame

Row	id2	a1	a2
	Char	Float64?	Float64?
1	a	0.145994	0.63789
2	b	0.703018	0.134889
3	c	0.27519	0.619725

all other columns are treated as keys

unstack(y, :variable, :value)

3×4 DataFrame

Row	id	id2	a1	a2
	Int64	Char	Float64?	Float64?
1	1	a	0.145994	0.63789
2	1	b	0.703018	0.134889
3	1	c	0.27519	0.619725

all columns other than named :variable and :value are treated as keys

unstack(y)

3×4 DataFrame

Row	id	id2	a1	a2
	Int64	Char	Float64?	Float64?
1	1	a	0.145994	0.63789
2	1	b	0.703018	0.134889
3	1	c	0.27519	0.619725

you can rename the unstacked columns

unstack(y, renamecols=n -> string("unstacked_", n))

3×4 DataFrame

Row	id	id2	unstacked_a1	unstacked_a2
	Int64	Char	Float64?	Float64?
1	1	a	0.145994	0.63789
2	1	b	0.703018	0.134889
3	1	c	0.27519	0.619725

df = stack(DataFrame(rand(3, 2), :auto))

6×2 DataFrame

Row	variable	value
	String	Float64
1	x1	0.733841
2	x1	0.115879
3	x1	0.133163
4	x2	0.96931
5	x2	0.766917
6	x2	0.713736

unable to unstack when no key column is present

try
    unstack(df, :variable, :value)
catch e
    show(e)
end

ArgumentError("Duplicate entries in unstack at row 2 for key () and variable x1. Pass `combine` keyword argument to specify how they should be handled.")

unstack fills missing combinations with missing, but you can change this default with fill keyword argument.

df = DataFrame(key=[1, 1, 2], variable=["a", "b", "a"], value=1:3)

3×3 DataFrame

Row	key	variable	value
	Int64	String	Int64
1	1	a	1
2	1	b	2
3	2	a	3

unstack(df, :variable, :value)

2×3 DataFrame

Row	key	a	b
	Int64	Int64?	Int64?
1	1	1	2
2	2	3	missing

unstack(df, :variable, :value, fill=0)

2×3 DataFrame

Row	key	a	b
	Int64	Int64	Int64
1	1	1	2
2	2	3	0

Reshaping DataFrames

Contents

Reshaping DataFrames#

Wide to long#

Long to wide#