Reshaping DataFrames

Reshaping DataFrames#

using DataFrames

Wide to long#

x = DataFrame(id=[1, 2, 3, 4], id2=[1, 1, 2, 2], M1=[11, 12, 13, 14], M2=[111, 112, 113, 114])
4×4 DataFrame
Rowidid2M1M2
Int64Int64Int64Int64
11111111
22112112
33213113
44214114

first pass measure variables and then id-variable

stack(x, [:M1, :M2], :id)
8×3 DataFrame
Rowidvariablevalue
Int64StringInt64
11M111
22M112
33M113
44M114
51M2111
62M2112
73M2113
84M2114

add view=true keyword argument to make a view; in that case columns of the resulting data frame share memory with columns of the source data frame, so the operation is potentially unsafe optionally you can rename columns

stack(x, ["M1", "M2"], "id", variable_name="key", value_name="observed")
8×3 DataFrame
Rowidkeyobserved
Int64StringInt64
11M111
22M112
33M113
44M114
51M2111
62M2112
73M2113
84M2114

if second argument is omitted in stack , all other columns are assumed to be the id-variables

stack(x, Not([:id, :id2]))
8×4 DataFrame
Rowidid2variablevalue
Int64Int64StringInt64
111M111
221M112
332M113
442M114
511M2111
621M2112
732M2113
842M2114

you can use index instead of symbol

stack(x, Not([1, 2]))
8×4 DataFrame
Rowidid2variablevalue
Int64Int64StringInt64
111M111
221M112
332M113
442M114
511M2111
621M2112
732M2113
842M2114
x = DataFrame(id=[1, 1, 1], id2=['a', 'b', 'c'], a1=rand(3), a2=rand(3))
3×4 DataFrame
Rowidid2a1a2
Int64CharFloat64Float64
11a0.2881090.790831
21b0.608290.0670148
31c0.1574030.538962

if stack is not passed any measure variables by default numeric variables are selected as measures

stack(x)
6×4 DataFrame
Rowidid2variablevalue
Int64CharStringFloat64
11aa10.288109
21ba10.60829
31ca10.157403
41aa20.790831
51ba20.0670148
61ca20.538962

here all columns are treated as measures:

stack(DataFrame(rand(3, 2), :auto))
6×2 DataFrame
Rowvariablevalue
StringFloat64
1x10.157915
2x10.358505
3x10.045899
4x20.865111
5x20.127051
6x20.952189
df = DataFrame(rand(3, 2), :auto)
df.key = [1, 1, 1]
mdf = stack(df) ## duplicates in key are silently accepted
6×3 DataFrame
Rowkeyvariablevalue
Int64StringFloat64
11x10.712677
21x10.143423
31x10.873908
41x20.92694
51x20.141397
61x20.421874

Long to wide#

x = DataFrame(id=[1, 1, 1], id2=['a', 'b', 'c'], a1=rand(3), a2=rand(3))
3×4 DataFrame
Rowidid2a1a2
Int64CharFloat64Float64
11a0.1459940.63789
21b0.7030180.134889
31c0.275190.619725
y = stack(x)
6×4 DataFrame
Rowidid2variablevalue
Int64CharStringFloat64
11aa10.145994
21ba10.703018
31ca10.27519
41aa20.63789
51ba20.134889
61ca20.619725

stndard unstack with a specified key

unstack(y, :id2, :variable, :value)
3×3 DataFrame
Rowid2a1a2
CharFloat64?Float64?
1a0.1459940.63789
2b0.7030180.134889
3c0.275190.619725

all other columns are treated as keys

unstack(y, :variable, :value)
3×4 DataFrame
Rowidid2a1a2
Int64CharFloat64?Float64?
11a0.1459940.63789
21b0.7030180.134889
31c0.275190.619725

all columns other than named :variable and :value are treated as keys

unstack(y)
3×4 DataFrame
Rowidid2a1a2
Int64CharFloat64?Float64?
11a0.1459940.63789
21b0.7030180.134889
31c0.275190.619725

you can rename the unstacked columns

unstack(y, renamecols=n -> string("unstacked_", n))
3×4 DataFrame
Rowidid2unstacked_a1unstacked_a2
Int64CharFloat64?Float64?
11a0.1459940.63789
21b0.7030180.134889
31c0.275190.619725
df = stack(DataFrame(rand(3, 2), :auto))
6×2 DataFrame
Rowvariablevalue
StringFloat64
1x10.733841
2x10.115879
3x10.133163
4x20.96931
5x20.766917
6x20.713736

unable to unstack when no key column is present

try
    unstack(df, :variable, :value)
catch e
    show(e)
end
ArgumentError("Duplicate entries in unstack at row 2 for key () and variable x1. Pass `combine` keyword argument to specify how they should be handled.")

unstack fills missing combinations with missing, but you can change this default with fill keyword argument.

df = DataFrame(key=[1, 1, 2], variable=["a", "b", "a"], value=1:3)
3×3 DataFrame
Rowkeyvariablevalue
Int64StringInt64
11a1
21b2
32a3
unstack(df, :variable, :value)
2×3 DataFrame
Rowkeyab
Int64Int64?Int64?
1112
223missing
unstack(df, :variable, :value, fill=0)
2×3 DataFrame
Rowkeyab
Int64Int64Int64
1112
2230