Reshaping DataFrames#
using DataFrames
Wide to long#
x = DataFrame(id=[1, 2, 3, 4], id2=[1, 1, 2, 2], M1=[11, 12, 13, 14], M2=[111, 112, 113, 114])
Row | id | id2 | M1 | M2 |
---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | |
1 | 1 | 1 | 11 | 111 |
2 | 2 | 1 | 12 | 112 |
3 | 3 | 2 | 13 | 113 |
4 | 4 | 2 | 14 | 114 |
first pass measure variables and then id-variable
stack(x, [:M1, :M2], :id)
Row | id | variable | value |
---|---|---|---|
Int64 | String | Int64 | |
1 | 1 | M1 | 11 |
2 | 2 | M1 | 12 |
3 | 3 | M1 | 13 |
4 | 4 | M1 | 14 |
5 | 1 | M2 | 111 |
6 | 2 | M2 | 112 |
7 | 3 | M2 | 113 |
8 | 4 | M2 | 114 |
add view=true
keyword argument to make a view; in that case columns of the resulting data frame share memory with columns of the source data frame, so the operation is potentially unsafe.
Optionally, you can rename columns.
stack(x, ["M1", "M2"], "id", variable_name="key", value_name="observed")
Row | id | key | observed |
---|---|---|---|
Int64 | String | Int64 | |
1 | 1 | M1 | 11 |
2 | 2 | M1 | 12 |
3 | 3 | M1 | 13 |
4 | 4 | M1 | 14 |
5 | 1 | M2 | 111 |
6 | 2 | M2 | 112 |
7 | 3 | M2 | 113 |
8 | 4 | M2 | 114 |
if second argument is omitted in stack
, all other columns are assumed to be the id-variables
stack(x, Not([:id, :id2]))
Row | id | id2 | variable | value |
---|---|---|---|---|
Int64 | Int64 | String | Int64 | |
1 | 1 | 1 | M1 | 11 |
2 | 2 | 1 | M1 | 12 |
3 | 3 | 2 | M1 | 13 |
4 | 4 | 2 | M1 | 14 |
5 | 1 | 1 | M2 | 111 |
6 | 2 | 1 | M2 | 112 |
7 | 3 | 2 | M2 | 113 |
8 | 4 | 2 | M2 | 114 |
you can use index instead of symbol
stack(x, Not([1, 2]))
Row | id | id2 | variable | value |
---|---|---|---|---|
Int64 | Int64 | String | Int64 | |
1 | 1 | 1 | M1 | 11 |
2 | 2 | 1 | M1 | 12 |
3 | 3 | 2 | M1 | 13 |
4 | 4 | 2 | M1 | 14 |
5 | 1 | 1 | M2 | 111 |
6 | 2 | 1 | M2 | 112 |
7 | 3 | 2 | M2 | 113 |
8 | 4 | 2 | M2 | 114 |
if stack
is not passed any measure variables by default numeric variables are selected as measures
x = DataFrame(id=[1, 1, 1], id2=['a', 'b', 'c'], a1=rand(3), a2=rand(3))
stack(x)
Row | id | id2 | variable | value |
---|---|---|---|---|
Int64 | Char | String | Float64 | |
1 | 1 | a | a1 | 0.773682 |
2 | 1 | b | a1 | 0.415153 |
3 | 1 | c | a1 | 0.581536 |
4 | 1 | a | a2 | 0.765763 |
5 | 1 | b | a2 | 0.492113 |
6 | 1 | c | a2 | 0.656704 |
here all columns are treated as measures:
stack(DataFrame(rand(3, 2), :auto))
Row | variable | value |
---|---|---|
String | Float64 | |
1 | x1 | 0.752894 |
2 | x1 | 0.495428 |
3 | x1 | 0.156994 |
4 | x2 | 0.30824 |
5 | x2 | 0.406108 |
6 | x2 | 0.144493 |
duplicates in key are silently accepted
df = DataFrame(rand(3, 2), :auto)
df.key = [1, 1, 1]
mdf = stack(df)
Row | key | variable | value |
---|---|---|---|
Int64 | String | Float64 | |
1 | 1 | x1 | 0.439815 |
2 | 1 | x1 | 0.83465 |
3 | 1 | x1 | 0.428553 |
4 | 1 | x2 | 0.00949398 |
5 | 1 | x2 | 0.709612 |
6 | 1 | x2 | 0.963289 |
Long to wide#
x = DataFrame(id=[1, 1, 1], id2=['a', 'b', 'c'], a1=rand(3), a2=rand(3))
Row | id | id2 | a1 | a2 |
---|---|---|---|---|
Int64 | Char | Float64 | Float64 | |
1 | 1 | a | 0.433598 | 0.782422 |
2 | 1 | b | 0.0673019 | 0.807716 |
3 | 1 | c | 0.18555 | 0.0390196 |
y = stack(x)
Row | id | id2 | variable | value |
---|---|---|---|---|
Int64 | Char | String | Float64 | |
1 | 1 | a | a1 | 0.433598 |
2 | 1 | b | a1 | 0.0673019 |
3 | 1 | c | a1 | 0.18555 |
4 | 1 | a | a2 | 0.782422 |
5 | 1 | b | a2 | 0.807716 |
6 | 1 | c | a2 | 0.0390196 |
standard unstack with a specified key
unstack(y, :id2, :variable, :value)
Row | id2 | a1 | a2 |
---|---|---|---|
Char | Float64? | Float64? | |
1 | a | 0.433598 | 0.782422 |
2 | b | 0.0673019 | 0.807716 |
3 | c | 0.18555 | 0.0390196 |
all other columns are treated as keys
unstack(y, :variable, :value)
Row | id | id2 | a1 | a2 |
---|---|---|---|---|
Int64 | Char | Float64? | Float64? | |
1 | 1 | a | 0.433598 | 0.782422 |
2 | 1 | b | 0.0673019 | 0.807716 |
3 | 1 | c | 0.18555 | 0.0390196 |
all columns other than named :variable and :value are treated as keys
unstack(y)
Row | id | id2 | a1 | a2 |
---|---|---|---|---|
Int64 | Char | Float64? | Float64? | |
1 | 1 | a | 0.433598 | 0.782422 |
2 | 1 | b | 0.0673019 | 0.807716 |
3 | 1 | c | 0.18555 | 0.0390196 |
you can rename the unstacked columns
unstack(y, renamecols=n -> string("unstacked_", n))
Row | id | id2 | unstacked_a1 | unstacked_a2 |
---|---|---|---|---|
Int64 | Char | Float64? | Float64? | |
1 | 1 | a | 0.433598 | 0.782422 |
2 | 1 | b | 0.0673019 | 0.807716 |
3 | 1 | c | 0.18555 | 0.0390196 |
df = stack(DataFrame(rand(3, 2), :auto))
Row | variable | value |
---|---|---|
String | Float64 | |
1 | x1 | 0.0847903 |
2 | x1 | 0.508263 |
3 | x1 | 0.043781 |
4 | x2 | 0.22944 |
5 | x2 | 0.180225 |
6 | x2 | 0.316169 |
unable to unstack when no key column is present
try
unstack(df, :variable, :value)
catch e
show(e)
end
ArgumentError("Duplicate entries in unstack at row 2 for key () and variable x1. Pass `combine` keyword argument to specify how they should be handled.")
unstack
fills missing combinations with missing
, but you can change this default with fill
keyword argument.
df = DataFrame(key=[1, 1, 2], variable=["a", "b", "a"], value=1:3)
Row | key | variable | value |
---|---|---|---|
Int64 | String | Int64 | |
1 | 1 | a | 1 |
2 | 1 | b | 2 |
3 | 2 | a | 3 |
unstack(df, :variable, :value)
Row | key | a | b |
---|---|---|---|
Int64 | Int64? | Int64? | |
1 | 1 | 1 | 2 |
2 | 2 | 3 | missing |
unstack(df, :variable, :value, fill=0)
Row | key | a | b |
---|---|---|---|
Int64 | Int64 | Int64 | |
1 | 1 | 1 | 2 |
2 | 2 | 3 | 0 |
This notebook was generated using Literate.jl.