Reshaping DataFrames#
using DataFrames
Wide to long#
x = DataFrame(id=[1, 2, 3, 4], id2=[1, 1, 2, 2], M1=[11, 12, 13, 14], M2=[111, 112, 113, 114])
Row | id | id2 | M1 | M2 |
---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | |
1 | 1 | 1 | 11 | 111 |
2 | 2 | 1 | 12 | 112 |
3 | 3 | 2 | 13 | 113 |
4 | 4 | 2 | 14 | 114 |
first pass measure variables and then id-variable
stack(x, [:M1, :M2], :id)
Row | id | variable | value |
---|---|---|---|
Int64 | String | Int64 | |
1 | 1 | M1 | 11 |
2 | 2 | M1 | 12 |
3 | 3 | M1 | 13 |
4 | 4 | M1 | 14 |
5 | 1 | M2 | 111 |
6 | 2 | M2 | 112 |
7 | 3 | M2 | 113 |
8 | 4 | M2 | 114 |
add view=true
keyword argument to make a view; in that case columns of the resulting data frame share memory with columns of the source data frame, so the operation is potentially unsafe
optionally you can rename columns
stack(x, ["M1", "M2"], "id", variable_name="key", value_name="observed")
Row | id | key | observed |
---|---|---|---|
Int64 | String | Int64 | |
1 | 1 | M1 | 11 |
2 | 2 | M1 | 12 |
3 | 3 | M1 | 13 |
4 | 4 | M1 | 14 |
5 | 1 | M2 | 111 |
6 | 2 | M2 | 112 |
7 | 3 | M2 | 113 |
8 | 4 | M2 | 114 |
if second argument is omitted in stack
, all other columns are assumed to be the id-variables
stack(x, Not([:id, :id2]))
Row | id | id2 | variable | value |
---|---|---|---|---|
Int64 | Int64 | String | Int64 | |
1 | 1 | 1 | M1 | 11 |
2 | 2 | 1 | M1 | 12 |
3 | 3 | 2 | M1 | 13 |
4 | 4 | 2 | M1 | 14 |
5 | 1 | 1 | M2 | 111 |
6 | 2 | 1 | M2 | 112 |
7 | 3 | 2 | M2 | 113 |
8 | 4 | 2 | M2 | 114 |
you can use index instead of symbol
stack(x, Not([1, 2]))
Row | id | id2 | variable | value |
---|---|---|---|---|
Int64 | Int64 | String | Int64 | |
1 | 1 | 1 | M1 | 11 |
2 | 2 | 1 | M1 | 12 |
3 | 3 | 2 | M1 | 13 |
4 | 4 | 2 | M1 | 14 |
5 | 1 | 1 | M2 | 111 |
6 | 2 | 1 | M2 | 112 |
7 | 3 | 2 | M2 | 113 |
8 | 4 | 2 | M2 | 114 |
x = DataFrame(id=[1, 1, 1], id2=['a', 'b', 'c'], a1=rand(3), a2=rand(3))
Row | id | id2 | a1 | a2 |
---|---|---|---|---|
Int64 | Char | Float64 | Float64 | |
1 | 1 | a | 0.288109 | 0.790831 |
2 | 1 | b | 0.60829 | 0.0670148 |
3 | 1 | c | 0.157403 | 0.538962 |
if stack
is not passed any measure variables by default numeric variables are selected as measures
stack(x)
Row | id | id2 | variable | value |
---|---|---|---|---|
Int64 | Char | String | Float64 | |
1 | 1 | a | a1 | 0.288109 |
2 | 1 | b | a1 | 0.60829 |
3 | 1 | c | a1 | 0.157403 |
4 | 1 | a | a2 | 0.790831 |
5 | 1 | b | a2 | 0.0670148 |
6 | 1 | c | a2 | 0.538962 |
here all columns are treated as measures:
stack(DataFrame(rand(3, 2), :auto))
Row | variable | value |
---|---|---|
String | Float64 | |
1 | x1 | 0.157915 |
2 | x1 | 0.358505 |
3 | x1 | 0.045899 |
4 | x2 | 0.865111 |
5 | x2 | 0.127051 |
6 | x2 | 0.952189 |
df = DataFrame(rand(3, 2), :auto)
df.key = [1, 1, 1]
mdf = stack(df) ## duplicates in key are silently accepted
Row | key | variable | value |
---|---|---|---|
Int64 | String | Float64 | |
1 | 1 | x1 | 0.712677 |
2 | 1 | x1 | 0.143423 |
3 | 1 | x1 | 0.873908 |
4 | 1 | x2 | 0.92694 |
5 | 1 | x2 | 0.141397 |
6 | 1 | x2 | 0.421874 |
Long to wide#
x = DataFrame(id=[1, 1, 1], id2=['a', 'b', 'c'], a1=rand(3), a2=rand(3))
Row | id | id2 | a1 | a2 |
---|---|---|---|---|
Int64 | Char | Float64 | Float64 | |
1 | 1 | a | 0.145994 | 0.63789 |
2 | 1 | b | 0.703018 | 0.134889 |
3 | 1 | c | 0.27519 | 0.619725 |
y = stack(x)
Row | id | id2 | variable | value |
---|---|---|---|---|
Int64 | Char | String | Float64 | |
1 | 1 | a | a1 | 0.145994 |
2 | 1 | b | a1 | 0.703018 |
3 | 1 | c | a1 | 0.27519 |
4 | 1 | a | a2 | 0.63789 |
5 | 1 | b | a2 | 0.134889 |
6 | 1 | c | a2 | 0.619725 |
stndard unstack with a specified key
unstack(y, :id2, :variable, :value)
Row | id2 | a1 | a2 |
---|---|---|---|
Char | Float64? | Float64? | |
1 | a | 0.145994 | 0.63789 |
2 | b | 0.703018 | 0.134889 |
3 | c | 0.27519 | 0.619725 |
all other columns are treated as keys
unstack(y, :variable, :value)
Row | id | id2 | a1 | a2 |
---|---|---|---|---|
Int64 | Char | Float64? | Float64? | |
1 | 1 | a | 0.145994 | 0.63789 |
2 | 1 | b | 0.703018 | 0.134889 |
3 | 1 | c | 0.27519 | 0.619725 |
all columns other than named :variable and :value are treated as keys
unstack(y)
Row | id | id2 | a1 | a2 |
---|---|---|---|---|
Int64 | Char | Float64? | Float64? | |
1 | 1 | a | 0.145994 | 0.63789 |
2 | 1 | b | 0.703018 | 0.134889 |
3 | 1 | c | 0.27519 | 0.619725 |
you can rename the unstacked columns
unstack(y, renamecols=n -> string("unstacked_", n))
Row | id | id2 | unstacked_a1 | unstacked_a2 |
---|---|---|---|---|
Int64 | Char | Float64? | Float64? | |
1 | 1 | a | 0.145994 | 0.63789 |
2 | 1 | b | 0.703018 | 0.134889 |
3 | 1 | c | 0.27519 | 0.619725 |
df = stack(DataFrame(rand(3, 2), :auto))
Row | variable | value |
---|---|---|
String | Float64 | |
1 | x1 | 0.733841 |
2 | x1 | 0.115879 |
3 | x1 | 0.133163 |
4 | x2 | 0.96931 |
5 | x2 | 0.766917 |
6 | x2 | 0.713736 |
unable to unstack when no key column is present
try
unstack(df, :variable, :value)
catch e
show(e)
end
ArgumentError("Duplicate entries in unstack at row 2 for key () and variable x1. Pass `combine` keyword argument to specify how they should be handled.")
unstack
fills missing combinations with missing
, but you can change this default with fill
keyword argument.
df = DataFrame(key=[1, 1, 2], variable=["a", "b", "a"], value=1:3)
Row | key | variable | value |
---|---|---|---|
Int64 | String | Int64 | |
1 | 1 | a | 1 |
2 | 1 | b | 2 |
3 | 2 | a | 3 |
unstack(df, :variable, :value)
Row | key | a | b |
---|---|---|---|
Int64 | Int64? | Int64? | |
1 | 1 | 1 | 2 |
2 | 2 | 3 | missing |
unstack(df, :variable, :value, fill=0)
Row | key | a | b |
---|---|---|---|
Int64 | Int64 | Int64 | |
1 | 1 | 1 | 2 |
2 | 2 | 3 | 0 |