{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Reshaping DataFrames"
],
"metadata": {}
},
{
"outputs": [],
"cell_type": "code",
"source": [
"using DataFrames"
],
"metadata": {},
"execution_count": 1
},
{
"cell_type": "markdown",
"source": [
"## Wide to long"
],
"metadata": {}
},
{
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\u001b[1m4×4 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m id2 \u001b[0m\u001b[1m M1 \u001b[0m\u001b[1m M2 \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\n─────┼────────────────────────────\n 1 │ 1 1 11 111\n 2 │ 2 1 12 112\n 3 │ 3 2 13 113\n 4 │ 4 2 14 114",
"text/html": [
"
1 | 1 | 1 | 11 | 111 |
2 | 2 | 1 | 12 | 112 |
3 | 3 | 2 | 13 | 113 |
4 | 4 | 2 | 14 | 114 |
"
]
},
"metadata": {},
"execution_count": 2
}
],
"cell_type": "code",
"source": [
"x = DataFrame(id=[1, 2, 3, 4], id2=[1, 1, 2, 2], M1=[11, 12, 13, 14], M2=[111, 112, 113, 114])"
],
"metadata": {},
"execution_count": 2
},
{
"cell_type": "markdown",
"source": [
"first pass measure variables and then id-variable"
],
"metadata": {}
},
{
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\u001b[1m8×3 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m variable \u001b[0m\u001b[1m value \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m String \u001b[0m\u001b[90m Int64 \u001b[0m\n─────┼────────────────────────\n 1 │ 1 M1 11\n 2 │ 2 M1 12\n 3 │ 3 M1 13\n 4 │ 4 M1 14\n 5 │ 1 M2 111\n 6 │ 2 M2 112\n 7 │ 3 M2 113\n 8 │ 4 M2 114",
"text/html": [
"1 | 1 | M1 | 11 |
2 | 2 | M1 | 12 |
3 | 3 | M1 | 13 |
4 | 4 | M1 | 14 |
5 | 1 | M2 | 111 |
6 | 2 | M2 | 112 |
7 | 3 | M2 | 113 |
8 | 4 | M2 | 114 |
"
]
},
"metadata": {},
"execution_count": 3
}
],
"cell_type": "code",
"source": [
"stack(x, [:M1, :M2], :id)"
],
"metadata": {},
"execution_count": 3
},
{
"cell_type": "markdown",
"source": [
"add `view=true` keyword argument to make a view; in that case columns of the resulting data frame share memory with columns of the source data frame, so the operation is potentially unsafe.\n",
"Optionally, you can rename columns."
],
"metadata": {}
},
{
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\u001b[1m8×3 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m key \u001b[0m\u001b[1m observed \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m String \u001b[0m\u001b[90m Int64 \u001b[0m\n─────┼─────────────────────────\n 1 │ 1 M1 11\n 2 │ 2 M1 12\n 3 │ 3 M1 13\n 4 │ 4 M1 14\n 5 │ 1 M2 111\n 6 │ 2 M2 112\n 7 │ 3 M2 113\n 8 │ 4 M2 114",
"text/html": [
"1 | 1 | M1 | 11 |
2 | 2 | M1 | 12 |
3 | 3 | M1 | 13 |
4 | 4 | M1 | 14 |
5 | 1 | M2 | 111 |
6 | 2 | M2 | 112 |
7 | 3 | M2 | 113 |
8 | 4 | M2 | 114 |
"
]
},
"metadata": {},
"execution_count": 4
}
],
"cell_type": "code",
"source": [
"stack(x, [\"M1\", \"M2\"], \"id\", variable_name=\"key\", value_name=\"observed\")"
],
"metadata": {},
"execution_count": 4
},
{
"cell_type": "markdown",
"source": [
"if second argument is omitted in `stack` , all other columns are assumed to be the id-variables"
],
"metadata": {}
},
{
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\u001b[1m8×4 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m id2 \u001b[0m\u001b[1m variable \u001b[0m\u001b[1m value \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m String \u001b[0m\u001b[90m Int64 \u001b[0m\n─────┼───────────────────────────────\n 1 │ 1 1 M1 11\n 2 │ 2 1 M1 12\n 3 │ 3 2 M1 13\n 4 │ 4 2 M1 14\n 5 │ 1 1 M2 111\n 6 │ 2 1 M2 112\n 7 │ 3 2 M2 113\n 8 │ 4 2 M2 114",
"text/html": [
"1 | 1 | 1 | M1 | 11 |
2 | 2 | 1 | M1 | 12 |
3 | 3 | 2 | M1 | 13 |
4 | 4 | 2 | M1 | 14 |
5 | 1 | 1 | M2 | 111 |
6 | 2 | 1 | M2 | 112 |
7 | 3 | 2 | M2 | 113 |
8 | 4 | 2 | M2 | 114 |
"
]
},
"metadata": {},
"execution_count": 5
}
],
"cell_type": "code",
"source": [
"stack(x, Not([:id, :id2]))"
],
"metadata": {},
"execution_count": 5
},
{
"cell_type": "markdown",
"source": [
"you can use index instead of symbol"
],
"metadata": {}
},
{
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\u001b[1m8×4 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m id2 \u001b[0m\u001b[1m variable \u001b[0m\u001b[1m value \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m String \u001b[0m\u001b[90m Int64 \u001b[0m\n─────┼───────────────────────────────\n 1 │ 1 1 M1 11\n 2 │ 2 1 M1 12\n 3 │ 3 2 M1 13\n 4 │ 4 2 M1 14\n 5 │ 1 1 M2 111\n 6 │ 2 1 M2 112\n 7 │ 3 2 M2 113\n 8 │ 4 2 M2 114",
"text/html": [
"1 | 1 | 1 | M1 | 11 |
2 | 2 | 1 | M1 | 12 |
3 | 3 | 2 | M1 | 13 |
4 | 4 | 2 | M1 | 14 |
5 | 1 | 1 | M2 | 111 |
6 | 2 | 1 | M2 | 112 |
7 | 3 | 2 | M2 | 113 |
8 | 4 | 2 | M2 | 114 |
"
]
},
"metadata": {},
"execution_count": 6
}
],
"cell_type": "code",
"source": [
"stack(x, Not([1, 2]))"
],
"metadata": {},
"execution_count": 6
},
{
"cell_type": "markdown",
"source": [
"if `stack` is not passed any measure variables by default numeric variables are selected as measures"
],
"metadata": {}
},
{
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\u001b[1m6×4 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m id2 \u001b[0m\u001b[1m variable \u001b[0m\u001b[1m value \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Char \u001b[0m\u001b[90m String \u001b[0m\u001b[90m Float64 \u001b[0m\n─────┼──────────────────────────────────\n 1 │ 1 a a1 0.645389\n 2 │ 1 b a1 0.0893361\n 3 │ 1 c a1 0.140096\n 4 │ 1 a a2 0.241577\n 5 │ 1 b a2 0.0454197\n 6 │ 1 c a2 0.879643",
"text/html": [
"1 | 1 | a | a1 | 0.645389 |
2 | 1 | b | a1 | 0.0893361 |
3 | 1 | c | a1 | 0.140096 |
4 | 1 | a | a2 | 0.241577 |
5 | 1 | b | a2 | 0.0454197 |
6 | 1 | c | a2 | 0.879643 |
"
]
},
"metadata": {},
"execution_count": 7
}
],
"cell_type": "code",
"source": [
"x = DataFrame(id=[1, 1, 1], id2=['a', 'b', 'c'], a1=rand(3), a2=rand(3))\n",
"stack(x)"
],
"metadata": {},
"execution_count": 7
},
{
"cell_type": "markdown",
"source": [
"here all columns are treated as measures:"
],
"metadata": {}
},
{
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\u001b[1m6×2 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m variable \u001b[0m\u001b[1m value \u001b[0m\n │\u001b[90m String \u001b[0m\u001b[90m Float64 \u001b[0m\n─────┼────────────────────\n 1 │ x1 0.117252\n 2 │ x1 0.347188\n 3 │ x1 0.821351\n 4 │ x2 0.694786\n 5 │ x2 0.686708\n 6 │ x2 0.1938",
"text/html": [
"1 | x1 | 0.117252 |
2 | x1 | 0.347188 |
3 | x1 | 0.821351 |
4 | x2 | 0.694786 |
5 | x2 | 0.686708 |
6 | x2 | 0.1938 |
"
]
},
"metadata": {},
"execution_count": 8
}
],
"cell_type": "code",
"source": [
"stack(DataFrame(rand(3, 2), :auto))"
],
"metadata": {},
"execution_count": 8
},
{
"cell_type": "markdown",
"source": [
"duplicates in key are silently accepted"
],
"metadata": {}
},
{
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\u001b[1m6×3 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m key \u001b[0m\u001b[1m variable \u001b[0m\u001b[1m value \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m String \u001b[0m\u001b[90m Float64 \u001b[0m\n─────┼────────────────────────────\n 1 │ 1 x1 0.793118\n 2 │ 1 x1 0.0725136\n 3 │ 1 x1 0.546648\n 4 │ 1 x2 0.938334\n 5 │ 1 x2 0.727929\n 6 │ 1 x2 0.603351",
"text/html": [
"1 | 1 | x1 | 0.793118 |
2 | 1 | x1 | 0.0725136 |
3 | 1 | x1 | 0.546648 |
4 | 1 | x2 | 0.938334 |
5 | 1 | x2 | 0.727929 |
6 | 1 | x2 | 0.603351 |
"
]
},
"metadata": {},
"execution_count": 9
}
],
"cell_type": "code",
"source": [
"df = DataFrame(rand(3, 2), :auto)\n",
"df.key = [1, 1, 1]\n",
"mdf = stack(df)"
],
"metadata": {},
"execution_count": 9
},
{
"cell_type": "markdown",
"source": [
"## Long to wide"
],
"metadata": {}
},
{
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\u001b[1m3×4 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m id2 \u001b[0m\u001b[1m a1 \u001b[0m\u001b[1m a2 \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Char \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\n─────┼─────────────────────────────────\n 1 │ 1 a 0.579926 0.457018\n 2 │ 1 b 0.820986 0.797507\n 3 │ 1 c 0.202164 0.67697",
"text/html": [
"1 | 1 | a | 0.579926 | 0.457018 |
2 | 1 | b | 0.820986 | 0.797507 |
3 | 1 | c | 0.202164 | 0.67697 |
"
]
},
"metadata": {},
"execution_count": 10
}
],
"cell_type": "code",
"source": [
"x = DataFrame(id=[1, 1, 1], id2=['a', 'b', 'c'], a1=rand(3), a2=rand(3))"
],
"metadata": {},
"execution_count": 10
},
{
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\u001b[1m6×4 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m id2 \u001b[0m\u001b[1m variable \u001b[0m\u001b[1m value \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Char \u001b[0m\u001b[90m String \u001b[0m\u001b[90m Float64 \u001b[0m\n─────┼─────────────────────────────────\n 1 │ 1 a a1 0.579926\n 2 │ 1 b a1 0.820986\n 3 │ 1 c a1 0.202164\n 4 │ 1 a a2 0.457018\n 5 │ 1 b a2 0.797507\n 6 │ 1 c a2 0.67697",
"text/html": [
"1 | 1 | a | a1 | 0.579926 |
2 | 1 | b | a1 | 0.820986 |
3 | 1 | c | a1 | 0.202164 |
4 | 1 | a | a2 | 0.457018 |
5 | 1 | b | a2 | 0.797507 |
6 | 1 | c | a2 | 0.67697 |
"
]
},
"metadata": {},
"execution_count": 11
}
],
"cell_type": "code",
"source": [
"y = stack(x)"
],
"metadata": {},
"execution_count": 11
},
{
"cell_type": "markdown",
"source": [
"standard unstack with a specified key"
],
"metadata": {}
},
{
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\u001b[1m3×3 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id2 \u001b[0m\u001b[1m a1 \u001b[0m\u001b[1m a2 \u001b[0m\n │\u001b[90m Char \u001b[0m\u001b[90m Float64? \u001b[0m\u001b[90m Float64? \u001b[0m\n─────┼──────────────────────────\n 1 │ a 0.579926 0.457018\n 2 │ b 0.820986 0.797507\n 3 │ c 0.202164 0.67697",
"text/html": [
"1 | a | 0.579926 | 0.457018 |
2 | b | 0.820986 | 0.797507 |
3 | c | 0.202164 | 0.67697 |
"
]
},
"metadata": {},
"execution_count": 12
}
],
"cell_type": "code",
"source": [
"unstack(y, :id2, :variable, :value)"
],
"metadata": {},
"execution_count": 12
},
{
"cell_type": "markdown",
"source": [
"all other columns are treated as keys"
],
"metadata": {}
},
{
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\u001b[1m3×4 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m id2 \u001b[0m\u001b[1m a1 \u001b[0m\u001b[1m a2 \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Char \u001b[0m\u001b[90m Float64? \u001b[0m\u001b[90m Float64? \u001b[0m\n─────┼─────────────────────────────────\n 1 │ 1 a 0.579926 0.457018\n 2 │ 1 b 0.820986 0.797507\n 3 │ 1 c 0.202164 0.67697",
"text/html": [
"1 | 1 | a | 0.579926 | 0.457018 |
2 | 1 | b | 0.820986 | 0.797507 |
3 | 1 | c | 0.202164 | 0.67697 |
"
]
},
"metadata": {},
"execution_count": 13
}
],
"cell_type": "code",
"source": [
"unstack(y, :variable, :value)"
],
"metadata": {},
"execution_count": 13
},
{
"cell_type": "markdown",
"source": [
"all columns other than named :variable and :value are treated as keys"
],
"metadata": {}
},
{
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\u001b[1m3×4 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m id2 \u001b[0m\u001b[1m a1 \u001b[0m\u001b[1m a2 \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Char \u001b[0m\u001b[90m Float64? \u001b[0m\u001b[90m Float64? \u001b[0m\n─────┼─────────────────────────────────\n 1 │ 1 a 0.579926 0.457018\n 2 │ 1 b 0.820986 0.797507\n 3 │ 1 c 0.202164 0.67697",
"text/html": [
"1 | 1 | a | 0.579926 | 0.457018 |
2 | 1 | b | 0.820986 | 0.797507 |
3 | 1 | c | 0.202164 | 0.67697 |
"
]
},
"metadata": {},
"execution_count": 14
}
],
"cell_type": "code",
"source": [
"unstack(y)"
],
"metadata": {},
"execution_count": 14
},
{
"cell_type": "markdown",
"source": [
"you can rename the unstacked columns"
],
"metadata": {}
},
{
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\u001b[1m3×4 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m id2 \u001b[0m\u001b[1m unstacked_a1 \u001b[0m\u001b[1m unstacked_a2 \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Char \u001b[0m\u001b[90m Float64? \u001b[0m\u001b[90m Float64? \u001b[0m\n─────┼─────────────────────────────────────────\n 1 │ 1 a 0.579926 0.457018\n 2 │ 1 b 0.820986 0.797507\n 3 │ 1 c 0.202164 0.67697",
"text/html": [
"1 | 1 | a | 0.579926 | 0.457018 |
2 | 1 | b | 0.820986 | 0.797507 |
3 | 1 | c | 0.202164 | 0.67697 |
"
]
},
"metadata": {},
"execution_count": 15
}
],
"cell_type": "code",
"source": [
"unstack(y, renamecols=n -> string(\"unstacked_\", n))"
],
"metadata": {},
"execution_count": 15
},
{
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\u001b[1m6×2 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m variable \u001b[0m\u001b[1m value \u001b[0m\n │\u001b[90m String \u001b[0m\u001b[90m Float64 \u001b[0m\n─────┼─────────────────────\n 1 │ x1 0.0932271\n 2 │ x1 0.210632\n 3 │ x1 0.461863\n 4 │ x2 0.681666\n 5 │ x2 0.104251\n 6 │ x2 0.0205399",
"text/html": [
"1 | x1 | 0.0932271 |
2 | x1 | 0.210632 |
3 | x1 | 0.461863 |
4 | x2 | 0.681666 |
5 | x2 | 0.104251 |
6 | x2 | 0.0205399 |
"
]
},
"metadata": {},
"execution_count": 16
}
],
"cell_type": "code",
"source": [
"df = stack(DataFrame(rand(3, 2), :auto))"
],
"metadata": {},
"execution_count": 16
},
{
"cell_type": "markdown",
"source": [
"unable to unstack when no key column is present"
],
"metadata": {}
},
{
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ArgumentError(\"Duplicate entries in unstack at row 2 for key () and variable x1. Pass `combine` keyword argument to specify how they should be handled.\")"
]
}
],
"cell_type": "code",
"source": [
"try\n",
" unstack(df, :variable, :value)\n",
"catch e\n",
" show(e)\n",
"end"
],
"metadata": {},
"execution_count": 17
},
{
"cell_type": "markdown",
"source": [
"`unstack` fills missing combinations with `missing`, but you can change this default with `fill` keyword argument."
],
"metadata": {}
},
{
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\u001b[1m3×3 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m key \u001b[0m\u001b[1m variable \u001b[0m\u001b[1m value \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m String \u001b[0m\u001b[90m Int64 \u001b[0m\n─────┼────────────────────────\n 1 │ 1 a 1\n 2 │ 1 b 2\n 3 │ 2 a 3",
"text/html": [
""
]
},
"metadata": {},
"execution_count": 18
}
],
"cell_type": "code",
"source": [
"df = DataFrame(key=[1, 1, 2], variable=[\"a\", \"b\", \"a\"], value=1:3)"
],
"metadata": {},
"execution_count": 18
},
{
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\u001b[1m2×3 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m key \u001b[0m\u001b[1m a \u001b[0m\u001b[1m b \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Int64? \u001b[0m\u001b[90m Int64? \u001b[0m\n─────┼────────────────────────\n 1 │ 1 1 2\n 2 │ 2 3 \u001b[90m missing \u001b[0m",
"text/html": [
""
]
},
"metadata": {},
"execution_count": 19
}
],
"cell_type": "code",
"source": [
"unstack(df, :variable, :value)"
],
"metadata": {},
"execution_count": 19
},
{
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "\u001b[1m2×3 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m key \u001b[0m\u001b[1m a \u001b[0m\u001b[1m b \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\n─────┼─────────────────────\n 1 │ 1 1 2\n 2 │ 2 3 0",
"text/html": [
""
]
},
"metadata": {},
"execution_count": 20
}
],
"cell_type": "code",
"source": [
"unstack(df, :variable, :value, fill=0)"
],
"metadata": {},
"execution_count": 20
},
{
"cell_type": "markdown",
"source": [
"---\n",
"\n",
"*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*"
],
"metadata": {}
}
],
"nbformat_minor": 3,
"metadata": {
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.10.5"
},
"kernelspec": {
"name": "julia-1.10",
"display_name": "Julia 1.10.5",
"language": "julia"
}
},
"nbformat": 4
}