{ "cells": [ { "cell_type": "markdown", "source": [ "# Reshaping DataFrames" ], "metadata": {} }, { "outputs": [], "cell_type": "code", "source": [ "using DataFrames" ], "metadata": {}, "execution_count": 1 }, { "cell_type": "markdown", "source": [ "## Wide to long" ], "metadata": {} }, { "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "\u001b[1m4×4 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m id2 \u001b[0m\u001b[1m M1 \u001b[0m\u001b[1m M2 \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\n─────┼────────────────────────────\n 1 │ 1 1 11 111\n 2 │ 2 1 12 112\n 3 │ 3 2 13 113\n 4 │ 4 2 14 114", "text/html": [ "
4×4 DataFrame
Rowidid2M1M2
Int64Int64Int64Int64
11111111
22112112
33213113
44214114
" ] }, "metadata": {}, "execution_count": 2 } ], "cell_type": "code", "source": [ "x = DataFrame(id=[1, 2, 3, 4], id2=[1, 1, 2, 2], M1=[11, 12, 13, 14], M2=[111, 112, 113, 114])" ], "metadata": {}, "execution_count": 2 }, { "cell_type": "markdown", "source": [ "first pass measure variables and then id-variable" ], "metadata": {} }, { "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "\u001b[1m8×3 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m variable \u001b[0m\u001b[1m value \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m String \u001b[0m\u001b[90m Int64 \u001b[0m\n─────┼────────────────────────\n 1 │ 1 M1 11\n 2 │ 2 M1 12\n 3 │ 3 M1 13\n 4 │ 4 M1 14\n 5 │ 1 M2 111\n 6 │ 2 M2 112\n 7 │ 3 M2 113\n 8 │ 4 M2 114", "text/html": [ "
8×3 DataFrame
Rowidvariablevalue
Int64StringInt64
11M111
22M112
33M113
44M114
51M2111
62M2112
73M2113
84M2114
" ] }, "metadata": {}, "execution_count": 3 } ], "cell_type": "code", "source": [ "stack(x, [:M1, :M2], :id)" ], "metadata": {}, "execution_count": 3 }, { "cell_type": "markdown", "source": [ "add `view=true` keyword argument to make a view; in that case columns of the resulting data frame share memory with columns of the source data frame, so the operation is potentially unsafe.\n", "Optionally, you can rename columns." ], "metadata": {} }, { "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "\u001b[1m8×3 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m key \u001b[0m\u001b[1m observed \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m String \u001b[0m\u001b[90m Int64 \u001b[0m\n─────┼─────────────────────────\n 1 │ 1 M1 11\n 2 │ 2 M1 12\n 3 │ 3 M1 13\n 4 │ 4 M1 14\n 5 │ 1 M2 111\n 6 │ 2 M2 112\n 7 │ 3 M2 113\n 8 │ 4 M2 114", "text/html": [ "
8×3 DataFrame
Rowidkeyobserved
Int64StringInt64
11M111
22M112
33M113
44M114
51M2111
62M2112
73M2113
84M2114
" ] }, "metadata": {}, "execution_count": 4 } ], "cell_type": "code", "source": [ "stack(x, [\"M1\", \"M2\"], \"id\", variable_name=\"key\", value_name=\"observed\")" ], "metadata": {}, "execution_count": 4 }, { "cell_type": "markdown", "source": [ "if second argument is omitted in `stack` , all other columns are assumed to be the id-variables" ], "metadata": {} }, { "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "\u001b[1m8×4 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m id2 \u001b[0m\u001b[1m variable \u001b[0m\u001b[1m value \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m String \u001b[0m\u001b[90m Int64 \u001b[0m\n─────┼───────────────────────────────\n 1 │ 1 1 M1 11\n 2 │ 2 1 M1 12\n 3 │ 3 2 M1 13\n 4 │ 4 2 M1 14\n 5 │ 1 1 M2 111\n 6 │ 2 1 M2 112\n 7 │ 3 2 M2 113\n 8 │ 4 2 M2 114", "text/html": [ "
8×4 DataFrame
Rowidid2variablevalue
Int64Int64StringInt64
111M111
221M112
332M113
442M114
511M2111
621M2112
732M2113
842M2114
" ] }, "metadata": {}, "execution_count": 5 } ], "cell_type": "code", "source": [ "stack(x, Not([:id, :id2]))" ], "metadata": {}, "execution_count": 5 }, { "cell_type": "markdown", "source": [ "you can use index instead of symbol" ], "metadata": {} }, { "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "\u001b[1m8×4 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m id2 \u001b[0m\u001b[1m variable \u001b[0m\u001b[1m value \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m String \u001b[0m\u001b[90m Int64 \u001b[0m\n─────┼───────────────────────────────\n 1 │ 1 1 M1 11\n 2 │ 2 1 M1 12\n 3 │ 3 2 M1 13\n 4 │ 4 2 M1 14\n 5 │ 1 1 M2 111\n 6 │ 2 1 M2 112\n 7 │ 3 2 M2 113\n 8 │ 4 2 M2 114", "text/html": [ "
8×4 DataFrame
Rowidid2variablevalue
Int64Int64StringInt64
111M111
221M112
332M113
442M114
511M2111
621M2112
732M2113
842M2114
" ] }, "metadata": {}, "execution_count": 6 } ], "cell_type": "code", "source": [ "stack(x, Not([1, 2]))" ], "metadata": {}, "execution_count": 6 }, { "cell_type": "markdown", "source": [ "if `stack` is not passed any measure variables by default numeric variables are selected as measures" ], "metadata": {} }, { "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "\u001b[1m6×4 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m id2 \u001b[0m\u001b[1m variable \u001b[0m\u001b[1m value \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Char \u001b[0m\u001b[90m String \u001b[0m\u001b[90m Float64 \u001b[0m\n─────┼──────────────────────────────────\n 1 │ 1 a a1 0.645389\n 2 │ 1 b a1 0.0893361\n 3 │ 1 c a1 0.140096\n 4 │ 1 a a2 0.241577\n 5 │ 1 b a2 0.0454197\n 6 │ 1 c a2 0.879643", "text/html": [ "
6×4 DataFrame
Rowidid2variablevalue
Int64CharStringFloat64
11aa10.645389
21ba10.0893361
31ca10.140096
41aa20.241577
51ba20.0454197
61ca20.879643
" ] }, "metadata": {}, "execution_count": 7 } ], "cell_type": "code", "source": [ "x = DataFrame(id=[1, 1, 1], id2=['a', 'b', 'c'], a1=rand(3), a2=rand(3))\n", "stack(x)" ], "metadata": {}, "execution_count": 7 }, { "cell_type": "markdown", "source": [ "here all columns are treated as measures:" ], "metadata": {} }, { "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "\u001b[1m6×2 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m variable \u001b[0m\u001b[1m value \u001b[0m\n │\u001b[90m String \u001b[0m\u001b[90m Float64 \u001b[0m\n─────┼────────────────────\n 1 │ x1 0.117252\n 2 │ x1 0.347188\n 3 │ x1 0.821351\n 4 │ x2 0.694786\n 5 │ x2 0.686708\n 6 │ x2 0.1938", "text/html": [ "
6×2 DataFrame
Rowvariablevalue
StringFloat64
1x10.117252
2x10.347188
3x10.821351
4x20.694786
5x20.686708
6x20.1938
" ] }, "metadata": {}, "execution_count": 8 } ], "cell_type": "code", "source": [ "stack(DataFrame(rand(3, 2), :auto))" ], "metadata": {}, "execution_count": 8 }, { "cell_type": "markdown", "source": [ "duplicates in key are silently accepted" ], "metadata": {} }, { "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "\u001b[1m6×3 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m key \u001b[0m\u001b[1m variable \u001b[0m\u001b[1m value \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m String \u001b[0m\u001b[90m Float64 \u001b[0m\n─────┼────────────────────────────\n 1 │ 1 x1 0.793118\n 2 │ 1 x1 0.0725136\n 3 │ 1 x1 0.546648\n 4 │ 1 x2 0.938334\n 5 │ 1 x2 0.727929\n 6 │ 1 x2 0.603351", "text/html": [ "
6×3 DataFrame
Rowkeyvariablevalue
Int64StringFloat64
11x10.793118
21x10.0725136
31x10.546648
41x20.938334
51x20.727929
61x20.603351
" ] }, "metadata": {}, "execution_count": 9 } ], "cell_type": "code", "source": [ "df = DataFrame(rand(3, 2), :auto)\n", "df.key = [1, 1, 1]\n", "mdf = stack(df)" ], "metadata": {}, "execution_count": 9 }, { "cell_type": "markdown", "source": [ "## Long to wide" ], "metadata": {} }, { "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "\u001b[1m3×4 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m id2 \u001b[0m\u001b[1m a1 \u001b[0m\u001b[1m a2 \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Char \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\n─────┼─────────────────────────────────\n 1 │ 1 a 0.579926 0.457018\n 2 │ 1 b 0.820986 0.797507\n 3 │ 1 c 0.202164 0.67697", "text/html": [ "
3×4 DataFrame
Rowidid2a1a2
Int64CharFloat64Float64
11a0.5799260.457018
21b0.8209860.797507
31c0.2021640.67697
" ] }, "metadata": {}, "execution_count": 10 } ], "cell_type": "code", "source": [ "x = DataFrame(id=[1, 1, 1], id2=['a', 'b', 'c'], a1=rand(3), a2=rand(3))" ], "metadata": {}, "execution_count": 10 }, { "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "\u001b[1m6×4 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m id2 \u001b[0m\u001b[1m variable \u001b[0m\u001b[1m value \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Char \u001b[0m\u001b[90m String \u001b[0m\u001b[90m Float64 \u001b[0m\n─────┼─────────────────────────────────\n 1 │ 1 a a1 0.579926\n 2 │ 1 b a1 0.820986\n 3 │ 1 c a1 0.202164\n 4 │ 1 a a2 0.457018\n 5 │ 1 b a2 0.797507\n 6 │ 1 c a2 0.67697", "text/html": [ "
6×4 DataFrame
Rowidid2variablevalue
Int64CharStringFloat64
11aa10.579926
21ba10.820986
31ca10.202164
41aa20.457018
51ba20.797507
61ca20.67697
" ] }, "metadata": {}, "execution_count": 11 } ], "cell_type": "code", "source": [ "y = stack(x)" ], "metadata": {}, "execution_count": 11 }, { "cell_type": "markdown", "source": [ "standard unstack with a specified key" ], "metadata": {} }, { "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "\u001b[1m3×3 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id2 \u001b[0m\u001b[1m a1 \u001b[0m\u001b[1m a2 \u001b[0m\n │\u001b[90m Char \u001b[0m\u001b[90m Float64? \u001b[0m\u001b[90m Float64? \u001b[0m\n─────┼──────────────────────────\n 1 │ a 0.579926 0.457018\n 2 │ b 0.820986 0.797507\n 3 │ c 0.202164 0.67697", "text/html": [ "
3×3 DataFrame
Rowid2a1a2
CharFloat64?Float64?
1a0.5799260.457018
2b0.8209860.797507
3c0.2021640.67697
" ] }, "metadata": {}, "execution_count": 12 } ], "cell_type": "code", "source": [ "unstack(y, :id2, :variable, :value)" ], "metadata": {}, "execution_count": 12 }, { "cell_type": "markdown", "source": [ "all other columns are treated as keys" ], "metadata": {} }, { "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "\u001b[1m3×4 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m id2 \u001b[0m\u001b[1m a1 \u001b[0m\u001b[1m a2 \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Char \u001b[0m\u001b[90m Float64? \u001b[0m\u001b[90m Float64? \u001b[0m\n─────┼─────────────────────────────────\n 1 │ 1 a 0.579926 0.457018\n 2 │ 1 b 0.820986 0.797507\n 3 │ 1 c 0.202164 0.67697", "text/html": [ "
3×4 DataFrame
Rowidid2a1a2
Int64CharFloat64?Float64?
11a0.5799260.457018
21b0.8209860.797507
31c0.2021640.67697
" ] }, "metadata": {}, "execution_count": 13 } ], "cell_type": "code", "source": [ "unstack(y, :variable, :value)" ], "metadata": {}, "execution_count": 13 }, { "cell_type": "markdown", "source": [ "all columns other than named :variable and :value are treated as keys" ], "metadata": {} }, { "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "\u001b[1m3×4 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m id2 \u001b[0m\u001b[1m a1 \u001b[0m\u001b[1m a2 \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Char \u001b[0m\u001b[90m Float64? \u001b[0m\u001b[90m Float64? \u001b[0m\n─────┼─────────────────────────────────\n 1 │ 1 a 0.579926 0.457018\n 2 │ 1 b 0.820986 0.797507\n 3 │ 1 c 0.202164 0.67697", "text/html": [ "
3×4 DataFrame
Rowidid2a1a2
Int64CharFloat64?Float64?
11a0.5799260.457018
21b0.8209860.797507
31c0.2021640.67697
" ] }, "metadata": {}, "execution_count": 14 } ], "cell_type": "code", "source": [ "unstack(y)" ], "metadata": {}, "execution_count": 14 }, { "cell_type": "markdown", "source": [ "you can rename the unstacked columns" ], "metadata": {} }, { "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "\u001b[1m3×4 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m id \u001b[0m\u001b[1m id2 \u001b[0m\u001b[1m unstacked_a1 \u001b[0m\u001b[1m unstacked_a2 \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Char \u001b[0m\u001b[90m Float64? \u001b[0m\u001b[90m Float64? \u001b[0m\n─────┼─────────────────────────────────────────\n 1 │ 1 a 0.579926 0.457018\n 2 │ 1 b 0.820986 0.797507\n 3 │ 1 c 0.202164 0.67697", "text/html": [ "
3×4 DataFrame
Rowidid2unstacked_a1unstacked_a2
Int64CharFloat64?Float64?
11a0.5799260.457018
21b0.8209860.797507
31c0.2021640.67697
" ] }, "metadata": {}, "execution_count": 15 } ], "cell_type": "code", "source": [ "unstack(y, renamecols=n -> string(\"unstacked_\", n))" ], "metadata": {}, "execution_count": 15 }, { "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "\u001b[1m6×2 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m variable \u001b[0m\u001b[1m value \u001b[0m\n │\u001b[90m String \u001b[0m\u001b[90m Float64 \u001b[0m\n─────┼─────────────────────\n 1 │ x1 0.0932271\n 2 │ x1 0.210632\n 3 │ x1 0.461863\n 4 │ x2 0.681666\n 5 │ x2 0.104251\n 6 │ x2 0.0205399", "text/html": [ "
6×2 DataFrame
Rowvariablevalue
StringFloat64
1x10.0932271
2x10.210632
3x10.461863
4x20.681666
5x20.104251
6x20.0205399
" ] }, "metadata": {}, "execution_count": 16 } ], "cell_type": "code", "source": [ "df = stack(DataFrame(rand(3, 2), :auto))" ], "metadata": {}, "execution_count": 16 }, { "cell_type": "markdown", "source": [ "unable to unstack when no key column is present" ], "metadata": {} }, { "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ArgumentError(\"Duplicate entries in unstack at row 2 for key () and variable x1. Pass `combine` keyword argument to specify how they should be handled.\")" ] } ], "cell_type": "code", "source": [ "try\n", " unstack(df, :variable, :value)\n", "catch e\n", " show(e)\n", "end" ], "metadata": {}, "execution_count": 17 }, { "cell_type": "markdown", "source": [ "`unstack` fills missing combinations with `missing`, but you can change this default with `fill` keyword argument." ], "metadata": {} }, { "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "\u001b[1m3×3 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m key \u001b[0m\u001b[1m variable \u001b[0m\u001b[1m value \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m String \u001b[0m\u001b[90m Int64 \u001b[0m\n─────┼────────────────────────\n 1 │ 1 a 1\n 2 │ 1 b 2\n 3 │ 2 a 3", "text/html": [ "
3×3 DataFrame
Rowkeyvariablevalue
Int64StringInt64
11a1
21b2
32a3
" ] }, "metadata": {}, "execution_count": 18 } ], "cell_type": "code", "source": [ "df = DataFrame(key=[1, 1, 2], variable=[\"a\", \"b\", \"a\"], value=1:3)" ], "metadata": {}, "execution_count": 18 }, { "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "\u001b[1m2×3 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m key \u001b[0m\u001b[1m a \u001b[0m\u001b[1m b \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Int64? \u001b[0m\u001b[90m Int64? \u001b[0m\n─────┼────────────────────────\n 1 │ 1 1 2\n 2 │ 2 3 \u001b[90m missing \u001b[0m", "text/html": [ "
2×3 DataFrame
Rowkeyab
Int64Int64?Int64?
1112
223missing
" ] }, "metadata": {}, "execution_count": 19 } ], "cell_type": "code", "source": [ "unstack(df, :variable, :value)" ], "metadata": {}, "execution_count": 19 }, { "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "\u001b[1m2×3 DataFrame\u001b[0m\n\u001b[1m Row \u001b[0m│\u001b[1m key \u001b[0m\u001b[1m a \u001b[0m\u001b[1m b \u001b[0m\n │\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\n─────┼─────────────────────\n 1 │ 1 1 2\n 2 │ 2 3 0", "text/html": [ "
2×3 DataFrame
Rowkeyab
Int64Int64Int64
1112
2230
" ] }, "metadata": {}, "execution_count": 20 } ], "cell_type": "code", "source": [ "unstack(df, :variable, :value, fill=0)" ], "metadata": {}, "execution_count": 20 }, { "cell_type": "markdown", "source": [ "---\n", "\n", "*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*" ], "metadata": {} } ], "nbformat_minor": 3, "metadata": { "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.10.5" }, "kernelspec": { "name": "julia-1.10", "display_name": "Julia 1.10.5", "language": "julia" } }, "nbformat": 4 }