Manipulating rows of DataFrame#

Selecting rows#

using DataFrames
using Statistics
using Random
Random.seed!(1);
df = DataFrame(rand(4, 5), :auto)
4×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.04917180.6918570.8403840.1985210.802561
20.1190790.7675180.890770.008197860.661425
30.3932710.0872530.1382270.5920410.347513
40.02409430.8557180.3477370.8010550.778149

using : as row selector will copy columns

df[:, :]
4×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.04917180.6918570.8403840.1985210.802561
20.1190790.7675180.890770.008197860.661425
30.3932710.0872530.1382270.5920410.347513
40.02409430.8557180.3477370.8010550.778149

this is the same as

copy(df)
4×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.04917180.6918570.8403840.1985210.802561
20.1190790.7675180.890770.008197860.661425
30.3932710.0872530.1382270.5920410.347513
40.02409430.8557180.3477370.8010550.778149

you can get a subset of rows of a data frame without copying using view to get a SubDataFrame

sdf = view(df, 1:3, 1:3)
3×3 SubDataFrame
Rowx1x2x3
Float64Float64Float64
10.04917180.6918570.840384
20.1190790.7675180.89077
30.3932710.0872530.138227

you still have a detailed reference to the parent

parent(sdf), parentindices(sdf)
(4×5 DataFrame
 Row  x1         x2        x3        x4          x5       
     │ Float64    Float64   Float64   Float64     Float64  
─────┼─────────────────────────────────────────────────────
   1 │ 0.0491718  0.691857  0.840384  0.198521    0.802561
   2 │ 0.119079   0.767518  0.89077   0.00819786  0.661425
   3 │ 0.393271   0.087253  0.138227  0.592041    0.347513
   4 │ 0.0240943  0.855718  0.347737  0.801055    0.778149, (1:3, 1:3))

selecting a single row returns a DataFrameRow object which is also a view

dfr = df[3, :]
DataFrameRow (5 columns)
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
30.3932710.0872530.1382270.5920410.347513
parent(dfr), parentindices(dfr), rownumber(dfr)
(4×5 DataFrame
 Row  x1         x2        x3        x4          x5       
     │ Float64    Float64   Float64   Float64     Float64  
─────┼─────────────────────────────────────────────────────
   1 │ 0.0491718  0.691857  0.840384  0.198521    0.802561
   2 │ 0.119079   0.767518  0.89077   0.00819786  0.661425
   3 │ 0.393271   0.087253  0.138227  0.592041    0.347513
   4 │ 0.0240943  0.855718  0.347737  0.801055    0.778149, (3, Base.OneTo(5)), 3)

let us add a column to a data frame by assigning a scalar broadcasting

df[!, :Z] .= 1
4-element Vector{Int64}:
 1
 1
 1
 1
df
4×6 DataFrame
Rowx1x2x3x4x5Z
Float64Float64Float64Float64Float64Int64
10.04917180.6918570.8403840.1985210.8025611
20.1190790.7675180.890770.008197860.6614251
30.3932710.0872530.1382270.5920410.3475131
40.02409430.8557180.3477370.8010550.7781491

Earlier we used : for column selection in a view (SubDataFrame and DataFrameRow). In this case a view will have all columns of the parent after the parent is mutated.

dfr
DataFrameRow (6 columns)
Rowx1x2x3x4x5Z
Float64Float64Float64Float64Float64Int64
30.3932710.0872530.1382270.5920410.3475131
parent(dfr), parentindices(dfr), rownumber(dfr)
(4×6 DataFrame
 Row  x1         x2        x3        x4          x5        Z     
     │ Float64    Float64   Float64   Float64     Float64   Int64 
─────┼────────────────────────────────────────────────────────────
   1 │ 0.0491718  0.691857  0.840384  0.198521    0.802561      1
   2 │ 0.119079   0.767518  0.89077   0.00819786  0.661425      1
   3 │ 0.393271   0.087253  0.138227  0.592041    0.347513      1
   4 │ 0.0240943  0.855718  0.347737  0.801055    0.778149      1, (3, Base.OneTo(6)), 3)

Note that parent and parentindices refer to the true source of data for a DataFrameRow and rownumber refers to row number in the direct object that was used to create DataFrameRow

df = DataFrame(a=1:4)
4×1 DataFrame
Rowa
Int64
11
22
33
44
dfv = view(df, [3, 2], :)
2×1 SubDataFrame
Rowa
Int64
13
22
dfr = dfv[1, :]
DataFrameRow (1 columns)
Rowa
Int64
33
parent(dfr), parentindices(dfr), rownumber(dfr)
(4×1 DataFrame
 Row  a     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
   3 │     3
   4 │     4, (3, Base.OneTo(1)), 1)

Reordering rows#

We create some random data frame (and hope that x.x is not sorted :), which is quite likely with 12 rows)

x = DataFrame(id=1:12, x=rand(12), y=[zeros(6); ones(6)])
12×3 DataFrame
Rowidxy
Int64Float64Float64
110.8303340.0
220.5731320.0
330.1766250.0
440.1149350.0
550.78640.0
660.8925980.0
770.4520151.0
880.2068731.0
990.2865821.0
10100.9189161.0
11110.9910711.0
12120.7968311.0

check if a DataFrame or a subset of its columns is sorted

issorted(x), issorted(x, :x)
(true, false)

we sort x in place

sort!(x, :x)
12×3 DataFrame
Rowidxy
Int64Float64Float64
140.1149350.0
230.1766250.0
380.2068731.0
490.2865821.0
570.4520151.0
620.5731320.0
750.78640.0
8120.7968311.0
910.8303340.0
1060.8925980.0
11100.9189161.0
12110.9910711.0

now we create a new DataFrame

y = sort(x, :id)
12×3 DataFrame
Rowidxy
Int64Float64Float64
110.8303340.0
220.5731320.0
330.1766250.0
440.1149350.0
550.78640.0
660.8925980.0
770.4520151.0
880.2068731.0
990.2865821.0
10100.9189161.0
11110.9910711.0
12120.7968311.0

here we sort by two columns, first is decreasing, second is increasing

sort(x, [:y, :x], rev=[true, false])
12×3 DataFrame
Rowidxy
Int64Float64Float64
180.2068731.0
290.2865821.0
370.4520151.0
4120.7968311.0
5100.9189161.0
6110.9910711.0
740.1149350.0
830.1766250.0
920.5731320.0
1050.78640.0
1110.8303340.0
1260.8925980.0
sort(x, [order(:y, rev=true), :x]) ## the same as above
12×3 DataFrame
Rowidxy
Int64Float64Float64
180.2068731.0
290.2865821.0
370.4520151.0
4120.7968311.0
5100.9189161.0
6110.9910711.0
740.1149350.0
830.1766250.0
920.5731320.0
1050.78640.0
1110.8303340.0
1260.8925980.0

this is how you can shuffle rows

x[shuffle(1:10), :]
10×3 DataFrame
Rowidxy
Int64Float64Float64
180.2068731.0
2120.7968311.0
320.5731320.0
410.8303340.0
550.78640.0
690.2865821.0
760.8925980.0
840.1149350.0
930.1766250.0
1070.4520151.0

it is also easy to swap rows using broadcasted assignment

sort!(x, :id)
x[[1, 10], :] .= x[[10, 1], :]
x
12×3 DataFrame
Rowidxy
Int64Float64Float64
1100.9189161.0
220.5731320.0
330.1766250.0
440.1149350.0
550.78640.0
660.8925980.0
770.4520151.0
880.2068731.0
990.2865821.0
1010.8303340.0
11110.9910711.0
12120.7968311.0

Merging/adding rows#

x = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.02944980.2183660.3384020.1408550.306016
20.2714360.529310.05261950.40.843511
30.323890.386240.1888940.3219680.896884

merge by rows - data frames must have the same column names; the same is vcat

[x; x]
6×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.02944980.2183660.3384020.1408550.306016
20.2714360.529310.05261950.40.843511
30.323890.386240.1888940.3219680.896884
40.02944980.2183660.3384020.1408550.306016
50.2714360.529310.05261950.40.843511
60.323890.386240.1888940.3219680.896884

you can efficiently vcat a vector of DataFrames using reduce

reduce(vcat, [x, x, x])
9×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.02944980.2183660.3384020.1408550.306016
20.2714360.529310.05261950.40.843511
30.323890.386240.1888940.3219680.896884
40.02944980.2183660.3384020.1408550.306016
50.2714360.529310.05261950.40.843511
60.323890.386240.1888940.3219680.896884
70.02944980.2183660.3384020.1408550.306016
80.2714360.529310.05261950.40.843511
90.323890.386240.1888940.3219680.896884

get y with other order of names

y = x[:, reverse(names(x))]
3×5 DataFrame
Rowx5x4x3x2x1
Float64Float64Float64Float64Float64
10.3060160.1408550.3384020.2183660.0294498
20.8435110.40.05261950.529310.271436
30.8968840.3219680.1888940.386240.32389

vcat is still possible as it does column name matching

vcat(x, y)
6×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.02944980.2183660.3384020.1408550.306016
20.2714360.529310.05261950.40.843511
30.323890.386240.1888940.3219680.896884
40.02944980.2183660.3384020.1408550.306016
50.2714360.529310.05261950.40.843511
60.323890.386240.1888940.3219680.896884

but column names must still match

try
    vcat(x, y[:, 1:3])
catch e
    show(e)
end
ArgumentError("column(s) x1 and x2 are missing from argument(s) 2")

unless you pass :intersect, :union or specific column names as keyword argument cols

vcat(x, y[:, 1:3], cols=:intersect)
6×3 DataFrame
Rowx3x4x5
Float64Float64Float64
10.3384020.1408550.306016
20.05261950.40.843511
30.1888940.3219680.896884
40.3384020.1408550.306016
50.05261950.40.843511
60.1888940.3219680.896884
vcat(x, y[:, 1:3], cols=:union)
6×5 DataFrame
Rowx1x2x3x4x5
Float64?Float64?Float64Float64Float64
10.02944980.2183660.3384020.1408550.306016
20.2714360.529310.05261950.40.843511
30.323890.386240.1888940.3219680.896884
4missingmissing0.3384020.1408550.306016
5missingmissing0.05261950.40.843511
6missingmissing0.1888940.3219680.896884
vcat(x, y[:, 1:3], cols=[:x1, :x5])
6×2 DataFrame
Rowx1x5
Float64?Float64
10.02944980.306016
20.2714360.843511
30.323890.896884
4missing0.306016
5missing0.843511
6missing0.896884

append! modifies x in place

append!(x, x)
6×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.02944980.2183660.3384020.1408550.306016
20.2714360.529310.05261950.40.843511
30.323890.386240.1888940.3219680.896884
40.02944980.2183660.3384020.1408550.306016
50.2714360.529310.05261950.40.843511
60.323890.386240.1888940.3219680.896884

here column names must match exactly unless cols keyword argument is passed

append!(x, y)
9×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.02944980.2183660.3384020.1408550.306016
20.2714360.529310.05261950.40.843511
30.323890.386240.1888940.3219680.896884
40.02944980.2183660.3384020.1408550.306016
50.2714360.529310.05261950.40.843511
60.323890.386240.1888940.3219680.896884
70.02944980.2183660.3384020.1408550.306016
80.2714360.529310.05261950.40.843511
90.323890.386240.1888940.3219680.896884

standard repeat function works on rows; also inner and outer keyword arguments are accepted

repeat(x, 2)
18×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.02944980.2183660.3384020.1408550.306016
20.2714360.529310.05261950.40.843511
30.323890.386240.1888940.3219680.896884
40.02944980.2183660.3384020.1408550.306016
50.2714360.529310.05261950.40.843511
60.323890.386240.1888940.3219680.896884
70.02944980.2183660.3384020.1408550.306016
80.2714360.529310.05261950.40.843511
90.323890.386240.1888940.3219680.896884
100.02944980.2183660.3384020.1408550.306016
110.2714360.529310.05261950.40.843511
120.323890.386240.1888940.3219680.896884
130.02944980.2183660.3384020.1408550.306016
140.2714360.529310.05261950.40.843511
150.323890.386240.1888940.3219680.896884
160.02944980.2183660.3384020.1408550.306016
170.2714360.529310.05261950.40.843511
180.323890.386240.1888940.3219680.896884

push! adds one row to x at the end; one must pass a correct number of values unless cols keyword argument is passed

push!(x, 1:5)
x
10×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.02944980.2183660.3384020.1408550.306016
20.2714360.529310.05261950.40.843511
30.323890.386240.1888940.3219680.896884
40.02944980.2183660.3384020.1408550.306016
50.2714360.529310.05261950.40.843511
60.323890.386240.1888940.3219680.896884
70.02944980.2183660.3384020.1408550.306016
80.2714360.529310.05261950.40.843511
90.323890.386240.1888940.3219680.896884
101.02.03.04.05.0

push! also works with dictionaries

push!(x, Dict(:x1 => 11, :x2 => 12, :x3 => 13, :x4 => 14, :x5 => 15))
x
11×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.02944980.2183660.3384020.1408550.306016
20.2714360.529310.05261950.40.843511
30.323890.386240.1888940.3219680.896884
40.02944980.2183660.3384020.1408550.306016
50.2714360.529310.05261950.40.843511
60.323890.386240.1888940.3219680.896884
70.02944980.2183660.3384020.1408550.306016
80.2714360.529310.05261950.40.843511
90.323890.386240.1888940.3219680.896884
101.02.03.04.05.0
1111.012.013.014.015.0

and NamedTuples via name matching

push!(x, (x2=2, x1=1, x4=4, x3=3, x5=5))
12×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.02944980.2183660.3384020.1408550.306016
20.2714360.529310.05261950.40.843511
30.323890.386240.1888940.3219680.896884
40.02944980.2183660.3384020.1408550.306016
50.2714360.529310.05261950.40.843511
60.323890.386240.1888940.3219680.896884
70.02944980.2183660.3384020.1408550.306016
80.2714360.529310.05261950.40.843511
90.323890.386240.1888940.3219680.896884
101.02.03.04.05.0
1111.012.013.014.015.0
121.02.03.04.05.0

and DataFrameRow also via name matching

push!(x, x[1, :])
13×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.02944980.2183660.3384020.1408550.306016
20.2714360.529310.05261950.40.843511
30.323890.386240.1888940.3219680.896884
40.02944980.2183660.3384020.1408550.306016
50.2714360.529310.05261950.40.843511
60.323890.386240.1888940.3219680.896884
70.02944980.2183660.3384020.1408550.306016
80.2714360.529310.05261950.40.843511
90.323890.386240.1888940.3219680.896884
101.02.03.04.05.0
1111.012.013.014.015.0
121.02.03.04.05.0
130.02944980.2183660.3384020.1408550.306016

Please consult the documentation of push!, append! and vcat for allowed values of cols keyword argument. This keyword argument governs the way these functions perform column matching of passed arguments. Also append! and push! support a promote keyword argument that decides if column type promotion is allowed.

Let us here just give a quick example of how heterogeneous data can be stored in the data frame using these functionalities:

source = [(a=1, b=2), (a=missing, b=10, c=20), (b="s", c=1, d=1)]
3-element Vector{NamedTuple}:
 (a = 1, b = 2)
 (a = missing, b = 10, c = 20)
 (b = "s", c = 1, d = 1)
df = DataFrame()
0×0 DataFrame
for row in source
    push!(df, row, cols=:union) ## if cols is :union then promote is true by default
end
df
3×4 DataFrame
Rowabcd
Int64?AnyInt64?Int64?
112missingmissing
2missing1020missing
3missings11

and we see that push! dynamically added columns as needed and updated their element types

Subsetting/removing rows#

x = DataFrame(id=1:10, val='a':'j')
10×2 DataFrame
Rowidval
Int64Char
11a
22b
33c
44d
55e
66f
77g
88h
99i
1010j

by using indexing

x[1:2, :]
2×2 DataFrame
Rowidval
Int64Char
11a
22b

a single row selection creates a DataFrameRow

x[1, :]
DataFrameRow (2 columns)
Rowidval
Int64Char
11a

while this is a DataFrame

x[1:1, :]
1×2 DataFrame
Rowidval
Int64Char
11a

this is a view

view(x, 1:2, :)
2×2 SubDataFrame
Rowidval
Int64Char
11a
22b

selects columns 1 and 2

view(x, :, 1:2)
10×2 SubDataFrame
Rowidval
Int64Char
11a
22b
33c
44d
55e
66f
77g
88h
99i
1010j

indexing by a Bool array, exact length match is required

x[repeat([true, false], 5), :]
5×2 DataFrame
Rowidval
Int64Char
11a
23c
35e
47g
59i

alternatively we can also create a view

view(x, repeat([true, false], 5), :)
5×2 SubDataFrame
Rowidval
Int64Char
11a
23c
35e
47g
59i

we can delete one row in place

deleteat!(x, 7)
9×2 DataFrame
Rowidval
Int64Char
11a
22b
33c
44d
55e
66f
78h
89i
910j

or a collection of rows, also in place

deleteat!(x, 6:7)
7×2 DataFrame
Rowidval
Int64Char
11a
22b
33c
44d
55e
69i
710j

you can also create a new DataFrame when deleting rows using Not indexing

x[Not(1:2), :]
5×2 DataFrame
Rowidval
Int64Char
13c
24d
35e
49i
510j
x
7×2 DataFrame
Rowidval
Int64Char
11a
22b
33c
44d
55e
69i
710j

now we move to row filtering

x = DataFrame([1:4, 2:5, 3:6], :auto)
4×3 DataFrame
Rowx1x2x3
Int64Int64Int64
1123
2234
3345
4456

create a new DataFrame where filtering function operates on DataFrameRow

filter(r -> r.x1 > 2.5, x)
2×3 DataFrame
Rowx1x2x3
Int64Int64Int64
1345
2456

the same but as a view

filter(r -> r.x1 > 2.5, x, view=true)
2×3 SubDataFrame
Rowx1x2x3
Int64Int64Int64
1345
2456

or

filter(:x1 => >(2.5), x)
2×3 DataFrame
Rowx1x2x3
Int64Int64Int64
1345
2456

in place modification of x, using the do-block syntax for a more complex transformation

filter!(x) do r
    if r.x1 > 2.5
        return r.x2 < 4.5
    end
    r.x3 < 3.5
end
2×3 DataFrame
Rowx1x2x3
Int64Int64Int64
1123
2345

A common operation is selection of rows for which a value in a column is contained in a given set. Here are a few ways in which you can achieve this.

df = DataFrame(x=1:12, y=mod1.(1:12, 4))
12×2 DataFrame
Rowxy
Int64Int64
111
222
333
444
551
662
773
884
991
10102
11113
12124

We select rows for which column y has value 1 or 4.

filter(row -> row.y in [1, 4], df)
6×2 DataFrame
Rowxy
Int64Int64
111
244
351
484
591
6124
filter(:y => in([1, 4]), df)
6×2 DataFrame
Rowxy
Int64Int64
111
244
351
484
591
6124
df[in.(df.y, Ref([1, 4])), :]
6×2 DataFrame
Rowxy
Int64Int64
111
244
351
484
591
6124

DataFrames.jl also provides a subset function that works on whole columns and allows for multiple conditions:

x = DataFrame([1:4, 2:5, 3:6], :auto)
4×3 DataFrame
Rowx1x2x3
Int64Int64Int64
1123
2234
3345
4456
subset(x, :x1 => x -> x .< mean(x), :x2 => ByRow(<(2.5)))
1×3 DataFrame
Rowx1x2x3
Int64Int64Int64
1123

Similarly an in-place subset! function is provided.

Deduplicating#

x = DataFrame(A=[1, 2], B=["x", "y"])
append!(x, x)
x.C = 1:4
x
4×3 DataFrame
RowABC
Int64StringInt64
11x1
22y2
31x3
42y4

get first unique rows for given index

unique(x, [1, 2])
2×3 DataFrame
RowABC
Int64StringInt64
11x1
22y2

now we look at whole rows

unique(x)
4×3 DataFrame
RowABC
Int64StringInt64
11x1
22y2
31x3
42y4

get indicators of non-unique rows

nonunique(x, :A)
4-element Vector{Bool}:
 0
 0
 1
 1

modify x in place

unique!(x, :B)
2×3 DataFrame
RowABC
Int64StringInt64
11x1
22y2

Extracting one row from a DataFrame into standard collections#

x = DataFrame(x=[1, missing, 2], y=["a", "b", missing], z=[true, false, true])
3×3 DataFrame
Rowxyz
Int64?String?Bool
11atrue
2missingbfalse
32missingtrue
cols = [:y, :z]
2-element Vector{Symbol}:
 :y
 :z

you can convert it to a Vector or an Array

Vector(x[1, cols])
2-element Vector{Any}:
     "a"
 true

the same as

Array(x[1, cols])
2-element Vector{Any}:
     "a"
 true

get a vector of vectors

[Vector(x[i, cols]) for i in axes(x, 1)]
3-element Vector{Vector{Any}}:
 ["a", true]
 ["b", false]
 [missing, true]

it is easy to convert a DataFrameRow into a NamedTuple

copy(x[1, cols])
@NamedTuple{y::Union{Missing, String}, z::Bool}(("a", true))

or a Tuple

Tuple(x[1, cols])
("a", true)

Working with a collection of rows of a data frame#

You can use eachrow to get a vector-like collection of DataFrameRows

df = DataFrame(reshape(1:12, 3, 4), :auto)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
114710
225811
336912
er_df = eachrow(df)
3×4 DataFrameRows
Rowx1x2x3x4
Int64Int64Int64Int64
114710
225811
336912
er_df[1]
DataFrameRow (4 columns)
Rowx1x2x3x4
Int64Int64Int64Int64
114710
last(er_df)
DataFrameRow (4 columns)
Rowx1x2x3x4
Int64Int64Int64Int64
336912
er_df[end]
DataFrameRow (4 columns)
Rowx1x2x3x4
Int64Int64Int64Int64
336912

As DataFrameRows objects keeps connection to the parent data frame you can get the columns of the parent using getproperty

er_df.x1
3-element Vector{Int64}:
 1
 2
 3

Flattening a data frame#

Occasionally you have a data frame whose one column is a vector of collections. You can expand (flatten) such a column using the flatten function

df = DataFrame(a='a':'c', b=[[1, 2, 3], [4, 5], 6])
3×2 DataFrame
Rowab
CharAny
1a[1, 2, 3]
2b[4, 5]
3c6
flatten(df, :b)
6×2 DataFrame
Rowab
CharInt64
1a1
2a2
3a3
4b4
5b5
6c6

Only one row#

only from Julia Base is also supported in DataFrames.jl and succeeds if the data frame in question has only one row, in which case it is returned.

df = DataFrame(a=1)
1×1 DataFrame
Rowa
Int64
11
only(df)
DataFrameRow (1 columns)
Rowa
Int64
11
df2 = repeat(df, 2)
2×1 DataFrame
Rowa
Int64
11
21

Errors

try
    only(df2)
catch e
    show(e)
end
ArgumentError("data frame must contain exactly 1 row, got 2")

This notebook was generated using Literate.jl.