Manipulating rows of DataFrame

Manipulating rows of DataFrame#

Selecting rows#

using DataFrames
using Statistics
using Random
Random.seed!(1);
df = DataFrame(rand(4, 5), :auto)

4×5 DataFrame

Row	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0491718	0.691857	0.840384	0.198521	0.802561
2	0.119079	0.767518	0.89077	0.00819786	0.661425
3	0.393271	0.087253	0.138227	0.592041	0.347513
4	0.0240943	0.855718	0.347737	0.801055	0.778149

using : as row selector will copy columns

df[:, :]

4×5 DataFrame

Row	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0491718	0.691857	0.840384	0.198521	0.802561
2	0.119079	0.767518	0.89077	0.00819786	0.661425
3	0.393271	0.087253	0.138227	0.592041	0.347513
4	0.0240943	0.855718	0.347737	0.801055	0.778149

this is the same as

copy(df)

4×5 DataFrame

Row	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0491718	0.691857	0.840384	0.198521	0.802561
2	0.119079	0.767518	0.89077	0.00819786	0.661425
3	0.393271	0.087253	0.138227	0.592041	0.347513
4	0.0240943	0.855718	0.347737	0.801055	0.778149

you can get a subset of rows of a data frame without copying using view to get a SubDataFrame

sdf = view(df, 1:3, 1:3)

3×3 SubDataFrame

Row	x1	x2	x3
	Float64	Float64	Float64
1	0.0491718	0.691857	0.840384
2	0.119079	0.767518	0.89077
3	0.393271	0.087253	0.138227

you still have a detailed reference to the parent

parent(sdf), parentindices(sdf)

(4×5 DataFrame
 Row │ x1         x2        x3        x4          x5       
     │ Float64    Float64   Float64   Float64     Float64  
─────┼─────────────────────────────────────────────────────
   1 │ 0.0491718  0.691857  0.840384  0.198521    0.802561
   2 │ 0.119079   0.767518  0.89077   0.00819786  0.661425
   3 │ 0.393271   0.087253  0.138227  0.592041    0.347513
   4 │ 0.0240943  0.855718  0.347737  0.801055    0.778149, (1:3, 1:3))

selecting a single row returns a DataFrameRow object which is also a view

dfr = df[3, :]

DataFrameRow (5 columns)

Row	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
3	0.393271	0.087253	0.138227	0.592041	0.347513

parent(dfr), parentindices(dfr), rownumber(dfr)

(4×5 DataFrame
 Row │ x1         x2        x3        x4          x5       
     │ Float64    Float64   Float64   Float64     Float64  
─────┼─────────────────────────────────────────────────────
   1 │ 0.0491718  0.691857  0.840384  0.198521    0.802561
   2 │ 0.119079   0.767518  0.89077   0.00819786  0.661425
   3 │ 0.393271   0.087253  0.138227  0.592041    0.347513
   4 │ 0.0240943  0.855718  0.347737  0.801055    0.778149, (3, Base.OneTo(5)), 3)

let us add a column to a data frame by assigning a scalar broadcasting

df[!, :Z] .= 1

4-element Vector{Int64}:
 1
 1
 1
 1

df

4×6 DataFrame

Row	x1	x2	x3	x4	x5	Z
	Float64	Float64	Float64	Float64	Float64	Int64
1	0.0491718	0.691857	0.840384	0.198521	0.802561	1
2	0.119079	0.767518	0.89077	0.00819786	0.661425	1
3	0.393271	0.087253	0.138227	0.592041	0.347513	1
4	0.0240943	0.855718	0.347737	0.801055	0.778149	1

Earlier we used : for column selection in a view (SubDataFrame and DataFrameRow). In this case a view will have all columns of the parent after the parent is mutated.

dfr

DataFrameRow (6 columns)

Row	x1	x2	x3	x4	x5	Z
	Float64	Float64	Float64	Float64	Float64	Int64
3	0.393271	0.087253	0.138227	0.592041	0.347513	1

parent(dfr), parentindices(dfr), rownumber(dfr)

(4×6 DataFrame
 Row │ x1         x2        x3        x4          x5        Z     
     │ Float64    Float64   Float64   Float64     Float64   Int64 
─────┼────────────────────────────────────────────────────────────
   1 │ 0.0491718  0.691857  0.840384  0.198521    0.802561      1
   2 │ 0.119079   0.767518  0.89077   0.00819786  0.661425      1
   3 │ 0.393271   0.087253  0.138227  0.592041    0.347513      1
   4 │ 0.0240943  0.855718  0.347737  0.801055    0.778149      1, (3, Base.OneTo(6)), 3)

Note that parent and parentindices refer to the true source of data for a DataFrameRow and rownumber refers to row number in the direct object that was used to create DataFrameRow

df = DataFrame(a=1:4)

4×1 DataFrame

Row	a
	Int64
1	1
2	2
3	3
4	4

dfv = view(df, [3, 2], :)

2×1 SubDataFrame

Row	a
	Int64
1	3
2	2

dfr = dfv[1, :]

DataFrameRow (1 columns)

Row	a
	Int64
3	3

parent(dfr), parentindices(dfr), rownumber(dfr)

(4×1 DataFrame
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
   3 │     3
   4 │     4, (3, Base.OneTo(1)), 1)

Reordering rows#

We create some random data frame (and hope that x.x is not sorted :), which is quite likely with 12 rows)

x = DataFrame(id=1:12, x=rand(12), y=[zeros(6); ones(6)])

12×3 DataFrame

Row	id	x	y
	Int64	Float64	Float64
1	1	0.830334	0.0
2	2	0.573132	0.0
3	3	0.176625	0.0
4	4	0.114935	0.0
5	5	0.7864	0.0
6	6	0.892598	0.0
7	7	0.452015	1.0
8	8	0.206873	1.0
9	9	0.286582	1.0
10	10	0.918916	1.0
11	11	0.991071	1.0
12	12	0.796831	1.0

check if a DataFrame or a subset of its columns is sorted

issorted(x), issorted(x, :x)

(true, false)

we sort x in place

sort!(x, :x)

12×3 DataFrame

Row	id	x	y
	Int64	Float64	Float64
1	4	0.114935	0.0
2	3	0.176625	0.0
3	8	0.206873	1.0
4	9	0.286582	1.0
5	7	0.452015	1.0
6	2	0.573132	0.0
7	5	0.7864	0.0
8	12	0.796831	1.0
9	1	0.830334	0.0
10	6	0.892598	0.0
11	10	0.918916	1.0
12	11	0.991071	1.0

now we create a new DataFrame

y = sort(x, :id)

12×3 DataFrame

Row	id	x	y
	Int64	Float64	Float64
1	1	0.830334	0.0
2	2	0.573132	0.0
3	3	0.176625	0.0
4	4	0.114935	0.0
5	5	0.7864	0.0
6	6	0.892598	0.0
7	7	0.452015	1.0
8	8	0.206873	1.0
9	9	0.286582	1.0
10	10	0.918916	1.0
11	11	0.991071	1.0
12	12	0.796831	1.0

here we sort by two columns, first is decreasing, second is increasing

sort(x, [:y, :x], rev=[true, false])

12×3 DataFrame

Row	id	x	y
	Int64	Float64	Float64
1	8	0.206873	1.0
2	9	0.286582	1.0
3	7	0.452015	1.0
4	12	0.796831	1.0
5	10	0.918916	1.0
6	11	0.991071	1.0
7	4	0.114935	0.0
8	3	0.176625	0.0
9	2	0.573132	0.0
10	5	0.7864	0.0
11	1	0.830334	0.0
12	6	0.892598	0.0

sort(x, [order(:y, rev=true), :x]) ## the same as above

12×3 DataFrame

Row	id	x	y
	Int64	Float64	Float64
1	8	0.206873	1.0
2	9	0.286582	1.0
3	7	0.452015	1.0
4	12	0.796831	1.0
5	10	0.918916	1.0
6	11	0.991071	1.0
7	4	0.114935	0.0
8	3	0.176625	0.0
9	2	0.573132	0.0
10	5	0.7864	0.0
11	1	0.830334	0.0
12	6	0.892598	0.0

this is how you can shuffle rows

x[shuffle(1:10), :]

10×3 DataFrame

Row	id	x	y
	Int64	Float64	Float64
1	8	0.206873	1.0
2	12	0.796831	1.0
3	2	0.573132	0.0
4	1	0.830334	0.0
5	5	0.7864	0.0
6	9	0.286582	1.0
7	6	0.892598	0.0
8	4	0.114935	0.0
9	3	0.176625	0.0
10	7	0.452015	1.0

it is also easy to swap rows using broadcasted assignment

sort!(x, :id)
x[[1, 10], :] .= x[[10, 1], :]
x

12×3 DataFrame

Row	id	x	y
	Int64	Float64	Float64
1	10	0.918916	1.0
2	2	0.573132	0.0
3	3	0.176625	0.0
4	4	0.114935	0.0
5	5	0.7864	0.0
6	6	0.892598	0.0
7	7	0.452015	1.0
8	8	0.206873	1.0
9	9	0.286582	1.0
10	1	0.830334	0.0
11	11	0.991071	1.0
12	12	0.796831	1.0

Merging/adding rows#

x = DataFrame(rand(3, 5), :auto)

3×5 DataFrame

Row	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0294498	0.218366	0.338402	0.140855	0.306016
2	0.271436	0.52931	0.0526195	0.4	0.843511
3	0.32389	0.38624	0.188894	0.321968	0.896884

merge by rows - data frames must have the same column names; the same is vcat

[x; x]

6×5 DataFrame

Row	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0294498	0.218366	0.338402	0.140855	0.306016
2	0.271436	0.52931	0.0526195	0.4	0.843511
3	0.32389	0.38624	0.188894	0.321968	0.896884
4	0.0294498	0.218366	0.338402	0.140855	0.306016
5	0.271436	0.52931	0.0526195	0.4	0.843511
6	0.32389	0.38624	0.188894	0.321968	0.896884

you can efficiently vcat a vector of DataFrames using reduce

reduce(vcat, [x, x, x])

9×5 DataFrame

Row	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0294498	0.218366	0.338402	0.140855	0.306016
2	0.271436	0.52931	0.0526195	0.4	0.843511
3	0.32389	0.38624	0.188894	0.321968	0.896884
4	0.0294498	0.218366	0.338402	0.140855	0.306016
5	0.271436	0.52931	0.0526195	0.4	0.843511
6	0.32389	0.38624	0.188894	0.321968	0.896884
7	0.0294498	0.218366	0.338402	0.140855	0.306016
8	0.271436	0.52931	0.0526195	0.4	0.843511
9	0.32389	0.38624	0.188894	0.321968	0.896884

get y with other order of names

y = x[:, reverse(names(x))]

3×5 DataFrame

Row	x5	x4	x3	x2	x1
	Float64	Float64	Float64	Float64	Float64
1	0.306016	0.140855	0.338402	0.218366	0.0294498
2	0.843511	0.4	0.0526195	0.52931	0.271436
3	0.896884	0.321968	0.188894	0.38624	0.32389

vcat is still possible as it does column name matching

vcat(x, y)

6×5 DataFrame

Row	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0294498	0.218366	0.338402	0.140855	0.306016
2	0.271436	0.52931	0.0526195	0.4	0.843511
3	0.32389	0.38624	0.188894	0.321968	0.896884
4	0.0294498	0.218366	0.338402	0.140855	0.306016
5	0.271436	0.52931	0.0526195	0.4	0.843511
6	0.32389	0.38624	0.188894	0.321968	0.896884

but column names must still match

try
    vcat(x, y[:, 1:3])
catch e
    show(e)
end

ArgumentError("column(s) x1 and x2 are missing from argument(s) 2")

unless you pass :intersect, :union or specific column names as keyword argument cols

vcat(x, y[:, 1:3], cols=:intersect)

6×3 DataFrame

Row	x3	x4	x5
	Float64	Float64	Float64
1	0.338402	0.140855	0.306016
2	0.0526195	0.4	0.843511
3	0.188894	0.321968	0.896884
4	0.338402	0.140855	0.306016
5	0.0526195	0.4	0.843511
6	0.188894	0.321968	0.896884

vcat(x, y[:, 1:3], cols=:union)

6×5 DataFrame

Row	x1	x2	x3	x4	x5
	Float64?	Float64?	Float64	Float64	Float64
1	0.0294498	0.218366	0.338402	0.140855	0.306016
2	0.271436	0.52931	0.0526195	0.4	0.843511
3	0.32389	0.38624	0.188894	0.321968	0.896884
4	missing	missing	0.338402	0.140855	0.306016
5	missing	missing	0.0526195	0.4	0.843511
6	missing	missing	0.188894	0.321968	0.896884

vcat(x, y[:, 1:3], cols=[:x1, :x5])

6×2 DataFrame

Row	x1	x5
	Float64?	Float64
1	0.0294498	0.306016
2	0.271436	0.843511
3	0.32389	0.896884
4	missing	0.306016
5	missing	0.843511
6	missing	0.896884

append! modifies x in place

append!(x, x)

6×5 DataFrame

Row	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0294498	0.218366	0.338402	0.140855	0.306016
2	0.271436	0.52931	0.0526195	0.4	0.843511
3	0.32389	0.38624	0.188894	0.321968	0.896884
4	0.0294498	0.218366	0.338402	0.140855	0.306016
5	0.271436	0.52931	0.0526195	0.4	0.843511
6	0.32389	0.38624	0.188894	0.321968	0.896884

here column names must match exactly unless cols keyword argument is passed

append!(x, y)

9×5 DataFrame

Row	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0294498	0.218366	0.338402	0.140855	0.306016
2	0.271436	0.52931	0.0526195	0.4	0.843511
3	0.32389	0.38624	0.188894	0.321968	0.896884
4	0.0294498	0.218366	0.338402	0.140855	0.306016
5	0.271436	0.52931	0.0526195	0.4	0.843511
6	0.32389	0.38624	0.188894	0.321968	0.896884
7	0.0294498	0.218366	0.338402	0.140855	0.306016
8	0.271436	0.52931	0.0526195	0.4	0.843511
9	0.32389	0.38624	0.188894	0.321968	0.896884

standard repeat function works on rows; also inner and outer keyword arguments are accepted

repeat(x, 2)

18×5 DataFrame

Row	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0294498	0.218366	0.338402	0.140855	0.306016
2	0.271436	0.52931	0.0526195	0.4	0.843511
3	0.32389	0.38624	0.188894	0.321968	0.896884
4	0.0294498	0.218366	0.338402	0.140855	0.306016
5	0.271436	0.52931	0.0526195	0.4	0.843511
6	0.32389	0.38624	0.188894	0.321968	0.896884
7	0.0294498	0.218366	0.338402	0.140855	0.306016
8	0.271436	0.52931	0.0526195	0.4	0.843511
9	0.32389	0.38624	0.188894	0.321968	0.896884
10	0.0294498	0.218366	0.338402	0.140855	0.306016
11	0.271436	0.52931	0.0526195	0.4	0.843511
12	0.32389	0.38624	0.188894	0.321968	0.896884
13	0.0294498	0.218366	0.338402	0.140855	0.306016
14	0.271436	0.52931	0.0526195	0.4	0.843511
15	0.32389	0.38624	0.188894	0.321968	0.896884
16	0.0294498	0.218366	0.338402	0.140855	0.306016
17	0.271436	0.52931	0.0526195	0.4	0.843511
18	0.32389	0.38624	0.188894	0.321968	0.896884

push! adds one row to x at the end; one must pass a correct number of values unless cols keyword argument is passed

push!(x, 1:5)
x

10×5 DataFrame

Row	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0294498	0.218366	0.338402	0.140855	0.306016
2	0.271436	0.52931	0.0526195	0.4	0.843511
3	0.32389	0.38624	0.188894	0.321968	0.896884
4	0.0294498	0.218366	0.338402	0.140855	0.306016
5	0.271436	0.52931	0.0526195	0.4	0.843511
6	0.32389	0.38624	0.188894	0.321968	0.896884
7	0.0294498	0.218366	0.338402	0.140855	0.306016
8	0.271436	0.52931	0.0526195	0.4	0.843511
9	0.32389	0.38624	0.188894	0.321968	0.896884
10	1.0	2.0	3.0	4.0	5.0

push! also works with dictionaries

push!(x, Dict(:x1 => 11, :x2 => 12, :x3 => 13, :x4 => 14, :x5 => 15))
x

11×5 DataFrame

Row	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0294498	0.218366	0.338402	0.140855	0.306016
2	0.271436	0.52931	0.0526195	0.4	0.843511
3	0.32389	0.38624	0.188894	0.321968	0.896884
4	0.0294498	0.218366	0.338402	0.140855	0.306016
5	0.271436	0.52931	0.0526195	0.4	0.843511
6	0.32389	0.38624	0.188894	0.321968	0.896884
7	0.0294498	0.218366	0.338402	0.140855	0.306016
8	0.271436	0.52931	0.0526195	0.4	0.843511
9	0.32389	0.38624	0.188894	0.321968	0.896884
10	1.0	2.0	3.0	4.0	5.0
11	11.0	12.0	13.0	14.0	15.0

and NamedTuples via name matching

push!(x, (x2=2, x1=1, x4=4, x3=3, x5=5))

12×5 DataFrame

Row	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0294498	0.218366	0.338402	0.140855	0.306016
2	0.271436	0.52931	0.0526195	0.4	0.843511
3	0.32389	0.38624	0.188894	0.321968	0.896884
4	0.0294498	0.218366	0.338402	0.140855	0.306016
5	0.271436	0.52931	0.0526195	0.4	0.843511
6	0.32389	0.38624	0.188894	0.321968	0.896884
7	0.0294498	0.218366	0.338402	0.140855	0.306016
8	0.271436	0.52931	0.0526195	0.4	0.843511
9	0.32389	0.38624	0.188894	0.321968	0.896884
10	1.0	2.0	3.0	4.0	5.0
11	11.0	12.0	13.0	14.0	15.0
12	1.0	2.0	3.0	4.0	5.0

and DataFrameRow also via name matching

push!(x, x[1, :])

13×5 DataFrame

Row	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0294498	0.218366	0.338402	0.140855	0.306016
2	0.271436	0.52931	0.0526195	0.4	0.843511
3	0.32389	0.38624	0.188894	0.321968	0.896884
4	0.0294498	0.218366	0.338402	0.140855	0.306016
5	0.271436	0.52931	0.0526195	0.4	0.843511
6	0.32389	0.38624	0.188894	0.321968	0.896884
7	0.0294498	0.218366	0.338402	0.140855	0.306016
8	0.271436	0.52931	0.0526195	0.4	0.843511
9	0.32389	0.38624	0.188894	0.321968	0.896884
10	1.0	2.0	3.0	4.0	5.0
11	11.0	12.0	13.0	14.0	15.0
12	1.0	2.0	3.0	4.0	5.0
13	0.0294498	0.218366	0.338402	0.140855	0.306016

Please consult the documentation of push!, append! and vcat for allowed values of cols keyword argument. This keyword argument governs the way these functions perform column matching of passed arguments. Also append! and push! support a promote keyword argument that decides if column type promotion is allowed.

Let us here just give a quick example of how heterogeneous data can be stored in the data frame using these functionalities:

source = [(a=1, b=2), (a=missing, b=10, c=20), (b="s", c=1, d=1)]

3-element Vector{NamedTuple}:
 (a = 1, b = 2)
 (a = missing, b = 10, c = 20)
 (b = "s", c = 1, d = 1)

df = DataFrame()

0×0 DataFrame

for row in source
    push!(df, row, cols=:union) ## if cols is :union then promote is true by default
end

df

3×4 DataFrame

Row	a	b	c	d
	Int64?	Any	Int64?	Int64?
1	1	2	missing	missing
2	missing	10	20	missing
3	missing	s	1	1

and we see that push! dynamically added columns as needed and updated their element types

Subsetting/removing rows#

x = DataFrame(id=1:10, val='a':'j')

10×2 DataFrame

Row	id	val
	Int64	Char
1	1	a
2	2	b
3	3	c
4	4	d
5	5	e
6	6	f
7	7	g
8	8	h
9	9	i
10	10	j

by using indexing

x[1:2, :]

2×2 DataFrame

Row	id	val
	Int64	Char
1	1	a
2	2	b

a single row selection creates a DataFrameRow

x[1, :]

DataFrameRow (2 columns)

Row	id	val
	Int64	Char
1	1	a

while this is a DataFrame

x[1:1, :]

1×2 DataFrame

Row	id	val
	Int64	Char
1	1	a

this is a view

view(x, 1:2, :)

2×2 SubDataFrame

Row	id	val
	Int64	Char
1	1	a
2	2	b

selects columns 1 and 2

view(x, :, 1:2)

10×2 SubDataFrame

Row	id	val
	Int64	Char
1	1	a
2	2	b
3	3	c
4	4	d
5	5	e
6	6	f
7	7	g
8	8	h
9	9	i
10	10	j

indexing by a Bool array, exact length match is required

x[repeat([true, false], 5), :]

5×2 DataFrame

Row	id	val
	Int64	Char
1	1	a
2	3	c
3	5	e
4	7	g
5	9	i

alternatively we can also create a view

view(x, repeat([true, false], 5), :)

5×2 SubDataFrame

Row	id	val
	Int64	Char
1	1	a
2	3	c
3	5	e
4	7	g
5	9	i

we can delete one row in place

deleteat!(x, 7)

9×2 DataFrame

Row	id	val
	Int64	Char
1	1	a
2	2	b
3	3	c
4	4	d
5	5	e
6	6	f
7	8	h
8	9	i
9	10	j

or a collection of rows, also in place

deleteat!(x, 6:7)

7×2 DataFrame

Row	id	val
	Int64	Char
1	1	a
2	2	b
3	3	c
4	4	d
5	5	e
6	9	i
7	10	j

you can also create a new DataFrame when deleting rows using Not indexing

x[Not(1:2), :]

5×2 DataFrame

Row	id	val
	Int64	Char
1	3	c
2	4	d
3	5	e
4	9	i
5	10	j

7×2 DataFrame

Row	id	val
	Int64	Char
1	1	a
2	2	b
3	3	c
4	4	d
5	5	e
6	9	i
7	10	j

now we move to row filtering

x = DataFrame([1:4, 2:5, 3:6], :auto)

4×3 DataFrame

Row	x1	x2	x3
	Int64	Int64	Int64
1	1	2	3
2	2	3	4
3	3	4	5
4	4	5	6

create a new DataFrame where filtering function operates on DataFrameRow

filter(r -> r.x1 > 2.5, x)

2×3 DataFrame

Row	x1	x2	x3
	Int64	Int64	Int64
1	3	4	5
2	4	5	6

the same but as a view

filter(r -> r.x1 > 2.5, x, view=true)

2×3 SubDataFrame

Row	x1	x2	x3
	Int64	Int64	Int64
1	3	4	5
2	4	5	6

or

filter(:x1 => >(2.5), x)

2×3 DataFrame

Row	x1	x2	x3
	Int64	Int64	Int64
1	3	4	5
2	4	5	6

in place modification of x, using the do-block syntax for a more complex transformation

filter!(x) do r
    if r.x1 > 2.5
        return r.x2 < 4.5
    end
    r.x3 < 3.5
end

2×3 DataFrame

Row	x1	x2	x3
	Int64	Int64	Int64
1	1	2	3
2	3	4	5

A common operation is selection of rows for which a value in a column is contained in a given set. Here are a few ways in which you can achieve this.

df = DataFrame(x=1:12, y=mod1.(1:12, 4))

12×2 DataFrame

Row	x	y
	Int64	Int64
1	1	1
2	2	2
3	3	3
4	4	4
5	5	1
6	6	2
7	7	3
8	8	4
9	9	1
10	10	2
11	11	3
12	12	4

We select rows for which column y has value 1 or 4.

filter(row -> row.y in [1, 4], df)

6×2 DataFrame

Row	x	y
	Int64	Int64
1	1	1
2	4	4
3	5	1
4	8	4
5	9	1
6	12	4

filter(:y => in([1, 4]), df)

6×2 DataFrame

Row	x	y
	Int64	Int64
1	1	1
2	4	4
3	5	1
4	8	4
5	9	1
6	12	4

df[in.(df.y, Ref([1, 4])), :]

6×2 DataFrame

Row	x	y
	Int64	Int64
1	1	1
2	4	4
3	5	1
4	8	4
5	9	1
6	12	4

DataFrames.jl also provides a subset function that works on whole columns and allows for multiple conditions:

x = DataFrame([1:4, 2:5, 3:6], :auto)

4×3 DataFrame

Row	x1	x2	x3
	Int64	Int64	Int64
1	1	2	3
2	2	3	4
3	3	4	5
4	4	5	6

subset(x, :x1 => x -> x .< mean(x), :x2 => ByRow(<(2.5)))

1×3 DataFrame

Row	x1	x2	x3
	Int64	Int64	Int64
1	1	2	3

Similarly an in-place subset! function is provided.

Deduplicating#

x = DataFrame(A=[1, 2], B=["x", "y"])
append!(x, x)
x.C = 1:4
x

4×3 DataFrame

Row	A	B	C
	Int64	String	Int64
1	1	x	1
2	2	y	2
3	1	x	3
4	2	y	4

get first unique rows for given index

unique(x, [1, 2])

2×3 DataFrame

Row	A	B	C
	Int64	String	Int64
1	1	x	1
2	2	y	2

now we look at whole rows

unique(x)

4×3 DataFrame

Row	A	B	C
	Int64	String	Int64
1	1	x	1
2	2	y	2
3	1	x	3
4	2	y	4

get indicators of non-unique rows

nonunique(x, :A)

4-element Vector{Bool}:
 0
 0
 1
 1

modify x in place

unique!(x, :B)

2×3 DataFrame

Row	A	B	C
	Int64	String	Int64
1	1	x	1
2	2	y	2

Extracting one row from a DataFrame into standard collections#

x = DataFrame(x=[1, missing, 2], y=["a", "b", missing], z=[true, false, true])

3×3 DataFrame

Row	x	y	z
	Int64?	String?	Bool
1	1	a	true
2	missing	b	false
3	2	missing	true

cols = [:y, :z]

2-element Vector{Symbol}:
 :y
 :z

you can convert it to a Vector or an Array

Vector(x[1, cols])

2-element Vector{Any}:
     "a"
 true

the same as

Array(x[1, cols])

2-element Vector{Any}:
     "a"
 true

get a vector of vectors

[Vector(x[i, cols]) for i in axes(x, 1)]

3-element Vector{Vector{Any}}:
 ["a", true]
 ["b", false]
 [missing, true]

it is easy to convert a DataFrameRow into a NamedTuple

copy(x[1, cols])

@NamedTuple{y::Union{Missing, String}, z::Bool}(("a", true))

or a Tuple

Tuple(x[1, cols])

("a", true)

Working with a collection of rows of a data frame#

You can use eachrow to get a vector-like collection of DataFrameRows

df = DataFrame(reshape(1:12, 3, 4), :auto)

3×4 DataFrame

Row	x1	x2	x3	x4
	Int64	Int64	Int64	Int64
1	1	4	7	10
2	2	5	8	11
3	3	6	9	12

er_df = eachrow(df)

3×4 DataFrameRows

Row	x1	x2	x3	x4
	Int64	Int64	Int64	Int64
1	1	4	7	10
2	2	5	8	11
3	3	6	9	12

er_df[1]

DataFrameRow (4 columns)

Row	x1	x2	x3	x4
	Int64	Int64	Int64	Int64
1	1	4	7	10

last(er_df)

DataFrameRow (4 columns)

Row	x1	x2	x3	x4
	Int64	Int64	Int64	Int64
3	3	6	9	12

er_df[end]

DataFrameRow (4 columns)

Row	x1	x2	x3	x4
	Int64	Int64	Int64	Int64
3	3	6	9	12

As DataFrameRows objects keeps connection to the parent data frame you can get the columns of the parent using getproperty

er_df.x1

3-element Vector{Int64}:
 1
 2
 3

Flattening a data frame#

Occasionally you have a data frame whose one column is a vector of collections. You can expand (flatten) such a column using the flatten function

df = DataFrame(a='a':'c', b=[[1, 2, 3], [4, 5], 6])

3×2 DataFrame

Row	a	b
	Char	Any
1	a	[1, 2, 3]
2	b	[4, 5]
3	c	6

flatten(df, :b)

6×2 DataFrame

Row	a	b
	Char	Int64
1	a	1
2	a	2
3	a	3
4	b	4
5	b	5
6	c	6

Only one row#

only from Julia Base is also supported in DataFrames.jl and succeeds if the data frame in question has only one row, in which case it is returned.

df = DataFrame(a=1)

1×1 DataFrame

Row	a
	Int64
1	1

only(df)

DataFrameRow (1 columns)

Row	a
	Int64
1	1

df2 = repeat(df, 2)

2×1 DataFrame

Row	a
	Int64
1	1
2	1

Errors

try
    only(df2)
catch e
    show(e)
end

ArgumentError("data frame must contain exactly 1 row, got 2")

This notebook was generated using Literate.jl.

Manipulating rows of DataFrame

Contents

Manipulating rows of DataFrame#

Selecting rows#

Reordering rows#

Merging/adding rows#

Subsetting/removing rows#

Deduplicating#

Extracting one row from a DataFrame into standard collections#

Working with a collection of rows of a data frame#

Flattening a data frame#

Only one row#