Selecting rows
1 | 0.0491718 | 0.691857 | 0.840384 | 0.198521 | 0.802561 |
2 | 0.119079 | 0.767518 | 0.89077 | 0.00819786 | 0.661425 |
3 | 0.393271 | 0.087253 | 0.138227 | 0.592041 | 0.347513 |
4 | 0.0240943 | 0.855718 | 0.347737 | 0.801055 | 0.778149 |
using :
as row selector will copy columns
1 | 0.0491718 | 0.691857 | 0.840384 | 0.198521 | 0.802561 |
2 | 0.119079 | 0.767518 | 0.89077 | 0.00819786 | 0.661425 |
3 | 0.393271 | 0.087253 | 0.138227 | 0.592041 | 0.347513 |
4 | 0.0240943 | 0.855718 | 0.347737 | 0.801055 | 0.778149 |
this is the same as
1 | 0.0491718 | 0.691857 | 0.840384 | 0.198521 | 0.802561 |
2 | 0.119079 | 0.767518 | 0.89077 | 0.00819786 | 0.661425 |
3 | 0.393271 | 0.087253 | 0.138227 | 0.592041 | 0.347513 |
4 | 0.0240943 | 0.855718 | 0.347737 | 0.801055 | 0.778149 |
you can get a subset of rows of a data frame without copying using view
to get a SubDataFrame
1 | 0.0491718 | 0.691857 | 0.840384 |
2 | 0.119079 | 0.767518 | 0.89077 |
3 | 0.393271 | 0.087253 | 0.138227 |
you still have a detailed reference to the parent
(4×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────
1 │ 0.0491718 0.691857 0.840384 0.198521 0.802561
2 │ 0.119079 0.767518 0.89077 0.00819786 0.661425
3 │ 0.393271 0.087253 0.138227 0.592041 0.347513
4 │ 0.0240943 0.855718 0.347737 0.801055 0.778149, (1:3, 1:3))
selecting a single row returns a DataFrameRow
object which is also a view
3 | 0.393271 | 0.087253 | 0.138227 | 0.592041 | 0.347513 |
(4×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────
1 │ 0.0491718 0.691857 0.840384 0.198521 0.802561
2 │ 0.119079 0.767518 0.89077 0.00819786 0.661425
3 │ 0.393271 0.087253 0.138227 0.592041 0.347513
4 │ 0.0240943 0.855718 0.347737 0.801055 0.778149, (3, Base.OneTo(5)), 3)
let us add a column to a data frame by assigning a scalar broadcasting
4-element Vector{Int64}:
1
1
1
1
1 | 0.0491718 | 0.691857 | 0.840384 | 0.198521 | 0.802561 | 1 |
2 | 0.119079 | 0.767518 | 0.89077 | 0.00819786 | 0.661425 | 1 |
3 | 0.393271 | 0.087253 | 0.138227 | 0.592041 | 0.347513 | 1 |
4 | 0.0240943 | 0.855718 | 0.347737 | 0.801055 | 0.778149 | 1 |
Earlier we used : for column selection in a view (SubDataFrame
and DataFrameRow
). In this case a view will have all columns of the parent after the parent is mutated.
3 | 0.393271 | 0.087253 | 0.138227 | 0.592041 | 0.347513 | 1 |
(4×6 DataFrame
Row │ x1 x2 x3 x4 x5 Z
│ Float64 Float64 Float64 Float64 Float64 Int64
─────┼────────────────────────────────────────────────────────────
1 │ 0.0491718 0.691857 0.840384 0.198521 0.802561 1
2 │ 0.119079 0.767518 0.89077 0.00819786 0.661425 1
3 │ 0.393271 0.087253 0.138227 0.592041 0.347513 1
4 │ 0.0240943 0.855718 0.347737 0.801055 0.778149 1, (3, Base.OneTo(6)), 3)
Note that parent
and parentindices
refer to the true source of data for a DataFrameRow
and rownumber
refers to row number in the direct object that was used to create DataFrameRow
(4×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4, (3, Base.OneTo(1)), 1)
Reordering rows
We create some random data frame (and hope that x.x is not sorted :), which is quite likely with 12 rows)
1 | 1 | 0.830334 | 0.0 |
2 | 2 | 0.573132 | 0.0 |
3 | 3 | 0.176625 | 0.0 |
4 | 4 | 0.114935 | 0.0 |
5 | 5 | 0.7864 | 0.0 |
6 | 6 | 0.892598 | 0.0 |
7 | 7 | 0.452015 | 1.0 |
8 | 8 | 0.206873 | 1.0 |
9 | 9 | 0.286582 | 1.0 |
10 | 10 | 0.918916 | 1.0 |
11 | 11 | 0.991071 | 1.0 |
12 | 12 | 0.796831 | 1.0 |
check if a DataFrame or a subset of its columns is sorted
we sort x in place
1 | 4 | 0.114935 | 0.0 |
2 | 3 | 0.176625 | 0.0 |
3 | 8 | 0.206873 | 1.0 |
4 | 9 | 0.286582 | 1.0 |
5 | 7 | 0.452015 | 1.0 |
6 | 2 | 0.573132 | 0.0 |
7 | 5 | 0.7864 | 0.0 |
8 | 12 | 0.796831 | 1.0 |
9 | 1 | 0.830334 | 0.0 |
10 | 6 | 0.892598 | 0.0 |
11 | 10 | 0.918916 | 1.0 |
12 | 11 | 0.991071 | 1.0 |
now we create a new DataFrame
1 | 1 | 0.830334 | 0.0 |
2 | 2 | 0.573132 | 0.0 |
3 | 3 | 0.176625 | 0.0 |
4 | 4 | 0.114935 | 0.0 |
5 | 5 | 0.7864 | 0.0 |
6 | 6 | 0.892598 | 0.0 |
7 | 7 | 0.452015 | 1.0 |
8 | 8 | 0.206873 | 1.0 |
9 | 9 | 0.286582 | 1.0 |
10 | 10 | 0.918916 | 1.0 |
11 | 11 | 0.991071 | 1.0 |
12 | 12 | 0.796831 | 1.0 |
here we sort by two columns, first is decreasing, second is increasing
1 | 8 | 0.206873 | 1.0 |
2 | 9 | 0.286582 | 1.0 |
3 | 7 | 0.452015 | 1.0 |
4 | 12 | 0.796831 | 1.0 |
5 | 10 | 0.918916 | 1.0 |
6 | 11 | 0.991071 | 1.0 |
7 | 4 | 0.114935 | 0.0 |
8 | 3 | 0.176625 | 0.0 |
9 | 2 | 0.573132 | 0.0 |
10 | 5 | 0.7864 | 0.0 |
11 | 1 | 0.830334 | 0.0 |
12 | 6 | 0.892598 | 0.0 |
1 | 8 | 0.206873 | 1.0 |
2 | 9 | 0.286582 | 1.0 |
3 | 7 | 0.452015 | 1.0 |
4 | 12 | 0.796831 | 1.0 |
5 | 10 | 0.918916 | 1.0 |
6 | 11 | 0.991071 | 1.0 |
7 | 4 | 0.114935 | 0.0 |
8 | 3 | 0.176625 | 0.0 |
9 | 2 | 0.573132 | 0.0 |
10 | 5 | 0.7864 | 0.0 |
11 | 1 | 0.830334 | 0.0 |
12 | 6 | 0.892598 | 0.0 |
this is how you can shuffle rows
1 | 8 | 0.206873 | 1.0 |
2 | 12 | 0.796831 | 1.0 |
3 | 2 | 0.573132 | 0.0 |
4 | 1 | 0.830334 | 0.0 |
5 | 5 | 0.7864 | 0.0 |
6 | 9 | 0.286582 | 1.0 |
7 | 6 | 0.892598 | 0.0 |
8 | 4 | 0.114935 | 0.0 |
9 | 3 | 0.176625 | 0.0 |
10 | 7 | 0.452015 | 1.0 |
it is also easy to swap rows using broadcasted assignment
1 | 10 | 0.918916 | 1.0 |
2 | 2 | 0.573132 | 0.0 |
3 | 3 | 0.176625 | 0.0 |
4 | 4 | 0.114935 | 0.0 |
5 | 5 | 0.7864 | 0.0 |
6 | 6 | 0.892598 | 0.0 |
7 | 7 | 0.452015 | 1.0 |
8 | 8 | 0.206873 | 1.0 |
9 | 9 | 0.286582 | 1.0 |
10 | 1 | 0.830334 | 0.0 |
11 | 11 | 0.991071 | 1.0 |
12 | 12 | 0.796831 | 1.0 |
Merging/adding rows
1 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
2 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
3 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
merge by rows - data frames must have the same column names; the same is vcat
1 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
2 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
3 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
4 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
5 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
6 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
you can efficiently vcat
a vector of DataFrames
using reduce
1 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
2 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
3 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
4 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
5 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
6 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
7 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
8 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
9 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
get y
with other order of names
1 | 0.306016 | 0.140855 | 0.338402 | 0.218366 | 0.0294498 |
2 | 0.843511 | 0.4 | 0.0526195 | 0.52931 | 0.271436 |
3 | 0.896884 | 0.321968 | 0.188894 | 0.38624 | 0.32389 |
vcat
is still possible as it does column name matching
1 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
2 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
3 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
4 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
5 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
6 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
but column names must still match
ArgumentError("column(s) x1 and x2 are missing from argument(s) 2")
unless you pass :intersect
, :union
or specific column names as keyword argument cols
1 | 0.338402 | 0.140855 | 0.306016 |
2 | 0.0526195 | 0.4 | 0.843511 |
3 | 0.188894 | 0.321968 | 0.896884 |
4 | 0.338402 | 0.140855 | 0.306016 |
5 | 0.0526195 | 0.4 | 0.843511 |
6 | 0.188894 | 0.321968 | 0.896884 |
1 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
2 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
3 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
4 | missing | missing | 0.338402 | 0.140855 | 0.306016 |
5 | missing | missing | 0.0526195 | 0.4 | 0.843511 |
6 | missing | missing | 0.188894 | 0.321968 | 0.896884 |
1 | 0.0294498 | 0.306016 |
2 | 0.271436 | 0.843511 |
3 | 0.32389 | 0.896884 |
4 | missing | 0.306016 |
5 | missing | 0.843511 |
6 | missing | 0.896884 |
append
! modifies x
in place
1 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
2 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
3 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
4 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
5 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
6 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
here column names must match exactly unless cols
keyword argument is passed
1 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
2 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
3 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
4 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
5 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
6 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
7 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
8 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
9 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
standard repeat
function works on rows; also inner
and outer
keyword arguments are accepted
1 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
2 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
3 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
4 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
5 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
6 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
7 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
8 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
9 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
10 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
11 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
12 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
13 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
14 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
15 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
16 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
17 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
18 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
push!
adds one row to x
at the end; one must pass a correct number of values unless cols
keyword argument is passed
1 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
2 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
3 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
4 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
5 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
6 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
7 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
8 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
9 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
10 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
push!
also works with dictionaries
1 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
2 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
3 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
4 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
5 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
6 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
7 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
8 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
9 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
10 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
11 | 11.0 | 12.0 | 13.0 | 14.0 | 15.0 |
and NamedTuples
via name matching
1 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
2 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
3 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
4 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
5 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
6 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
7 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
8 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
9 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
10 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
11 | 11.0 | 12.0 | 13.0 | 14.0 | 15.0 |
12 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
and DataFrameRow
also via name matching
1 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
2 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
3 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
4 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
5 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
6 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
7 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
8 | 0.271436 | 0.52931 | 0.0526195 | 0.4 | 0.843511 |
9 | 0.32389 | 0.38624 | 0.188894 | 0.321968 | 0.896884 |
10 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
11 | 11.0 | 12.0 | 13.0 | 14.0 | 15.0 |
12 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
13 | 0.0294498 | 0.218366 | 0.338402 | 0.140855 | 0.306016 |
Please consult the documentation of push!
, append!
and vcat
for allowed values of cols
keyword argument.
This keyword argument governs the way these functions perform column matching of passed arguments. Also append!
and push!
support a promote
keyword argument that decides if column type promotion is allowed.
Let us here just give a quick example of how heterogeneous data can be stored in the data frame using these functionalities:
3-element Vector{NamedTuple}:
(a = 1, b = 2)
(a = missing, b = 10, c = 20)
(b = "s", c = 1, d = 1)
1 | 1 | 2 | missing | missing |
2 | missing | 10 | 20 | missing |
3 | missing | s | 1 | 1 |
and we see that push!
dynamically added columns as needed and updated their element types
Subsetting/removing rows
1 | 1 | a |
2 | 2 | b |
3 | 3 | c |
4 | 4 | d |
5 | 5 | e |
6 | 6 | f |
7 | 7 | g |
8 | 8 | h |
9 | 9 | i |
10 | 10 | j |
by using indexing
a single row selection creates a DataFrameRow
while this is a DataFrame
this is a view
selects columns 1 and 2
1 | 1 | a |
2 | 2 | b |
3 | 3 | c |
4 | 4 | d |
5 | 5 | e |
6 | 6 | f |
7 | 7 | g |
8 | 8 | h |
9 | 9 | i |
10 | 10 | j |
indexing by a Bool array, exact length match is required
alternatively we can also create a view
we can delete one row in place
1 | 1 | a |
2 | 2 | b |
3 | 3 | c |
4 | 4 | d |
5 | 5 | e |
6 | 6 | f |
7 | 8 | h |
8 | 9 | i |
9 | 10 | j |
or a collection of rows, also in place
you can also create a new DataFrame when deleting rows using Not indexing
now we move to row filtering
create a new DataFrame
where filtering function operates on DataFrameRow
the same but as a view
or
in place modification of x
, using the do
-block syntax for a more complex transformation
A common operation is selection of rows for which a value in a column is contained in a given set. Here are a few ways in which you can achieve this.
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
4 | 4 | 4 |
5 | 5 | 1 |
6 | 6 | 2 |
7 | 7 | 3 |
8 | 8 | 4 |
9 | 9 | 1 |
10 | 10 | 2 |
11 | 11 | 3 |
12 | 12 | 4 |
We select rows for which column y
has value 1
or 4
.
DataFrames.jl also provides a subset function that works on whole columns and allows for multiple conditions:
Similarly an in-place subset!
function is provided.