Selecting rows¶
using DataFrames
using Statistics
using Random
Random.seed!(1);
df = DataFrame(rand(4, 5), :auto)using : as row selector will copy columns
df[:, :]this is the same as
copy(df)you can get a subset of rows of a data frame without copying using view to get a SubDataFrame
sdf = view(df, 1:3, 1:3)you still have a detailed reference to the parent
parent(sdf), parentindices(sdf)(4×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────
1 │ 0.0491718 0.691857 0.840384 0.198521 0.802561
2 │ 0.119079 0.767518 0.89077 0.00819786 0.661425
3 │ 0.393271 0.087253 0.138227 0.592041 0.347513
4 │ 0.0240943 0.855718 0.347737 0.801055 0.778149, (1:3, 1:3))selecting a single row returns a DataFrameRow object which is also a view
dfr = df[3, :]parent(dfr), parentindices(dfr), rownumber(dfr)(4×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────
1 │ 0.0491718 0.691857 0.840384 0.198521 0.802561
2 │ 0.119079 0.767518 0.89077 0.00819786 0.661425
3 │ 0.393271 0.087253 0.138227 0.592041 0.347513
4 │ 0.0240943 0.855718 0.347737 0.801055 0.778149, (3, Base.OneTo(5)), 3)let us add a column to a data frame by assigning a scalar broadcasting
df[!, :Z] .= 14-element Vector{Int64}:
1
1
1
1dfEarlier we used : for column selection in a view (SubDataFrame and DataFrameRow). In this case a view will have all columns of the parent after the parent is mutated.
dfrparent(dfr), parentindices(dfr), rownumber(dfr)(4×6 DataFrame
Row │ x1 x2 x3 x4 x5 Z
│ Float64 Float64 Float64 Float64 Float64 Int64
─────┼────────────────────────────────────────────────────────────
1 │ 0.0491718 0.691857 0.840384 0.198521 0.802561 1
2 │ 0.119079 0.767518 0.89077 0.00819786 0.661425 1
3 │ 0.393271 0.087253 0.138227 0.592041 0.347513 1
4 │ 0.0240943 0.855718 0.347737 0.801055 0.778149 1, (3, Base.OneTo(6)), 3)Note that parent and parentindices refer to the true source of data for a DataFrameRow and rownumber refers to row number in the direct object that was used to create DataFrameRow
df = DataFrame(a=1:4)dfv = view(df, [3, 2], :)dfr = dfv[1, :]parent(dfr), parentindices(dfr), rownumber(dfr)(4×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4, (3, Base.OneTo(1)), 1)Reordering rows¶
We create some random data frame (and hope that x.x is not sorted :), which is quite likely with 12 rows)
x = DataFrame(id=1:12, x=rand(12), y=[zeros(6); ones(6)])check if a DataFrame or a subset of its columns is sorted
issorted(x), issorted(x, :x)(true, false)we sort x in place
sort!(x, :x)now we create a new DataFrame
y = sort(x, :id)here we sort by two columns, first is decreasing, second is increasing
sort(x, [:y, :x], rev=[true, false])sort(x, [order(:y, rev=true), :x]) ## the same as abovethis is how you can shuffle rows
x[shuffle(1:10), :]it is also easy to swap rows using broadcasted assignment
sort!(x, :id)
x[[1, 10], :] .= x[[10, 1], :]
xMerging/adding rows¶
x = DataFrame(rand(3, 5), :auto)merge by rows - data frames must have the same column names; the same is vcat
[x; x]you can efficiently vcat a vector of DataFrames using reduce
reduce(vcat, [x, x, x])get y with other order of names
y = x[:, reverse(names(x))]vcat is still possible as it does column name matching
vcat(x, y)but column names must still match
try
vcat(x, y[:, 1:3])
catch e
show(e)
endArgumentError("column(s) x1 and x2 are missing from argument(s) 2")unless you pass :intersect, :union or specific column names as keyword argument cols
vcat(x, y[:, 1:3], cols=:intersect)vcat(x, y[:, 1:3], cols=:union)vcat(x, y[:, 1:3], cols=[:x1, :x5])append! modifies x in place
append!(x, x)here column names must match exactly unless cols keyword argument is passed
append!(x, y)standard repeat function works on rows; also inner and outer keyword arguments are accepted
repeat(x, 2)push! adds one row to x at the end; one must pass a correct number of values unless cols keyword argument is passed
push!(x, 1:5)
xpush! also works with dictionaries
push!(x, Dict(:x1 => 11, :x2 => 12, :x3 => 13, :x4 => 14, :x5 => 15))
xand NamedTuples via name matching
push!(x, (x2=2, x1=1, x4=4, x3=3, x5=5))and DataFrameRow also via name matching
push!(x, x[1, :])Please consult the documentation of push!, append! and vcat for allowed values of cols keyword argument.
This keyword argument governs the way these functions perform column matching of passed arguments. Also append! and push! support a promote keyword argument that decides if column type promotion is allowed.
Let us here just give a quick example of how heterogeneous data can be stored in the data frame using these functionalities:
source = [(a=1, b=2), (a=missing, b=10, c=20), (b="s", c=1, d=1)]3-element Vector{NamedTuple}:
(a = 1, b = 2)
(a = missing, b = 10, c = 20)
(b = "s", c = 1, d = 1)df = DataFrame()for row in source
push!(df, row, cols=:union) ## if cols is :union then promote is true by default
enddfand we see that push! dynamically added columns as needed and updated their element types
Subsetting/removing rows¶
x = DataFrame(id=1:10, val='a':'j')by using indexing
x[1:2, :]a single row selection creates a DataFrameRow
x[1, :]while this is a DataFrame
x[1:1, :]this is a view
view(x, 1:2, :)selects columns 1 and 2
view(x, :, 1:2)indexing by a Bool array, exact length match is required
x[repeat([true, false], 5), :]alternatively we can also create a view
view(x, repeat([true, false], 5), :)we can delete one row in place
deleteat!(x, 7)or a collection of rows, also in place
deleteat!(x, 6:7)you can also create a new DataFrame when deleting rows using Not indexing
x[Not(1:2), :]xnow we move to row filtering
x = DataFrame([1:4, 2:5, 3:6], :auto)create a new DataFrame where filtering function operates on DataFrameRow
filter(r -> r.x1 > 2.5, x)the same but as a view
filter(r -> r.x1 > 2.5, x, view=true)or
filter(:x1 => >(2.5), x)in place modification of x, using the do-block syntax for a more complex transformation
filter!(x) do r
if r.x1 > 2.5
return r.x2 < 4.5
end
r.x3 < 3.5
endA common operation is selection of rows for which a value in a column is contained in a given set. Here are a few ways in which you can achieve this.
df = DataFrame(x=1:12, y=mod1.(1:12, 4))We select rows for which column y has value 1 or 4.
filter(row -> row.y in [1, 4], df)filter(:y => in([1, 4]), df)df[in.(df.y, Ref([1, 4])), :]DataFrames.jl also provides a subset function that works on whole columns and allows for multiple conditions:
x = DataFrame([1:4, 2:5, 3:6], :auto)subset(x, :x1 => x -> x .< mean(x), :x2 => ByRow(<(2.5)))Similarly an in-place subset! function is provided.
Deduplicating¶
x = DataFrame(A=[1, 2], B=["x", "y"])
append!(x, x)
x.C = 1:4
xget first unique rows for given index
unique(x, [1, 2])now we look at whole rows
unique(x)get indicators of non-unique rows
nonunique(x, :A)4-element Vector{Bool}:
0
0
1
1modify x in place
unique!(x, :B)Extracting one row from a DataFrame into standard collections¶
x = DataFrame(x=[1, missing, 2], y=["a", "b", missing], z=[true, false, true])cols = [:y, :z]2-element Vector{Symbol}:
:y
:zyou can convert it to a Vector or an Array
Vector(x[1, cols])2-element Vector{Any}:
"a"
truethe same as
Array(x[1, cols])2-element Vector{Any}:
"a"
trueget a vector of vectors
[Vector(x[i, cols]) for i in axes(x, 1)]3-element Vector{Vector{Any}}:
["a", true]
["b", false]
[missing, true]it is easy to convert a DataFrameRow into a NamedTuple
copy(x[1, cols])@NamedTuple{y::Union{Missing, String}, z::Bool}(("a", true))or a Tuple
Tuple(x[1, cols])("a", true)Working with a collection of rows of a data frame¶
You can use eachrow to get a vector-like collection of DataFrameRows
df = DataFrame(reshape(1:12, 3, 4), :auto)er_df = eachrow(df)er_df[1]last(er_df)er_df[end]As DataFrameRows objects keeps connection to the parent data frame you can get the columns of the parent using getproperty
er_df.x13-element Vector{Int64}:
1
2
3Flattening a data frame¶
Occasionally you have a data frame whose one column is a vector of collections. You can expand (flatten) such a column using the flatten function
df = DataFrame(a='a':'c', b=[[1, 2, 3], [4, 5], 6])flatten(df, :b)Only one row¶
only from Julia Base is also supported in DataFrames.jl and succeeds if the data frame in question has only one row, in which case it is returned.
df = DataFrame(a=1)only(df)df2 = repeat(df, 2)Errors
try
only(df2)
catch e
show(e)
endArgumentError("data frame must contain exactly 1 row, got 2")This notebook was generated using Literate.jl.