using DataFramesLet’s start by creating a DataFrame object, x, so that we can learn how to get information on that data frame.
x = DataFrame(A=[1, 2], B=[1.0, missing], C=["a", "b"])The standard size function works to get dimensions of the DataFrame,
size(x), size(x, 1), size(x, 2)((2, 3), 2, 3)as well as nrow and ncol from R.
nrow(x), ncol(x)(2, 3)describe gives basic summary statistics of data in your DataFrame (check out the help of describe for information on how to customize shown statistics).
describe(x)you can limit the columns shown by describe using cols keyword argument
describe(x, cols=1:2)names will return the names of all columns as strings
names(x)3-element Vector{String}:
"A"
"B"
"C"you can also get column names with a given element type (eltype):
names(x, String)1-element Vector{String}:
"C"use propertynames to get a vector of Symbols:
propertynames(x)3-element Vector{Symbol}:
:A
:B
:Celtype on eachcol(x) returns element types of columns:
eltype.(eachcol(x))3-element Vector{Type}:
Int64
Union{Missing, Float64}
StringHere we create some large DataFrame
y = DataFrame(rand(1:10, 1000, 10), :auto)and then we can use first to peek into its first few rows
first(y, 5)and last to see its bottom rows.
last(y, 3)Using first and last without number of rows will return a first/last DataFrameRow in the DataFrame
first(y)last(y)Displaying large data frames¶
Create a wide and tall data frame:
df = DataFrame(rand(100, 100), :auto)we can see that 92 of its columns were not printed. Also we get its first 30 rows. You can easily change this behavior by changing the value of ENV["LINES"] and ENV["COLUMNS"].
withenv("LINES" => 10, "COLUMNS" => 200) do
show(df)
end100×100 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 ⋯
│ Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Floa ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.139319 0.183546 0.540498 0.688563 0.510201 0.55724 0.662957 0.16551 0.731465 0.413509 0.848207 0.370414 0.290568 0.367343 0.028806 0.583496 0.625629 0.862849 0.52102 0.97 ⋯
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
81 columns and 99 rows omittedMost elementary get and set operations¶
Given the DataFrame x we have created earlier, here are various ways to grab one of its columns as a Vector.
x = DataFrame(A=[1, 2], B=[1.0, missing], C=["a", "b"])all get the vector stored in our DataFrame without copying it
x.A, x[!, 1], x[!, :A]([1, 2], [1, 2], [1, 2])the same using string indexing
x."A", x[!, "A"]([1, 2], [1, 2])note that this creates a copy
x[:, 1]2-element Vector{Int64}:
1
2x[:, 1] === x[:, 1]falseTo grab one row as a DataFrame, we can index as follows.
x[1:1, :]this produces a DataFrameRow which is treated as 1-dimensional object similar to a NamedTuple
x[1, :]We can grab a single cell or element with the same syntax to grab an element of an array.
x[1, 1]1or a new DataFrame that is a subset of rows and columns
x[1:2, 1:2]You can also use Regex to select columns and Not from InvertedIndices.jl both to select rows and columns
x[Not(1), r"A"]! indicates that underlying columns are not copied
x[!, Not(1)]: means that the columns will get copied
x[:, Not(1)]Assignment of a scalar to a data frame can be done in ranges using broadcasting:
x[1:2, 1:2] .= 1
xAssignment of a vector of length equal to the number of assigned rows using broadcasting
x[1:2, 1:2] .= [1, 2]
xAssignment or of another data frame of matching size and column names, again using broadcasting:
x[1:2, 1:2] .= DataFrame([5 6; 7 8], [:A, :B])
xCaution
With df[!, :col] and df.col syntax you get a direct (non copying) access to a column of a data frame.
This is potentially unsafe as you can easily corrupt data in the df data frame if you resize, sort, etc. the column obtained in this way.
Therefore such access should be used with caution.
Similarly df[!, cols] when cols is a collection of columns produces a new data frame that holds the same (not copied) columns as the source df data frame. Similarly, modifying the data frame obtained via df[!, cols] might cause problems with the consistency of df.
The df[:, :col] and df[:, cols] syntaxes always copy columns so they are safe to use (and should generally be preferred except for performance or memory critical use cases).
Here are examples of how Cols and Between can be used to select columns of a data frame.
x = DataFrame(rand(4, 5), :auto)x[:, Between(:x2, :x4)]x[:, Cols("x1", Between("x2", "x4"))]Views¶
You can simply create a view of a DataFrame (it is more efficient than creating a materialized selection). Here are the possible return value options.
@view x[1:2, 1]2-element view(::Vector{Float64}, 1:2) with eltype Float64:
0.694839744622125
0.8105602542247192@view x[1, 1]0-dimensional view(::Vector{Float64}, 1) with eltype Float64:
0.694839744622125a DataFrameRow, the same as for x[1, 1:2] without a view
@view x[1, 1:2]a SubDataFrame
@view x[1:2, 1:2]Adding new columns to a data frame¶
df = DataFrame()using setproperty! (element assignment)
x = [1, 2, 3]
df.a = x
dfno copy is performed (sharing the same memory address)
df.a === xtrueusing setindex!
df[!, :b] = x
df[:, :c] = x
dfno copy is performed
df.b === xtrueWith copying
! and : has different effects
df.c === xfalseElement-wise assignment
df[!, :d] .= x
df[:, :e] .= x
dfboth copy, so in this case ! and : has the same effect
df.d === x, df.e === x(false, false)note that in our data frame columns :a and :b store the vector x (not a copy)
df.a === df.b === xtrueThis can lead to silent errors. For example this code leads to a bug (note that calling pairs on eachcol(df) creates an iterator of (column name, column) pairs):
try
for (n, c) in pairs(eachcol(df))
println("$n: ", pop!(c))
end
catch e
show(e)
enda: 3
b: 2
c: 3
d: 3
e: 3
note that for column :b we printed 2 as 3 was removed from it when we used pop! on column :a.
Such mistakes sometimes happen. Because of this DataFrames.jl performs consistency checks before doing an expensive operation (most notably before showing a data frame).
try
show(df)
catch e
show(e)
endAssertionError("Data frame is corrupt: length of column :c (2) does not match length of column 1 (1). The column vector has likely been resized unintentionally (either directly or because it is shared with another data frame).")We can investigate the columns to find out what happened:
collect(pairs(eachcol(df)))5-element Vector{Pair{Symbol, AbstractVector}}:
:a => [1]
:b => [1]
:c => [1, 2]
:d => [1, 2]
:e => [1, 2]The output confirms that the data frame df got corrupted.
DataFrames.jl supports a complete set of getindex, getproperty, setindex!, setproperty!, view, broadcasting, and broadcasting assignment operations. The details are explained here: http://
Comparisons¶
using DataFramesdf = DataFrame(rand(2, 3), :auto)df2 = copy(df)compares column names and contents
df == df2truecreate a minimally different data frame and use isapprox for comparison
df3 = df2 .+ eps()df == df3falseisapprox(df, df3)trueisapprox(df, df3, atol=eps() / 2)falsemissings are handled as in Julia Base
df = DataFrame(a=missing)Equality test shows missing.
df == dfmissingThe same object?
df === dftrueisequal(df, df)trueThis notebook was generated using Literate.jl.