Constructors and conversion#
using DataFrames
using Random
Constructors#
In this section, you’ll see many ways to create a DataFrame
using the DataFrame()
constructor.
First, we could create an empty DataFrame.
DataFrame()
Or we could call the constructor using keyword arguments to add columns to the DataFrame
.
DataFrame(A=1:3, B=rand(3), C=randstring.([3, 3, 3]), fixed=1)
Row | A | B | C | fixed |
---|---|---|---|---|
Int64 | Float64 | String | Int64 | |
1 | 1 | 0.931343 | 7d7 | 1 |
2 | 2 | 0.83357 | smC | 1 |
3 | 3 | 0.459508 | oS6 | 1 |
note in column :fixed
that scalars get automatically broadcasted.
We can create a DataFrame
from a dictionary, in which case keys from the dictionary will be sorted to create the DataFrame
columns.
x = Dict("A" => [1, 2], "B" => [true, false], "C" => ['a', 'b'], "fixed" => Ref([1, 1]))
DataFrame(x)
Row | A | B | C | fixed |
---|---|---|---|---|
Int64 | Bool | Char | Array… | |
1 | 1 | true | a | [1, 1] |
2 | 2 | false | b | [1, 1] |
This time we used Ref
to protect a vector from being treated as a column and forcing broadcasting it into every row of :fixed
column (note that the [1,1]
vector is aliased in each row).
Rather than explicitly creating a dictionary first, as above, we could pass DataFrame
arguments with the syntax of dictionary key-value pairs.
Note that in this case, we use Symbol
s to denote the column names and arguments are not sorted. For example, :A
, the symbol, produces A
, the name of the first column here:
DataFrame(:A => [1, 2], :B => [true, false], :C => ['a', 'b'])
Row | A | B | C |
---|---|---|---|
Int64 | Bool | Char | |
1 | 1 | true | a |
2 | 2 | false | b |
Although, in general, using Symbol
s rather than strings to denote column names is preferred (as it is faster) DataFrames.jl accepts passing strings as column names, so this also works:
DataFrame("A" => [1, 2], "B" => [true, false], "C" => ['a', 'b'])
Row | A | B | C |
---|---|---|---|
Int64 | Bool | Char | |
1 | 1 | true | a |
2 | 2 | false | b |
You can also pass a vector of pairs, which is useful if it is constructed programatically:
DataFrame([:A => [1, 2], :B => [true, false], :C => ['a', 'b'], :fixed => "const"])
Row | A | B | C | fixed |
---|---|---|---|---|
Int64 | Bool | Char | String | |
1 | 1 | true | a | const |
2 | 2 | false | b | const |
Here we create a DataFrame
from a vector of vectors, and each vector becomes a column.
DataFrame([rand(3) for i in 1:3], :auto)
Row | x1 | x2 | x3 |
---|---|---|---|
Float64 | Float64 | Float64 | |
1 | 0.459857 | 0.379615 | 0.809887 |
2 | 0.877466 | 0.227315 | 0.204452 |
3 | 0.918579 | 0.993363 | 0.349277 |
DataFrame([rand(3) for i in 1:3], [:x1, :x2, :x3])
Row | x1 | x2 | x3 |
---|---|---|---|
Float64 | Float64 | Float64 | |
1 | 0.858108 | 0.119499 | 0.353078 |
2 | 0.75442 | 0.540551 | 0.0184252 |
3 | 0.415757 | 0.344261 | 0.0770455 |
DataFrame([rand(3) for i in 1:3], ["x1", "x2", "x3"])
Row | x1 | x2 | x3 |
---|---|---|---|
Float64 | Float64 | Float64 | |
1 | 0.408632 | 0.368194 | 0.651141 |
2 | 0.425464 | 0.0690646 | 0.999539 |
3 | 0.692819 | 0.934759 | 0.074886 |
As you can see you either pass a vector of column names as a second argument or :auto
in which case column names are generated automatically.
In particular it is not allowed to pass a vector of scalars to DataFrame
constructor.
try
DataFrame([1, 2, 3])
catch e
show(e)
end
ArgumentError("'Vector{Int64}' iterates 'Int64' values, which doesn't satisfy the Tables.jl `AbstractRow` interface")
Instead use a transposed vector if you have a vector of single values (in this way you effectively pass a two dimensional array to the constructor which is supported the same way as in vector of vectors case).
DataFrame(permutedims([1, 2, 3]), :auto)
Row | x1 | x2 | x3 |
---|---|---|---|
Int64 | Int64 | Int64 | |
1 | 1 | 2 | 3 |
You can also pass a vector of NamedTuple
s to construct a DataFrame
:
v = [(a=1, b=2), (a=3, b=4)]
DataFrame(v)
Row | a | b |
---|---|---|
Int64 | Int64 | |
1 | 1 | 2 |
2 | 3 | 4 |
Alternatively you can pass a NamedTuple
of vectors:
n = (a=1:3, b=11:13)
DataFrame(n)
Row | a | b |
---|---|---|
Int64 | Int64 | |
1 | 1 | 11 |
2 | 2 | 12 |
3 | 3 | 13 |
DataFrame(rand(3, 4), :auto)
Row | x1 | x2 | x3 | x4 |
---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | |
1 | 0.573658 | 0.691278 | 0.392694 | 0.152718 |
2 | 0.0506909 | 0.672677 | 0.957157 | 0.949021 |
3 | 0.429574 | 0.798447 | 0.760203 | 0.835569 |
and here we do the same but also pass column names.
DataFrame(rand(3, 4), Symbol.('a':'d'))
Row | a | b | c | d |
---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | |
1 | 0.47566 | 0.00975125 | 0.383663 | 0.0221255 |
2 | 0.022451 | 0.27392 | 0.280785 | 0.859176 |
3 | 0.0721469 | 0.571895 | 0.330737 | 0.736976 |
or
DataFrame(rand(3, 4), string.('a':'d'))
Row | a | b | c | d |
---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | |
1 | 0.0388881 | 0.83767 | 0.260397 | 0.0666654 |
2 | 0.898531 | 0.352449 | 0.978764 | 0.834269 |
3 | 0.0283187 | 0.597955 | 0.228103 | 0.955276 |
This is how you can create a data frame with no rows, but with predefined columns and their types:
DataFrame(A=Int[], B=Float64[], C=String[])
Row | A | B | C |
---|---|---|---|
Int64 | Float64 | String |
Finally, we can create a DataFrame
by copying an existing DataFrame
.
Note that copy
also copies the vectors.
x = DataFrame(a=1:2, b='a':'b')
y = copy(x)
(x === y), isequal(x, y), (x.a == y.a), (x.a === y.a)
(false, true, true, false)
Calling DataFrame
on a DataFrame
object works like copy
.
x = DataFrame(a=1:2, b='a':'b')
y = DataFrame(x)
(x === y), isequal(x, y), (x.a == y.a), (x.a === y.a)
(false, true, true, false)
You can avoid copying of columns of a data frame (if it is possible) by passing copycols=false
keyword argument:
x = DataFrame(a=1:2, b='a':'b')
y = DataFrame(x, copycols=false)
(x === y), isequal(x, y), (x.a == y.a), (x.a === y.a)
(false, true, true, true)
The same rule applies to other constructors
a = [1, 2, 3]
df1 = DataFrame(a=a)
df2 = DataFrame(a=a, copycols=false)
df1.a === a, df2.a === a
(false, true)
You can create a similar uninitialized DataFrame
based on an original one:
x = DataFrame(a=1, b=1.0)
Row | a | b |
---|---|---|
Int64 | Float64 | |
1 | 1 | 1.0 |
similar(x)
Row | a | b |
---|---|---|
Int64 | Float64 | |
1 | 139694967276384 | 0.0 |
number of rows in a new DataFrame can be passed as a second argument
similar(x, 0)
Row | a | b |
---|---|---|
Int64 | Float64 |
similar(x, 2)
Row | a | b |
---|---|---|
Int64 | Float64 | |
1 | 139695495337856 | 6.90187e-310 |
2 | 139695495350848 | 6.90187e-310 |
You can also create a new DataFrame
from SubDataFrame
or DataFrameRow
(discussed in detail later in the tutorial; in particular although DataFrameRow
is considered a 1-dimensional object similar to a NamedTuple
it gets converted to a 1-row DataFrame
for convenience)
x = DataFrame(a=1, b=1.0)
sdf = view(x, [1, 1], :)
Row | a | b |
---|---|---|
Int64 | Float64 | |
1 | 1 | 1.0 |
2 | 1 | 1.0 |
typeof(sdf)
DataFrames.SubDataFrame{DataFrames.DataFrame, DataFrames.Index, Vector{Int64}}
DataFrame(sdf)
Row | a | b |
---|---|---|
Int64 | Float64 | |
1 | 1 | 1.0 |
2 | 1 | 1.0 |
dfr = x[1, :]
Row | a | b |
---|---|---|
Int64 | Float64 | |
1 | 1 | 1.0 |
DataFrame(dfr)
Row | a | b |
---|---|---|
Int64 | Float64 | |
1 | 1 | 1.0 |
Conversion to a matrix#
Let’s start by creating a DataFrame
with two rows and two columns.
x = DataFrame(x=1:2, y=["A", "B"])
Row | x | y |
---|---|---|
Int64 | String | |
1 | 1 | A |
2 | 2 | B |
We can create a matrix by passing this DataFrame
to Matrix
or Array
.
Matrix(x)
2×2 Matrix{Any}:
1 "A"
2 "B"
Array(x)
2×2 Matrix{Any}:
1 "A"
2 "B"
This would work even if the DataFrame
had some missing
s:
x = DataFrame(x=1:2, y=[missing, "B"])
Row | x | y |
---|---|---|
Int64 | String? | |
1 | 1 | missing |
2 | 2 | B |
Matrix(x)
2×2 Matrix{Any}:
1 missing
2 "B"
In the two previous matrix examples, Julia created matrices with elements of type Any
. We can see more clearly that the type of matrix is inferred when we pass, for example, a DataFrame
of integers to Matrix
, creating a 2D Array
of Int64
s:
x = DataFrame(x=1:2, y=3:4)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 3 |
2 | 2 | 4 |
Matrix(x)
2×2 Matrix{Int64}:
1 3
2 4
In this next example, Julia correctly identifies that Union
is needed to express the type of the resulting Matrix
(which contains missing
s).
x = DataFrame(x=1:2, y=[missing, 4])
Row | x | y |
---|---|---|
Int64 | Int64? | |
1 | 1 | missing |
2 | 2 | 4 |
Matrix(x)
2×2 Matrix{Union{Missing, Int64}}:
1 missing
2 4
Note that we can’t force a conversion of missing
values to Int
s!
try
Matrix{Int}(x)
catch e
show(e)
end
ArgumentError("cannot convert a DataFrame containing missing values to Matrix{Int64} (found for column y)")
Iterating data frame by rows or columns#
Sometimes it is useful to create a wrapper around a DataFrame
that produces its rows or columns.
For iterating columns you can use the eachcol
function.
ec = eachcol(x)
Row | x | y |
---|---|---|
Int64 | String | |
1 | 1 | A |
2 | 2 | B |
DataFrameColumns
object behaves as a vector (note though it is not AbstractVector
)
ec isa AbstractVector
false
ec[1]
2-element Vector{Int64}:
1
2
but you can also index into it using column names:
ec["x"]
2-element Vector{Int64}:
1
2
similarly eachrow
creates a DataFrameRows
object that is a vector of its rows
er = eachrow(x)
Row | x | y |
---|---|---|
Int64 | String | |
1 | 1 | A |
2 | 2 | B |
DataFrameRows
is an AbstractVector
er isa AbstractVector
true
er[end]
Row | x | y |
---|---|---|
Int64 | String | |
2 | 2 | B |
Note that both data frame and also DataFrameColumns
and DataFrameRows
objects are not type stable (they do not know the types of their columns). This is useful to avoid compilation cost if you have very wide data frames with heterogenous column types.
However, often (especially if a data frame is narrows) it is useful to create a lazy iterator that produces NamedTuple
s for each row of the DataFrame
. Its key benefit is that it is type stable (so it is useful when you want to perform some operations in a fast way on a small subset of columns of a DataFrame
- this strategy is often used internally by DataFrames.jl package):
nti = Tables.namedtupleiterator(x)
Tables.NamedTupleIterator{Tables.Schema{(:x, :y), Tuple{Int64, String}}, Tables.RowIterator{@NamedTuple{x::Vector{Int64}, y::Vector{String}}}}(Tables.RowIterator{@NamedTuple{x::Vector{Int64}, y::Vector{String}}}((x = [1, 2], y = ["A", "B"]), 2))
for row in enumerate(nti)
@show row
end
row = (1, (x = 1, y = "A"))
row = (2, (x = 2, y = "B"))
similarly to the previous options you can easily convert NamedTupleIterator
back to a DataFrame
.
DataFrame(nti)
Row | x | y |
---|---|---|
Int64 | String | |
1 | 1 | A |
2 | 2 | B |
Handling of duplicate column names#
We can pass the makeunique
keyword argument to allow passing duplicate names (they get deduplicated)
df = DataFrame(:a => 1, :a => 2, :a_1 => 3; makeunique=true)
Row | a | a_2 | a_1 |
---|---|---|---|
Int64 | Int64 | Int64 | |
1 | 1 | 2 | 3 |
Otherwise, duplicates are not allowed.
try
df = DataFrame(:a => 1, :a => 2, :a_1 => 3)
catch e
show(e)
end
ArgumentError("Duplicate variable names: :a. Pass makeunique=true to make them unique using a suffix automatically.")
Observe that currently nothing
is not printed when displaying a DataFrame
in Jupyter Notebook:
df = DataFrame(x=[1, nothing], y=[nothing, "a"], z=[missing, "c"])
Row | x | y | z |
---|---|---|---|
Union… | Union… | String? | |
1 | 1 | missing | |
2 | a | c |
Finally you can use empty
and empty!
functions to remove all rows from a data frame:
empty(df)
df
Row | x | y | z |
---|---|---|---|
Union… | Union… | String? | |
1 | 1 | missing | |
2 | a | c |
empty!(df)
df
Row | x | y | z |
---|---|---|---|
Union… | Union… | String? |
This notebook was generated using Literate.jl.