Constructors and conversion

Constructors and conversion#

using DataFrames
using Random

Constructors#

In this section, you’ll see many ways to create a DataFrame using the DataFrame() constructor.

First, we could create an empty DataFrame,

DataFrame()

0×0 DataFrame

Or we could call the constructor using keyword arguments to add columns to the DataFrame.

DataFrame(A=1:3, B=rand(3), C=randstring.([3, 3, 3]), fixed=1)

3×4 DataFrame

Row	A	B	C	fixed
	Int64	Float64	String	Int64
1	1	0.508101	lqi	1
2	2	0.301414	gY8	1
3	3	0.209166	BiS	1

note in column :fixed that scalars get automatically broadcasted. We can create a DataFrame from a dictionary, in which case keys from the dictionary will be sorted to create the DataFrame columns.

x = Dict("A" => [1, 2], "B" => [true, false], "C" => ['a', 'b'], "fixed" => Ref([1, 1]))
DataFrame(x)

2×4 DataFrame

Row	A	B	C	fixed
	Int64	Bool	Char	Array…
1	1	true	a	[1, 1]
2	2	false	b	[1, 1]

This time we used Ref to protect a vector from being treated as a column and forcing broadcasting it into every row of :fixed column (note that the [1,1] vector is aliased in each row).

Rather than explicitly creating a dictionary first, as above, we could pass DataFrame arguments with the syntax of dictionary key-value pairs.

Note that in this case, we use Symbols to denote the column names and arguments are not sorted. For example, :A, the symbol, produces A, the name of the first column here:

DataFrame(:A => [1, 2], :B => [true, false], :C => ['a', 'b'])

2×3 DataFrame

Row	A	B	C
	Int64	Bool	Char
1	1	true	a
2	2	false	b

Although, in general, using Symbols rather than strings to denote column names is preferred (as it is faster) DataFrames.jl accepts passing strings as column names, so this also works:

DataFrame("A" => [1, 2], "B" => [true, false], "C" => ['a', 'b'])

2×3 DataFrame

Row	A	B	C
	Int64	Bool	Char
1	1	true	a
2	2	false	b

You can also pass a vector of pairs, which is useful if it is constructed programatically:

DataFrame([:A => [1, 2], :B => [true, false], :C => ['a', 'b'], :fixed => "const"])

2×4 DataFrame

Row	A	B	C	fixed
	Int64	Bool	Char	String
1	1	true	a	const
2	2	false	b	const

Here we create a DataFrame from a vector of vectors, and each vector becomes a column.

DataFrame([rand(3) for i in 1:3], :auto)

3×3 DataFrame

Row	x1	x2	x3
	Float64	Float64	Float64
1	0.933312	0.696247	0.0251472
2	0.755231	0.927249	0.328722
3	0.0744421	0.955966	0.575201

DataFrame([rand(3) for i in 1:3], [:x1, :x2, :x3])

3×3 DataFrame

Row	x1	x2	x3
	Float64	Float64	Float64
1	0.798755	0.750187	0.759962
2	0.00900714	0.468435	0.0989276
3	0.508239	0.9966	0.331985

DataFrame([rand(3) for i in 1:3], ["x1", "x2", "x3"])

3×3 DataFrame

Row	x1	x2	x3
	Float64	Float64	Float64
1	0.321822	0.560639	0.283254
2	0.957754	0.862181	0.529742
3	0.883377	0.0875476	0.555152

As you can see you either pass a vector of column names as a second argument or :auto in which case column names are generated automatically. In particular it is not allowed to pass a vector of scalars to DataFrame constructor.

try
    DataFrame([1, 2, 3])
catch e
    show(e)
end

ArgumentError("'Vector{Int64}' iterates 'Int64' values, which doesn't satisfy the Tables.jl `AbstractRow` interface")

Instead use a transposed vector if you have a vector of single values (in this way you effectively pass a two dimensional array to the constructor which is supported the same way as in vector of vectors case).

DataFrame(permutedims([1, 2, 3]), :auto)

1×3 DataFrame

Row	x1	x2	x3
	Int64	Int64	Int64
1	1	2	3

You can also pass a vector of NamedTuples to construct a DataFrame:

v = [(a=1, b=2), (a=3, b=4)]
DataFrame(v)

2×2 DataFrame

Row	a	b
	Int64	Int64
1	1	2
2	3	4

Alternatively you can pass a NamedTuple of vectors:

n = (a=1:3, b=11:13)
DataFrame(n)

3×2 DataFrame

Row	a	b
	Int64	Int64
1	1	11
2	2	12
3	3	13

Here we create a DataFrame from a matrix,

DataFrame(rand(3, 4), :auto)

3×4 DataFrame

Row	x1	x2	x3	x4
	Float64	Float64	Float64	Float64
1	0.0658299	0.343641	0.327629	0.814063
2	0.172508	0.337955	0.927201	0.821862
3	0.649522	0.52167	0.224757	0.44181

and here we do the same but also pass column names.

DataFrame(rand(3, 4), Symbol.('a':'d'))

3×4 DataFrame

Row	a	b	c	d
	Float64	Float64	Float64	Float64
1	0.0801007	0.651558	0.777528	0.518961
2	0.409794	0.389744	0.441976	0.834553
3	0.524643	0.0686172	0.246575	0.927093

or

DataFrame(rand(3, 4), string.('a':'d'))

3×4 DataFrame

Row	a	b	c	d
	Float64	Float64	Float64	Float64
1	0.791842	0.704386	0.336266	0.0157616
2	0.593104	0.0682098	0.787272	0.662576
3	0.401622	0.439904	0.242669	0.552314

This is how you can create a data frame with no rows, but with predefined columns and their types:

DataFrame(A=Int[], B=Float64[], C=String[])

0×3 DataFrame

Row	A	B	C
	Int64	Float64	String

Finally, we can create a DataFrame by copying an existing DataFrame. Note that copy also copies the vectors.

x = DataFrame(a=1:2, b='a':'b')
y = copy(x)
(x === y), isequal(x, y), (x.a == y.a), (x.a === y.a)

(false, true, true, false)

Calling DataFrame on a DataFrame object works like copy.

x = DataFrame(a=1:2, b='a':'b')
y = DataFrame(x)
(x === y), isequal(x, y), (x.a == y.a), (x.a === y.a)

(false, true, true, false)

You can avoid copying of columns of a data frame (if it is possible) by passing copycols=false keyword argument:

x = DataFrame(a=1:2, b='a':'b')
y = DataFrame(x, copycols=false)
(x === y), isequal(x, y), (x.a == y.a), (x.a === y.a)

(false, true, true, true)

The same rule applies to other constructors

a = [1, 2, 3]
df1 = DataFrame(a=a)
df2 = DataFrame(a=a, copycols=false)
df1.a === a, df2.a === a

(false, true)

You can create a similar uninitialized DataFrame based on an original one:

x = DataFrame(a=1, b=1.0)

1×2 DataFrame

Row	a	b
	Int64	Float64
1	1	1.0

similar(x)

1×2 DataFrame

Row	a	b
	Int64	Float64
1	139696676461712	6.90187e-310

number of rows in a new DataFrame can be passed as a second argument

similar(x, 0)

0×2 DataFrame

Row	a	b
	Int64	Float64

similar(x, 2)

2×2 DataFrame

Row	a	b
	Int64	Float64
1	4294967297	5.0e-324
2	139693811302400	5.0e-324

You can also create a new DataFrame from SubDataFrame or DataFrameRow (discussed in detail later in the tutorial; in particular although DataFrameRow is considered a 1-dimensional object similar to a NamedTuple it gets converted to a 1-row DataFrame for convinience)

x = DataFrame(a=1, b=1.0)
sdf = view(x, [1, 1], :)

2×2 SubDataFrame

Row	a	b
	Int64	Float64
1	1	1.0
2	1	1.0

typeof(sdf)

SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}

DataFrame(sdf)

2×2 DataFrame

Row	a	b
	Int64	Float64
1	1	1.0
2	1	1.0

dfr = x[1, :]

DataFrameRow (2 columns)

Row	a	b
	Int64	Float64
1	1	1.0

DataFrame(dfr)

1×2 DataFrame

Row	a	b
	Int64	Float64
1	1	1.0

Conversion to a matrix#

Let’s start by creating a DataFrame with two rows and two columns.

x = DataFrame(x=1:2, y=["A", "B"])

2×2 DataFrame

Row	x	y
	Int64	String
1	1	A
2	2	B

We can create a matrix by passing this DataFrame to Matrix or Array.

Matrix(x)

2×2 Matrix{Any}:
 1  "A"
 2  "B"

Array(x)

2×2 Matrix{Any}:
 1  "A"
 2  "B"

This would work even if the DataFrame had some missings:

x = DataFrame(x=1:2, y=[missing, "B"])

2×2 DataFrame

Row	x	y
	Int64	String?
1	1	missing
2	2	B

Matrix(x)

2×2 Matrix{Any}:
 1  missing
 2  "B"

In the two previous matrix examples, Julia created matrices with elements of type Any. We can see more clearly that the type of matrix is inferred when we pass, for example, a DataFrame of integers to Matrix, creating a 2D Array of Int64s:

x = DataFrame(x=1:2, y=3:4)

2×2 DataFrame

Row	x	y
	Int64	Int64
1	1	3
2	2	4

Matrix(x)

2×2 Matrix{Int64}:
 1  3
 2  4

In this next example, Julia correctly identifies that Union is needed to express the type of the resulting Matrix (which contains missings).

x = DataFrame(x=1:2, y=[missing, 4])

2×2 DataFrame

Row	x	y
	Int64	Int64?
1	1	missing
2	2	4

Matrix(x)

2×2 Matrix{Union{Missing, Int64}}:
 1   missing
 2  4

Note that we can’t force a conversion of missing values to Ints!

try
    Matrix{Int}(x)
catch e
    show(e)
end

ArgumentError("cannot convert a DataFrame containing missing values to Matrix{Int64} (found for column y)")

Conversion to `NamedTuple` related tabular structures#

First define some data frame

x = DataFrame(x=1:2, y=["A", "B"])

2×2 DataFrame

Row	x	y
	Int64	String
1	1	A
2	2	B

Now we convert a DataFrame into a NamedTuple of vectors

ct = Tables.columntable(x)

(x = [1, 2], y = ["A", "B"])

Next we convert it into a vector of NamedTuples

rt = Tables.rowtable(x)

2-element Vector{@NamedTuple{x::Int64, y::String}}:
 (x = 1, y = "A")
 (x = 2, y = "B")

We can perform the conversions back to a DataFrame using a standard constructor call:

DataFrame(ct)

2×2 DataFrame

Row	x	y
	Int64	String
1	1	A
2	2	B

DataFrame(rt)

2×2 DataFrame

Row	x	y
	Int64	String
1	1	A
2	2	B

Iterating data frame by rows or columns#

Sometimes it is useful to create a wrapper around a DataFrame that produces its rows or columns. For iterating columns you can use the eachcol function.

ec = eachcol(x)

2×2 DataFrameColumns

Row	x	y
	Int64	String
1	1	A
2	2	B

DataFrameColumns object behaves as a vector (note though it is not AbstractVector)

ec isa AbstractVector

false

ec[1]

2-element Vector{Int64}:
 1
 2

but you can also index into it using column names:

ec["x"]

2-element Vector{Int64}:
 1
 2

similarly eachrow creates a DataFrameRows object that is a vector of its rows

er = eachrow(x)

2×2 DataFrameRows

Row	x	y
	Int64	String
1	1	A
2	2	B

DataFrameRows is an AbstractVector

er isa AbstractVector

true

er[end]

DataFrameRow (2 columns)

Row	x	y
	Int64	String
2	2	B

Note that both data frame and also DataFrameColumns and DataFrameRows objects are not type stable (they do not know the types of their columns). This is useful to avoid compilation cost if you have very wide data frames with heterogenous column types.

However, often (especially if a data frame is narrows) it is useful to create a lazy iterator that produces NamedTuples for each row of the DataFrame. Its key benefit is that it is type stable (so it is useful when you want to perform some operations in a fast way on a small subset of columns of a DataFrame - this strategy is often used internally by DataFrames.jl package):

nti = Tables.namedtupleiterator(x)

Tables.NamedTupleIterator{Tables.Schema{(:x, :y), Tuple{Int64, String}}, Tables.RowIterator{@NamedTuple{x::Vector{Int64}, y::Vector{String}}}}(Tables.RowIterator{@NamedTuple{x::Vector{Int64}, y::Vector{String}}}((x = [1, 2], y = ["A", "B"]), 2))

for row in enumerate(nti)
    @show row
end

row = (1, (x = 1, y = "A"))
row = (2, (x = 2, y = "B"))

similarly to the previous options you can easily convert NamedTupleIterator back to a DataFrame.

DataFrame(nti)

2×2 DataFrame

Row	x	y
	Int64	String
1	1	A
2	2	B

Handling of duplicate column names#

We can pass the makeunique keyword argument to allow passing duplicate names (they get deduplicated)

df = DataFrame(:a => 1, :a => 2, :a_1 => 3; makeunique=true)

1×3 DataFrame

Row	a	a_2	a_1
	Int64	Int64	Int64
1	1	2	3

Otherwise, duplicates are not allowed.

try
    df = DataFrame(:a => 1, :a => 2, :a_1 => 3)
catch e
    show(e)
end

ArgumentError("Duplicate variable names: :a. Pass makeunique=true to make them unique using a suffix automatically.")

Observe that currently nothing is not printed when displaying a DataFrame in Jupyter Notebook:

df = DataFrame(x=[1, nothing], y=[nothing, "a"], z=[missing, "c"])

2×3 DataFrame

Row	x	y	z
	Union…	Union…	String?
1	1		missing
2		a	c

Finally you can use empty and empty! functions to remove all rows from a data frame:

empty(df)
df

2×3 DataFrame

Row	x	y	z
	Union…	Union…	String?
1	1		missing
2		a	c

empty!(df)
df

0×3 DataFrame

Row	x	y	z
	Union…	Union…	String?

Constructors and conversion

Contents

Constructors and conversion#

Constructors#

Conversion to a matrix#

Conversion to NamedTuple related tabular structures#

Iterating data frame by rows or columns#

Handling of duplicate column names#

Conversion to `NamedTuple` related tabular structures#