Possible pitfalls

using DataFrames
using BenchmarkTools

Know what is copied when creating a DataFrame¶

x = DataFrame(rand(3, 5), :auto)

x and y are not the same object

y = copy(x)
x === y

false

x and y are not the same object

y = DataFrame(x)
x === y

false

the columns are also not the same

any(x[!, i] === y[!, i] for i in ncol(x))

false

x and y are not the same object

y = DataFrame(x, copycols=false)
x === y

false

But the columns are the same

all(x[!, i] === y[!, i] for i in ncol(x))

true

the same when creating data frames using kwarg syntax

x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)

different object

y === df.y

false

range is converted to a vector

typeof(x), typeof(df.x)

(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

y === df[:, :y]

false

you can avoid copying by using copycols=false keyword argument in functions.

df = DataFrame(x=x, y=y, copycols=false)

now it is the same

y === df.y

true

not the same object

select(df, :y)[!, 1] === y

false

the same object

select(df, :y, copycols=false)[!, 1] === y

true

Do not modify the parent of `GroupedDataFrame` or view¶

x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view

s = view(x, 5:6, :)

delete!(x, 3:6)

This is an error

s ## Will return BoundsError

Single column selection for a `DataFrame`¶

Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :

x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)

x[1, 1] = 100
display(x)

When iterating rows of a data frame¶

use eachrow to avoid compilation cost in wide tables,
but Tables.namedtupleiterator for fast execution in tall tables

The table below is tall:

df2 = DataFrame(rand(10^6, 10), :auto)

@time map(sum, eachrow(df2));

  2.671266 seconds (60.12 M allocations: 1.056 GiB, 17.79% gc time, 4.75% compilation time)

@time map(sum, eachrow(df2));

  2.135510 seconds (59.99 M allocations: 1.050 GiB, 5.03% gc time)

@time map(sum, Tables.namedtupleiterator(df2));

  0.335504 seconds (1.22 M allocations: 66.446 MiB, 98.46% compilation time)

@time map(sum, Tables.namedtupleiterator(df2));

  0.015974 seconds (21 allocations: 7.631 MiB, 62.70% gc time)

as you can see - this time it is much faster to iterate a type stable container still you might want to use the select syntax, which is optimized for such reductions:

this includes compilation time

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum

  0.548352 seconds (656.77 k allocations: 40.399 MiB, 98.85% compilation time: 93% of which was recompilation)

1000000-element Vector{Float64}:
 4.318684907890558
 6.748187752135945
 5.236656073591117
 4.836177394337353
 5.503487771763385
 6.288434302675904
 4.531503546777908
 5.239222200121802
 5.430613876426879
 3.956228037422769
 ⋮
 3.697216270324479
 5.675469665988822
 5.834480926774667
 4.977867909858582
 4.6420907225709644
 5.832588116621575
 4.1195992586097026
 4.354514786767316
 4.994858611372211

Do it again

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum

  0.005886 seconds (121 allocations: 7.634 MiB)

1000000-element Vector{Float64}:
 4.318684907890558
 6.748187752135945
 5.236656073591117
 4.836177394337353
 5.503487771763385
 6.288434302675904
 4.531503546777908
 5.239222200121802
 5.430613876426879
 3.956228037422769
 ⋮
 3.697216270324479
 5.675469665988822
 5.834480926774667
 4.977867909858582
 4.6420907225709644
 5.832588116621575
 4.1195992586097026
 4.354514786767316
 4.994858611372211

This notebook was generated using Literate.jl.

Know what is copied when creating a DataFrame¶

Do not modify the parent of GroupedDataFrame or view¶

Single column selection for a DataFrame¶

When iterating rows of a data frame¶

Do not modify the parent of `GroupedDataFrame` or view¶

Single column selection for a `DataFrame`¶