Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

using DataFrames
using BenchmarkTools

Know what is copied when creating a DataFrame

x = DataFrame(rand(3, 5), :auto)
Loading...

x and y are not the same object

y = copy(x)
x === y
false

x and y are not the same object

y = DataFrame(x)
x === y
false

the columns are also not the same

any(x[!, i] === y[!, i] for i in ncol(x))
false

x and y are not the same object

y = DataFrame(x, copycols=false)
x === y
false

But the columns are the same

all(x[!, i] === y[!, i] for i in ncol(x))
true

the same when creating data frames using kwarg syntax

x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
Loading...

different object

y === df.y
false

range is converted to a vector

typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

y === df[:, :y]
false

you can avoid copying by using copycols=false keyword argument in functions.

df = DataFrame(x=x, y=y, copycols=false)
Loading...

now it is the same

y === df.y
true

not the same object

select(df, :y)[!, 1] === y
false

the same object

select(df, :y, copycols=false)[!, 1] === y
true

Do not modify the parent of GroupedDataFrame or view

x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view
Loading...
s = view(x, 5:6, :)
Loading...
delete!(x, 3:6)
Loading...

This is an error

s ## Will return BoundsError

Single column selection for a DataFrame

Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :

x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
Loading...
x[1, 1] = 100
display(x)
Loading...

When iterating rows of a data frame

  • use eachrow to avoid compilation cost in wide tables,

  • but Tables.namedtupleiterator for fast execution in tall tables

The table below is tall:

df2 = DataFrame(rand(10^6, 10), :auto)
Loading...
@time map(sum, eachrow(df2));
  2.862006 seconds (60.12 M allocations: 1.056 GiB, 17.86% gc time, 3.80% compilation time)
@time map(sum, eachrow(df2));
  1.839702 seconds (59.99 M allocations: 1.050 GiB, 6.62% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.294215 seconds (1.22 M allocations: 66.375 MiB, 4.60% gc time, 92.71% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.007064 seconds (21 allocations: 7.631 MiB)

as you can see - this time it is much faster to iterate a type stable container still you might want to use the select syntax, which is optimized for such reductions:

this includes compilation time

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.453513 seconds (652.53 k allocations: 40.209 MiB, 97.41% compilation time: 93% of which was recompilation)
1000000-element Vector{Float64}: 6.088121300114665 4.583506989773668 6.436452085802293 4.422063403705745 5.706114411432138 6.406715319509457 4.045501772163187 4.011497015157744 4.614048166137602 4.53977448643021 ⋮ 5.23525353022991 3.819440822231344 2.5706625452467433 5.209653001107025 4.997139019016276 5.407115447513615 5.793244439757203 5.3173389590187545 5.904771256612289

Do it again

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.011121 seconds (123 allocations: 7.634 MiB)
1000000-element Vector{Float64}: 6.088121300114665 4.583506989773668 6.436452085802293 4.422063403705745 5.706114411432138 6.406715319509457 4.045501772163187 4.011497015157744 4.614048166137602 4.53977448643021 ⋮ 5.23525353022991 3.819440822231344 2.5706625452467433 5.209653001107025 4.997139019016276 5.407115447513615 5.793244439757203 5.3173389590187545 5.904771256612289

This notebook was generated using Literate.jl.