using DataFrames
using BenchmarkToolsKnow what is copied when creating a DataFrame¶
x = DataFrame(rand(3, 5), :auto)x and y are not the same object
y = copy(x)
x === yfalsex and y are not the same object
y = DataFrame(x)
x === yfalsethe columns are also not the same
any(x[!, i] === y[!, i] for i in ncol(x))falsex and y are not the same object
y = DataFrame(x, copycols=false)
x === yfalseBut the columns are the same
all(x[!, i] === y[!, i] for i in ncol(x))truethe same when creating data frames using kwarg syntax
x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)different object
y === df.yfalserange is converted to a vector
typeof(x), typeof(df.x)(UnitRange{Int64}, Vector{Int64})slicing rows always creates a copy
y === df[:, :y]falseyou can avoid copying by using copycols=false keyword argument in functions.
df = DataFrame(x=x, y=y, copycols=false)now it is the same
y === df.ytruenot the same object
select(df, :y)[!, 1] === yfalsethe same object
select(df, :y, copycols=false)[!, 1] === ytrueDo not modify the parent of GroupedDataFrame or view¶
x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)
x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a views = view(x, 5:6, :)delete!(x, 3:6)This is an error
s ## Will return BoundsErrorSingle column selection for a DataFrame¶
Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :
x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)x[1, 1] = 100
display(x)When iterating rows of a data frame¶
use
eachrowto avoid compilation cost in wide tables,but
Tables.namedtupleiteratorfor fast execution in tall tables
The table below is tall:
df2 = DataFrame(rand(10^6, 10), :auto)@time map(sum, eachrow(df2)); 2.237700 seconds (60.13 M allocations: 1.057 GiB, 15.06% gc time, 4.78% compilation time)
@time map(sum, eachrow(df2)); 1.858801 seconds (59.99 M allocations: 1.050 GiB, 5.72% gc time)
@time map(sum, Tables.namedtupleiterator(df2)); 0.207093 seconds (416.41 k allocations: 28.386 MiB, 4.99% gc time, 90.07% compilation time)
@time map(sum, Tables.namedtupleiterator(df2)); 0.009769 seconds (21 allocations: 7.634 MiB)
as you can see - this time it is much faster to iterate a type stable container
still you might want to use the select syntax, which is optimized for such reductions:
this includes compilation time
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum 0.012942 seconds (8.37 k allocations: 8.053 MiB, 44.37% compilation time)
1000000-element Vector{Float64}:
5.681282780543965
5.193902645781667
4.850866529726622
6.859539372323131
5.275363155091211
5.88117324094082
5.542401814790153
5.777287414295434
4.268532664681015
5.073472018109767
⋮
3.560891506759153
5.811414785408425
5.920903736768583
4.46180920185985
6.201544892680714
3.751239120607656
5.715157708343561
3.857121093632021
5.367968857394535Do it again
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum 0.007216 seconds (121 allocations: 7.637 MiB)
1000000-element Vector{Float64}:
5.681282780543965
5.193902645781667
4.850866529726622
6.859539372323131
5.275363155091211
5.88117324094082
5.542401814790153
5.777287414295434
4.268532664681015
5.073472018109767
⋮
3.560891506759153
5.811414785408425
5.920903736768583
4.46180920185985
6.201544892680714
3.751239120607656
5.715157708343561
3.857121093632021
5.367968857394535This notebook was generated using Literate.jl.