Possible pitfalls#
using DataFrames
using BenchmarkTools
Know what is copied when creating a DataFrame#
x = DataFrame(rand(3, 5), :auto)
Row | x1 | x2 | x3 | x4 | x5 |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.102373 | 0.715619 | 0.402129 | 0.663427 | 0.79705 |
2 | 0.62435 | 0.402815 | 0.0745343 | 0.972842 | 0.811803 |
3 | 0.421527 | 0.980137 | 0.0678962 | 0.937666 | 0.789266 |
x and y are not the same object
y = copy(x)
x === y
false
x and y are not the same object
y = DataFrame(x)
x === y
false
the columns are also not the same
any(x[!, i] === y[!, i] for i in ncol(x))
false
x and y are not the same object
y = DataFrame(x, copycols=false)
x === y
false
But the columns are the same
all(x[!, i] === y[!, i] for i in ncol(x))
true
the same when creating data frames using kwarg
syntax
x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
different object
y === df.y
false
range is converted to a vector
typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})
slicing rows always creates a copy
y === df[:, :y]
false
you can avoid copying by using copycols=false keyword argument in functions.
df = DataFrame(x=x, y=y, copycols=false)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
now it is the same
y === df.y
true
not the same object
select(df, :y)[!, 1] === y
false
the same object
select(df, :y, copycols=false)[!, 1] === y
true
Do not modify the parent of GroupedDataFrame
or view#
x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)
x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view
GroupedDataFrame with 2 groups based on key: id
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 3 |
3 | 1 | 5 |
⋮
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 2 |
2 | 2 | 4 |
3 | 2 | 6 |
s = view(x, 5:6, :)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 1 | 5 |
2 | 2 | 6 |
delete!(x, 3:6)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 2 |
This is an error
s ## Will return BoundsError
Single column selection for a DataFrame
#
Single column selection for a DataFrame
creates aliases with ! and getproperty
syntax and copies with :
x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 1 | 1 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
x[1, 1] = 100
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 100 | 100 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
When iterating rows of a data frame#
use
eachrow
to avoid compilation cost in wide tables,but
Tables.namedtupleiterator
for fast execution in tall tables
The table below is tall:
df2 = DataFrame(rand(10^6, 10), :auto)
Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 |
---|---|---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.766072 | 0.958625 | 0.832073 | 0.0254966 | 0.847788 | 0.314068 | 0.162517 | 0.264803 | 0.100111 | 0.0251814 |
2 | 0.369232 | 0.698069 | 0.816405 | 0.581132 | 0.907493 | 0.128747 | 0.603082 | 0.550548 | 0.943366 | 0.657636 |
3 | 0.37886 | 0.688995 | 0.318009 | 0.954047 | 0.534389 | 0.803777 | 0.481564 | 0.0241329 | 0.341135 | 0.660792 |
4 | 0.365775 | 0.176099 | 0.790104 | 0.685668 | 0.426212 | 0.164362 | 0.903828 | 0.271555 | 0.64879 | 0.901741 |
5 | 0.747757 | 0.465838 | 0.522257 | 0.794617 | 0.166702 | 0.771483 | 0.801306 | 0.138768 | 0.685883 | 0.230728 |
6 | 0.590686 | 0.234883 | 0.403878 | 0.611071 | 0.388746 | 0.281314 | 0.944573 | 0.729575 | 0.295263 | 0.875675 |
7 | 0.0693997 | 0.441801 | 0.747364 | 0.75684 | 0.829098 | 0.172671 | 0.797226 | 0.0123324 | 0.0525178 | 0.981163 |
8 | 0.682463 | 0.719511 | 0.873292 | 0.138598 | 0.22788 | 0.704053 | 0.596968 | 0.123704 | 0.929795 | 0.213875 |
9 | 0.759339 | 0.626111 | 0.622908 | 0.520235 | 0.637596 | 0.658228 | 0.0669593 | 0.917496 | 0.729016 | 0.827624 |
10 | 0.830865 | 0.658525 | 0.136859 | 0.327927 | 0.72403 | 0.181743 | 0.617011 | 0.658946 | 0.945421 | 0.543078 |
11 | 0.45224 | 0.950252 | 0.524913 | 0.636485 | 0.896756 | 0.526336 | 0.375906 | 0.155847 | 0.109402 | 0.700583 |
12 | 0.706518 | 0.548551 | 0.442289 | 0.637226 | 0.925868 | 0.679232 | 0.670576 | 0.434754 | 0.321455 | 0.0329374 |
13 | 0.397241 | 0.836678 | 0.212289 | 0.994819 | 0.379538 | 0.0108845 | 0.198273 | 0.65096 | 0.0799598 | 0.37601 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
999989 | 0.404831 | 0.405713 | 0.229537 | 0.525418 | 0.451297 | 0.943304 | 0.481654 | 0.0567747 | 0.178495 | 0.0693898 |
999990 | 0.349499 | 0.388355 | 0.881326 | 0.0624467 | 0.487766 | 0.69541 | 0.74954 | 0.623448 | 0.201285 | 0.807722 |
999991 | 0.418072 | 0.331013 | 0.651961 | 0.229347 | 0.18391 | 0.399454 | 0.0548266 | 0.216161 | 0.973492 | 0.0173772 |
999992 | 0.238997 | 0.0435125 | 0.0930842 | 0.268393 | 0.802257 | 0.257628 | 0.918723 | 0.23682 | 0.332129 | 0.936689 |
999993 | 0.438404 | 0.385233 | 0.336612 | 0.677567 | 0.171174 | 0.253147 | 0.093383 | 0.46 | 0.74405 | 0.948417 |
999994 | 0.787164 | 0.713589 | 0.858848 | 0.615152 | 0.712799 | 0.890109 | 0.915295 | 0.510148 | 0.615832 | 0.626102 |
999995 | 0.354889 | 0.692322 | 0.739987 | 0.489757 | 0.588558 | 0.163898 | 0.118598 | 0.70944 | 0.905823 | 0.0660051 |
999996 | 0.858621 | 0.068783 | 0.807774 | 0.301436 | 0.176622 | 0.943327 | 0.35847 | 0.0714865 | 0.684629 | 0.593146 |
999997 | 0.168081 | 0.772569 | 0.0557187 | 0.272444 | 0.868914 | 0.689224 | 0.440164 | 0.185854 | 0.849483 | 0.0965424 |
999998 | 0.805764 | 0.0757029 | 0.452915 | 0.579394 | 0.433996 | 0.992273 | 0.436371 | 0.128252 | 0.676505 | 0.743283 |
999999 | 0.235839 | 0.799897 | 0.583111 | 0.401753 | 0.703255 | 0.931781 | 0.642978 | 0.297659 | 0.475144 | 0.0156555 |
1000000 | 0.471203 | 0.00555274 | 0.718645 | 0.540562 | 0.519271 | 0.264605 | 0.659582 | 0.499696 | 0.315777 | 0.956476 |
@time map(sum, eachrow(df2));
2.248705 seconds (60.04 M allocations: 1.053 GiB, 10.63% gc time, 2.38% compilation time)
@time map(sum, eachrow(df2));
2.415540 seconds (59.99 M allocations: 1.050 GiB, 3.73% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
0.295434 seconds (371.24 k allocations: 27.132 MiB, 4.34% gc time, 91.02% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
0.014394 seconds (25 allocations: 7.631 MiB)
as you can see - this time it is much faster to iterate a type stable container
still you might want to use the select
syntax, which is optimized for such reductions:
this includes compilation time
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.013925 seconds (6.67 k allocations: 7.969 MiB, 51.19% compilation time)
1000000-element Vector{Float64}:
4.2967365440983585
6.255710306108731
5.185700498816111
5.334134450728015
5.325339163319068
5.355664387891267
4.860413103161113
5.210139354465969
6.365514164469623
5.62440555470397
⋮
4.128232610040115
4.50798787807542
7.245036793462261
4.829276873210631
4.864294862813554
4.398993642137623
5.324455470285535
5.0870718591125135
4.951369427152281
Do it again
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.007440 seconds (127 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
4.2967365440983585
6.255710306108731
5.185700498816111
5.334134450728015
5.325339163319068
5.355664387891267
4.860413103161113
5.210139354465969
6.365514164469623
5.62440555470397
⋮
4.128232610040115
4.50798787807542
7.245036793462261
4.829276873210631
4.864294862813554
4.398993642137623
5.324455470285535
5.0870718591125135
4.951369427152281
This notebook was generated using Literate.jl.