Possible pitfalls#
using DataFrames
using BenchmarkTools
Know what is copied when creating a DataFrame#
x = DataFrame(rand(3, 5), :auto)
Row | x1 | x2 | x3 | x4 | x5 |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.64218 | 0.518333 | 0.912597 | 0.737362 | 0.369005 |
2 | 0.31615 | 0.0129298 | 0.830193 | 0.790456 | 0.872891 |
3 | 0.321944 | 0.858216 | 0.433624 | 0.00642884 | 0.584734 |
x and y are not the same object
y = copy(x)
x === y
false
x and y are not the same object
y = DataFrame(x)
x === y
false
the columns are also not the same
any(x[!, i] === y[!, i] for i in ncol(x))
false
x and y are not the same object
y = DataFrame(x, copycols=false)
x === y
false
But the columns are the same
all(x[!, i] === y[!, i] for i in ncol(x))
true
the same when creating data frames using kwarg
syntax
x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
different object
y === df.y
false
range is converted to a vector
typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})
slicing rows always creates a copy
y === df[:, :y]
false
you can avoid copying by using copycols=false keyword argument in functions.
df = DataFrame(x=x, y=y, copycols=false)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
now it is the same
y === df.y
true
not the same object
select(df, :y)[!, 1] === y
false
the same object
select(df, :y, copycols=false)[!, 1] === y
true
Do not modify the parent of GroupedDataFrame
or view#
x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)
x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view
GroupedDataFrame with 2 groups based on key: id
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 3 |
3 | 1 | 5 |
⋮
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 2 |
2 | 2 | 4 |
3 | 2 | 6 |
s = view(x, 5:6, :)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 1 | 5 |
2 | 2 | 6 |
delete!(x, 3:6)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 2 |
This is an error
s ## Will return BoundsError
Single column selection for a DataFrame
#
Single column selection for a DataFrame
creates aliases with ! and getproperty
syntax and copies with :
x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 1 | 1 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
x[1, 1] = 100
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 100 | 100 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
When iterating rows of a data frame#
use
eachrow
to avoid compilation cost in wide tables,but
Tables.namedtupleiterator
for fast execution in tall tables
The table below is tall:
df2 = DataFrame(rand(10^6, 10), :auto)
Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 |
---|---|---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.514603 | 0.584329 | 0.0665491 | 0.530295 | 0.86387 | 0.898966 | 0.679293 | 0.721792 | 0.730427 | 0.764711 |
2 | 0.267495 | 0.0713982 | 0.672922 | 0.310102 | 0.710557 | 0.159766 | 0.648488 | 0.957066 | 0.784073 | 0.760727 |
3 | 0.272191 | 0.41045 | 0.106284 | 0.333831 | 0.685483 | 0.970379 | 0.255603 | 0.876819 | 0.713776 | 0.152785 |
4 | 0.0528972 | 0.824871 | 0.480898 | 0.13952 | 0.137032 | 0.570055 | 0.910393 | 0.371392 | 0.249605 | 0.00861442 |
5 | 0.298223 | 0.660267 | 0.856617 | 0.0392045 | 0.792595 | 0.347979 | 0.000670216 | 0.444046 | 0.734485 | 0.47244 |
6 | 0.470917 | 0.957735 | 0.49079 | 0.116445 | 0.70665 | 0.712987 | 0.581061 | 0.338392 | 0.727909 | 0.259861 |
7 | 0.307946 | 0.992683 | 0.259293 | 0.741856 | 0.161519 | 0.823901 | 0.361185 | 0.422894 | 0.462686 | 0.092596 |
8 | 0.718977 | 0.598485 | 0.218965 | 0.409626 | 0.555054 | 0.0475205 | 0.00540351 | 0.188369 | 0.682426 | 0.959554 |
9 | 0.933012 | 0.807318 | 0.551271 | 0.34622 | 0.937425 | 0.624077 | 0.240691 | 0.14738 | 0.510568 | 0.427122 |
10 | 0.409493 | 0.443627 | 0.614149 | 0.279265 | 0.818656 | 0.383236 | 0.855691 | 0.361436 | 0.862913 | 0.273861 |
11 | 0.232408 | 0.142214 | 0.604112 | 0.513238 | 0.582627 | 0.0209442 | 0.265426 | 0.253028 | 0.865552 | 0.516647 |
12 | 0.0416786 | 0.479475 | 0.535593 | 0.114542 | 0.910311 | 0.508257 | 0.585025 | 0.466144 | 0.655532 | 0.989518 |
13 | 0.776002 | 0.349074 | 0.163478 | 0.744065 | 0.188351 | 0.638256 | 0.535164 | 0.998956 | 0.560299 | 0.578827 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
999989 | 0.575849 | 0.947785 | 0.594998 | 0.0366146 | 0.440273 | 0.0496197 | 0.745546 | 0.632301 | 0.28064 | 0.349605 |
999990 | 0.155688 | 0.712417 | 0.503775 | 0.140053 | 0.152792 | 0.185115 | 0.189817 | 0.101924 | 0.0710076 | 0.280905 |
999991 | 0.00378607 | 0.528154 | 0.794535 | 0.134944 | 0.41213 | 0.249464 | 0.777389 | 0.670145 | 0.791562 | 0.833196 |
999992 | 0.568585 | 0.0819683 | 0.449248 | 0.477519 | 0.841495 | 0.181821 | 0.630556 | 0.679918 | 0.551774 | 0.425615 |
999993 | 0.748835 | 0.363711 | 0.969691 | 0.07941 | 0.170153 | 0.755486 | 0.397959 | 0.542357 | 0.449842 | 0.385593 |
999994 | 0.74982 | 0.264108 | 0.7711 | 0.343475 | 0.841114 | 0.509809 | 0.238477 | 0.40012 | 0.114931 | 0.362719 |
999995 | 0.166444 | 0.677702 | 0.781527 | 0.797404 | 0.809076 | 0.042696 | 0.0685178 | 0.146515 | 0.882035 | 0.605978 |
999996 | 0.99203 | 0.579089 | 0.637306 | 0.288347 | 0.400514 | 0.289615 | 0.549606 | 0.318763 | 0.352045 | 0.88931 |
999997 | 0.911368 | 0.186061 | 0.624658 | 0.0400239 | 0.404521 | 0.170685 | 0.461204 | 0.36386 | 0.454862 | 0.669423 |
999998 | 0.910057 | 0.820059 | 0.659252 | 0.0567977 | 0.0453175 | 0.863239 | 0.973717 | 0.26367 | 0.670758 | 0.383683 |
999999 | 0.00533842 | 0.257444 | 0.164686 | 0.51996 | 0.590139 | 0.21177 | 0.907831 | 0.0961115 | 0.960155 | 0.775406 |
1000000 | 0.344445 | 0.193189 | 0.263284 | 0.264805 | 0.281536 | 0.669053 | 0.881398 | 0.187907 | 0.959585 | 0.695789 |
@time map(sum, eachrow(df2));
2.783457 seconds (60.11 M allocations: 1.056 GiB, 24.50% gc time, 3.74% compilation time)
@time map(sum, eachrow(df2));
1.939167 seconds (59.99 M allocations: 1.050 GiB, 5.80% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
0.313881 seconds (1.31 M allocations: 72.141 MiB, 97.98% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
0.019452 seconds (25 allocations: 7.631 MiB, 68.01% gc time)
as you can see - this time it is much faster to iterate a type stable container
still you might want to use the select
syntax, which is optimized for such reductions:
this includes compilation time
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.576455 seconds (765.68 k allocations: 46.162 MiB, 98.88% compilation time: 93% of which was recompilation)
1000000-element Vector{Float64}:
6.354835449763846
5.3425944089008865
4.777601004121772
3.7452773291459938
4.646526552655023
5.362747006556143
4.626559527928046
4.384378236075445
5.525083232942826
5.3023269182995865
⋮
4.888499780349042
4.863038043321743
4.595675110383267
4.977894137096483
5.296626151007581
4.286665043814575
5.646550460409576
4.488840250193496
4.740990746940874
Do it again
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.004556 seconds (127 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
6.354835449763846
5.3425944089008865
4.777601004121772
3.7452773291459938
4.646526552655023
5.362747006556143
4.626559527928046
4.384378236075445
5.525083232942826
5.3023269182995865
⋮
4.888499780349042
4.863038043321743
4.595675110383267
4.977894137096483
5.296626151007581
4.286665043814575
5.646550460409576
4.488840250193496
4.740990746940874
This notebook was generated using Literate.jl.