Possible pitfalls#
using DataFrames
using BenchmarkTools
Know what is copied when creating a DataFrame#
x = DataFrame(rand(3, 5), :auto)
Row | x1 | x2 | x3 | x4 | x5 |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.0107158 | 0.322789 | 0.912201 | 0.361051 | 0.091482 |
2 | 0.996489 | 0.763475 | 0.596776 | 0.519464 | 0.810729 |
3 | 0.143136 | 0.656134 | 0.385184 | 0.596483 | 0.595928 |
x and y are not the same object
y = copy(x)
x === y
false
x and y are not the same object
y = DataFrame(x)
x === y
false
the columns are also not the same
any(x[!, i] === y[!, i] for i in ncol(x))
false
x and y are not the same object
y = DataFrame(x, copycols=false)
x === y
false
But the columns are the same
all(x[!, i] === y[!, i] for i in ncol(x))
true
the same when creating data frames using kwarg
syntax
x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
different object
y === df.y
false
range is converted to a vector
typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})
slicing rows always creates a copy
y === df[:, :y]
false
you can avoid copying by using copycols=false keyword argument in functions.
df = DataFrame(x=x, y=y, copycols=false)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
now it is the same
y === df.y
true
not the same object
select(df, :y)[!, 1] === y
false
the same object
select(df, :y, copycols=false)[!, 1] === y
true
Do not modify the parent of GroupedDataFrame
or view#
x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)
x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view
GroupedDataFrame with 2 groups based on key: id
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 3 |
3 | 1 | 5 |
⋮
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 2 |
2 | 2 | 4 |
3 | 2 | 6 |
s = view(x, 5:6, :)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 1 | 5 |
2 | 2 | 6 |
delete!(x, 3:6)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 2 |
This is an error
s ## Will return BoundsError
Single column selection for a DataFrame
#
Single column selection for a DataFrame
creates aliases with ! and getproperty
syntax and copies with :
x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 1 | 1 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
x[1, 1] = 100
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 100 | 100 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
When iterating rows of a data frame#
use
eachrow
to avoid compilation cost in wide tables,but
Tables.namedtupleiterator
for fast execution in tall tables
The table below is tall:
df2 = DataFrame(rand(10^6, 10), :auto)
Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 |
---|---|---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.832265 | 0.131674 | 0.508225 | 0.447058 | 0.725331 | 0.0961234 | 0.218112 | 0.286077 | 0.105018 | 0.2172 |
2 | 0.81276 | 0.856391 | 0.421479 | 0.581901 | 0.900061 | 0.916382 | 0.0645517 | 0.893773 | 0.206452 | 0.520251 |
3 | 0.326325 | 0.499484 | 0.190477 | 0.713592 | 0.835194 | 0.899216 | 0.216704 | 0.19115 | 0.574479 | 0.137535 |
4 | 0.58597 | 0.791057 | 0.780497 | 0.213826 | 0.419666 | 0.0386514 | 0.813186 | 0.684702 | 0.945724 | 0.909644 |
5 | 0.924245 | 0.453775 | 0.160163 | 0.185441 | 0.613308 | 0.630058 | 0.0349581 | 0.242295 | 0.550329 | 0.444379 |
6 | 0.56838 | 0.671034 | 0.460085 | 0.174114 | 0.444761 | 0.0711773 | 0.639508 | 0.440083 | 0.608258 | 0.856337 |
7 | 0.295128 | 0.992051 | 0.132123 | 0.283321 | 0.538977 | 0.0318384 | 0.339727 | 0.102358 | 0.553367 | 0.890664 |
8 | 0.391832 | 0.584179 | 0.151672 | 0.104387 | 0.388192 | 0.700328 | 0.995755 | 0.307668 | 0.510221 | 0.364263 |
9 | 0.0383472 | 0.605881 | 0.0972578 | 0.850324 | 0.207511 | 0.181542 | 0.834932 | 0.739994 | 0.229699 | 0.565145 |
10 | 0.137732 | 0.221662 | 0.269603 | 0.260454 | 0.116514 | 0.545798 | 0.971564 | 0.695842 | 0.609764 | 0.997268 |
11 | 0.387003 | 0.874427 | 0.312095 | 0.694009 | 0.549808 | 0.815002 | 0.383385 | 0.739481 | 0.634661 | 0.605096 |
12 | 0.992679 | 0.935156 | 0.338056 | 0.362112 | 0.0555973 | 0.584112 | 0.360882 | 0.423658 | 0.601974 | 0.180581 |
13 | 0.871635 | 0.607497 | 0.526908 | 0.850567 | 0.405141 | 0.569324 | 0.81289 | 0.731156 | 0.754438 | 0.500285 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
999989 | 0.512012 | 0.583735 | 0.811174 | 0.626186 | 0.566022 | 0.290865 | 0.767167 | 0.39364 | 0.961983 | 0.963371 |
999990 | 0.942763 | 0.0343473 | 0.0703265 | 0.0107793 | 0.945252 | 0.181098 | 0.295494 | 0.901809 | 0.722218 | 0.133737 |
999991 | 0.885334 | 0.92445 | 0.696873 | 0.122386 | 0.843756 | 0.127146 | 0.47006 | 0.916132 | 0.986527 | 0.844013 |
999992 | 0.76433 | 0.964798 | 0.466808 | 0.0565495 | 0.542518 | 0.641157 | 0.976515 | 0.775186 | 0.829134 | 0.342463 |
999993 | 0.66927 | 0.55809 | 0.275991 | 0.555236 | 0.807204 | 0.449871 | 0.864521 | 0.44005 | 0.407486 | 0.0877313 |
999994 | 0.619486 | 0.324191 | 0.412416 | 0.0173235 | 0.586624 | 0.315169 | 0.0604456 | 0.699634 | 0.0465044 | 0.790883 |
999995 | 0.224703 | 0.347606 | 0.389763 | 0.0573255 | 0.780379 | 0.634856 | 0.534964 | 0.40202 | 0.998112 | 0.916573 |
999996 | 0.750259 | 0.356131 | 0.417182 | 0.71138 | 0.845825 | 0.587552 | 0.78217 | 0.721694 | 0.806511 | 0.80849 |
999997 | 0.979656 | 0.33232 | 0.44473 | 0.675533 | 0.937913 | 0.302433 | 0.668751 | 0.693913 | 0.0689218 | 0.765755 |
999998 | 0.81213 | 0.371985 | 0.841895 | 0.146453 | 0.831588 | 0.877518 | 0.598396 | 0.166529 | 0.0312706 | 0.784023 |
999999 | 0.636062 | 0.240501 | 0.0467885 | 0.668953 | 0.299394 | 0.0476516 | 0.453527 | 0.482403 | 0.0214226 | 0.484848 |
1000000 | 0.0644691 | 0.460597 | 0.989858 | 0.720458 | 0.796665 | 0.192304 | 0.857411 | 0.322194 | 0.228155 | 0.430382 |
@time map(sum, eachrow(df2));
2.460167 seconds (60.04 M allocations: 1.053 GiB, 12.48% gc time, 2.24% compilation time)
@time map(sum, eachrow(df2));
2.205419 seconds (59.99 M allocations: 1.050 GiB, 5.75% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
0.167873 seconds (370.37 k allocations: 27.104 MiB, 93.80% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
0.018365 seconds (25 allocations: 7.631 MiB, 47.51% gc time)
as you can see - this time it is much faster to iterate a type stable container
still you might want to use the select
syntax, which is optimized for such reductions:
this includes compilation time
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.009417 seconds (6.67 k allocations: 7.971 MiB, 44.94% compilation time)
1000000-element Vector{Float64}:
3.567084026690375
6.17400049412799
4.584157424151704
6.182924491214273
4.2389509124219185
4.933736477186374
4.159554933520796
4.498496951245511
4.350633848199879
4.826199194033561
⋮
6.359458205497677
5.115449497624578
3.8726771753912517
5.286301698291334
6.787193316962482
5.869925333426452
5.461787443903272
3.3815507835547747
5.06249293709858
Do it again
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.005082 seconds (127 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
3.567084026690375
6.17400049412799
4.584157424151704
6.182924491214273
4.2389509124219185
4.933736477186374
4.159554933520796
4.498496951245511
4.350633848199879
4.826199194033561
⋮
6.359458205497677
5.115449497624578
3.8726771753912517
5.286301698291334
6.787193316962482
5.869925333426452
5.461787443903272
3.3815507835547747
5.06249293709858
This notebook was generated using Literate.jl.