Possible pitfalls#
using DataFrames
using BenchmarkTools
Know what is copied when creating a DataFrame#
x = DataFrame(rand(3, 5), :auto)
Row | x1 | x2 | x3 | x4 | x5 |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.560891 | 0.566409 | 0.604743 | 0.606931 | 0.761472 |
2 | 0.255647 | 0.636852 | 0.45487 | 0.199889 | 0.898372 |
3 | 0.300935 | 0.310988 | 0.673843 | 0.512502 | 0.585906 |
x and y are not the same object
y = copy(x)
x === y
false
x and y are not the same object
y = DataFrame(x)
x === y
false
the columns are also not the same
any(x[!, i] === y[!, i] for i in ncol(x))
false
x and y are not the same object
y = DataFrame(x, copycols=false)
x === y
false
But the columns are the same
all(x[!, i] === y[!, i] for i in ncol(x))
true
the same when creating data frames using kwarg
syntax
x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
different object
y === df.y
false
range is converted to a vector
typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})
slicing rows always creates a copy
y === df[:, :y]
false
you can avoid copying by using copycols=false keyword argument in functions.
df = DataFrame(x=x, y=y, copycols=false)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
now it is the same
y === df.y
true
not the same object
select(df, :y)[!, 1] === y
false
the same object
select(df, :y, copycols=false)[!, 1] === y
true
Do not modify the parent of GroupedDataFrame
or view#
x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)
x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view
GroupedDataFrame with 2 groups based on key: id
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 3 |
3 | 1 | 5 |
⋮
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 2 |
2 | 2 | 4 |
3 | 2 | 6 |
s = view(x, 5:6, :)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 1 | 5 |
2 | 2 | 6 |
delete!(x, 3:6)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 2 |
This is an error
s ## Will return BoundsError
Single column selection for a DataFrame
#
Single column selection for a DataFrame
creates aliases with ! and getproperty
syntax and copies with :
x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 1 | 1 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
x[1, 1] = 100
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 100 | 100 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
When iterating rows of a data frame#
use
eachrow
to avoid compilation cost in wide tables,but
Tables.namedtupleiterator
for fast execution in tall tables
The table below is tall:
df2 = DataFrame(rand(10^6, 10), :auto)
Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 |
---|---|---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.173259 | 0.868429 | 0.199462 | 0.144989 | 0.140205 | 0.246803 | 0.246632 | 0.942353 | 0.926668 | 0.911074 |
2 | 0.123954 | 0.996035 | 0.484935 | 0.392219 | 0.203026 | 0.459265 | 0.177535 | 0.343015 | 0.484788 | 0.790101 |
3 | 0.154781 | 0.186644 | 0.799731 | 0.997506 | 0.955073 | 0.445179 | 0.619125 | 0.0335754 | 0.607219 | 0.865513 |
4 | 0.816031 | 0.868959 | 0.770971 | 0.581648 | 0.913118 | 0.241069 | 0.185114 | 0.959924 | 0.329531 | 0.144141 |
5 | 0.814599 | 0.801111 | 0.956055 | 0.248121 | 0.241139 | 0.150823 | 0.404526 | 0.322152 | 0.165737 | 0.185206 |
6 | 0.582423 | 0.119731 | 0.607285 | 0.0600893 | 0.827207 | 0.467929 | 0.855228 | 0.136935 | 0.866942 | 0.495018 |
7 | 0.592345 | 0.28729 | 0.530249 | 0.671263 | 0.387001 | 0.478387 | 0.270147 | 0.00583683 | 0.169445 | 0.92732 |
8 | 0.566305 | 0.466944 | 0.433918 | 0.509695 | 0.125568 | 0.540185 | 0.0345033 | 0.215133 | 0.247376 | 0.742053 |
9 | 0.711011 | 0.156912 | 0.966428 | 0.373472 | 0.779452 | 0.171457 | 0.9646 | 0.205437 | 0.0636527 | 0.92508 |
10 | 0.4733 | 0.639631 | 0.758602 | 0.342212 | 0.446266 | 0.940815 | 0.354516 | 0.846175 | 0.255437 | 0.04738 |
11 | 0.36807 | 0.944925 | 0.904134 | 0.648487 | 0.894707 | 0.423242 | 0.925946 | 0.727968 | 0.658193 | 0.688444 |
12 | 0.604438 | 0.472574 | 0.969352 | 0.831801 | 0.454789 | 0.811133 | 0.718755 | 0.733665 | 0.3699 | 0.438884 |
13 | 0.883801 | 0.398876 | 0.944276 | 0.466476 | 0.102224 | 0.77286 | 0.733994 | 0.652745 | 0.819884 | 0.883364 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
999989 | 0.890034 | 0.0641943 | 0.499185 | 0.196725 | 0.420027 | 0.0912085 | 0.00682904 | 0.53108 | 0.637768 | 0.533981 |
999990 | 0.810706 | 0.403768 | 0.0845722 | 0.664577 | 0.521997 | 0.580567 | 0.415557 | 0.702156 | 0.204175 | 0.965622 |
999991 | 0.456027 | 0.77377 | 0.598199 | 0.122355 | 0.448871 | 0.792457 | 0.827398 | 0.148164 | 0.28031 | 0.158536 |
999992 | 0.846266 | 0.679415 | 0.277502 | 0.94522 | 0.908634 | 0.0268256 | 0.573818 | 0.794065 | 0.396007 | 0.0417267 |
999993 | 0.77744 | 0.399099 | 0.127973 | 0.304421 | 0.710879 | 0.635667 | 0.324141 | 0.728154 | 0.0727041 | 0.606665 |
999994 | 0.756423 | 0.272729 | 0.13272 | 0.0142115 | 0.614726 | 0.115551 | 0.619445 | 0.438747 | 0.460739 | 0.663496 |
999995 | 0.287939 | 0.881226 | 0.350966 | 0.540559 | 0.70547 | 0.544379 | 0.0550066 | 0.71417 | 0.134671 | 0.81261 |
999996 | 0.849627 | 0.789437 | 0.127928 | 0.507435 | 0.319254 | 0.978655 | 0.00634836 | 0.803388 | 0.783744 | 0.67073 |
999997 | 0.118957 | 0.254107 | 0.854678 | 0.871965 | 0.747361 | 0.911129 | 0.152809 | 0.388875 | 0.538347 | 0.296838 |
999998 | 0.859681 | 0.547856 | 0.52048 | 0.0714623 | 0.849722 | 0.690608 | 0.347115 | 0.8185 | 0.311047 | 0.280407 |
999999 | 0.69592 | 0.131087 | 0.547444 | 0.261241 | 0.741731 | 0.796596 | 0.223215 | 0.750396 | 0.61934 | 0.495444 |
1000000 | 0.317247 | 0.632787 | 0.264964 | 0.94906 | 0.877184 | 0.632165 | 0.254215 | 0.0337684 | 0.0369622 | 0.905203 |
@time map(sum, eachrow(df2));
3.542418 seconds (60.11 M allocations: 1.056 GiB, 9.67% gc time, 3.42% compilation time)
@time map(sum, eachrow(df2));
3.169916 seconds (59.99 M allocations: 1.050 GiB, 4.13% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
0.249135 seconds (361.13 k allocations: 26.568 MiB, 4.86% gc time, 85.77% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
0.012652 seconds (20 allocations: 7.630 MiB)
as you can see - this time it is much faster to iterate a type stable container
still you might want to use the select
syntax, which is optimized for such reductions:
this includes compilation time
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.011790 seconds (6.67 k allocations: 7.969 MiB, 50.48% compilation time)
1000000-element Vector{Float64}:
4.799871994364215
4.454871793224532
5.664348582520293
5.810507372086148
4.289469346132967
5.0187854176762725
4.3192849190286875
3.8816819363205237
5.317501511565184
5.104333118795934
⋮
5.489477995118583
4.687143261221065
4.088786137198144
5.026998053622702
5.8365470064870415
5.135065469480588
5.29687884751163
5.262414871594562
4.90355610068893
Do it again
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.006313 seconds (123 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
4.799871994364215
4.454871793224532
5.664348582520293
5.810507372086148
4.289469346132967
5.0187854176762725
4.3192849190286875
3.8816819363205237
5.317501511565184
5.104333118795934
⋮
5.489477995118583
4.687143261221065
4.088786137198144
5.026998053622702
5.8365470064870415
5.135065469480588
5.29687884751163
5.262414871594562
4.90355610068893
This notebook was generated using Literate.jl.