Possible pitfalls#
using DataFrames
using BenchmarkTools
Know what is copied when creating a DataFrame#
x = DataFrame(rand(3, 5), :auto)
| Row | x1 | x2 | x3 | x4 | x5 |
|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64 | |
| 1 | 0.735717 | 0.839505 | 0.0154341 | 0.675586 | 0.881035 |
| 2 | 0.42152 | 0.680851 | 0.395228 | 0.727355 | 0.828121 |
| 3 | 0.650958 | 0.800487 | 0.252657 | 0.585549 | 0.112034 |
x and y are not the same object
y = copy(x)
x === y
false
x and y are not the same object
y = DataFrame(x)
x === y
false
the columns are also not the same
any(x[!, i] === y[!, i] for i in ncol(x))
false
x and y are not the same object
y = DataFrame(x, copycols=false)
x === y
false
But the columns are the same
all(x[!, i] === y[!, i] for i in ncol(x))
true
the same when creating data frames using kwarg syntax
x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
| Row | x | y |
|---|---|---|
| Int64 | Int64 | |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
different object
y === df.y
false
range is converted to a vector
typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})
slicing rows always creates a copy
y === df[:, :y]
false
you can avoid copying by using copycols=false keyword argument in functions.
df = DataFrame(x=x, y=y, copycols=false)
| Row | x | y |
|---|---|---|
| Int64 | Int64 | |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
now it is the same
y === df.y
true
not the same object
select(df, :y)[!, 1] === y
false
the same object
select(df, :y, copycols=false)[!, 1] === y
true
Do not modify the parent of GroupedDataFrame or view#
x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)
x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view
GroupedDataFrame with 2 groups based on key: id
| Row | id | x |
|---|---|---|
| Int64 | Int64 | |
| 1 | 2 | 1 |
| 2 | 2 | 3 |
| 3 | 1 | 5 |
⋮
| Row | id | x |
|---|---|---|
| Int64 | Int64 | |
| 1 | 2 | 2 |
| 2 | 2 | 4 |
| 3 | 2 | 6 |
s = view(x, 5:6, :)
| Row | id | x |
|---|---|---|
| Int64 | Int64 | |
| 1 | 1 | 5 |
| 2 | 2 | 6 |
delete!(x, 3:6)
| Row | id | x |
|---|---|---|
| Int64 | Int64 | |
| 1 | 2 | 1 |
| 2 | 2 | 2 |
This is an error
s ## Will return BoundsError
Single column selection for a DataFrame#
Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :
x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
| Row | a | b | c | d | e |
|---|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | Int64 | |
| 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 2 | 2 | 2 | 2 |
| 3 | 3 | 3 | 3 | 3 | 3 |
x[1, 1] = 100
display(x)
| Row | a | b | c | d | e |
|---|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | Int64 | |
| 1 | 100 | 100 | 1 | 1 | 1 |
| 2 | 2 | 2 | 2 | 2 | 2 |
| 3 | 3 | 3 | 3 | 3 | 3 |
When iterating rows of a data frame#
use
eachrowto avoid compilation cost in wide tables,but
Tables.namedtupleiteratorfor fast execution in tall tables
The table below is tall:
df2 = DataFrame(rand(10^6, 10), :auto)
| Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
| 1 | 0.662533 | 0.0791648 | 0.476085 | 0.137527 | 0.780112 | 0.556826 | 0.29654 | 0.0462489 | 0.580725 | 0.459269 |
| 2 | 0.641919 | 0.219975 | 0.262794 | 0.391827 | 0.116199 | 0.397834 | 0.894128 | 0.178664 | 0.84901 | 0.97465 |
| 3 | 0.633747 | 0.404336 | 0.422398 | 0.523208 | 0.839493 | 0.994638 | 0.938289 | 0.109516 | 0.35902 | 0.153182 |
| 4 | 0.878339 | 0.600868 | 0.341619 | 0.536543 | 0.362645 | 0.796499 | 0.88459 | 0.849896 | 0.293786 | 0.657048 |
| 5 | 0.106658 | 0.150221 | 0.91712 | 0.809954 | 0.617408 | 0.316943 | 0.958499 | 0.409879 | 0.489289 | 0.719617 |
| 6 | 0.59303 | 0.275082 | 0.659807 | 0.987099 | 0.919333 | 0.0796179 | 0.496563 | 0.714089 | 0.260569 | 0.889253 |
| 7 | 0.556824 | 0.114174 | 0.876116 | 0.833703 | 0.583783 | 0.358228 | 0.47351 | 0.891788 | 0.932091 | 0.851929 |
| 8 | 0.324652 | 0.69072 | 0.086412 | 0.690169 | 0.713064 | 0.22064 | 0.597105 | 0.319834 | 0.191315 | 0.122917 |
| 9 | 0.397104 | 0.963838 | 0.336063 | 0.429932 | 0.799338 | 0.543028 | 0.716992 | 0.427673 | 0.572117 | 0.849775 |
| 10 | 0.830195 | 0.716072 | 0.704308 | 0.363882 | 0.0986513 | 0.659677 | 0.596932 | 0.998389 | 0.236294 | 0.615016 |
| 11 | 0.328347 | 0.667366 | 0.0899753 | 0.868009 | 0.579315 | 0.633655 | 0.931564 | 0.815405 | 0.526418 | 0.981884 |
| 12 | 0.881655 | 0.453405 | 0.154194 | 0.566519 | 0.352452 | 0.985721 | 0.723605 | 0.827787 | 0.880202 | 0.767686 |
| 13 | 0.636864 | 0.673021 | 0.249024 | 0.710393 | 0.475797 | 0.128731 | 0.359529 | 0.315217 | 0.723981 | 0.812359 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 999989 | 0.975986 | 0.476689 | 0.0514201 | 0.989957 | 0.883476 | 0.357384 | 0.243844 | 0.916654 | 0.496581 | 0.0234581 |
| 999990 | 0.0708377 | 0.110511 | 0.594757 | 0.679725 | 0.334746 | 0.211194 | 0.680969 | 0.165843 | 0.257112 | 0.459214 |
| 999991 | 0.579021 | 0.539474 | 0.200064 | 0.594268 | 0.462375 | 0.27947 | 0.229069 | 0.024284 | 0.37251 | 0.296529 |
| 999992 | 0.191944 | 0.0983965 | 0.252409 | 0.793691 | 0.375082 | 0.973486 | 0.845244 | 0.0734637 | 0.777887 | 0.722518 |
| 999993 | 0.542728 | 0.92741 | 0.0118351 | 0.110601 | 0.527383 | 0.96193 | 0.958437 | 0.634349 | 0.523464 | 0.171206 |
| 999994 | 0.423901 | 0.768264 | 0.813564 | 0.647029 | 0.776164 | 0.744056 | 0.304229 | 0.2424 | 0.905494 | 0.639531 |
| 999995 | 0.641797 | 0.871426 | 0.213085 | 0.805663 | 0.341525 | 0.524431 | 0.393226 | 0.0950247 | 0.158032 | 0.357649 |
| 999996 | 0.153543 | 0.311477 | 0.508389 | 0.424192 | 0.512524 | 0.0338346 | 0.620033 | 0.749842 | 0.29848 | 0.116878 |
| 999997 | 0.494436 | 0.126412 | 0.295339 | 0.472308 | 0.110807 | 0.639409 | 0.10246 | 0.0628602 | 0.502538 | 0.0265686 |
| 999998 | 0.418533 | 0.756071 | 0.936742 | 0.142512 | 0.231282 | 0.509376 | 0.686532 | 0.602506 | 0.361023 | 0.378734 |
| 999999 | 0.0709701 | 0.838043 | 0.807257 | 0.0562733 | 0.775084 | 0.251166 | 0.895086 | 0.38131 | 0.652535 | 0.614597 |
| 1000000 | 0.259408 | 0.747359 | 0.616164 | 0.749795 | 0.622107 | 0.0136075 | 0.959057 | 0.192052 | 0.958715 | 0.629737 |
@time map(sum, eachrow(df2));
2.248203 seconds (60.12 M allocations: 1.056 GiB, 17.66% gc time, 4.00% compilation time)
@time map(sum, eachrow(df2));
1.672558 seconds (59.99 M allocations: 1.050 GiB, 5.47% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
0.271730 seconds (1.22 M allocations: 66.497 MiB, 4.36% gc time, 93.31% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
0.007500 seconds (21 allocations: 7.631 MiB)
as you can see - this time it is much faster to iterate a type stable container
still you might want to use the select syntax, which is optimized for such reductions:
this includes compilation time
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.434296 seconds (712.75 k allocations: 43.180 MiB, 97.27% compilation time: 94% of which was recompilation)
1000000-element Vector{Float64}:
4.075031145560174
4.9269995567112765
5.37782807808806
6.201831198796953
5.49558611192314
5.874443898489834
6.472146151299726
3.956827075489279
6.035858171073009
5.819416611082717
⋮
5.104121438607423
5.369343756593169
6.2646325534129454
4.4018586929999035
3.729193960552614
2.833138329204522
5.023312467271604
5.3423217101664395
5.748001588947343
Do it again
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.011397 seconds (123 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
4.075031145560174
4.9269995567112765
5.37782807808806
6.201831198796953
5.49558611192314
5.874443898489834
6.472146151299726
3.956827075489279
6.035858171073009
5.819416611082717
⋮
5.104121438607423
5.369343756593169
6.2646325534129454
4.4018586929999035
3.729193960552614
2.833138329204522
5.023312467271604
5.3423217101664395
5.748001588947343
This notebook was generated using Literate.jl.