Possible pitfalls#
using DataFrames
using BenchmarkTools
Know what is copied when creating a DataFrame#
x = DataFrame(rand(3, 5), :auto)
| Row | x1 | x2 | x3 | x4 | x5 |
|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64 | |
| 1 | 0.2113 | 0.308308 | 0.0566407 | 0.828248 | 0.0765941 |
| 2 | 0.794645 | 0.998069 | 0.770412 | 0.653407 | 0.168957 |
| 3 | 0.139427 | 0.348479 | 0.123609 | 0.199557 | 0.735428 |
x and y are not the same object
y = copy(x)
x === y
false
x and y are not the same object
y = DataFrame(x)
x === y
false
the columns are also not the same
any(x[!, i] === y[!, i] for i in ncol(x))
false
x and y are not the same object
y = DataFrame(x, copycols=false)
x === y
false
But the columns are the same
all(x[!, i] === y[!, i] for i in ncol(x))
true
the same when creating data frames using kwarg syntax
x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
| Row | x | y |
|---|---|---|
| Int64 | Int64 | |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
different object
y === df.y
false
range is converted to a vector
typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})
slicing rows always creates a copy
y === df[:, :y]
false
you can avoid copying by using copycols=false keyword argument in functions.
df = DataFrame(x=x, y=y, copycols=false)
| Row | x | y |
|---|---|---|
| Int64 | Int64 | |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
now it is the same
y === df.y
true
not the same object
select(df, :y)[!, 1] === y
false
the same object
select(df, :y, copycols=false)[!, 1] === y
true
Do not modify the parent of GroupedDataFrame or view#
x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)
x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view
GroupedDataFrame with 2 groups based on key: id
| Row | id | x |
|---|---|---|
| Int64 | Int64 | |
| 1 | 2 | 1 |
| 2 | 2 | 3 |
| 3 | 1 | 5 |
⋮
| Row | id | x |
|---|---|---|
| Int64 | Int64 | |
| 1 | 2 | 2 |
| 2 | 2 | 4 |
| 3 | 2 | 6 |
s = view(x, 5:6, :)
| Row | id | x |
|---|---|---|
| Int64 | Int64 | |
| 1 | 1 | 5 |
| 2 | 2 | 6 |
delete!(x, 3:6)
| Row | id | x |
|---|---|---|
| Int64 | Int64 | |
| 1 | 2 | 1 |
| 2 | 2 | 2 |
This is an error
s ## Will return BoundsError
Single column selection for a DataFrame#
Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :
x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
| Row | a | b | c | d | e |
|---|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | Int64 | |
| 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 2 | 2 | 2 | 2 |
| 3 | 3 | 3 | 3 | 3 | 3 |
x[1, 1] = 100
display(x)
| Row | a | b | c | d | e |
|---|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | Int64 | |
| 1 | 100 | 100 | 1 | 1 | 1 |
| 2 | 2 | 2 | 2 | 2 | 2 |
| 3 | 3 | 3 | 3 | 3 | 3 |
When iterating rows of a data frame#
use
eachrowto avoid compilation cost in wide tables,but
Tables.namedtupleiteratorfor fast execution in tall tables
The table below is tall:
df2 = DataFrame(rand(10^6, 10), :auto)
| Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
| 1 | 0.126173 | 0.91391 | 0.230886 | 0.46031 | 0.0800606 | 0.0696025 | 0.0561202 | 0.893639 | 0.894401 | 0.718733 |
| 2 | 0.660918 | 0.978677 | 0.611323 | 0.21381 | 0.328033 | 0.956103 | 0.587796 | 0.127799 | 0.406534 | 0.986028 |
| 3 | 0.75506 | 0.710277 | 0.653572 | 0.946084 | 0.628871 | 0.727277 | 0.635549 | 0.56663 | 0.397588 | 0.641642 |
| 4 | 0.969984 | 0.67806 | 0.992892 | 0.894194 | 0.393372 | 0.0337288 | 0.150722 | 0.0681556 | 0.379299 | 0.678804 |
| 5 | 0.266534 | 0.83572 | 0.304604 | 0.000771709 | 0.649048 | 0.279221 | 0.204479 | 0.598983 | 0.674805 | 0.521239 |
| 6 | 0.859502 | 0.371216 | 0.626017 | 0.658222 | 0.534773 | 0.481119 | 0.0269219 | 0.677936 | 0.116461 | 0.00703477 |
| 7 | 0.956322 | 0.439785 | 0.328789 | 0.618854 | 0.104963 | 0.216793 | 0.536987 | 0.248422 | 0.898859 | 0.0577803 |
| 8 | 0.285074 | 0.965578 | 0.34113 | 0.146301 | 0.279613 | 0.441333 | 0.0284473 | 0.658741 | 0.115377 | 0.482112 |
| 9 | 0.0183711 | 0.122434 | 0.374319 | 0.341599 | 0.107395 | 0.980639 | 0.736708 | 0.806391 | 0.748125 | 0.71007 |
| 10 | 0.380976 | 0.60499 | 0.83708 | 0.285092 | 0.420096 | 0.666504 | 0.24988 | 0.211287 | 0.245034 | 0.654536 |
| 11 | 0.62405 | 0.652209 | 0.760929 | 0.606618 | 0.111383 | 0.10651 | 0.0904231 | 0.666953 | 0.858432 | 0.778396 |
| 12 | 0.389306 | 0.950827 | 0.869131 | 0.204924 | 0.830092 | 0.716451 | 0.299398 | 0.45055 | 0.811171 | 0.486584 |
| 13 | 0.303644 | 0.411425 | 0.945022 | 0.838854 | 0.485448 | 0.0375779 | 0.234965 | 0.767782 | 0.835408 | 0.882997 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 999989 | 0.573598 | 0.60517 | 0.208363 | 0.113562 | 0.375677 | 0.407589 | 0.097919 | 0.510495 | 0.965661 | 0.687687 |
| 999990 | 0.0825845 | 0.274469 | 0.0438893 | 0.8678 | 0.100887 | 0.715827 | 0.729274 | 0.631956 | 0.492734 | 0.269479 |
| 999991 | 0.686946 | 0.593147 | 0.872832 | 0.353351 | 0.337731 | 0.23053 | 0.16554 | 0.364014 | 0.0924762 | 0.380706 |
| 999992 | 0.600088 | 0.0194345 | 0.361781 | 0.170843 | 0.373847 | 0.0926208 | 0.432099 | 0.629715 | 0.728548 | 0.753648 |
| 999993 | 0.84924 | 0.645248 | 0.0407621 | 0.404281 | 0.383247 | 0.228372 | 0.74692 | 0.535874 | 0.96791 | 0.59859 |
| 999994 | 0.874322 | 0.241312 | 0.723002 | 0.208205 | 0.545769 | 0.684953 | 0.905342 | 0.34449 | 0.699187 | 0.377778 |
| 999995 | 0.178979 | 0.243952 | 0.888587 | 0.963289 | 0.366592 | 0.68756 | 0.33175 | 0.103789 | 0.378261 | 0.350398 |
| 999996 | 0.845058 | 0.48191 | 0.408775 | 0.904383 | 0.846516 | 0.493856 | 0.0344712 | 0.581395 | 0.383492 | 0.159446 |
| 999997 | 0.710376 | 0.815444 | 0.21895 | 0.313038 | 0.668372 | 0.593656 | 0.494818 | 0.140116 | 0.668364 | 0.598582 |
| 999998 | 0.939286 | 0.564092 | 0.403675 | 0.924856 | 0.440322 | 0.011495 | 0.849363 | 0.93145 | 0.932504 | 0.249018 |
| 999999 | 0.582941 | 0.672818 | 0.848791 | 0.315112 | 0.636934 | 0.437168 | 0.324979 | 0.729984 | 0.202597 | 0.554824 |
| 1000000 | 0.11236 | 0.674445 | 0.411437 | 0.175847 | 0.249222 | 0.774572 | 0.901069 | 0.401976 | 0.122709 | 0.554017 |
@time map(sum, eachrow(df2));
2.468679 seconds (60.12 M allocations: 1.056 GiB, 21.56% gc time, 4.90% compilation time)
@time map(sum, eachrow(df2));
1.765101 seconds (59.99 M allocations: 1.050 GiB, 6.20% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
0.339066 seconds (1.22 M allocations: 66.508 MiB, 4.27% gc time, 94.07% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
0.005196 seconds (21 allocations: 7.631 MiB)
as you can see - this time it is much faster to iterate a type stable container
still you might want to use the select syntax, which is optimized for such reductions:
this includes compilation time
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.552321 seconds (711.57 k allocations: 43.126 MiB, 98.61% compilation time: 94% of which was recompilation)
1000000-element Vector{Float64}:
4.443835677199177
5.857019558194336
6.662549653640113
5.2392115580311875
4.335405396831737
4.359202661825034
4.407556880885516
3.7437040312173777
4.946051256166412
4.5554762944531015
⋮
4.162624520064019
5.400444138071531
5.6043599690171435
4.4931574135086505
5.139301007663957
5.221714755064596
6.2460613897101105
5.306147866526345
4.377654217174633
Do it again
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.006508 seconds (123 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
4.443835677199177
5.857019558194336
6.662549653640113
5.2392115580311875
4.335405396831737
4.359202661825034
4.407556880885516
3.7437040312173777
4.946051256166412
4.5554762944531015
⋮
4.162624520064019
5.400444138071531
5.6043599690171435
4.4931574135086505
5.139301007663957
5.221714755064596
6.2460613897101105
5.306147866526345
4.377654217174633
This notebook was generated using Literate.jl.