Possible pitfalls#
using DataFrames
using BenchmarkTools
Know what is copied when creating a DataFrame#
x = DataFrame(rand(3, 5), :auto)
Row | x1 | x2 | x3 | x4 | x5 |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.424737 | 0.782748 | 0.681777 | 0.572985 | 0.735879 |
2 | 0.428584 | 0.600395 | 0.539097 | 0.995162 | 0.757722 |
3 | 0.960558 | 0.167239 | 0.926269 | 0.00101525 | 0.797032 |
x and y are not the same object
y = copy(x)
x === y
false
x and y are not the same object
y = DataFrame(x)
x === y
false
the columns are also not the same
any(x[!, i] === y[!, i] for i in ncol(x))
false
x and y are not the same object
y = DataFrame(x, copycols=false)
x === y
false
But the columns are the same
all(x[!, i] === y[!, i] for i in ncol(x))
true
the same when creating data frames using kwarg
syntax
x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
different object
y === df.y
false
range is converted to a vector
typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})
slicing rows always creates a copy
y === df[:, :y]
false
you can avoid copying by using copycols=false keyword argument in functions.
df = DataFrame(x=x, y=y, copycols=false)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
now it is the same
y === df.y
true
not the same object
select(df, :y)[!, 1] === y
false
the same object
select(df, :y, copycols=false)[!, 1] === y
true
Do not modify the parent of GroupedDataFrame
or view#
x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)
x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view
GroupedDataFrame with 2 groups based on key: id
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 3 |
3 | 1 | 5 |
⋮
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 2 |
2 | 2 | 4 |
3 | 2 | 6 |
s = view(x, 5:6, :)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 1 | 5 |
2 | 2 | 6 |
delete!(x, 3:6)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 2 |
This is an error
s ## Will return BoundsError
Single column selection for a DataFrame
#
Single column selection for a DataFrame
creates aliases with ! and getproperty
syntax and copies with :
x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 1 | 1 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
x[1, 1] = 100
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 100 | 100 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
When iterating rows of a data frame#
use
eachrow
to avoid compilation cost in wide tables,but
Tables.namedtupleiterator
for fast execution in tall tables
The table below is tall:
df2 = DataFrame(rand(10^6, 10), :auto)
Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 |
---|---|---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.0926801 | 0.51047 | 0.0182201 | 0.282308 | 0.633224 | 0.218633 | 0.923353 | 0.577041 | 0.673746 | 0.59492 |
2 | 0.674586 | 0.442703 | 0.360177 | 0.977093 | 0.232776 | 0.743179 | 0.959031 | 0.474704 | 0.362828 | 0.299904 |
3 | 0.853672 | 0.219866 | 0.767345 | 0.803446 | 0.753585 | 0.278893 | 0.139847 | 0.19652 | 0.583675 | 0.447995 |
4 | 0.30419 | 0.528587 | 0.419393 | 0.872883 | 0.256235 | 0.0927021 | 0.733771 | 0.773304 | 0.35754 | 0.112215 |
5 | 0.00703228 | 0.563562 | 0.94804 | 0.543083 | 0.452466 | 0.60112 | 0.534152 | 0.169756 | 0.972281 | 0.511011 |
6 | 0.0205538 | 0.433061 | 0.184369 | 0.367608 | 0.533516 | 0.542916 | 0.00705235 | 0.470152 | 0.19433 | 0.00416603 |
7 | 0.688379 | 0.425821 | 0.790728 | 0.829473 | 0.374633 | 0.785347 | 0.0298765 | 0.939468 | 0.0746554 | 0.679708 |
8 | 0.34046 | 0.30185 | 0.537306 | 0.350657 | 0.882454 | 0.460346 | 0.780001 | 0.551189 | 0.728183 | 0.550581 |
9 | 0.0243611 | 0.286375 | 0.35574 | 0.622435 | 0.754108 | 0.91973 | 0.316334 | 0.0269321 | 0.26608 | 0.231496 |
10 | 0.373269 | 0.508587 | 0.345888 | 0.190497 | 0.754962 | 0.303547 | 0.660626 | 0.192616 | 0.385012 | 0.216885 |
11 | 0.720611 | 0.359379 | 0.788825 | 0.395271 | 0.973575 | 0.132564 | 0.723773 | 0.829181 | 0.590105 | 0.359539 |
12 | 0.311856 | 0.474905 | 0.310713 | 0.769761 | 0.533877 | 0.758357 | 0.661514 | 0.17868 | 0.963539 | 0.22141 |
13 | 0.117566 | 0.787011 | 0.271336 | 0.99351 | 0.470142 | 0.191422 | 0.936034 | 0.825118 | 0.864033 | 0.42115 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
999989 | 0.0962606 | 0.598586 | 0.732444 | 0.489372 | 0.241066 | 0.512523 | 0.527585 | 0.711133 | 0.923477 | 0.425483 |
999990 | 0.159721 | 0.802838 | 0.804383 | 0.295411 | 0.348738 | 0.0687883 | 0.0468217 | 0.883975 | 0.376613 | 0.0351025 |
999991 | 0.901358 | 0.734939 | 0.714723 | 0.60859 | 0.111339 | 0.603886 | 0.705214 | 0.0593379 | 0.212619 | 0.24508 |
999992 | 0.281053 | 0.771366 | 0.543346 | 0.869548 | 0.168224 | 0.141557 | 0.409522 | 0.760517 | 0.170061 | 0.693612 |
999993 | 0.0964515 | 0.550841 | 0.801289 | 0.351199 | 0.229217 | 0.501073 | 0.482301 | 0.502982 | 0.878444 | 0.671248 |
999994 | 0.453908 | 0.252602 | 0.513762 | 0.449409 | 0.604462 | 0.5066 | 0.765549 | 0.91116 | 0.643185 | 0.986821 |
999995 | 0.0667829 | 0.0153803 | 0.256034 | 0.330013 | 0.0346192 | 0.595964 | 0.844599 | 0.832714 | 0.979674 | 0.359995 |
999996 | 0.655738 | 0.571018 | 0.871706 | 0.514961 | 0.99357 | 0.580691 | 0.762547 | 0.307967 | 0.337729 | 0.940187 |
999997 | 0.118139 | 0.471918 | 0.857863 | 0.334845 | 0.754409 | 0.955712 | 0.305699 | 0.131203 | 0.0845749 | 0.235298 |
999998 | 0.821392 | 0.709429 | 0.950366 | 0.792938 | 0.204405 | 0.459165 | 0.924076 | 0.53253 | 0.127452 | 0.165834 |
999999 | 0.86303 | 0.135856 | 0.483179 | 0.765809 | 0.0529708 | 0.496276 | 0.206796 | 0.673471 | 0.583007 | 0.803619 |
1000000 | 0.630663 | 0.773273 | 0.144543 | 0.136788 | 0.978063 | 0.0167045 | 0.869717 | 0.0857667 | 0.478027 | 0.32786 |
@time map(sum, eachrow(df2));
2.178086 seconds (60.08 M allocations: 1.055 GiB, 15.60% gc time, 3.73% compilation time)
@time map(sum, eachrow(df2));
1.862223 seconds (59.99 M allocations: 1.050 GiB, 6.29% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
0.233865 seconds (428.99 k allocations: 30.089 MiB, 4.03% gc time, 91.72% compilation time: 27% of which was recompilation)
@time map(sum, Tables.namedtupleiterator(df2));
0.009578 seconds (25 allocations: 7.631 MiB)
as you can see - this time it is much faster to iterate a type stable container
still you might want to use the select
syntax, which is optimized for such reductions:
this includes compilation time
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.164873 seconds (684.31 k allocations: 41.595 MiB, 96.42% compilation time: 97% of which was recompilation)
1000000-element Vector{Float64}:
4.524594656468376
5.526979192486358
5.044843910022011
4.450819836455767
5.302503012392829
2.7577227261336805
5.618088460709214
5.483027953951166
3.8035897887299397
3.93188928765376
⋮
4.808806298265378
5.065044624374165
6.087458020724531
4.3157769856357
6.536114733186674
4.249661766586929
5.687587649537533
5.064013304031212
4.44140391850883
Do it again
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.005210 seconds (127 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
4.524594656468376
5.526979192486358
5.044843910022011
4.450819836455767
5.302503012392829
2.7577227261336805
5.618088460709214
5.483027953951166
3.8035897887299397
3.93188928765376
⋮
4.808806298265378
5.065044624374165
6.087458020724531
4.3157769856357
6.536114733186674
4.249661766586929
5.687587649537533
5.064013304031212
4.44140391850883
This notebook was generated using Literate.jl.