Possible pitfalls#
using DataFrames
using BenchmarkTools
Know what is copied when creating a DataFrame#
x = DataFrame(rand(3, 5), :auto)
Row | x1 | x2 | x3 | x4 | x5 |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.947396 | 0.594914 | 0.318333 | 0.291359 | 0.591581 |
2 | 0.27621 | 0.856675 | 0.318142 | 0.795397 | 0.151377 |
3 | 0.78505 | 0.0548061 | 0.864647 | 0.708265 | 0.775993 |
x and y are not the same object
y = copy(x)
x === y
false
x and y are not the same object
y = DataFrame(x)
x === y
false
the columns are also not the same
any(x[!, i] === y[!, i] for i in ncol(x))
false
x and y are not the same object
y = DataFrame(x, copycols=false)
x === y
false
But the columns are the same
all(x[!, i] === y[!, i] for i in ncol(x))
true
the same when creating data frames using kwarg
syntax
x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
different object
y === df.y
false
range is converted to a vector
typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})
slicing rows always creates a copy
y === df[:, :y]
false
you can avoid copying by using copycols=false keyword argument in functions.
df = DataFrame(x=x, y=y, copycols=false)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
now it is the same
y === df.y
true
not the same object
select(df, :y)[!, 1] === y
false
the same object
select(df, :y, copycols=false)[!, 1] === y
true
Do not modify the parent of GroupedDataFrame
or view#
x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)
x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view
GroupedDataFrame with 2 groups based on key: id
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 3 |
3 | 1 | 5 |
⋮
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 2 |
2 | 2 | 4 |
3 | 2 | 6 |
s = view(x, 5:6, :)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 1 | 5 |
2 | 2 | 6 |
delete!(x, 3:6)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 2 |
This is an error
s ## Will return BoundsError
Single column selection for a DataFrame
#
Single column selection for a DataFrame
creates aliases with ! and getproperty
syntax and copies with :
x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 1 | 1 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
x[1, 1] = 100
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 100 | 100 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
When iterating rows of a data frame#
use
eachrow
to avoid compilation cost in wide tables,but
Tables.namedtupleiterator
for fast execution in tall tables
The table below is tall:
df2 = DataFrame(rand(10^6, 10), :auto)
Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 |
---|---|---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.538165 | 0.477619 | 0.63865 | 0.114941 | 0.591869 | 0.919326 | 0.460326 | 0.441607 | 0.0579152 | 0.00357446 |
2 | 0.533138 | 0.395225 | 0.936605 | 0.0296993 | 0.239957 | 0.31099 | 0.293299 | 0.950694 | 0.102088 | 0.362718 |
3 | 0.128026 | 0.037075 | 0.884437 | 0.150983 | 0.480751 | 0.819375 | 0.488905 | 0.0527264 | 0.08525 | 0.918672 |
4 | 0.0638011 | 0.676545 | 0.107159 | 0.344157 | 0.911609 | 0.852503 | 0.579901 | 0.129051 | 0.665324 | 0.439404 |
5 | 0.409903 | 0.117916 | 0.241907 | 0.676004 | 0.697972 | 0.169596 | 0.720508 | 0.653289 | 0.678077 | 0.266621 |
6 | 0.518344 | 0.960537 | 0.651609 | 0.67432 | 0.979009 | 0.873823 | 0.128961 | 0.0356952 | 0.139544 | 0.247548 |
7 | 0.717526 | 0.773986 | 0.564901 | 0.594871 | 0.60695 | 0.269118 | 0.809437 | 0.725244 | 0.383868 | 0.606193 |
8 | 0.437243 | 0.984836 | 0.500519 | 0.514212 | 0.130419 | 0.852367 | 0.511447 | 0.0222195 | 0.944033 | 0.438669 |
9 | 0.0227649 | 0.115036 | 0.0683505 | 0.0503512 | 0.943526 | 0.623187 | 0.0234799 | 0.461865 | 0.348735 | 0.690593 |
10 | 0.455792 | 0.527437 | 0.187127 | 0.679676 | 0.464027 | 0.414834 | 0.27163 | 0.717938 | 0.84506 | 0.394637 |
11 | 0.143232 | 0.592707 | 0.461296 | 0.0288561 | 0.433593 | 0.703271 | 0.286978 | 0.670572 | 0.570357 | 0.0609912 |
12 | 0.470649 | 0.324332 | 0.932704 | 0.548668 | 0.815837 | 0.877708 | 0.688376 | 0.445633 | 0.28391 | 0.827211 |
13 | 0.320294 | 0.38712 | 0.651804 | 0.278253 | 0.985029 | 0.777575 | 0.350404 | 0.403067 | 0.385916 | 0.495512 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
999989 | 0.343766 | 0.349236 | 0.531518 | 0.591489 | 0.612046 | 0.0491963 | 0.328509 | 0.655394 | 0.605836 | 0.524536 |
999990 | 0.0673405 | 0.30125 | 0.954371 | 0.765194 | 0.646108 | 0.694336 | 0.358838 | 0.803799 | 0.0972333 | 0.119382 |
999991 | 0.519115 | 0.654394 | 0.84288 | 0.858733 | 0.541975 | 0.964638 | 0.921998 | 0.536797 | 0.101776 | 0.83639 |
999992 | 0.986049 | 0.888041 | 0.585754 | 0.817333 | 0.0598934 | 0.88705 | 0.916273 | 0.890539 | 0.639275 | 0.625951 |
999993 | 0.969635 | 0.271582 | 0.972013 | 0.988876 | 0.455172 | 0.888254 | 0.60208 | 0.431273 | 0.359801 | 0.607461 |
999994 | 0.617444 | 0.079036 | 0.67513 | 0.484139 | 0.00749603 | 0.383036 | 0.38363 | 0.0849654 | 0.510573 | 0.0714929 |
999995 | 0.710299 | 0.783795 | 0.977731 | 0.885664 | 0.731486 | 0.728469 | 0.406494 | 0.77531 | 0.520003 | 0.891723 |
999996 | 0.37826 | 0.579551 | 0.370413 | 0.286594 | 0.309514 | 0.685915 | 0.421191 | 0.224232 | 0.0486228 | 0.0447236 |
999997 | 0.0815508 | 0.767083 | 0.779161 | 0.728052 | 0.474848 | 0.475733 | 0.0454733 | 0.861135 | 0.162566 | 0.7554 |
999998 | 0.241736 | 0.215685 | 0.685861 | 0.400015 | 0.570262 | 0.0111432 | 0.69192 | 0.29754 | 0.521174 | 0.314071 |
999999 | 0.680218 | 0.933159 | 0.599354 | 0.116367 | 0.914086 | 0.136051 | 0.4714 | 0.171243 | 0.315983 | 0.879676 |
1000000 | 0.212074 | 0.266076 | 0.825468 | 0.923946 | 0.721343 | 0.081667 | 0.0335297 | 0.534697 | 0.53835 | 0.00548066 |
@time map(sum, eachrow(df2));
2.705622 seconds (60.11 M allocations: 1.056 GiB, 25.15% gc time, 3.68% compilation time)
@time map(sum, eachrow(df2));
1.862482 seconds (59.99 M allocations: 1.050 GiB, 5.09% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
0.348988 seconds (1.32 M allocations: 72.991 MiB, 2.33% gc time, 98.14% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
0.006220 seconds (25 allocations: 7.631 MiB)
as you can see - this time it is much faster to iterate a type stable container
still you might want to use the select
syntax, which is optimized for such reductions:
this includes compilation time
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.632661 seconds (754.52 k allocations: 46.014 MiB, 99.11% compilation time: 96% of which was recompilation)
1000000-element Vector{Float64}:
4.243993540615169
4.154414077448179
4.046201057064972
4.769453432081732
4.631793508413152
5.2093905747165135
6.052092977953647
5.335964653974474
3.3478892521388897
4.958159796081296
⋮
7.296159776774205
6.546147719145592
3.296942512182397
7.410973094568378
3.3490167652546234
5.1310010949088385
3.949406516653557
5.217535862311809
4.142632091909143
Do it again
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.004660 seconds (127 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
4.243993540615169
4.154414077448179
4.046201057064972
4.769453432081732
4.631793508413152
5.2093905747165135
6.052092977953647
5.335964653974474
3.3478892521388897
4.958159796081296
⋮
7.296159776774205
6.546147719145592
3.296942512182397
7.410973094568378
3.3490167652546234
5.1310010949088385
3.949406516653557
5.217535862311809
4.142632091909143
This notebook was generated using Literate.jl.