Possible pitfalls#
using DataFrames
using BenchmarkTools
Know what is copied when creating a DataFrame#
x = DataFrame(rand(3, 5), :auto)
Row | x1 | x2 | x3 | x4 | x5 |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.529334 | 0.0597551 | 0.235795 | 0.501942 | 0.958109 |
2 | 0.484044 | 0.047142 | 0.114432 | 0.0418198 | 0.913857 |
3 | 0.500429 | 0.156764 | 0.0733438 | 0.843276 | 0.986249 |
x and y are not the same object
y = copy(x)
x === y
false
x and y are not the same object
y = DataFrame(x)
x === y
false
the columns are also not the same
any(x[!, i] === y[!, i] for i in ncol(x))
false
x and y are not the same object
y = DataFrame(x, copycols=false)
x === y
false
But the columns are the same
all(x[!, i] === y[!, i] for i in ncol(x))
true
the same when creating data frames using kwarg
syntax
x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
different object
y === df.y
false
range is converted to a vector
typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})
slicing rows always creates a copy
y === df[:, :y]
false
you can avoid copying by using copycols=false keyword argument in functions.
df = DataFrame(x=x, y=y, copycols=false)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
now it is the same
y === df.y
true
not the same object
select(df, :y)[!, 1] === y
false
the same object
select(df, :y, copycols=false)[!, 1] === y
true
Do not modify the parent of GroupedDataFrame
or view#
x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)
x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view
GroupedDataFrame with 2 groups based on key: id
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 3 |
3 | 1 | 5 |
⋮
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 2 |
2 | 2 | 4 |
3 | 2 | 6 |
s = view(x, 5:6, :)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 1 | 5 |
2 | 2 | 6 |
delete!(x, 3:6)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 2 |
This is an error
s ## Will return BoundsError
Single column selection for a DataFrame
#
Single column selection for a DataFrame
creates aliases with ! and getproperty
syntax and copies with :
x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 1 | 1 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
x[1, 1] = 100
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 100 | 100 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
When iterating rows of a data frame#
use
eachrow
to avoid compilation cost in wide tables,but
Tables.namedtupleiterator
for fast execution in tall tables
The table below is tall:
df2 = DataFrame(rand(10^6, 10), :auto)
Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 |
---|---|---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.812425 | 0.953875 | 0.289671 | 0.854766 | 0.509818 | 0.140464 | 0.08607 | 0.520695 | 0.205697 | 0.321225 |
2 | 0.208584 | 0.000966154 | 0.902822 | 0.423168 | 0.473373 | 0.157054 | 0.793769 | 0.957795 | 0.800943 | 0.664851 |
3 | 0.002694 | 0.326925 | 0.0368252 | 0.187928 | 0.126627 | 0.776543 | 0.402102 | 0.469324 | 0.0728587 | 0.844602 |
4 | 0.569028 | 0.31596 | 0.848604 | 0.812409 | 0.45489 | 0.86572 | 0.345673 | 0.236728 | 0.224588 | 0.646519 |
5 | 0.0631041 | 0.203245 | 0.701698 | 0.303583 | 0.674797 | 0.597116 | 0.606777 | 0.451958 | 0.39108 | 0.00379739 |
6 | 0.247021 | 0.809399 | 0.760111 | 0.598913 | 0.272918 | 0.597262 | 0.49311 | 0.520876 | 0.447351 | 0.292064 |
7 | 0.267973 | 0.211999 | 0.771751 | 0.639811 | 0.784383 | 0.121173 | 0.563557 | 0.549457 | 0.0545117 | 0.745306 |
8 | 0.108655 | 0.110102 | 0.953703 | 0.168199 | 0.976653 | 0.0276927 | 0.643222 | 0.963599 | 0.304311 | 0.678014 |
9 | 0.693502 | 0.54823 | 0.20262 | 0.210266 | 0.561789 | 0.900743 | 0.528027 | 0.615125 | 0.313564 | 0.557392 |
10 | 0.661105 | 0.661711 | 0.820527 | 0.440406 | 0.434706 | 0.429731 | 0.0608958 | 0.846089 | 0.422275 | 0.17791 |
11 | 0.234678 | 0.914611 | 0.715813 | 0.029249 | 0.257056 | 0.122383 | 0.262263 | 0.00461947 | 0.146143 | 0.0903481 |
12 | 0.193291 | 0.342759 | 0.914605 | 0.418826 | 0.135818 | 0.484762 | 0.191179 | 0.188863 | 0.364738 | 0.21938 |
13 | 0.242351 | 0.327346 | 0.448075 | 0.560953 | 0.592612 | 0.0503523 | 0.893915 | 0.835376 | 0.64798 | 0.246143 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
999989 | 0.661472 | 0.635782 | 0.792354 | 0.761077 | 0.0308713 | 0.099554 | 0.240259 | 0.35399 | 0.314355 | 0.686009 |
999990 | 0.0775762 | 0.845558 | 0.742922 | 0.464236 | 0.689336 | 0.548313 | 0.807805 | 0.671055 | 0.0915516 | 0.833302 |
999991 | 0.0976372 | 0.361425 | 0.714378 | 0.341477 | 0.0911392 | 0.403061 | 0.723864 | 0.777392 | 0.595327 | 0.25862 |
999992 | 0.501212 | 0.272561 | 0.0973808 | 0.331344 | 0.0614938 | 0.519877 | 0.839399 | 0.765083 | 0.688672 | 0.150719 |
999993 | 0.273739 | 0.401793 | 0.344902 | 0.693997 | 0.595919 | 0.83713 | 0.931485 | 0.0467357 | 0.24671 | 0.151542 |
999994 | 0.832169 | 0.267376 | 0.925687 | 0.418644 | 0.0413268 | 0.791828 | 0.732267 | 0.197472 | 0.778455 | 0.226904 |
999995 | 0.0660683 | 0.428659 | 0.340804 | 0.747841 | 0.279143 | 0.182118 | 0.698698 | 0.570114 | 0.918868 | 0.789216 |
999996 | 0.466766 | 0.64338 | 0.773518 | 0.738444 | 0.963 | 0.521375 | 0.100483 | 0.656539 | 0.82001 | 0.324571 |
999997 | 0.907574 | 0.59836 | 0.888056 | 0.378057 | 0.606351 | 0.0249071 | 0.0768992 | 0.531575 | 0.621098 | 0.440302 |
999998 | 0.612807 | 0.416059 | 0.728195 | 0.271598 | 0.571033 | 0.268268 | 0.230246 | 0.610069 | 0.474585 | 0.931925 |
999999 | 0.763149 | 0.923241 | 0.428408 | 0.99299 | 0.377282 | 0.43588 | 0.980952 | 0.00561584 | 0.65738 | 0.105073 |
1000000 | 0.645671 | 0.461747 | 0.28093 | 0.71416 | 0.38365 | 0.786039 | 0.162344 | 0.541735 | 0.13926 | 0.988997 |
@time map(sum, eachrow(df2));
1.934170 seconds (60.11 M allocations: 1.056 GiB, 6.28% gc time, 6.29% compilation time)
@time map(sum, eachrow(df2));
2.244732 seconds (59.99 M allocations: 1.050 GiB, 21.20% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
0.358052 seconds (1.32 M allocations: 73.135 MiB, 5.26% gc time, 98.21% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
0.006158 seconds (25 allocations: 7.631 MiB)
as you can see - this time it is much faster to iterate a type stable container
still you might want to use the select
syntax, which is optimized for such reductions:
this includes compilation time
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.586411 seconds (662.47 k allocations: 41.440 MiB, 98.79% compilation time: 94% of which was recompilation)
1000000-element Vector{Float64}:
4.694706502313453
5.383324647550716
3.2464293812878067
5.320117297804017
3.9971556695645343
5.039026681750683
4.709921099884204
4.93415120764944
5.131257550693444
4.955355792546958
⋮
4.227741340484659
4.523953769215137
5.2121280936199845
5.021528576928276
6.008086254445894
5.073179235758298
5.1147852166724865
5.669970336464331
5.1045325666422725
Do it again
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.005626 seconds (127 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
4.694706502313453
5.383324647550716
3.2464293812878067
5.320117297804017
3.9971556695645343
5.039026681750683
4.709921099884204
4.93415120764944
5.131257550693444
4.955355792546958
⋮
4.227741340484659
4.523953769215137
5.2121280936199845
5.021528576928276
6.008086254445894
5.073179235758298
5.1147852166724865
5.669970336464331
5.1045325666422725
This notebook was generated using Literate.jl.