Possible pitfalls#
using DataFrames
using BenchmarkTools
Know what is copied when creating a DataFrame#
x = DataFrame(rand(3, 5), :auto)
Row | x1 | x2 | x3 | x4 | x5 |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.627966 | 0.780746 | 0.216054 | 0.49819 | 0.740574 |
2 | 0.982644 | 0.863759 | 0.521577 | 0.791425 | 0.916146 |
3 | 0.931923 | 0.66291 | 0.42917 | 0.778559 | 0.110755 |
x and y are not the same object
y = copy(x)
x === y
false
x and y are not the same object
y = DataFrame(x)
x === y
false
the columns are also not the same
any(x[!, i] === y[!, i] for i in ncol(x))
false
x and y are not the same object
y = DataFrame(x, copycols=false)
x === y
false
But the columns are the same
all(x[!, i] === y[!, i] for i in ncol(x))
true
the same when creating data frames using kwarg
syntax
x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
different object
y === df.y
false
range is converted to a vector
typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})
slicing rows always creates a copy
y === df[:, :y]
false
you can avoid copying by using copycols=false keyword argument in functions.
df = DataFrame(x=x, y=y, copycols=false)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
now it is the same
y === df.y
true
not the same object
select(df, :y)[!, 1] === y
false
the same object
select(df, :y, copycols=false)[!, 1] === y
true
Do not modify the parent of GroupedDataFrame
or view#
x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)
x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view
GroupedDataFrame with 2 groups based on key: id
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 3 |
3 | 1 | 5 |
⋮
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 2 |
2 | 2 | 4 |
3 | 2 | 6 |
s = view(x, 5:6, :)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 1 | 5 |
2 | 2 | 6 |
delete!(x, 3:6)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 2 |
This is an error
s ## Will return BoundsError
Single column selection for a DataFrame
#
Single column selection for a DataFrame
creates aliases with ! and getproperty
syntax and copies with :
x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 1 | 1 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
x[1, 1] = 100
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 100 | 100 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
When iterating rows of a data frame#
use
eachrow
to avoid compilation cost in wide tables,but
Tables.namedtupleiterator
for fast execution in tall tables
The table below is tall:
df2 = DataFrame(rand(10^6, 10), :auto)
Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 |
---|---|---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.827493 | 0.522678 | 0.30587 | 0.697431 | 0.356574 | 0.147348 | 0.165574 | 0.35231 | 0.735541 | 0.734174 |
2 | 0.334083 | 0.721048 | 0.678431 | 0.338055 | 0.162462 | 0.96368 | 0.98222 | 0.310456 | 0.788059 | 0.808003 |
3 | 0.249476 | 0.185305 | 0.386422 | 0.670001 | 0.43029 | 0.300558 | 0.567526 | 0.604122 | 0.993032 | 0.412015 |
4 | 0.331262 | 0.241808 | 0.131926 | 0.47897 | 0.939951 | 0.866249 | 0.713522 | 0.85524 | 0.62698 | 0.94087 |
5 | 0.328309 | 0.913404 | 0.997025 | 0.631137 | 0.0583395 | 0.535069 | 0.115982 | 0.261661 | 0.771639 | 0.38268 |
6 | 0.49489 | 0.935854 | 0.849281 | 0.0651715 | 0.702173 | 0.973227 | 0.698805 | 0.911043 | 0.344635 | 0.0538166 |
7 | 0.639202 | 0.11087 | 0.380368 | 0.99173 | 0.732873 | 0.637735 | 0.130003 | 0.74666 | 0.825874 | 0.635628 |
8 | 0.594615 | 0.952221 | 0.118798 | 0.890403 | 0.731167 | 0.129171 | 0.190358 | 0.306405 | 0.598677 | 0.659036 |
9 | 0.326836 | 0.867387 | 0.779446 | 0.0514019 | 0.668467 | 0.0895685 | 0.825931 | 0.313593 | 0.506275 | 0.18824 |
10 | 0.451944 | 0.450682 | 0.695083 | 0.279971 | 0.518991 | 0.481737 | 0.510053 | 0.235263 | 0.755214 | 0.512064 |
11 | 0.391553 | 0.422148 | 0.679026 | 0.815403 | 0.120333 | 0.932611 | 0.784385 | 0.168465 | 0.662867 | 0.567092 |
12 | 0.394268 | 0.510768 | 0.868931 | 0.401659 | 0.701443 | 0.701251 | 0.214795 | 0.632434 | 0.392702 | 0.716703 |
13 | 0.574541 | 0.509977 | 0.733312 | 0.357177 | 0.204279 | 0.7421 | 0.837905 | 0.222128 | 0.960075 | 0.405046 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
999989 | 0.0175485 | 0.480487 | 0.892262 | 0.130995 | 0.670485 | 0.00171493 | 0.120122 | 0.299505 | 0.951317 | 0.964039 |
999990 | 0.891715 | 0.647522 | 0.561725 | 0.833824 | 0.299696 | 0.981391 | 0.747893 | 0.205204 | 0.602296 | 0.577729 |
999991 | 0.716421 | 0.939543 | 0.298614 | 0.0477076 | 0.410623 | 0.211471 | 0.536963 | 0.0797999 | 0.541015 | 0.343898 |
999992 | 0.36239 | 0.76989 | 0.351988 | 0.0889811 | 0.689272 | 0.0768747 | 0.860183 | 0.0105461 | 0.769641 | 0.511566 |
999993 | 0.875264 | 0.91345 | 0.0497442 | 0.132398 | 0.501189 | 0.974346 | 0.733187 | 0.0699565 | 0.0915653 | 0.492524 |
999994 | 0.548949 | 0.127222 | 0.188089 | 0.486114 | 0.163262 | 0.518948 | 0.600134 | 0.23425 | 0.120563 | 0.484254 |
999995 | 0.00522245 | 0.332911 | 0.332718 | 0.166368 | 0.905102 | 0.660736 | 0.15789 | 0.70676 | 0.379854 | 0.407764 |
999996 | 0.612489 | 0.967301 | 0.379855 | 0.147552 | 0.662538 | 0.847811 | 0.954871 | 0.116529 | 0.907558 | 0.337808 |
999997 | 0.774333 | 0.5798 | 0.745301 | 0.7291 | 0.490831 | 0.730082 | 0.964367 | 0.830893 | 0.309956 | 0.728045 |
999998 | 0.25082 | 0.490147 | 0.227523 | 0.871503 | 0.946821 | 0.19416 | 0.198351 | 0.913674 | 0.50724 | 0.0972831 |
999999 | 0.060027 | 0.375635 | 0.558574 | 0.296796 | 0.297386 | 0.747957 | 0.0859935 | 0.996066 | 0.351147 | 0.0207747 |
1000000 | 0.040967 | 0.316655 | 0.0913896 | 0.469002 | 0.44254 | 0.0260206 | 0.739715 | 0.796013 | 0.837032 | 0.0939526 |
@time map(sum, eachrow(df2));
2.677138 seconds (60.11 M allocations: 1.056 GiB, 27.30% gc time, 3.70% compilation time)
@time map(sum, eachrow(df2));
1.773191 seconds (59.99 M allocations: 1.050 GiB, 5.91% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
0.348400 seconds (1.32 M allocations: 72.984 MiB, 4.03% gc time, 94.13% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
0.006175 seconds (25 allocations: 7.631 MiB)
as you can see - this time it is much faster to iterate a type stable container
still you might want to use the select
syntax, which is optimized for such reductions:
this includes compilation time
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.554322 seconds (756.88 k allocations: 46.117 MiB, 98.88% compilation time: 95% of which was recompilation)
1000000-element Vector{Float64}:
4.844991690000428
6.086497806255862
4.798747007881859
6.126779894498693
4.9952461619646815
6.028895513503592
5.83094213599793
5.170850614032696
4.617145262159551
4.891002752222908
⋮
4.4913317950278175
4.833623718382274
3.471786429775383
4.055326374105319
5.9343112062272665
6.882706645838512
4.697521607782131
3.790353847411509
3.8532860025938884
Do it again
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.004271 seconds (127 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
4.844991690000428
6.086497806255862
4.798747007881859
6.126779894498693
4.9952461619646815
6.028895513503592
5.83094213599793
5.170850614032696
4.617145262159551
4.891002752222908
⋮
4.4913317950278175
4.833623718382274
3.471786429775383
4.055326374105319
5.9343112062272665
6.882706645838512
4.697521607782131
3.790353847411509
3.8532860025938884
This notebook was generated using Literate.jl.