Possible pitfalls#
using DataFrames
using BenchmarkTools
Know what is copied when creating a DataFrame#
x = DataFrame(rand(3, 5), :auto)
Row | x1 | x2 | x3 | x4 | x5 |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.199274 | 0.640435 | 0.0898202 | 0.368331 | 0.545752 |
2 | 0.643667 | 0.720819 | 0.529834 | 0.573098 | 0.667158 |
3 | 0.607682 | 0.471876 | 0.998542 | 0.710718 | 0.518653 |
x and y are not the same object
y = copy(x)
x === y
false
x and y are not the same object
y = DataFrame(x)
x === y
false
the columns are also not the same
any(x[!, i] === y[!, i] for i in ncol(x))
false
x and y are not the same object
y = DataFrame(x, copycols=false)
x === y
false
But the columns are the same
all(x[!, i] === y[!, i] for i in ncol(x))
true
the same when creating data frames using kwarg
syntax
x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
different object
y === df.y
false
range is converted to a vector
typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})
slicing rows always creates a copy
y === df[:, :y]
false
you can avoid copying by using copycols=false keyword argument in functions.
df = DataFrame(x=x, y=y, copycols=false)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
now it is the same
y === df.y
true
not the same object
select(df, :y)[!, 1] === y
false
the same object
select(df, :y, copycols=false)[!, 1] === y
true
Do not modify the parent of GroupedDataFrame
or view#
x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)
x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view
GroupedDataFrame with 2 groups based on key: id
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 3 |
3 | 1 | 5 |
⋮
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 2 |
2 | 2 | 4 |
3 | 2 | 6 |
s = view(x, 5:6, :)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 1 | 5 |
2 | 2 | 6 |
delete!(x, 3:6)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 2 |
This is an error
s ## Will return BoundsError
Single column selection for a DataFrame
#
Single column selection for a DataFrame
creates aliases with ! and getproperty
syntax and copies with :
x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 1 | 1 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
x[1, 1] = 100
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 100 | 100 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
When iterating rows of a data frame#
use
eachrow
to avoid compilation cost in wide tables,but
Tables.namedtupleiterator
for fast execution in tall tables
The table below is tall:
df2 = DataFrame(rand(10^6, 10), :auto)
Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 |
---|---|---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.950021 | 0.493981 | 0.841407 | 0.346694 | 0.432968 | 0.722143 | 0.708738 | 0.248226 | 0.981028 | 0.595189 |
2 | 0.510748 | 0.147537 | 0.413281 | 0.977355 | 0.101247 | 0.807325 | 0.800398 | 0.78776 | 0.0911609 | 0.710667 |
3 | 0.0741849 | 0.79479 | 0.274569 | 0.897213 | 0.298997 | 0.784028 | 0.448836 | 0.65234 | 0.468197 | 0.77791 |
4 | 0.648084 | 0.734042 | 0.580993 | 0.185687 | 0.250022 | 0.50913 | 0.671657 | 0.168917 | 0.256192 | 0.233248 |
5 | 0.223446 | 0.492996 | 0.792187 | 0.852423 | 0.81848 | 0.864788 | 0.0991924 | 0.322516 | 0.854278 | 0.999516 |
6 | 0.67553 | 0.436438 | 0.0680783 | 0.18672 | 0.659893 | 0.562186 | 0.743273 | 0.465054 | 0.306696 | 0.297225 |
7 | 0.243176 | 0.494016 | 0.812506 | 0.873523 | 0.721469 | 0.928953 | 0.0549756 | 0.1267 | 0.216617 | 0.192617 |
8 | 0.797813 | 0.435235 | 0.94572 | 0.417295 | 0.653296 | 0.657442 | 0.982349 | 0.563264 | 0.0534572 | 0.77055 |
9 | 0.110185 | 0.803692 | 0.530527 | 0.888656 | 0.219044 | 0.348585 | 0.546133 | 0.601138 | 0.873774 | 0.10204 |
10 | 0.842785 | 0.242754 | 0.0112607 | 0.999743 | 0.237779 | 0.000416126 | 0.0839501 | 0.335453 | 0.721537 | 0.857107 |
11 | 0.809588 | 0.771386 | 0.761612 | 0.073205 | 0.0317563 | 0.454098 | 0.868367 | 0.812988 | 0.269912 | 0.954401 |
12 | 0.701799 | 0.385671 | 0.835105 | 0.212708 | 0.788724 | 0.371308 | 0.0329072 | 0.723001 | 0.408519 | 0.476572 |
13 | 0.432829 | 0.689016 | 0.0764351 | 0.548859 | 0.863457 | 0.619049 | 0.057559 | 0.122305 | 0.229387 | 0.0674775 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
999989 | 0.533383 | 0.855802 | 0.687371 | 0.708839 | 0.744857 | 0.966548 | 0.0824002 | 0.733476 | 0.763916 | 0.335373 |
999990 | 0.691378 | 0.445439 | 0.80811 | 0.365229 | 0.119191 | 0.330685 | 0.223937 | 0.660535 | 0.370578 | 0.472301 |
999991 | 0.251729 | 0.17505 | 0.993557 | 0.358303 | 0.0209346 | 0.980397 | 0.350851 | 0.681646 | 0.46114 | 0.145045 |
999992 | 0.355679 | 0.271324 | 0.508806 | 0.960068 | 0.251722 | 0.752553 | 0.639803 | 0.724803 | 0.359463 | 0.433033 |
999993 | 0.749344 | 0.038648 | 0.539718 | 0.541698 | 0.891388 | 0.274497 | 0.390514 | 0.955952 | 0.164919 | 0.366685 |
999994 | 0.304581 | 0.401609 | 0.675038 | 0.506986 | 0.94345 | 0.61174 | 0.534913 | 0.60762 | 0.189458 | 0.229565 |
999995 | 0.550062 | 0.365073 | 0.916468 | 0.557559 | 0.27932 | 0.85866 | 0.517763 | 0.301164 | 0.693753 | 0.642684 |
999996 | 0.60784 | 0.610899 | 0.647092 | 0.713368 | 0.297371 | 0.678995 | 0.588585 | 0.396604 | 0.687318 | 0.766718 |
999997 | 0.24466 | 0.0849482 | 0.597542 | 0.90537 | 0.6473 | 0.664269 | 0.706461 | 0.0204798 | 0.626498 | 0.23971 |
999998 | 0.591317 | 0.742489 | 0.134039 | 0.318329 | 0.382009 | 0.242253 | 0.679399 | 0.835081 | 0.0353084 | 0.201794 |
999999 | 0.969696 | 0.646781 | 0.457161 | 0.902402 | 0.823323 | 0.607858 | 0.254191 | 0.319389 | 5.85587e-5 | 0.656893 |
1000000 | 0.143247 | 0.00617306 | 0.673317 | 0.613773 | 0.795874 | 0.489389 | 0.732493 | 0.422551 | 0.186894 | 0.717166 |
@time map(sum, eachrow(df2));
2.025296 seconds (60.05 M allocations: 1.053 GiB, 10.68% gc time, 2.94% compilation time)
@time map(sum, eachrow(df2));
1.717192 seconds (59.99 M allocations: 1.050 GiB, 4.65% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
0.204156 seconds (378.09 k allocations: 27.491 MiB, 95.25% compilation time: 13% of which was recompilation)
@time map(sum, Tables.namedtupleiterator(df2));
0.016708 seconds (25 allocations: 7.631 MiB, 37.84% gc time)
as you can see - this time it is much faster to iterate a type stable container
still you might want to use the select
syntax, which is optimized for such reductions:
this includes compilation time
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.058912 seconds (94.63 k allocations: 12.292 MiB, 91.79% compilation time: 86% of which was recompilation)
1000000-element Vector{Float64}:
6.320395885756943
5.347479514465469
5.4710650902659665
4.237972304596929
6.3198238435150635
4.4010928042350494
4.664552055585735
6.2764215672263
5.023772748469487
4.332785122603654
⋮
5.257255312875229
4.913362971837867
5.004960265880328
5.682505134117076
5.994789175889104
4.737239123215181
4.162017573952469
5.637752149817912
4.780876130310359
Do it again
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.004646 seconds (127 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
6.320395885756943
5.347479514465469
5.4710650902659665
4.237972304596929
6.3198238435150635
4.4010928042350494
4.664552055585735
6.2764215672263
5.023772748469487
4.332785122603654
⋮
5.257255312875229
4.913362971837867
5.004960265880328
5.682505134117076
5.994789175889104
4.737239123215181
4.162017573952469
5.637752149817912
4.780876130310359
This notebook was generated using Literate.jl.