Possible pitfalls#
using DataFrames
using BenchmarkTools
Know what is copied when creating a DataFrame#
x = DataFrame(rand(3, 5), :auto)
| Row | x1 | x2 | x3 | x4 | x5 |
|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64 | |
| 1 | 0.213634 | 0.976066 | 0.764957 | 0.23689 | 0.586611 |
| 2 | 0.257904 | 0.727199 | 0.125993 | 0.175773 | 0.556361 |
| 3 | 0.567658 | 0.580706 | 0.227893 | 0.628414 | 0.325745 |
x and y are not the same object
y = copy(x)
x === y
false
x and y are not the same object
y = DataFrame(x)
x === y
false
the columns are also not the same
any(x[!, i] === y[!, i] for i in ncol(x))
false
x and y are not the same object
y = DataFrame(x, copycols=false)
x === y
false
But the columns are the same
all(x[!, i] === y[!, i] for i in ncol(x))
true
the same when creating data frames using kwarg syntax
x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
| Row | x | y |
|---|---|---|
| Int64 | Int64 | |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
different object
y === df.y
false
range is converted to a vector
typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})
slicing rows always creates a copy
y === df[:, :y]
false
you can avoid copying by using copycols=false keyword argument in functions.
df = DataFrame(x=x, y=y, copycols=false)
| Row | x | y |
|---|---|---|
| Int64 | Int64 | |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
now it is the same
y === df.y
true
not the same object
select(df, :y)[!, 1] === y
false
the same object
select(df, :y, copycols=false)[!, 1] === y
true
Do not modify the parent of GroupedDataFrame or view#
x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)
x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view
GroupedDataFrame with 2 groups based on key: id
| Row | id | x |
|---|---|---|
| Int64 | Int64 | |
| 1 | 2 | 1 |
| 2 | 2 | 3 |
| 3 | 1 | 5 |
⋮
| Row | id | x |
|---|---|---|
| Int64 | Int64 | |
| 1 | 2 | 2 |
| 2 | 2 | 4 |
| 3 | 2 | 6 |
s = view(x, 5:6, :)
| Row | id | x |
|---|---|---|
| Int64 | Int64 | |
| 1 | 1 | 5 |
| 2 | 2 | 6 |
delete!(x, 3:6)
| Row | id | x |
|---|---|---|
| Int64 | Int64 | |
| 1 | 2 | 1 |
| 2 | 2 | 2 |
This is an error
s ## Will return BoundsError
Single column selection for a DataFrame#
Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :
x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
| Row | a | b | c | d | e |
|---|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | Int64 | |
| 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 2 | 2 | 2 | 2 |
| 3 | 3 | 3 | 3 | 3 | 3 |
x[1, 1] = 100
display(x)
| Row | a | b | c | d | e |
|---|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | Int64 | |
| 1 | 100 | 100 | 1 | 1 | 1 |
| 2 | 2 | 2 | 2 | 2 | 2 |
| 3 | 3 | 3 | 3 | 3 | 3 |
When iterating rows of a data frame#
use
eachrowto avoid compilation cost in wide tables,but
Tables.namedtupleiteratorfor fast execution in tall tables
The table below is tall:
df2 = DataFrame(rand(10^6, 10), :auto)
| Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
| 1 | 0.22214 | 0.194725 | 0.787694 | 0.79179 | 0.577703 | 0.148028 | 0.090797 | 0.518032 | 0.430776 | 0.820852 |
| 2 | 0.797788 | 0.482724 | 0.104816 | 0.98981 | 0.280418 | 0.973437 | 0.171954 | 0.66921 | 0.489958 | 0.907005 |
| 3 | 0.830344 | 0.166873 | 0.0119676 | 0.38869 | 0.112336 | 0.675906 | 0.554439 | 0.888163 | 0.881613 | 0.348705 |
| 4 | 0.0130943 | 0.702295 | 0.417558 | 0.824496 | 0.271467 | 0.293717 | 0.655974 | 0.78709 | 0.66586 | 0.592221 |
| 5 | 0.993136 | 0.262433 | 0.132242 | 0.850283 | 0.521659 | 0.187733 | 0.334311 | 0.859878 | 0.534288 | 0.40397 |
| 6 | 0.500378 | 0.22558 | 0.647151 | 0.315375 | 0.647435 | 0.307202 | 0.382302 | 0.370105 | 0.425099 | 0.642584 |
| 7 | 0.429387 | 0.960168 | 0.128584 | 0.644386 | 0.539359 | 0.230073 | 0.822344 | 0.187343 | 0.768617 | 0.834613 |
| 8 | 0.494793 | 0.0896164 | 0.0811292 | 0.32082 | 0.12624 | 0.50482 | 0.545125 | 0.556587 | 0.822893 | 0.1666 |
| 9 | 0.776773 | 0.896758 | 0.119333 | 0.687507 | 0.92007 | 0.929793 | 0.0857724 | 0.595789 | 0.850245 | 0.861515 |
| 10 | 0.918536 | 0.966107 | 0.274968 | 0.801825 | 0.697714 | 0.295892 | 0.733026 | 0.654443 | 0.474951 | 0.162668 |
| 11 | 0.682397 | 0.590592 | 0.269285 | 0.845926 | 0.733575 | 0.296264 | 0.716101 | 0.660762 | 0.210495 | 0.244092 |
| 12 | 0.692736 | 0.363891 | 0.196404 | 0.479814 | 0.0263628 | 0.857153 | 0.770087 | 0.184756 | 0.944424 | 0.294371 |
| 13 | 0.00104153 | 0.799787 | 0.322529 | 0.274998 | 0.334355 | 0.00502477 | 0.543046 | 0.0934528 | 0.652469 | 0.502395 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 999989 | 0.38893 | 0.488278 | 0.0385544 | 0.495639 | 0.241208 | 0.580943 | 0.478893 | 0.49156 | 0.0338644 | 0.474437 |
| 999990 | 0.273409 | 0.729579 | 0.672658 | 0.306717 | 0.141573 | 0.619027 | 0.934958 | 0.498666 | 0.992331 | 0.96703 |
| 999991 | 0.914382 | 0.977408 | 0.0466257 | 0.353187 | 0.721385 | 0.220179 | 0.39309 | 0.243458 | 0.474306 | 0.423111 |
| 999992 | 0.0925393 | 0.0492193 | 0.497661 | 0.242698 | 0.201455 | 0.650363 | 0.908122 | 0.428147 | 0.428148 | 0.210207 |
| 999993 | 0.764645 | 0.948247 | 0.104478 | 0.823475 | 0.451709 | 0.332793 | 0.736104 | 0.991225 | 0.81367 | 0.507504 |
| 999994 | 0.227822 | 0.368746 | 0.951459 | 0.568322 | 0.3037 | 0.487387 | 0.152092 | 0.00259459 | 0.699173 | 0.49836 |
| 999995 | 0.862975 | 0.145 | 0.519696 | 0.812599 | 0.65848 | 0.649833 | 0.44456 | 0.630177 | 0.792391 | 0.168679 |
| 999996 | 0.313742 | 0.269196 | 0.0392782 | 0.90347 | 0.961205 | 0.743785 | 0.265006 | 0.513281 | 0.494936 | 0.768261 |
| 999997 | 0.347113 | 0.51485 | 0.293376 | 0.885858 | 0.947911 | 0.697525 | 0.369732 | 0.535832 | 0.211851 | 0.660482 |
| 999998 | 0.357896 | 0.573182 | 0.00617545 | 0.632811 | 0.579779 | 0.960782 | 0.888766 | 0.496995 | 0.384513 | 0.180331 |
| 999999 | 0.355143 | 0.69701 | 0.755443 | 0.18112 | 0.732943 | 0.569134 | 0.849911 | 0.443431 | 0.532987 | 0.716891 |
| 1000000 | 0.509224 | 0.903955 | 0.359278 | 0.178076 | 0.385685 | 0.0988106 | 0.885855 | 0.360228 | 0.446265 | 0.77072 |
@time map(sum, eachrow(df2));
2.454178 seconds (60.12 M allocations: 1.056 GiB, 17.63% gc time, 4.70% compilation time)
@time map(sum, eachrow(df2));
1.833568 seconds (59.99 M allocations: 1.050 GiB, 6.03% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
0.303801 seconds (1.22 M allocations: 66.451 MiB, 98.32% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
0.004991 seconds (21 allocations: 7.631 MiB)
as you can see - this time it is much faster to iterate a type stable container
still you might want to use the select syntax, which is optimized for such reductions:
this includes compilation time
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.483571 seconds (652.93 k allocations: 40.217 MiB, 2.73% gc time, 95.90% compilation time: 95% of which was recompilation)
1000000-element Vector{Float64}:
4.582538267133092
5.86712127030766
4.85903797859344
5.223773013475111
5.07993464728864
4.4632116807439814
5.544873933254496
3.7086242892505434
6.723556847831017
5.98012950282477
⋮
3.70856059429051
6.473850788533407
4.2596556715720215
5.684389196675713
5.272159975659073
5.46453064264186
5.061230203019331
5.834012167127027
4.898097576410764
Do it again
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.006329 seconds (123 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
4.582538267133092
5.86712127030766
4.85903797859344
5.223773013475111
5.07993464728864
4.4632116807439814
5.544873933254496
3.7086242892505434
6.723556847831017
5.98012950282477
⋮
3.70856059429051
6.473850788533407
4.2596556715720215
5.684389196675713
5.272159975659073
5.46453064264186
5.061230203019331
5.834012167127027
4.898097576410764
This notebook was generated using Literate.jl.