Possible pitfalls#

using DataFrames
using BenchmarkTools

Know what is copied when creating a DataFrame#

x = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.5293340.05975510.2357950.5019420.958109
20.4840440.0471420.1144320.04181980.913857
30.5004290.1567640.07334380.8432760.986249

x and y are not the same object

y = copy(x)
x === y
false

x and y are not the same object

y = DataFrame(x)
x === y
false

the columns are also not the same

any(x[!, i] === y[!, i] for i in ncol(x))
false

x and y are not the same object

y = DataFrame(x, copycols=false)
x === y
false

But the columns are the same

all(x[!, i] === y[!, i] for i in ncol(x))
true

the same when creating data frames using kwarg syntax

x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

different object

y === df.y
false

range is converted to a vector

typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

y === df[:, :y]
false

you can avoid copying by using copycols=false keyword argument in functions.

df = DataFrame(x=x, y=y, copycols=false)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

now it is the same

y === df.y
true

not the same object

select(df, :y)[!, 1] === y
false

the same object

select(df, :y, copycols=false)[!, 1] === y
true

Do not modify the parent of GroupedDataFrame or view#

x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view

GroupedDataFrame with 2 groups based on key: id

First Group (3 rows): id = 2
Rowidx
Int64Int64
121
223
315

Last Group (3 rows): id = 2
Rowidx
Int64Int64
122
224
326
s = view(x, 5:6, :)
2×2 SubDataFrame
Rowidx
Int64Int64
115
226
delete!(x, 3:6)
2×2 DataFrame
Rowidx
Int64Int64
121
222

This is an error

s ## Will return BoundsError

Single column selection for a DataFrame#

Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :

x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
111111
222222
333333
x[1, 1] = 100
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
1100100111
222222
333333

When iterating rows of a data frame#

  • use eachrow to avoid compilation cost in wide tables,

  • but Tables.namedtupleiterator for fast execution in tall tables

The table below is tall:

df2 = DataFrame(rand(10^6, 10), :auto)
1000000×10 DataFrame
999975 rows omitted
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.8124250.9538750.2896710.8547660.5098180.1404640.086070.5206950.2056970.321225
20.2085840.0009661540.9028220.4231680.4733730.1570540.7937690.9577950.8009430.664851
30.0026940.3269250.03682520.1879280.1266270.7765430.4021020.4693240.07285870.844602
40.5690280.315960.8486040.8124090.454890.865720.3456730.2367280.2245880.646519
50.06310410.2032450.7016980.3035830.6747970.5971160.6067770.4519580.391080.00379739
60.2470210.8093990.7601110.5989130.2729180.5972620.493110.5208760.4473510.292064
70.2679730.2119990.7717510.6398110.7843830.1211730.5635570.5494570.05451170.745306
80.1086550.1101020.9537030.1681990.9766530.02769270.6432220.9635990.3043110.678014
90.6935020.548230.202620.2102660.5617890.9007430.5280270.6151250.3135640.557392
100.6611050.6617110.8205270.4404060.4347060.4297310.06089580.8460890.4222750.17791
110.2346780.9146110.7158130.0292490.2570560.1223830.2622630.004619470.1461430.0903481
120.1932910.3427590.9146050.4188260.1358180.4847620.1911790.1888630.3647380.21938
130.2423510.3273460.4480750.5609530.5926120.05035230.8939150.8353760.647980.246143
9999890.6614720.6357820.7923540.7610770.03087130.0995540.2402590.353990.3143550.686009
9999900.07757620.8455580.7429220.4642360.6893360.5483130.8078050.6710550.09155160.833302
9999910.09763720.3614250.7143780.3414770.09113920.4030610.7238640.7773920.5953270.25862
9999920.5012120.2725610.09738080.3313440.06149380.5198770.8393990.7650830.6886720.150719
9999930.2737390.4017930.3449020.6939970.5959190.837130.9314850.04673570.246710.151542
9999940.8321690.2673760.9256870.4186440.04132680.7918280.7322670.1974720.7784550.226904
9999950.06606830.4286590.3408040.7478410.2791430.1821180.6986980.5701140.9188680.789216
9999960.4667660.643380.7735180.7384440.9630.5213750.1004830.6565390.820010.324571
9999970.9075740.598360.8880560.3780570.6063510.02490710.07689920.5315750.6210980.440302
9999980.6128070.4160590.7281950.2715980.5710330.2682680.2302460.6100690.4745850.931925
9999990.7631490.9232410.4284080.992990.3772820.435880.9809520.005615840.657380.105073
10000000.6456710.4617470.280930.714160.383650.7860390.1623440.5417350.139260.988997
@time map(sum, eachrow(df2));
  1.934170 seconds (60.11 M allocations: 1.056 GiB, 6.28% gc time, 6.29% compilation time)
@time map(sum, eachrow(df2));
  2.244732 seconds (59.99 M allocations: 1.050 GiB, 21.20% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.358052 seconds (1.32 M allocations: 73.135 MiB, 5.26% gc time, 98.21% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.006158 seconds (25 allocations: 7.631 MiB)

as you can see - this time it is much faster to iterate a type stable container still you might want to use the select syntax, which is optimized for such reductions:

this includes compilation time

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.586411 seconds (662.47 k allocations: 41.440 MiB, 98.79% compilation time: 94% of which was recompilation)
1000000-element Vector{Float64}:
 4.694706502313453
 5.383324647550716
 3.2464293812878067
 5.320117297804017
 3.9971556695645343
 5.039026681750683
 4.709921099884204
 4.93415120764944
 5.131257550693444
 4.955355792546958
 ⋮
 4.227741340484659
 4.523953769215137
 5.2121280936199845
 5.021528576928276
 6.008086254445894
 5.073179235758298
 5.1147852166724865
 5.669970336464331
 5.1045325666422725

Do it again

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.005626 seconds (127 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
 4.694706502313453
 5.383324647550716
 3.2464293812878067
 5.320117297804017
 3.9971556695645343
 5.039026681750683
 4.709921099884204
 4.93415120764944
 5.131257550693444
 4.955355792546958
 ⋮
 4.227741340484659
 4.523953769215137
 5.2121280936199845
 5.021528576928276
 6.008086254445894
 5.073179235758298
 5.1147852166724865
 5.669970336464331
 5.1045325666422725

This notebook was generated using Literate.jl.