Possible pitfalls#

using DataFrames
using BenchmarkTools

Know what is copied when creating a DataFrame#

x = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.21130.3083080.05664070.8282480.0765941
20.7946450.9980690.7704120.6534070.168957
30.1394270.3484790.1236090.1995570.735428

x and y are not the same object

y = copy(x)
x === y
false

x and y are not the same object

y = DataFrame(x)
x === y
false

the columns are also not the same

any(x[!, i] === y[!, i] for i in ncol(x))
false

x and y are not the same object

y = DataFrame(x, copycols=false)
x === y
false

But the columns are the same

all(x[!, i] === y[!, i] for i in ncol(x))
true

the same when creating data frames using kwarg syntax

x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

different object

y === df.y
false

range is converted to a vector

typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

y === df[:, :y]
false

you can avoid copying by using copycols=false keyword argument in functions.

df = DataFrame(x=x, y=y, copycols=false)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

now it is the same

y === df.y
true

not the same object

select(df, :y)[!, 1] === y
false

the same object

select(df, :y, copycols=false)[!, 1] === y
true

Do not modify the parent of GroupedDataFrame or view#

x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view

GroupedDataFrame with 2 groups based on key: id

First Group (3 rows): id = 2
Rowidx
Int64Int64
121
223
315

Last Group (3 rows): id = 2
Rowidx
Int64Int64
122
224
326
s = view(x, 5:6, :)
2×2 SubDataFrame
Rowidx
Int64Int64
115
226
delete!(x, 3:6)
2×2 DataFrame
Rowidx
Int64Int64
121
222

This is an error

s ## Will return BoundsError

Single column selection for a DataFrame#

Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :

x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
111111
222222
333333
x[1, 1] = 100
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
1100100111
222222
333333

When iterating rows of a data frame#

  • use eachrow to avoid compilation cost in wide tables,

  • but Tables.namedtupleiterator for fast execution in tall tables

The table below is tall:

df2 = DataFrame(rand(10^6, 10), :auto)
1000000×10 DataFrame
999975 rows omitted
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.1261730.913910.2308860.460310.08006060.06960250.05612020.8936390.8944010.718733
20.6609180.9786770.6113230.213810.3280330.9561030.5877960.1277990.4065340.986028
30.755060.7102770.6535720.9460840.6288710.7272770.6355490.566630.3975880.641642
40.9699840.678060.9928920.8941940.3933720.03372880.1507220.06815560.3792990.678804
50.2665340.835720.3046040.0007717090.6490480.2792210.2044790.5989830.6748050.521239
60.8595020.3712160.6260170.6582220.5347730.4811190.02692190.6779360.1164610.00703477
70.9563220.4397850.3287890.6188540.1049630.2167930.5369870.2484220.8988590.0577803
80.2850740.9655780.341130.1463010.2796130.4413330.02844730.6587410.1153770.482112
90.01837110.1224340.3743190.3415990.1073950.9806390.7367080.8063910.7481250.71007
100.3809760.604990.837080.2850920.4200960.6665040.249880.2112870.2450340.654536
110.624050.6522090.7609290.6066180.1113830.106510.09042310.6669530.8584320.778396
120.3893060.9508270.8691310.2049240.8300920.7164510.2993980.450550.8111710.486584
130.3036440.4114250.9450220.8388540.4854480.03757790.2349650.7677820.8354080.882997
9999890.5735980.605170.2083630.1135620.3756770.4075890.0979190.5104950.9656610.687687
9999900.08258450.2744690.04388930.86780.1008870.7158270.7292740.6319560.4927340.269479
9999910.6869460.5931470.8728320.3533510.3377310.230530.165540.3640140.09247620.380706
9999920.6000880.01943450.3617810.1708430.3738470.09262080.4320990.6297150.7285480.753648
9999930.849240.6452480.04076210.4042810.3832470.2283720.746920.5358740.967910.59859
9999940.8743220.2413120.7230020.2082050.5457690.6849530.9053420.344490.6991870.377778
9999950.1789790.2439520.8885870.9632890.3665920.687560.331750.1037890.3782610.350398
9999960.8450580.481910.4087750.9043830.8465160.4938560.03447120.5813950.3834920.159446
9999970.7103760.8154440.218950.3130380.6683720.5936560.4948180.1401160.6683640.598582
9999980.9392860.5640920.4036750.9248560.4403220.0114950.8493630.931450.9325040.249018
9999990.5829410.6728180.8487910.3151120.6369340.4371680.3249790.7299840.2025970.554824
10000000.112360.6744450.4114370.1758470.2492220.7745720.9010690.4019760.1227090.554017
@time map(sum, eachrow(df2));
  2.468679 seconds (60.12 M allocations: 1.056 GiB, 21.56% gc time, 4.90% compilation time)
@time map(sum, eachrow(df2));
  1.765101 seconds (59.99 M allocations: 1.050 GiB, 6.20% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.339066 seconds (1.22 M allocations: 66.508 MiB, 4.27% gc time, 94.07% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.005196 seconds (21 allocations: 7.631 MiB)

as you can see - this time it is much faster to iterate a type stable container still you might want to use the select syntax, which is optimized for such reductions:

this includes compilation time

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.552321 seconds (711.57 k allocations: 43.126 MiB, 98.61% compilation time: 94% of which was recompilation)
1000000-element Vector{Float64}:
 4.443835677199177
 5.857019558194336
 6.662549653640113
 5.2392115580311875
 4.335405396831737
 4.359202661825034
 4.407556880885516
 3.7437040312173777
 4.946051256166412
 4.5554762944531015
 ⋮
 4.162624520064019
 5.400444138071531
 5.6043599690171435
 4.4931574135086505
 5.139301007663957
 5.221714755064596
 6.2460613897101105
 5.306147866526345
 4.377654217174633

Do it again

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.006508 seconds (123 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
 4.443835677199177
 5.857019558194336
 6.662549653640113
 5.2392115580311875
 4.335405396831737
 4.359202661825034
 4.407556880885516
 3.7437040312173777
 4.946051256166412
 4.5554762944531015
 ⋮
 4.162624520064019
 5.400444138071531
 5.6043599690171435
 4.4931574135086505
 5.139301007663957
 5.221714755064596
 6.2460613897101105
 5.306147866526345
 4.377654217174633

This notebook was generated using Literate.jl.