Possible pitfalls#

using DataFrames
using BenchmarkTools

Know what is copied when creating a DataFrame#

x = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.7357170.8395050.01543410.6755860.881035
20.421520.6808510.3952280.7273550.828121
30.6509580.8004870.2526570.5855490.112034

x and y are not the same object

y = copy(x)
x === y
false

x and y are not the same object

y = DataFrame(x)
x === y
false

the columns are also not the same

any(x[!, i] === y[!, i] for i in ncol(x))
false

x and y are not the same object

y = DataFrame(x, copycols=false)
x === y
false

But the columns are the same

all(x[!, i] === y[!, i] for i in ncol(x))
true

the same when creating data frames using kwarg syntax

x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

different object

y === df.y
false

range is converted to a vector

typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

y === df[:, :y]
false

you can avoid copying by using copycols=false keyword argument in functions.

df = DataFrame(x=x, y=y, copycols=false)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

now it is the same

y === df.y
true

not the same object

select(df, :y)[!, 1] === y
false

the same object

select(df, :y, copycols=false)[!, 1] === y
true

Do not modify the parent of GroupedDataFrame or view#

x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view

GroupedDataFrame with 2 groups based on key: id

First Group (3 rows): id = 2
Rowidx
Int64Int64
121
223
315

Last Group (3 rows): id = 2
Rowidx
Int64Int64
122
224
326
s = view(x, 5:6, :)
2×2 SubDataFrame
Rowidx
Int64Int64
115
226
delete!(x, 3:6)
2×2 DataFrame
Rowidx
Int64Int64
121
222

This is an error

s ## Will return BoundsError

Single column selection for a DataFrame#

Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :

x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
111111
222222
333333
x[1, 1] = 100
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
1100100111
222222
333333

When iterating rows of a data frame#

  • use eachrow to avoid compilation cost in wide tables,

  • but Tables.namedtupleiterator for fast execution in tall tables

The table below is tall:

df2 = DataFrame(rand(10^6, 10), :auto)
1000000×10 DataFrame
999975 rows omitted
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.6625330.07916480.4760850.1375270.7801120.5568260.296540.04624890.5807250.459269
20.6419190.2199750.2627940.3918270.1161990.3978340.8941280.1786640.849010.97465
30.6337470.4043360.4223980.5232080.8394930.9946380.9382890.1095160.359020.153182
40.8783390.6008680.3416190.5365430.3626450.7964990.884590.8498960.2937860.657048
50.1066580.1502210.917120.8099540.6174080.3169430.9584990.4098790.4892890.719617
60.593030.2750820.6598070.9870990.9193330.07961790.4965630.7140890.2605690.889253
70.5568240.1141740.8761160.8337030.5837830.3582280.473510.8917880.9320910.851929
80.3246520.690720.0864120.6901690.7130640.220640.5971050.3198340.1913150.122917
90.3971040.9638380.3360630.4299320.7993380.5430280.7169920.4276730.5721170.849775
100.8301950.7160720.7043080.3638820.09865130.6596770.5969320.9983890.2362940.615016
110.3283470.6673660.08997530.8680090.5793150.6336550.9315640.8154050.5264180.981884
120.8816550.4534050.1541940.5665190.3524520.9857210.7236050.8277870.8802020.767686
130.6368640.6730210.2490240.7103930.4757970.1287310.3595290.3152170.7239810.812359
9999890.9759860.4766890.05142010.9899570.8834760.3573840.2438440.9166540.4965810.0234581
9999900.07083770.1105110.5947570.6797250.3347460.2111940.6809690.1658430.2571120.459214
9999910.5790210.5394740.2000640.5942680.4623750.279470.2290690.0242840.372510.296529
9999920.1919440.09839650.2524090.7936910.3750820.9734860.8452440.07346370.7778870.722518
9999930.5427280.927410.01183510.1106010.5273830.961930.9584370.6343490.5234640.171206
9999940.4239010.7682640.8135640.6470290.7761640.7440560.3042290.24240.9054940.639531
9999950.6417970.8714260.2130850.8056630.3415250.5244310.3932260.09502470.1580320.357649
9999960.1535430.3114770.5083890.4241920.5125240.03383460.6200330.7498420.298480.116878
9999970.4944360.1264120.2953390.4723080.1108070.6394090.102460.06286020.5025380.0265686
9999980.4185330.7560710.9367420.1425120.2312820.5093760.6865320.6025060.3610230.378734
9999990.07097010.8380430.8072570.05627330.7750840.2511660.8950860.381310.6525350.614597
10000000.2594080.7473590.6161640.7497950.6221070.01360750.9590570.1920520.9587150.629737
@time map(sum, eachrow(df2));
  2.248203 seconds (60.12 M allocations: 1.056 GiB, 17.66% gc time, 4.00% compilation time)
@time map(sum, eachrow(df2));
  1.672558 seconds (59.99 M allocations: 1.050 GiB, 5.47% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.271730 seconds (1.22 M allocations: 66.497 MiB, 4.36% gc time, 93.31% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.007500 seconds (21 allocations: 7.631 MiB)

as you can see - this time it is much faster to iterate a type stable container still you might want to use the select syntax, which is optimized for such reductions:

this includes compilation time

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.434296 seconds (712.75 k allocations: 43.180 MiB, 97.27% compilation time: 94% of which was recompilation)
1000000-element Vector{Float64}:
 4.075031145560174
 4.9269995567112765
 5.37782807808806
 6.201831198796953
 5.49558611192314
 5.874443898489834
 6.472146151299726
 3.956827075489279
 6.035858171073009
 5.819416611082717
 ⋮
 5.104121438607423
 5.369343756593169
 6.2646325534129454
 4.4018586929999035
 3.729193960552614
 2.833138329204522
 5.023312467271604
 5.3423217101664395
 5.748001588947343

Do it again

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.011397 seconds (123 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
 4.075031145560174
 4.9269995567112765
 5.37782807808806
 6.201831198796953
 5.49558611192314
 5.874443898489834
 6.472146151299726
 3.956827075489279
 6.035858171073009
 5.819416611082717
 ⋮
 5.104121438607423
 5.369343756593169
 6.2646325534129454
 4.4018586929999035
 3.729193960552614
 2.833138329204522
 5.023312467271604
 5.3423217101664395
 5.748001588947343

This notebook was generated using Literate.jl.