Possible pitfalls#

using DataFrames
using BenchmarkTools

Know what is copied when creating a DataFrame#

x = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.5608910.5664090.6047430.6069310.761472
20.2556470.6368520.454870.1998890.898372
30.3009350.3109880.6738430.5125020.585906

x and y are not the same object

y = copy(x)
x === y
false

x and y are not the same object

y = DataFrame(x)
x === y
false

the columns are also not the same

any(x[!, i] === y[!, i] for i in ncol(x))
false

x and y are not the same object

y = DataFrame(x, copycols=false)
x === y
false

But the columns are the same

all(x[!, i] === y[!, i] for i in ncol(x))
true

the same when creating data frames using kwarg syntax

x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

different object

y === df.y
false

range is converted to a vector

typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

y === df[:, :y]
false

you can avoid copying by using copycols=false keyword argument in functions.

df = DataFrame(x=x, y=y, copycols=false)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

now it is the same

y === df.y
true

not the same object

select(df, :y)[!, 1] === y
false

the same object

select(df, :y, copycols=false)[!, 1] === y
true

Do not modify the parent of GroupedDataFrame or view#

x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view

GroupedDataFrame with 2 groups based on key: id

First Group (3 rows): id = 2
Rowidx
Int64Int64
121
223
315

Last Group (3 rows): id = 2
Rowidx
Int64Int64
122
224
326
s = view(x, 5:6, :)
2×2 SubDataFrame
Rowidx
Int64Int64
115
226
delete!(x, 3:6)
2×2 DataFrame
Rowidx
Int64Int64
121
222

This is an error

s ## Will return BoundsError

Single column selection for a DataFrame#

Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :

x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
111111
222222
333333
x[1, 1] = 100
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
1100100111
222222
333333

When iterating rows of a data frame#

  • use eachrow to avoid compilation cost in wide tables,

  • but Tables.namedtupleiterator for fast execution in tall tables

The table below is tall:

df2 = DataFrame(rand(10^6, 10), :auto)
1000000×10 DataFrame
999975 rows omitted
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.1732590.8684290.1994620.1449890.1402050.2468030.2466320.9423530.9266680.911074
20.1239540.9960350.4849350.3922190.2030260.4592650.1775350.3430150.4847880.790101
30.1547810.1866440.7997310.9975060.9550730.4451790.6191250.03357540.6072190.865513
40.8160310.8689590.7709710.5816480.9131180.2410690.1851140.9599240.3295310.144141
50.8145990.8011110.9560550.2481210.2411390.1508230.4045260.3221520.1657370.185206
60.5824230.1197310.6072850.06008930.8272070.4679290.8552280.1369350.8669420.495018
70.5923450.287290.5302490.6712630.3870010.4783870.2701470.005836830.1694450.92732
80.5663050.4669440.4339180.5096950.1255680.5401850.03450330.2151330.2473760.742053
90.7110110.1569120.9664280.3734720.7794520.1714570.96460.2054370.06365270.92508
100.47330.6396310.7586020.3422120.4462660.9408150.3545160.8461750.2554370.04738
110.368070.9449250.9041340.6484870.8947070.4232420.9259460.7279680.6581930.688444
120.6044380.4725740.9693520.8318010.4547890.8111330.7187550.7336650.36990.438884
130.8838010.3988760.9442760.4664760.1022240.772860.7339940.6527450.8198840.883364
9999890.8900340.06419430.4991850.1967250.4200270.09120850.006829040.531080.6377680.533981
9999900.8107060.4037680.08457220.6645770.5219970.5805670.4155570.7021560.2041750.965622
9999910.4560270.773770.5981990.1223550.4488710.7924570.8273980.1481640.280310.158536
9999920.8462660.6794150.2775020.945220.9086340.02682560.5738180.7940650.3960070.0417267
9999930.777440.3990990.1279730.3044210.7108790.6356670.3241410.7281540.07270410.606665
9999940.7564230.2727290.132720.01421150.6147260.1155510.6194450.4387470.4607390.663496
9999950.2879390.8812260.3509660.5405590.705470.5443790.05500660.714170.1346710.81261
9999960.8496270.7894370.1279280.5074350.3192540.9786550.006348360.8033880.7837440.67073
9999970.1189570.2541070.8546780.8719650.7473610.9111290.1528090.3888750.5383470.296838
9999980.8596810.5478560.520480.07146230.8497220.6906080.3471150.81850.3110470.280407
9999990.695920.1310870.5474440.2612410.7417310.7965960.2232150.7503960.619340.495444
10000000.3172470.6327870.2649640.949060.8771840.6321650.2542150.03376840.03696220.905203
@time map(sum, eachrow(df2));
  3.542418 seconds (60.11 M allocations: 1.056 GiB, 9.67% gc time, 3.42% compilation time)
@time map(sum, eachrow(df2));
  3.169916 seconds (59.99 M allocations: 1.050 GiB, 4.13% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.249135 seconds (361.13 k allocations: 26.568 MiB, 4.86% gc time, 85.77% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.012652 seconds (20 allocations: 7.630 MiB)

as you can see - this time it is much faster to iterate a type stable container still you might want to use the select syntax, which is optimized for such reductions:

this includes compilation time

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.011790 seconds (6.67 k allocations: 7.969 MiB, 50.48% compilation time)
1000000-element Vector{Float64}:
 4.799871994364215
 4.454871793224532
 5.664348582520293
 5.810507372086148
 4.289469346132967
 5.0187854176762725
 4.3192849190286875
 3.8816819363205237
 5.317501511565184
 5.104333118795934
 ⋮
 5.489477995118583
 4.687143261221065
 4.088786137198144
 5.026998053622702
 5.8365470064870415
 5.135065469480588
 5.29687884751163
 5.262414871594562
 4.90355610068893

Do it again

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.006313 seconds (123 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
 4.799871994364215
 4.454871793224532
 5.664348582520293
 5.810507372086148
 4.289469346132967
 5.0187854176762725
 4.3192849190286875
 3.8816819363205237
 5.317501511565184
 5.104333118795934
 ⋮
 5.489477995118583
 4.687143261221065
 4.088786137198144
 5.026998053622702
 5.8365470064870415
 5.135065469480588
 5.29687884751163
 5.262414871594562
 4.90355610068893

This notebook was generated using Literate.jl.