Possible pitfalls#

using DataFrames
using BenchmarkTools

Know what is copied when creating a DataFrame#

x = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.6279660.7807460.2160540.498190.740574
20.9826440.8637590.5215770.7914250.916146
30.9319230.662910.429170.7785590.110755

x and y are not the same object

y = copy(x)
x === y
false

x and y are not the same object

y = DataFrame(x)
x === y
false

the columns are also not the same

any(x[!, i] === y[!, i] for i in ncol(x))
false

x and y are not the same object

y = DataFrame(x, copycols=false)
x === y
false

But the columns are the same

all(x[!, i] === y[!, i] for i in ncol(x))
true

the same when creating data frames using kwarg syntax

x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

different object

y === df.y
false

range is converted to a vector

typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

y === df[:, :y]
false

you can avoid copying by using copycols=false keyword argument in functions.

df = DataFrame(x=x, y=y, copycols=false)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

now it is the same

y === df.y
true

not the same object

select(df, :y)[!, 1] === y
false

the same object

select(df, :y, copycols=false)[!, 1] === y
true

Do not modify the parent of GroupedDataFrame or view#

x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view

GroupedDataFrame with 2 groups based on key: id

First Group (3 rows): id = 2
Rowidx
Int64Int64
121
223
315

Last Group (3 rows): id = 2
Rowidx
Int64Int64
122
224
326
s = view(x, 5:6, :)
2×2 SubDataFrame
Rowidx
Int64Int64
115
226
delete!(x, 3:6)
2×2 DataFrame
Rowidx
Int64Int64
121
222

This is an error

s ## Will return BoundsError

Single column selection for a DataFrame#

Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :

x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
111111
222222
333333
x[1, 1] = 100
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
1100100111
222222
333333

When iterating rows of a data frame#

  • use eachrow to avoid compilation cost in wide tables,

  • but Tables.namedtupleiterator for fast execution in tall tables

The table below is tall:

df2 = DataFrame(rand(10^6, 10), :auto)
1000000×10 DataFrame
999975 rows omitted
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.8274930.5226780.305870.6974310.3565740.1473480.1655740.352310.7355410.734174
20.3340830.7210480.6784310.3380550.1624620.963680.982220.3104560.7880590.808003
30.2494760.1853050.3864220.6700010.430290.3005580.5675260.6041220.9930320.412015
40.3312620.2418080.1319260.478970.9399510.8662490.7135220.855240.626980.94087
50.3283090.9134040.9970250.6311370.05833950.5350690.1159820.2616610.7716390.38268
60.494890.9358540.8492810.06517150.7021730.9732270.6988050.9110430.3446350.0538166
70.6392020.110870.3803680.991730.7328730.6377350.1300030.746660.8258740.635628
80.5946150.9522210.1187980.8904030.7311670.1291710.1903580.3064050.5986770.659036
90.3268360.8673870.7794460.05140190.6684670.08956850.8259310.3135930.5062750.18824
100.4519440.4506820.6950830.2799710.5189910.4817370.5100530.2352630.7552140.512064
110.3915530.4221480.6790260.8154030.1203330.9326110.7843850.1684650.6628670.567092
120.3942680.5107680.8689310.4016590.7014430.7012510.2147950.6324340.3927020.716703
130.5745410.5099770.7333120.3571770.2042790.74210.8379050.2221280.9600750.405046
9999890.01754850.4804870.8922620.1309950.6704850.001714930.1201220.2995050.9513170.964039
9999900.8917150.6475220.5617250.8338240.2996960.9813910.7478930.2052040.6022960.577729
9999910.7164210.9395430.2986140.04770760.4106230.2114710.5369630.07979990.5410150.343898
9999920.362390.769890.3519880.08898110.6892720.07687470.8601830.01054610.7696410.511566
9999930.8752640.913450.04974420.1323980.5011890.9743460.7331870.06995650.09156530.492524
9999940.5489490.1272220.1880890.4861140.1632620.5189480.6001340.234250.1205630.484254
9999950.005222450.3329110.3327180.1663680.9051020.6607360.157890.706760.3798540.407764
9999960.6124890.9673010.3798550.1475520.6625380.8478110.9548710.1165290.9075580.337808
9999970.7743330.57980.7453010.72910.4908310.7300820.9643670.8308930.3099560.728045
9999980.250820.4901470.2275230.8715030.9468210.194160.1983510.9136740.507240.0972831
9999990.0600270.3756350.5585740.2967960.2973860.7479570.08599350.9960660.3511470.0207747
10000000.0409670.3166550.09138960.4690020.442540.02602060.7397150.7960130.8370320.0939526
@time map(sum, eachrow(df2));
  2.677138 seconds (60.11 M allocations: 1.056 GiB, 27.30% gc time, 3.70% compilation time)
@time map(sum, eachrow(df2));
  1.773191 seconds (59.99 M allocations: 1.050 GiB, 5.91% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.348400 seconds (1.32 M allocations: 72.984 MiB, 4.03% gc time, 94.13% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.006175 seconds (25 allocations: 7.631 MiB)

as you can see - this time it is much faster to iterate a type stable container still you might want to use the select syntax, which is optimized for such reductions:

this includes compilation time

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.554322 seconds (756.88 k allocations: 46.117 MiB, 98.88% compilation time: 95% of which was recompilation)
1000000-element Vector{Float64}:
 4.844991690000428
 6.086497806255862
 4.798747007881859
 6.126779894498693
 4.9952461619646815
 6.028895513503592
 5.83094213599793
 5.170850614032696
 4.617145262159551
 4.891002752222908
 ⋮
 4.4913317950278175
 4.833623718382274
 3.471786429775383
 4.055326374105319
 5.9343112062272665
 6.882706645838512
 4.697521607782131
 3.790353847411509
 3.8532860025938884

Do it again

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.004271 seconds (127 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
 4.844991690000428
 6.086497806255862
 4.798747007881859
 6.126779894498693
 4.9952461619646815
 6.028895513503592
 5.83094213599793
 5.170850614032696
 4.617145262159551
 4.891002752222908
 ⋮
 4.4913317950278175
 4.833623718382274
 3.471786429775383
 4.055326374105319
 5.9343112062272665
 6.882706645838512
 4.697521607782131
 3.790353847411509
 3.8532860025938884

This notebook was generated using Literate.jl.