Possible pitfalls#

using DataFrames
using BenchmarkTools

Know what is copied when creating a DataFrame#

x = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.4247370.7827480.6817770.5729850.735879
20.4285840.6003950.5390970.9951620.757722
30.9605580.1672390.9262690.001015250.797032

x and y are not the same object

y = copy(x)
x === y
false

x and y are not the same object

y = DataFrame(x)
x === y
false

the columns are also not the same

any(x[!, i] === y[!, i] for i in ncol(x))
false

x and y are not the same object

y = DataFrame(x, copycols=false)
x === y
false

But the columns are the same

all(x[!, i] === y[!, i] for i in ncol(x))
true

the same when creating data frames using kwarg syntax

x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

different object

y === df.y
false

range is converted to a vector

typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

y === df[:, :y]
false

you can avoid copying by using copycols=false keyword argument in functions.

df = DataFrame(x=x, y=y, copycols=false)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

now it is the same

y === df.y
true

not the same object

select(df, :y)[!, 1] === y
false

the same object

select(df, :y, copycols=false)[!, 1] === y
true

Do not modify the parent of GroupedDataFrame or view#

x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view

GroupedDataFrame with 2 groups based on key: id

First Group (3 rows): id = 2
Rowidx
Int64Int64
121
223
315

Last Group (3 rows): id = 2
Rowidx
Int64Int64
122
224
326
s = view(x, 5:6, :)
2×2 SubDataFrame
Rowidx
Int64Int64
115
226
delete!(x, 3:6)
2×2 DataFrame
Rowidx
Int64Int64
121
222

This is an error

s ## Will return BoundsError

Single column selection for a DataFrame#

Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :

x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
111111
222222
333333
x[1, 1] = 100
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
1100100111
222222
333333

When iterating rows of a data frame#

  • use eachrow to avoid compilation cost in wide tables,

  • but Tables.namedtupleiterator for fast execution in tall tables

The table below is tall:

df2 = DataFrame(rand(10^6, 10), :auto)
1000000×10 DataFrame
999975 rows omitted
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.09268010.510470.01822010.2823080.6332240.2186330.9233530.5770410.6737460.59492
20.6745860.4427030.3601770.9770930.2327760.7431790.9590310.4747040.3628280.299904
30.8536720.2198660.7673450.8034460.7535850.2788930.1398470.196520.5836750.447995
40.304190.5285870.4193930.8728830.2562350.09270210.7337710.7733040.357540.112215
50.007032280.5635620.948040.5430830.4524660.601120.5341520.1697560.9722810.511011
60.02055380.4330610.1843690.3676080.5335160.5429160.007052350.4701520.194330.00416603
70.6883790.4258210.7907280.8294730.3746330.7853470.02987650.9394680.07465540.679708
80.340460.301850.5373060.3506570.8824540.4603460.7800010.5511890.7281830.550581
90.02436110.2863750.355740.6224350.7541080.919730.3163340.02693210.266080.231496
100.3732690.5085870.3458880.1904970.7549620.3035470.6606260.1926160.3850120.216885
110.7206110.3593790.7888250.3952710.9735750.1325640.7237730.8291810.5901050.359539
120.3118560.4749050.3107130.7697610.5338770.7583570.6615140.178680.9635390.22141
130.1175660.7870110.2713360.993510.4701420.1914220.9360340.8251180.8640330.42115
9999890.09626060.5985860.7324440.4893720.2410660.5125230.5275850.7111330.9234770.425483
9999900.1597210.8028380.8043830.2954110.3487380.06878830.04682170.8839750.3766130.0351025
9999910.9013580.7349390.7147230.608590.1113390.6038860.7052140.05933790.2126190.24508
9999920.2810530.7713660.5433460.8695480.1682240.1415570.4095220.7605170.1700610.693612
9999930.09645150.5508410.8012890.3511990.2292170.5010730.4823010.5029820.8784440.671248
9999940.4539080.2526020.5137620.4494090.6044620.50660.7655490.911160.6431850.986821
9999950.06678290.01538030.2560340.3300130.03461920.5959640.8445990.8327140.9796740.359995
9999960.6557380.5710180.8717060.5149610.993570.5806910.7625470.3079670.3377290.940187
9999970.1181390.4719180.8578630.3348450.7544090.9557120.3056990.1312030.08457490.235298
9999980.8213920.7094290.9503660.7929380.2044050.4591650.9240760.532530.1274520.165834
9999990.863030.1358560.4831790.7658090.05297080.4962760.2067960.6734710.5830070.803619
10000000.6306630.7732730.1445430.1367880.9780630.01670450.8697170.08576670.4780270.32786
@time map(sum, eachrow(df2));
  2.178086 seconds (60.08 M allocations: 1.055 GiB, 15.60% gc time, 3.73% compilation time)
@time map(sum, eachrow(df2));
  1.862223 seconds (59.99 M allocations: 1.050 GiB, 6.29% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.233865 seconds (428.99 k allocations: 30.089 MiB, 4.03% gc time, 91.72% compilation time: 27% of which was recompilation)
@time map(sum, Tables.namedtupleiterator(df2));
  0.009578 seconds (25 allocations: 7.631 MiB)

as you can see - this time it is much faster to iterate a type stable container still you might want to use the select syntax, which is optimized for such reductions:

this includes compilation time

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.164873 seconds (684.31 k allocations: 41.595 MiB, 96.42% compilation time: 97% of which was recompilation)
1000000-element Vector{Float64}:
 4.524594656468376
 5.526979192486358
 5.044843910022011
 4.450819836455767
 5.302503012392829
 2.7577227261336805
 5.618088460709214
 5.483027953951166
 3.8035897887299397
 3.93188928765376
 ⋮
 4.808806298265378
 5.065044624374165
 6.087458020724531
 4.3157769856357
 6.536114733186674
 4.249661766586929
 5.687587649537533
 5.064013304031212
 4.44140391850883

Do it again

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.005210 seconds (127 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
 4.524594656468376
 5.526979192486358
 5.044843910022011
 4.450819836455767
 5.302503012392829
 2.7577227261336805
 5.618088460709214
 5.483027953951166
 3.8035897887299397
 3.93188928765376
 ⋮
 4.808806298265378
 5.065044624374165
 6.087458020724531
 4.3157769856357
 6.536114733186674
 4.249661766586929
 5.687587649537533
 5.064013304031212
 4.44140391850883

This notebook was generated using Literate.jl.