Possible pitfalls#

using DataFrames
using BenchmarkTools

Know what is copied when creating a DataFrame#

x = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.1023730.7156190.4021290.6634270.79705
20.624350.4028150.07453430.9728420.811803
30.4215270.9801370.06789620.9376660.789266

x and y are not the same object

y = copy(x)
x === y
false

x and y are not the same object

y = DataFrame(x)
x === y
false

the columns are also not the same

any(x[!, i] === y[!, i] for i in ncol(x))
false

x and y are not the same object

y = DataFrame(x, copycols=false)
x === y
false

But the columns are the same

all(x[!, i] === y[!, i] for i in ncol(x))
true

the same when creating data frames using kwarg syntax

x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

different object

y === df.y
false

range is converted to a vector

typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

y === df[:, :y]
false

you can avoid copying by using copycols=false keyword argument in functions.

df = DataFrame(x=x, y=y, copycols=false)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

now it is the same

y === df.y
true

not the same object

select(df, :y)[!, 1] === y
false

the same object

select(df, :y, copycols=false)[!, 1] === y
true

Do not modify the parent of GroupedDataFrame or view#

x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view

GroupedDataFrame with 2 groups based on key: id

First Group (3 rows): id = 2
Rowidx
Int64Int64
121
223
315

Last Group (3 rows): id = 2
Rowidx
Int64Int64
122
224
326
s = view(x, 5:6, :)
2×2 SubDataFrame
Rowidx
Int64Int64
115
226
delete!(x, 3:6)
2×2 DataFrame
Rowidx
Int64Int64
121
222

This is an error

s ## Will return BoundsError

Single column selection for a DataFrame#

Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :

x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
111111
222222
333333
x[1, 1] = 100
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
1100100111
222222
333333

When iterating rows of a data frame#

  • use eachrow to avoid compilation cost in wide tables,

  • but Tables.namedtupleiterator for fast execution in tall tables

The table below is tall:

df2 = DataFrame(rand(10^6, 10), :auto)
1000000×10 DataFrame
999975 rows omitted
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.7660720.9586250.8320730.02549660.8477880.3140680.1625170.2648030.1001110.0251814
20.3692320.6980690.8164050.5811320.9074930.1287470.6030820.5505480.9433660.657636
30.378860.6889950.3180090.9540470.5343890.8037770.4815640.02413290.3411350.660792
40.3657750.1760990.7901040.6856680.4262120.1643620.9038280.2715550.648790.901741
50.7477570.4658380.5222570.7946170.1667020.7714830.8013060.1387680.6858830.230728
60.5906860.2348830.4038780.6110710.3887460.2813140.9445730.7295750.2952630.875675
70.06939970.4418010.7473640.756840.8290980.1726710.7972260.01233240.05251780.981163
80.6824630.7195110.8732920.1385980.227880.7040530.5969680.1237040.9297950.213875
90.7593390.6261110.6229080.5202350.6375960.6582280.06695930.9174960.7290160.827624
100.8308650.6585250.1368590.3279270.724030.1817430.6170110.6589460.9454210.543078
110.452240.9502520.5249130.6364850.8967560.5263360.3759060.1558470.1094020.700583
120.7065180.5485510.4422890.6372260.9258680.6792320.6705760.4347540.3214550.0329374
130.3972410.8366780.2122890.9948190.3795380.01088450.1982730.650960.07995980.37601
9999890.4048310.4057130.2295370.5254180.4512970.9433040.4816540.05677470.1784950.0693898
9999900.3494990.3883550.8813260.06244670.4877660.695410.749540.6234480.2012850.807722
9999910.4180720.3310130.6519610.2293470.183910.3994540.05482660.2161610.9734920.0173772
9999920.2389970.04351250.09308420.2683930.8022570.2576280.9187230.236820.3321290.936689
9999930.4384040.3852330.3366120.6775670.1711740.2531470.0933830.460.744050.948417
9999940.7871640.7135890.8588480.6151520.7127990.8901090.9152950.5101480.6158320.626102
9999950.3548890.6923220.7399870.4897570.5885580.1638980.1185980.709440.9058230.0660051
9999960.8586210.0687830.8077740.3014360.1766220.9433270.358470.07148650.6846290.593146
9999970.1680810.7725690.05571870.2724440.8689140.6892240.4401640.1858540.8494830.0965424
9999980.8057640.07570290.4529150.5793940.4339960.9922730.4363710.1282520.6765050.743283
9999990.2358390.7998970.5831110.4017530.7032550.9317810.6429780.2976590.4751440.0156555
10000000.4712030.005552740.7186450.5405620.5192710.2646050.6595820.4996960.3157770.956476
@time map(sum, eachrow(df2));
  2.248705 seconds (60.04 M allocations: 1.053 GiB, 10.63% gc time, 2.38% compilation time)
@time map(sum, eachrow(df2));
  2.415540 seconds (59.99 M allocations: 1.050 GiB, 3.73% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.295434 seconds (371.24 k allocations: 27.132 MiB, 4.34% gc time, 91.02% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.014394 seconds (25 allocations: 7.631 MiB)

as you can see - this time it is much faster to iterate a type stable container still you might want to use the select syntax, which is optimized for such reductions:

this includes compilation time

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.013925 seconds (6.67 k allocations: 7.969 MiB, 51.19% compilation time)
1000000-element Vector{Float64}:
 4.2967365440983585
 6.255710306108731
 5.185700498816111
 5.334134450728015
 5.325339163319068
 5.355664387891267
 4.860413103161113
 5.210139354465969
 6.365514164469623
 5.62440555470397
 ⋮
 4.128232610040115
 4.50798787807542
 7.245036793462261
 4.829276873210631
 4.864294862813554
 4.398993642137623
 5.324455470285535
 5.0870718591125135
 4.951369427152281

Do it again

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.007440 seconds (127 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
 4.2967365440983585
 6.255710306108731
 5.185700498816111
 5.334134450728015
 5.325339163319068
 5.355664387891267
 4.860413103161113
 5.210139354465969
 6.365514164469623
 5.62440555470397
 ⋮
 4.128232610040115
 4.50798787807542
 7.245036793462261
 4.829276873210631
 4.864294862813554
 4.398993642137623
 5.324455470285535
 5.0870718591125135
 4.951369427152281

This notebook was generated using Literate.jl.