Possible pitfalls#

using DataFrames
using BenchmarkTools

Know what is copied when creating a DataFrame#

x = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.1992740.6404350.08982020.3683310.545752
20.6436670.7208190.5298340.5730980.667158
30.6076820.4718760.9985420.7107180.518653

x and y are not the same object

y = copy(x)
x === y
false

x and y are not the same object

y = DataFrame(x)
x === y
false

the columns are also not the same

any(x[!, i] === y[!, i] for i in ncol(x))
false

x and y are not the same object

y = DataFrame(x, copycols=false)
x === y
false

But the columns are the same

all(x[!, i] === y[!, i] for i in ncol(x))
true

the same when creating data frames using kwarg syntax

x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

different object

y === df.y
false

range is converted to a vector

typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

y === df[:, :y]
false

you can avoid copying by using copycols=false keyword argument in functions.

df = DataFrame(x=x, y=y, copycols=false)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

now it is the same

y === df.y
true

not the same object

select(df, :y)[!, 1] === y
false

the same object

select(df, :y, copycols=false)[!, 1] === y
true

Do not modify the parent of GroupedDataFrame or view#

x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view

GroupedDataFrame with 2 groups based on key: id

First Group (3 rows): id = 2
Rowidx
Int64Int64
121
223
315

Last Group (3 rows): id = 2
Rowidx
Int64Int64
122
224
326
s = view(x, 5:6, :)
2×2 SubDataFrame
Rowidx
Int64Int64
115
226
delete!(x, 3:6)
2×2 DataFrame
Rowidx
Int64Int64
121
222

This is an error

s ## Will return BoundsError

Single column selection for a DataFrame#

Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :

x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
111111
222222
333333
x[1, 1] = 100
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
1100100111
222222
333333

When iterating rows of a data frame#

  • use eachrow to avoid compilation cost in wide tables,

  • but Tables.namedtupleiterator for fast execution in tall tables

The table below is tall:

df2 = DataFrame(rand(10^6, 10), :auto)
1000000×10 DataFrame
999975 rows omitted
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.9500210.4939810.8414070.3466940.4329680.7221430.7087380.2482260.9810280.595189
20.5107480.1475370.4132810.9773550.1012470.8073250.8003980.787760.09116090.710667
30.07418490.794790.2745690.8972130.2989970.7840280.4488360.652340.4681970.77791
40.6480840.7340420.5809930.1856870.2500220.509130.6716570.1689170.2561920.233248
50.2234460.4929960.7921870.8524230.818480.8647880.09919240.3225160.8542780.999516
60.675530.4364380.06807830.186720.6598930.5621860.7432730.4650540.3066960.297225
70.2431760.4940160.8125060.8735230.7214690.9289530.05497560.12670.2166170.192617
80.7978130.4352350.945720.4172950.6532960.6574420.9823490.5632640.05345720.77055
90.1101850.8036920.5305270.8886560.2190440.3485850.5461330.6011380.8737740.10204
100.8427850.2427540.01126070.9997430.2377790.0004161260.08395010.3354530.7215370.857107
110.8095880.7713860.7616120.0732050.03175630.4540980.8683670.8129880.2699120.954401
120.7017990.3856710.8351050.2127080.7887240.3713080.03290720.7230010.4085190.476572
130.4328290.6890160.07643510.5488590.8634570.6190490.0575590.1223050.2293870.0674775
9999890.5333830.8558020.6873710.7088390.7448570.9665480.08240020.7334760.7639160.335373
9999900.6913780.4454390.808110.3652290.1191910.3306850.2239370.6605350.3705780.472301
9999910.2517290.175050.9935570.3583030.02093460.9803970.3508510.6816460.461140.145045
9999920.3556790.2713240.5088060.9600680.2517220.7525530.6398030.7248030.3594630.433033
9999930.7493440.0386480.5397180.5416980.8913880.2744970.3905140.9559520.1649190.366685
9999940.3045810.4016090.6750380.5069860.943450.611740.5349130.607620.1894580.229565
9999950.5500620.3650730.9164680.5575590.279320.858660.5177630.3011640.6937530.642684
9999960.607840.6108990.6470920.7133680.2973710.6789950.5885850.3966040.6873180.766718
9999970.244660.08494820.5975420.905370.64730.6642690.7064610.02047980.6264980.23971
9999980.5913170.7424890.1340390.3183290.3820090.2422530.6793990.8350810.03530840.201794
9999990.9696960.6467810.4571610.9024020.8233230.6078580.2541910.3193895.85587e-50.656893
10000000.1432470.006173060.6733170.6137730.7958740.4893890.7324930.4225510.1868940.717166
@time map(sum, eachrow(df2));
  2.025296 seconds (60.05 M allocations: 1.053 GiB, 10.68% gc time, 2.94% compilation time)
@time map(sum, eachrow(df2));
  1.717192 seconds (59.99 M allocations: 1.050 GiB, 4.65% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.204156 seconds (378.09 k allocations: 27.491 MiB, 95.25% compilation time: 13% of which was recompilation)
@time map(sum, Tables.namedtupleiterator(df2));
  0.016708 seconds (25 allocations: 7.631 MiB, 37.84% gc time)

as you can see - this time it is much faster to iterate a type stable container still you might want to use the select syntax, which is optimized for such reductions:

this includes compilation time

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.058912 seconds (94.63 k allocations: 12.292 MiB, 91.79% compilation time: 86% of which was recompilation)
1000000-element Vector{Float64}:
 6.320395885756943
 5.347479514465469
 5.4710650902659665
 4.237972304596929
 6.3198238435150635
 4.4010928042350494
 4.664552055585735
 6.2764215672263
 5.023772748469487
 4.332785122603654
 ⋮
 5.257255312875229
 4.913362971837867
 5.004960265880328
 5.682505134117076
 5.994789175889104
 4.737239123215181
 4.162017573952469
 5.637752149817912
 4.780876130310359

Do it again

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.004646 seconds (127 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
 6.320395885756943
 5.347479514465469
 5.4710650902659665
 4.237972304596929
 6.3198238435150635
 4.4010928042350494
 4.664552055585735
 6.2764215672263
 5.023772748469487
 4.332785122603654
 ⋮
 5.257255312875229
 4.913362971837867
 5.004960265880328
 5.682505134117076
 5.994789175889104
 4.737239123215181
 4.162017573952469
 5.637752149817912
 4.780876130310359

This notebook was generated using Literate.jl.