Possible pitfalls#

using DataFrames
using BenchmarkTools

Know what is copied when creating a DataFrame#

x = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.9473960.5949140.3183330.2913590.591581
20.276210.8566750.3181420.7953970.151377
30.785050.05480610.8646470.7082650.775993

x and y are not the same object

y = copy(x)
x === y
false

x and y are not the same object

y = DataFrame(x)
x === y
false

the columns are also not the same

any(x[!, i] === y[!, i] for i in ncol(x))
false

x and y are not the same object

y = DataFrame(x, copycols=false)
x === y
false

But the columns are the same

all(x[!, i] === y[!, i] for i in ncol(x))
true

the same when creating data frames using kwarg syntax

x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

different object

y === df.y
false

range is converted to a vector

typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

y === df[:, :y]
false

you can avoid copying by using copycols=false keyword argument in functions.

df = DataFrame(x=x, y=y, copycols=false)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

now it is the same

y === df.y
true

not the same object

select(df, :y)[!, 1] === y
false

the same object

select(df, :y, copycols=false)[!, 1] === y
true

Do not modify the parent of GroupedDataFrame or view#

x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view

GroupedDataFrame with 2 groups based on key: id

First Group (3 rows): id = 2
Rowidx
Int64Int64
121
223
315

Last Group (3 rows): id = 2
Rowidx
Int64Int64
122
224
326
s = view(x, 5:6, :)
2×2 SubDataFrame
Rowidx
Int64Int64
115
226
delete!(x, 3:6)
2×2 DataFrame
Rowidx
Int64Int64
121
222

This is an error

s ## Will return BoundsError

Single column selection for a DataFrame#

Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :

x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
111111
222222
333333
x[1, 1] = 100
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
1100100111
222222
333333

When iterating rows of a data frame#

  • use eachrow to avoid compilation cost in wide tables,

  • but Tables.namedtupleiterator for fast execution in tall tables

The table below is tall:

df2 = DataFrame(rand(10^6, 10), :auto)
1000000×10 DataFrame
999975 rows omitted
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.5381650.4776190.638650.1149410.5918690.9193260.4603260.4416070.05791520.00357446
20.5331380.3952250.9366050.02969930.2399570.310990.2932990.9506940.1020880.362718
30.1280260.0370750.8844370.1509830.4807510.8193750.4889050.05272640.085250.918672
40.06380110.6765450.1071590.3441570.9116090.8525030.5799010.1290510.6653240.439404
50.4099030.1179160.2419070.6760040.6979720.1695960.7205080.6532890.6780770.266621
60.5183440.9605370.6516090.674320.9790090.8738230.1289610.03569520.1395440.247548
70.7175260.7739860.5649010.5948710.606950.2691180.8094370.7252440.3838680.606193
80.4372430.9848360.5005190.5142120.1304190.8523670.5114470.02221950.9440330.438669
90.02276490.1150360.06835050.05035120.9435260.6231870.02347990.4618650.3487350.690593
100.4557920.5274370.1871270.6796760.4640270.4148340.271630.7179380.845060.394637
110.1432320.5927070.4612960.02885610.4335930.7032710.2869780.6705720.5703570.0609912
120.4706490.3243320.9327040.5486680.8158370.8777080.6883760.4456330.283910.827211
130.3202940.387120.6518040.2782530.9850290.7775750.3504040.4030670.3859160.495512
9999890.3437660.3492360.5315180.5914890.6120460.04919630.3285090.6553940.6058360.524536
9999900.06734050.301250.9543710.7651940.6461080.6943360.3588380.8037990.09723330.119382
9999910.5191150.6543940.842880.8587330.5419750.9646380.9219980.5367970.1017760.83639
9999920.9860490.8880410.5857540.8173330.05989340.887050.9162730.8905390.6392750.625951
9999930.9696350.2715820.9720130.9888760.4551720.8882540.602080.4312730.3598010.607461
9999940.6174440.0790360.675130.4841390.007496030.3830360.383630.08496540.5105730.0714929
9999950.7102990.7837950.9777310.8856640.7314860.7284690.4064940.775310.5200030.891723
9999960.378260.5795510.3704130.2865940.3095140.6859150.4211910.2242320.04862280.0447236
9999970.08155080.7670830.7791610.7280520.4748480.4757330.04547330.8611350.1625660.7554
9999980.2417360.2156850.6858610.4000150.5702620.01114320.691920.297540.5211740.314071
9999990.6802180.9331590.5993540.1163670.9140860.1360510.47140.1712430.3159830.879676
10000000.2120740.2660760.8254680.9239460.7213430.0816670.03352970.5346970.538350.00548066
@time map(sum, eachrow(df2));
  2.705622 seconds (60.11 M allocations: 1.056 GiB, 25.15% gc time, 3.68% compilation time)
@time map(sum, eachrow(df2));
  1.862482 seconds (59.99 M allocations: 1.050 GiB, 5.09% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.348988 seconds (1.32 M allocations: 72.991 MiB, 2.33% gc time, 98.14% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.006220 seconds (25 allocations: 7.631 MiB)

as you can see - this time it is much faster to iterate a type stable container still you might want to use the select syntax, which is optimized for such reductions:

this includes compilation time

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.632661 seconds (754.52 k allocations: 46.014 MiB, 99.11% compilation time: 96% of which was recompilation)
1000000-element Vector{Float64}:
 4.243993540615169
 4.154414077448179
 4.046201057064972
 4.769453432081732
 4.631793508413152
 5.2093905747165135
 6.052092977953647
 5.335964653974474
 3.3478892521388897
 4.958159796081296
 ⋮
 7.296159776774205
 6.546147719145592
 3.296942512182397
 7.410973094568378
 3.3490167652546234
 5.1310010949088385
 3.949406516653557
 5.217535862311809
 4.142632091909143

Do it again

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.004660 seconds (127 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
 4.243993540615169
 4.154414077448179
 4.046201057064972
 4.769453432081732
 4.631793508413152
 5.2093905747165135
 6.052092977953647
 5.335964653974474
 3.3478892521388897
 4.958159796081296
 ⋮
 7.296159776774205
 6.546147719145592
 3.296942512182397
 7.410973094568378
 3.3490167652546234
 5.1310010949088385
 3.949406516653557
 5.217535862311809
 4.142632091909143

This notebook was generated using Literate.jl.