Possible pitfalls#

using DataFrames
using BenchmarkTools

Know what is copied when creating a DataFrame#

x = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.642180.5183330.9125970.7373620.369005
20.316150.01292980.8301930.7904560.872891
30.3219440.8582160.4336240.006428840.584734

x and y are not the same object

y = copy(x)
x === y
false

x and y are not the same object

y = DataFrame(x)
x === y
false

the columns are also not the same

any(x[!, i] === y[!, i] for i in ncol(x))
false

x and y are not the same object

y = DataFrame(x, copycols=false)
x === y
false

But the columns are the same

all(x[!, i] === y[!, i] for i in ncol(x))
true

the same when creating data frames using kwarg syntax

x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

different object

y === df.y
false

range is converted to a vector

typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

y === df[:, :y]
false

you can avoid copying by using copycols=false keyword argument in functions.

df = DataFrame(x=x, y=y, copycols=false)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

now it is the same

y === df.y
true

not the same object

select(df, :y)[!, 1] === y
false

the same object

select(df, :y, copycols=false)[!, 1] === y
true

Do not modify the parent of GroupedDataFrame or view#

x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view

GroupedDataFrame with 2 groups based on key: id

First Group (3 rows): id = 2
Rowidx
Int64Int64
121
223
315

Last Group (3 rows): id = 2
Rowidx
Int64Int64
122
224
326
s = view(x, 5:6, :)
2×2 SubDataFrame
Rowidx
Int64Int64
115
226
delete!(x, 3:6)
2×2 DataFrame
Rowidx
Int64Int64
121
222

This is an error

s ## Will return BoundsError

Single column selection for a DataFrame#

Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :

x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
111111
222222
333333
x[1, 1] = 100
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
1100100111
222222
333333

When iterating rows of a data frame#

  • use eachrow to avoid compilation cost in wide tables,

  • but Tables.namedtupleiterator for fast execution in tall tables

The table below is tall:

df2 = DataFrame(rand(10^6, 10), :auto)
1000000×10 DataFrame
999975 rows omitted
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.5146030.5843290.06654910.5302950.863870.8989660.6792930.7217920.7304270.764711
20.2674950.07139820.6729220.3101020.7105570.1597660.6484880.9570660.7840730.760727
30.2721910.410450.1062840.3338310.6854830.9703790.2556030.8768190.7137760.152785
40.05289720.8248710.4808980.139520.1370320.5700550.9103930.3713920.2496050.00861442
50.2982230.6602670.8566170.03920450.7925950.3479790.0006702160.4440460.7344850.47244
60.4709170.9577350.490790.1164450.706650.7129870.5810610.3383920.7279090.259861
70.3079460.9926830.2592930.7418560.1615190.8239010.3611850.4228940.4626860.092596
80.7189770.5984850.2189650.4096260.5550540.04752050.005403510.1883690.6824260.959554
90.9330120.8073180.5512710.346220.9374250.6240770.2406910.147380.5105680.427122
100.4094930.4436270.6141490.2792650.8186560.3832360.8556910.3614360.8629130.273861
110.2324080.1422140.6041120.5132380.5826270.02094420.2654260.2530280.8655520.516647
120.04167860.4794750.5355930.1145420.9103110.5082570.5850250.4661440.6555320.989518
130.7760020.3490740.1634780.7440650.1883510.6382560.5351640.9989560.5602990.578827
9999890.5758490.9477850.5949980.03661460.4402730.04961970.7455460.6323010.280640.349605
9999900.1556880.7124170.5037750.1400530.1527920.1851150.1898170.1019240.07100760.280905
9999910.003786070.5281540.7945350.1349440.412130.2494640.7773890.6701450.7915620.833196
9999920.5685850.08196830.4492480.4775190.8414950.1818210.6305560.6799180.5517740.425615
9999930.7488350.3637110.9696910.079410.1701530.7554860.3979590.5423570.4498420.385593
9999940.749820.2641080.77110.3434750.8411140.5098090.2384770.400120.1149310.362719
9999950.1664440.6777020.7815270.7974040.8090760.0426960.06851780.1465150.8820350.605978
9999960.992030.5790890.6373060.2883470.4005140.2896150.5496060.3187630.3520450.88931
9999970.9113680.1860610.6246580.04002390.4045210.1706850.4612040.363860.4548620.669423
9999980.9100570.8200590.6592520.05679770.04531750.8632390.9737170.263670.6707580.383683
9999990.005338420.2574440.1646860.519960.5901390.211770.9078310.09611150.9601550.775406
10000000.3444450.1931890.2632840.2648050.2815360.6690530.8813980.1879070.9595850.695789
@time map(sum, eachrow(df2));
  2.783457 seconds (60.11 M allocations: 1.056 GiB, 24.50% gc time, 3.74% compilation time)
@time map(sum, eachrow(df2));
  1.939167 seconds (59.99 M allocations: 1.050 GiB, 5.80% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.313881 seconds (1.31 M allocations: 72.141 MiB, 97.98% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.019452 seconds (25 allocations: 7.631 MiB, 68.01% gc time)

as you can see - this time it is much faster to iterate a type stable container still you might want to use the select syntax, which is optimized for such reductions:

this includes compilation time

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.576455 seconds (765.68 k allocations: 46.162 MiB, 98.88% compilation time: 93% of which was recompilation)
1000000-element Vector{Float64}:
 6.354835449763846
 5.3425944089008865
 4.777601004121772
 3.7452773291459938
 4.646526552655023
 5.362747006556143
 4.626559527928046
 4.384378236075445
 5.525083232942826
 5.3023269182995865
 ⋮
 4.888499780349042
 4.863038043321743
 4.595675110383267
 4.977894137096483
 5.296626151007581
 4.286665043814575
 5.646550460409576
 4.488840250193496
 4.740990746940874

Do it again

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.004556 seconds (127 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
 6.354835449763846
 5.3425944089008865
 4.777601004121772
 3.7452773291459938
 4.646526552655023
 5.362747006556143
 4.626559527928046
 4.384378236075445
 5.525083232942826
 5.3023269182995865
 ⋮
 4.888499780349042
 4.863038043321743
 4.595675110383267
 4.977894137096483
 5.296626151007581
 4.286665043814575
 5.646550460409576
 4.488840250193496
 4.740990746940874

This notebook was generated using Literate.jl.