Possible pitfalls#

using DataFrames
using BenchmarkTools

Know what is copied when creating a DataFrame#

x = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.9044760.8294230.05036740.7120110.452017
20.5310430.1334020.01598360.1307810.312669
30.7742130.2600220.7795130.7860030.105047

x and y are not the same object

y = copy(x)
x === y
false

x and y are not the same object

y = DataFrame(x)
x === y
false

the columns are also not the same

any(x[!, i] === y[!, i] for i in ncol(x))
false

x and y are not the same object

y = DataFrame(x, copycols=false)
x === y
false

But the columns are the same

all(x[!, i] === y[!, i] for i in ncol(x))
true

the same when creating data frames using kwarg syntax

x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

different object

y === df.y
false

range is converted to a vector

typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

y === df[:, :y]
false

you can avoid copying by using copycols=false keyword argument in functions.

df = DataFrame(x=x, y=y, copycols=false)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

now it is the same

y === df.y
true

not the same object

select(df, :y)[!, 1] === y
false

the same object

select(df, :y, copycols=false)[!, 1] === y
true

Do not modify the parent of GroupedDataFrame or view#

x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view

GroupedDataFrame with 2 groups based on key: id

First Group (3 rows): id = 2
Rowidx
Int64Int64
121
223
315

Last Group (3 rows): id = 2
Rowidx
Int64Int64
122
224
326
s = view(x, 5:6, :)
2×2 SubDataFrame
Rowidx
Int64Int64
115
226
delete!(x, 3:6)
2×2 DataFrame
Rowidx
Int64Int64
121
222

This is an error

s ## Will return BoundsError

Single column selection for a DataFrame#

Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :

x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
111111
222222
333333
x[1, 1] = 100
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
1100100111
222222
333333

When iterating rows of a data frame#

  • use eachrow to avoid compilation cost in wide tables,

  • but Tables.namedtupleiterator for fast execution in tall tables

The table below is tall:

df2 = DataFrame(rand(10^6, 10), :auto)
1000000×10 DataFrame
999975 rows omitted
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.4783010.125160.4430860.5110080.8813680.7236460.8919430.7153380.2269880.05517
20.5372970.8848420.7619330.8375680.1425770.7921990.6024140.8997250.5949770.0328992
30.5450460.1469890.7674740.7502660.3538070.9013020.4931690.2897290.0379670.601937
40.4682040.8600190.4803990.4339610.361440.09876040.8698730.6906030.01299940.858868
50.9456260.7371680.8973330.7219630.7476430.3788980.3936680.9652490.8326530.368617
60.7950770.242540.4613010.0718430.5264770.8858150.3923220.8048840.9677990.484455
70.1443680.07468260.2965730.09468430.2336910.2071050.03225090.4461240.3161050.225936
80.2468770.1623750.8296480.6697060.2551610.2929260.2557230.009016150.5044610.0955296
90.6013750.3968150.6309750.1784450.3730650.03975250.6907130.5040590.5315990.243714
100.6981360.8810480.1161880.4865610.3026370.04387330.5039670.7023560.2735150.910538
110.01188330.8568120.5341040.09558720.1910180.3050490.6898650.5676680.3206840.557577
120.1918610.845470.2393790.7608550.3625010.871760.7293570.6525850.762120.226549
130.1726590.4901960.2869270.3034850.008234020.1207360.4035590.7555530.1499110.605745
9999890.256870.6860480.1418950.301860.7493360.0382520.6880740.6620570.6558620.650975
9999900.07554920.7919970.6153030.2341430.8749810.2576860.8333520.786470.01094490.341014
9999910.4655040.06218550.8416890.5896760.7459050.3591240.7241530.4805260.4698310.730895
9999920.6430360.931910.9927770.9833570.4597790.9388820.7763530.4240430.2346420.0562271
9999930.1600020.1471430.4962550.9834570.2762560.07258540.1964380.2100390.2413760.978434
9999940.9515020.3931410.4498060.2319520.2751610.5310190.8454760.4005110.2932220.234725
9999950.8639460.4682460.8039380.5882950.4481150.4669230.5234160.4928610.7168580.131195
9999960.8148750.2762210.464190.3159310.4139150.2294850.8893760.8141280.08116510.0324598
9999970.22210.6699160.8615960.871890.8529590.3253190.0282560.540340.4952460.832195
9999980.1835010.6941830.2369990.4170950.08328530.3647910.269450.4202740.5432520.0790998
9999990.3485970.07989180.5008070.09393040.7194590.1428780.3029410.3910260.1765120.981048
10000000.1716750.2619390.7241410.7960640.3577610.7921980.293890.2699180.6472730.575534
@time map(sum, eachrow(df2));
  2.084860 seconds (60.13 M allocations: 1.057 GiB, 8.12% gc time, 5.40% compilation time)
@time map(sum, eachrow(df2));
  1.929150 seconds (59.99 M allocations: 1.050 GiB, 8.39% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.356282 seconds (1.29 M allocations: 70.545 MiB, 5.93% gc time, 97.51% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.008565 seconds (26 allocations: 7.631 MiB)

as you can see - this time it is much faster to iterate a type stable container still you might want to use the select syntax, which is optimized for such reductions:

this includes compilation time

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.604399 seconds (999.89 k allocations: 57.228 MiB, 97.41% compilation time: 95% of which was recompilation)
1000000-element Vector{Float64}:
 5.0520095448766185
 6.086430758355689
 4.887685872073017
 5.135126449042814
 6.988818506958208
 5.632513763521528
 2.071519953916969
 3.321422331055035
 4.190512286064625
 4.918819423163244
 ⋮
 6.441006218261728
 3.761986370440841
 4.606515314644607
 5.503791814623299
 4.331745999643407
 5.69981683021577
 3.29192976063087
 3.7370911066160897
 4.890392635140402

Do it again

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.014927 seconds (127 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
 5.0520095448766185
 6.086430758355689
 4.887685872073017
 5.135126449042814
 6.988818506958208
 5.632513763521528
 2.071519953916969
 3.321422331055035
 4.190512286064625
 4.918819423163244
 ⋮
 6.441006218261728
 3.761986370440841
 4.606515314644607
 5.503791814623299
 4.331745999643407
 5.69981683021577
 3.29192976063087
 3.7370911066160897
 4.890392635140402

This notebook was generated using Literate.jl.