Possible pitfalls#

using DataFrames

Know what is copied when creating a DataFrame#

x = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.5875160.7189810.622050.7910860.505925
20.9006730.07751410.2104940.155410.0177345
30.4361330.8664120.6036960.3602060.424278

x and y are not the same object

y = copy(x)
x === y
false

x and y are not the same object

y = DataFrame(x)
x === y
false

the columns are also not the same

any(x[!, i] === y[!, i] for i in ncol(x))
false

x and y are not the same object

y = DataFrame(x, copycols=false)
x === y
false

But the columns are the same

all(x[!, i] === y[!, i] for i in ncol(x))
true

the same when creating data frames using kwarg syntax

x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

different object

y === df.y
false

range is converted to a vector

typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

y === df[:, :y]
false

you can avoid copying by using copycols=false keyword argument in functions.

df = DataFrame(x=x, y=y, copycols=false)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

now it is the same

y === df.y
true

not the same object

select(df, :y)[!, 1] === y
false

the same object

select(df, :y, copycols=false)[!, 1] === y
true

Do not modify the parent of GroupedDataFrame or view#

x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view

GroupedDataFrame with 2 groups based on key: id

First Group (3 rows): id = 2
Rowidx
Int64Int64
121
223
315

Last Group (3 rows): id = 2
Rowidx
Int64Int64
122
224
326
s = view(x, 5:6, :)
2×2 SubDataFrame
Rowidx
Int64Int64
115
226
delete!(x, 3:6)
2×2 DataFrame
Rowidx
Int64Int64
121
222

This is an error

s ## Will return BoundsError

Single column selection for DataFrame creates aliases with ! and getproperty syntax and copies with :#

x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
111111
222222
333333
x[1, 1] = 100
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
1100100111
222222
333333

When iterating rows of a data frame#

  • use eachrow to avoid compilation cost (wide tables),

  • but Tables.namedtupleiterator for fast execution (tall tables)

this table is wide

df1 = DataFrame([rand([1:2, 'a':'b', false:true, 1.0:2.0]) for i in 1:900], :auto)
2×900 DataFrame
800 columns omitted
Rowx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20x21x22x23x24x25x26x27x28x29x30x31x32x33x34x35x36x37x38x39x40x41x42x43x44x45x46x47x48x49x50x51x52x53x54x55x56x57x58x59x60x61x62x63x64x65x66x67x68x69x70x71x72x73x74x75x76x77x78x79x80x81x82x83x84x85x86x87x88x89x90x91x92x93x94x95x96x97x98x99x100
BoolCharBoolInt64Int64Int64Float64CharInt64Int64CharFloat64Float64Int64CharInt64CharInt64Int64Float64CharFloat64BoolCharInt64BoolInt64Float64Float64CharFloat64BoolInt64BoolCharInt64Int64Int64CharInt64Float64Int64BoolInt64CharFloat64BoolBoolBoolFloat64CharFloat64CharCharCharBoolFloat64CharCharCharFloat64CharInt64BoolBoolCharBoolFloat64Float64Int64Int64CharFloat64CharBoolInt64Float64Float64Float64Int64Int64CharInt64BoolInt64CharCharFloat64Int64BoolCharFloat64Int64Int64CharBoolFloat64Float64Int64Float64
1falseafalse1111.0a11a1.01.01a1a111.0a1.0falsea1false11.01.0a1.0false1falsea111a11.01false1a1.0falsefalsefalse1.0a1.0aaafalse1.0aaa1.0a1falsefalseafalse1.01.011a1.0afalse11.01.01.011a1false1aa1.01falsea1.011afalse1.01.011.0
2truebtrue2222.0b22b2.02.02b2b222.0b2.0trueb2true22.02.0b2.0true2trueb222b22.02true2b2.0truetruetrue2.0b2.0bbbtrue2.0bbb2.0b2truetruebtrue2.02.022b2.0btrue22.02.02.022b2true2bb2.02trueb2.022btrue2.02.022.0
@time collect(eachrow(df1));
  0.041341 seconds (52.71 k allocations: 3.614 MiB, 99.91% compilation time)
@time collect(Tables.namedtupleiterator(df1));
  7.539398 seconds (885.91 k allocations: 70.446 MiB, 99.68% compilation time)

as you can see the time to compile Tables.namedtupleiterator is very large in this case, and it would get much worse if the table was wider (that is why we include this tip in pitfalls notebook)

the table below is tall

df2 = DataFrame(rand(10^6, 10), :auto)
1000000×10 DataFrame
999975 rows omitted
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.7914520.2972860.0492070.889730.3097810.1247270.5269940.3767580.9820590.201495
20.2203640.7013630.376040.8534630.2925510.1608330.4036480.8061450.6868960.319294
30.3759850.09465380.4012640.2598180.4262530.9923510.3512960.8970250.6425450.36698
40.1905540.7999140.8144850.6642850.2432090.5380270.2810220.7165380.7352610.0711319
50.7940520.7075240.9828550.6798560.338460.3774050.3937510.781780.449720.205827
60.3881760.6416990.7933410.4557150.244830.5195130.6971980.3715010.7076920.252749
70.4321960.04782980.6913030.9529950.3526970.3595710.5975460.1510980.4137310.00542862
80.7961890.02845230.1029140.5709330.2070630.2124770.5989740.3910490.2906730.416863
90.5143410.5742740.009553440.04261170.8740770.1980390.5041210.8486450.6077430.464585
100.6781130.6012790.307890.0376170.1983170.09060970.07809640.3398570.06572250.0773766
110.2168090.918020.6636820.8131370.9886590.01237060.3229140.796040.3825190.227306
120.9529480.7977730.9197030.9837420.9760590.8186070.290840.6575540.9212090.362167
130.3708190.7770840.3431260.4574760.4119530.5060550.8296120.8944460.1138230.804729
9999890.598120.3416710.7568170.1296770.5580780.1192350.6629590.6198810.9025430.332156
9999900.7957650.5285710.7114050.05771240.9309840.5518590.7036040.2690380.5578760.433196
9999910.5796240.2336880.2153930.3046010.8511720.2043150.09412040.2947330.7458450.504573
9999920.4950980.1659490.4627640.837110.115770.7690740.104810.6307650.1545070.641631
9999930.6955560.3180690.3985150.7409370.8582540.2357510.1846720.03918090.4195160.183319
9999940.1657540.6773660.1340860.6445060.5949770.05335970.6163340.3918970.1613370.967639
9999950.1684860.8690690.07277930.04182420.2260540.01796840.6145550.9110160.2568360.658037
9999960.08737450.775490.2248580.1769930.3873840.1005990.3334150.2834540.2928320.447891
9999970.7928550.4714620.235740.182910.2353450.8019130.5637710.5581610.5081290.595255
9999980.827070.8130670.04962930.816870.05834340.8252180.633720.9029620.8236720.417822
9999990.8058740.6650290.3769350.8494470.6724710.2466540.5684490.5062830.07389840.650809
10000000.3579410.2791640.2860210.2290090.5692810.3484930.5923930.9225440.8256040.48944
@time map(sum, eachrow(df2));
  2.127519 seconds (60.08 M allocations: 1.056 GiB, 6.17% gc time, 3.69% compilation time)
@time map(sum, eachrow(df2));
  2.119773 seconds (59.99 M allocations: 1.050 GiB, 6.50% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.195608 seconds (200.06 k allocations: 21.156 MiB, 24.98% gc time, 96.98% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.005788 seconds (22 allocations: 7.631 MiB)

as you can see - this time it is much faster to iterate a type stable container still you might want to use the select syntax, which is optimized for such reductions:

this includes compilation time

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.392389 seconds (509.34 k allocations: 41.875 MiB, 98.52% compilation time: 93% of which was recompilation)
1000000-element Vector{Float64}:
 4.549489608990553
 4.820598260856793
 4.808170226496026
 5.054427780801031
 5.711228368654866
 5.0724123056568216
 4.00439573440345
 3.61558727488398
 4.637989455369663
 2.474878422732406
 ⋮
 4.377478339806004
 4.073770679842652
 4.40725544523935
 3.836624901881449
 3.110289532802179
 4.945542466292135
 6.168372468024979
 5.415848010904047
 4.899889662604407

Do it again

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.004722 seconds (125 allocations: 7.635 MiB)
1000000-element Vector{Float64}:
 4.549489608990553
 4.820598260856793
 4.808170226496026
 5.054427780801031
 5.711228368654866
 5.0724123056568216
 4.00439573440345
 3.61558727488398
 4.637989455369663
 2.474878422732406
 ⋮
 4.377478339806004
 4.073770679842652
 4.40725544523935
 3.836624901881449
 3.110289532802179
 4.945542466292135
 6.168372468024979
 5.415848010904047
 4.899889662604407

This notebook was generated using Literate.jl.