Possible pitfalls#

using DataFrames

Know what is copied when creating a DataFrame#

x = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.2271260.5881480.9006430.5471680.984877
20.7384480.839430.4068080.8784790.0165944
30.4985460.7633840.912980.05154850.49099

x and y are not the same object

y = copy(x)
x === y
false

x and y are not the same object

y = DataFrame(x)
x === y
false

the columns are also not the same

any(x[!, i] === y[!, i] for i in ncol(x))
false

x and y are not the same object

y = DataFrame(x, copycols=false)
x === y
false

But the columns are the same

all(x[!, i] === y[!, i] for i in ncol(x))
true

the same when creating data frames using kwarg syntax

x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y);

different object

y === df.y
false

range is converted to a vector

typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

y === df[:, :y]
false

you can avoid copying by using copycols=false keyword argument in functions.

df = DataFrame(x=x, y=y, copycols=false)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

now it is the same

y === df.y
true

not the same

select(df, :y)[!, 1] === y
false

the same

select(df, :y, copycols=false)[!, 1] === y
true

Do not modify the parent of GroupedDataFrame or view#

x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view

GroupedDataFrame with 2 groups based on key: id

First Group (3 rows): id = 2
Rowidx
Int64Int64
121
223
315

Last Group (3 rows): id = 2
Rowidx
Int64Int64
122
224
326
s = view(x, 5:6, :)
2×2 SubDataFrame
Rowidx
Int64Int64
115
226
delete!(x, 3:6)
2×2 DataFrame
Rowidx
Int64Int64
121
222

This is an error

s ## Will return BoundsError

Single column selection for DataFrame creates aliases with ! and getproperty syntax and copies with :#

x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
111111
222222
333333
x[1, 1] = 100
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
1100100111
222222
333333

When iterating rows of a data frame#

  • use eachrow to avoid compilation cost (wide tables),

  • but Tables.namedtupleiterator for fast execution (tall tables)

this table is wide

df1 = DataFrame([rand([1:2, 'a':'b', false:true, 1.0:2.0]) for i in 1:900], :auto)
2×900 DataFrame
800 columns omitted
Rowx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20x21x22x23x24x25x26x27x28x29x30x31x32x33x34x35x36x37x38x39x40x41x42x43x44x45x46x47x48x49x50x51x52x53x54x55x56x57x58x59x60x61x62x63x64x65x66x67x68x69x70x71x72x73x74x75x76x77x78x79x80x81x82x83x84x85x86x87x88x89x90x91x92x93x94x95x96x97x98x99x100
CharCharFloat64Float64CharFloat64Float64Int64BoolCharCharInt64CharCharBoolInt64Int64Float64Int64BoolBoolCharInt64Float64Int64CharCharBoolFloat64Int64Float64CharBoolCharFloat64Float64Float64CharFloat64CharFloat64CharBoolFloat64Int64CharInt64Float64Int64Int64Int64Int64CharBoolCharFloat64Float64BoolInt64BoolCharCharCharBoolCharCharBoolBoolInt64Float64Float64CharCharInt64CharCharCharBoolFloat64BoolFloat64Int64Int64Int64Int64CharFloat64BoolFloat64BoolBoolBoolFloat64BoolInt64Float64Int64Float64Float64Int64
1aa1.01.0a1.01.01falseaa1aafalse111.01falsefalsea11.01aafalse1.011.0afalsea1.01.01.0a1.0a1.0afalse1.01a11.01111afalsea1.01.0false1falseaaafalseaafalsefalse11.01.0aa1aaafalse1.0false1.01111a1.0false1.0falsefalsefalse1.0false11.011.01.01
2bb2.02.0b2.02.02truebb2bbtrue222.02truetrueb22.02bbtrue2.022.0btrueb2.02.02.0b2.0b2.0btrue2.02b22.02222btrueb2.02.0true2truebbbtruebbtruetrue22.02.0bb2bbbtrue2.0true2.02222b2.0true2.0truetruetrue2.0true22.022.02.02
@time collect(eachrow(df1))
  0.037020 seconds (56.39 k allocations: 3.886 MiB, 99.93% compilation time)
2-element Vector{DataFrameRow}:
 DataFrameRow
 Row  x1    x2    x3       x4       x5    x6       x7       x8     x9     x10 ⋯
     │ Char  Char  Float64  Float64  Char  Float64  Float64  Int64  Bool   Cha ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ a     a         1.0      1.0  a         1.0      1.0      1  false  a   ⋯
                                                             891 columns omitted
 DataFrameRow
 Row  x1    x2    x3       x4       x5    x6       x7       x8     x9    x10  ⋯
     │ Char  Char  Float64  Float64  Char  Float64  Float64  Int64  Bool  Char ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   2 │ b     b         2.0      2.0  b         2.0      2.0      2  true  b    ⋯
                                                             890 columns omitted
@time collect(Tables.namedtupleiterator(df1));
  7.351881 seconds (3.16 M allocations: 220.459 MiB, 1.31% gc time, 98.92% compilation time)

as you can see the time to compile Tables.namedtupleiterator is very large in this case, and it would get much worse if the table was wider (that is why we include this tip in pitfalls notebook)

the table below is tall

df2 = DataFrame(rand(10^6, 10), :auto)
1000000×10 DataFrame
999975 rows omitted
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.203050.6125790.9606990.04294610.5610740.5137870.950250.4470740.8056220.411043
20.3317020.7485920.8177330.7402080.6445010.9804370.2824870.9599240.4990660.275427
30.1714490.8851850.08731290.2750970.5977410.9330360.3872120.09617850.3933550.111773
40.7892240.03659540.7137690.362510.8362410.8121450.2232750.9074990.3985390.713236
50.0318420.3603870.7115450.05234170.6205480.3872970.7143860.3463330.6801570.781664
60.4292330.4575540.03144190.08638360.3017790.5864680.2256920.1097250.6114310.424604
70.9806130.9695150.584660.5083790.4903740.1982980.2696830.1422880.9701610.664537
80.3275540.7656490.1177950.3541970.8775370.0986520.3291720.1872730.5603010.635005
90.07151690.8653720.4370030.3962830.1686190.677490.4419580.4855870.8369270.266633
100.5451020.500350.7527330.2394690.2454740.9586420.09986640.05716160.9297640.150887
110.0743790.6768760.2712640.9411660.1576890.6365240.01018530.9721590.1524520.761652
120.4500310.8883440.6821720.8548430.884520.6647810.9479890.8106520.1550240.436813
130.7266750.8055760.1505090.1342650.5708270.3495130.8647710.1252710.4766250.830557
9999890.3268950.3969920.02512060.002761340.2902150.7264580.5608680.7600780.7931870.624001
9999900.1921860.6324190.5909820.2305390.7212770.5983670.2416420.7730440.8012290.0218375
9999910.1775510.1375560.4400970.41340.5972430.1235750.3903030.8570510.801930.299181
9999920.05947140.05446680.7101890.2234650.4699360.8980970.8599020.05573660.8840440.180341
9999930.9737610.1732310.193240.6398980.8390090.5908670.1572150.6494330.4131170.599076
9999940.1878710.9870620.7965420.2909890.3103570.01807430.1644090.9313430.05119880.31726
9999950.6668390.1270950.5323490.6616070.9940680.04761310.5932230.5871350.05481640.093763
9999960.3047120.9861630.4097870.2542530.6232070.5818120.6819080.586660.5142060.00413694
9999970.3069470.7364630.3721720.3489050.2740120.007846420.7899870.7588160.538370.73107
9999980.5696640.05719120.34770.7748520.1684780.5758190.2655570.5620070.5471830.445897
9999990.4024990.7504170.9627040.5637780.1474570.2003370.4795640.8886910.628550.0769906
10000000.08318380.2433650.6271540.4692650.04202560.4523270.6903670.9991210.9687190.243514
@time map(sum, eachrow(df2))
  1.674922 seconds (60.08 M allocations: 1.056 GiB, 5.52% gc time, 4.54% compilation time)
1000000-element Vector{Float64}:
 5.508125164172659
 6.280076471054171
 3.938338661817624
 5.793033557407247
 4.686500468265144
 3.264311324258349
 5.778508927749816
 4.253135118411479
 4.647388180990119
 4.479449037206201
 4.65434505685093
 6.775168752263508
 5.034589343875677
 ⋮
 4.506575801584777
 4.803521462121904
 4.237887531871456
 4.3956490758136155
 5.228847615756573
 4.055104832677104
 4.358507829206297
 4.946845026374001
 4.864588441093377
 4.3143479012392465
 5.100987884997755
 4.819042473984633
@time map(sum, eachrow(df2))
  1.541101 seconds (59.99 M allocations: 1.050 GiB, 2.33% gc time)
1000000-element Vector{Float64}:
 5.508125164172659
 6.280076471054171
 3.938338661817624
 5.793033557407247
 4.686500468265144
 3.264311324258349
 5.778508927749816
 4.253135118411479
 4.647388180990119
 4.479449037206201
 4.65434505685093
 6.775168752263508
 5.034589343875677
 ⋮
 4.506575801584777
 4.803521462121904
 4.237887531871456
 4.3956490758136155
 5.228847615756573
 4.055104832677104
 4.358507829206297
 4.946845026374001
 4.864588441093377
 4.3143479012392465
 5.100987884997755
 4.819042473984633
@time map(sum, Tables.namedtupleiterator(df2))
  0.146836 seconds (200.16 k allocations: 21.167 MiB, 95.14% compilation time)
1000000-element Vector{Float64}:
 5.508125164172659
 6.280076471054171
 3.938338661817624
 5.793033557407247
 4.686500468265144
 3.264311324258349
 5.778508927749816
 4.253135118411479
 4.647388180990119
 4.479449037206201
 4.65434505685093
 6.775168752263508
 5.034589343875677
 ⋮
 4.506575801584777
 4.803521462121904
 4.237887531871456
 4.3956490758136155
 5.228847615756573
 4.055104832677104
 4.358507829206297
 4.946845026374001
 4.864588441093377
 4.3143479012392465
 5.100987884997755
 4.819042473984633
@time map(sum, Tables.namedtupleiterator(df2))
  0.007023 seconds (17 allocations: 7.630 MiB)
1000000-element Vector{Float64}:
 5.508125164172659
 6.280076471054171
 3.938338661817624
 5.793033557407247
 4.686500468265144
 3.264311324258349
 5.778508927749816
 4.253135118411479
 4.647388180990119
 4.479449037206201
 4.65434505685093
 6.775168752263508
 5.034589343875677
 ⋮
 4.506575801584777
 4.803521462121904
 4.237887531871456
 4.3956490758136155
 5.228847615756573
 4.055104832677104
 4.358507829206297
 4.946845026374001
 4.864588441093377
 4.3143479012392465
 5.100987884997755
 4.819042473984633

as you can see - this time it is much faster to iterate a type stable container

still you might want to use the select syntax, which is optimized for such reductions:

this includes compilation time

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.035839 seconds (7.72 k allocations: 8.147 MiB, 83.72% compilation time)
1000000-element Vector{Float64}:
 5.508125164172659
 6.280076471054171
 3.938338661817624
 5.793033557407247
 4.686500468265144
 3.264311324258349
 5.778508927749816
 4.253135118411479
 4.647388180990119
 4.479449037206201
 4.65434505685093
 6.775168752263508
 5.034589343875677
 ⋮
 4.506575801584777
 4.803521462121904
 4.237887531871456
 4.3956490758136155
 5.228847615756573
 4.055104832677104
 4.358507829206297
 4.946845026374001
 4.864588441093377
 4.3143479012392465
 5.100987884997755
 4.819042473984633

Do it again

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.004841 seconds (121 allocations: 7.635 MiB)
1000000-element Vector{Float64}:
 5.508125164172659
 6.280076471054171
 3.938338661817624
 5.793033557407247
 4.686500468265144
 3.264311324258349
 5.778508927749816
 4.253135118411479
 4.647388180990119
 4.479449037206201
 4.65434505685093
 6.775168752263508
 5.034589343875677
 ⋮
 4.506575801584777
 4.803521462121904
 4.237887531871456
 4.3956490758136155
 5.228847615756573
 4.055104832677104
 4.358507829206297
 4.946845026374001
 4.864588441093377
 4.3143479012392465
 5.100987884997755
 4.819042473984633