Possible pitfalls#

using DataFrames
using BenchmarkTools

Know what is copied when creating a DataFrame#

x = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.2136340.9760660.7649570.236890.586611
20.2579040.7271990.1259930.1757730.556361
30.5676580.5807060.2278930.6284140.325745

x and y are not the same object

y = copy(x)
x === y
false

x and y are not the same object

y = DataFrame(x)
x === y
false

the columns are also not the same

any(x[!, i] === y[!, i] for i in ncol(x))
false

x and y are not the same object

y = DataFrame(x, copycols=false)
x === y
false

But the columns are the same

all(x[!, i] === y[!, i] for i in ncol(x))
true

the same when creating data frames using kwarg syntax

x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

different object

y === df.y
false

range is converted to a vector

typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

y === df[:, :y]
false

you can avoid copying by using copycols=false keyword argument in functions.

df = DataFrame(x=x, y=y, copycols=false)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

now it is the same

y === df.y
true

not the same object

select(df, :y)[!, 1] === y
false

the same object

select(df, :y, copycols=false)[!, 1] === y
true

Do not modify the parent of GroupedDataFrame or view#

x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view

GroupedDataFrame with 2 groups based on key: id

First Group (3 rows): id = 2
Rowidx
Int64Int64
121
223
315

Last Group (3 rows): id = 2
Rowidx
Int64Int64
122
224
326
s = view(x, 5:6, :)
2×2 SubDataFrame
Rowidx
Int64Int64
115
226
delete!(x, 3:6)
2×2 DataFrame
Rowidx
Int64Int64
121
222

This is an error

s ## Will return BoundsError

Single column selection for a DataFrame#

Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :

x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
111111
222222
333333
x[1, 1] = 100
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
1100100111
222222
333333

When iterating rows of a data frame#

  • use eachrow to avoid compilation cost in wide tables,

  • but Tables.namedtupleiterator for fast execution in tall tables

The table below is tall:

df2 = DataFrame(rand(10^6, 10), :auto)
1000000×10 DataFrame
999975 rows omitted
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.222140.1947250.7876940.791790.5777030.1480280.0907970.5180320.4307760.820852
20.7977880.4827240.1048160.989810.2804180.9734370.1719540.669210.4899580.907005
30.8303440.1668730.01196760.388690.1123360.6759060.5544390.8881630.8816130.348705
40.01309430.7022950.4175580.8244960.2714670.2937170.6559740.787090.665860.592221
50.9931360.2624330.1322420.8502830.5216590.1877330.3343110.8598780.5342880.40397
60.5003780.225580.6471510.3153750.6474350.3072020.3823020.3701050.4250990.642584
70.4293870.9601680.1285840.6443860.5393590.2300730.8223440.1873430.7686170.834613
80.4947930.08961640.08112920.320820.126240.504820.5451250.5565870.8228930.1666
90.7767730.8967580.1193330.6875070.920070.9297930.08577240.5957890.8502450.861515
100.9185360.9661070.2749680.8018250.6977140.2958920.7330260.6544430.4749510.162668
110.6823970.5905920.2692850.8459260.7335750.2962640.7161010.6607620.2104950.244092
120.6927360.3638910.1964040.4798140.02636280.8571530.7700870.1847560.9444240.294371
130.001041530.7997870.3225290.2749980.3343550.005024770.5430460.09345280.6524690.502395
9999890.388930.4882780.03855440.4956390.2412080.5809430.4788930.491560.03386440.474437
9999900.2734090.7295790.6726580.3067170.1415730.6190270.9349580.4986660.9923310.96703
9999910.9143820.9774080.04662570.3531870.7213850.2201790.393090.2434580.4743060.423111
9999920.09253930.04921930.4976610.2426980.2014550.6503630.9081220.4281470.4281480.210207
9999930.7646450.9482470.1044780.8234750.4517090.3327930.7361040.9912250.813670.507504
9999940.2278220.3687460.9514590.5683220.30370.4873870.1520920.002594590.6991730.49836
9999950.8629750.1450.5196960.8125990.658480.6498330.444560.6301770.7923910.168679
9999960.3137420.2691960.03927820.903470.9612050.7437850.2650060.5132810.4949360.768261
9999970.3471130.514850.2933760.8858580.9479110.6975250.3697320.5358320.2118510.660482
9999980.3578960.5731820.006175450.6328110.5797790.9607820.8887660.4969950.3845130.180331
9999990.3551430.697010.7554430.181120.7329430.5691340.8499110.4434310.5329870.716891
10000000.5092240.9039550.3592780.1780760.3856850.09881060.8858550.3602280.4462650.77072
@time map(sum, eachrow(df2));
  2.454178 seconds (60.12 M allocations: 1.056 GiB, 17.63% gc time, 4.70% compilation time)
@time map(sum, eachrow(df2));
  1.833568 seconds (59.99 M allocations: 1.050 GiB, 6.03% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.303801 seconds (1.22 M allocations: 66.451 MiB, 98.32% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.004991 seconds (21 allocations: 7.631 MiB)

as you can see - this time it is much faster to iterate a type stable container still you might want to use the select syntax, which is optimized for such reductions:

this includes compilation time

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.483571 seconds (652.93 k allocations: 40.217 MiB, 2.73% gc time, 95.90% compilation time: 95% of which was recompilation)
1000000-element Vector{Float64}:
 4.582538267133092
 5.86712127030766
 4.85903797859344
 5.223773013475111
 5.07993464728864
 4.4632116807439814
 5.544873933254496
 3.7086242892505434
 6.723556847831017
 5.98012950282477
 ⋮
 3.70856059429051
 6.473850788533407
 4.2596556715720215
 5.684389196675713
 5.272159975659073
 5.46453064264186
 5.061230203019331
 5.834012167127027
 4.898097576410764

Do it again

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.006329 seconds (123 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
 4.582538267133092
 5.86712127030766
 4.85903797859344
 5.223773013475111
 5.07993464728864
 4.4632116807439814
 5.544873933254496
 3.7086242892505434
 6.723556847831017
 5.98012950282477
 ⋮
 3.70856059429051
 6.473850788533407
 4.2596556715720215
 5.684389196675713
 5.272159975659073
 5.46453064264186
 5.061230203019331
 5.834012167127027
 4.898097576410764

This notebook was generated using Literate.jl.