Possible pitfalls#

using DataFrames
using BenchmarkTools

Know what is copied when creating a DataFrame#

x = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.01071580.3227890.9122010.3610510.091482
20.9964890.7634750.5967760.5194640.810729
30.1431360.6561340.3851840.5964830.595928

x and y are not the same object

y = copy(x)
x === y
false

x and y are not the same object

y = DataFrame(x)
x === y
false

the columns are also not the same

any(x[!, i] === y[!, i] for i in ncol(x))
false

x and y are not the same object

y = DataFrame(x, copycols=false)
x === y
false

But the columns are the same

all(x[!, i] === y[!, i] for i in ncol(x))
true

the same when creating data frames using kwarg syntax

x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

different object

y === df.y
false

range is converted to a vector

typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

y === df[:, :y]
false

you can avoid copying by using copycols=false keyword argument in functions.

df = DataFrame(x=x, y=y, copycols=false)
3×2 DataFrame
Rowxy
Int64Int64
111
222
333

now it is the same

y === df.y
true

not the same object

select(df, :y)[!, 1] === y
false

the same object

select(df, :y, copycols=false)[!, 1] === y
true

Do not modify the parent of GroupedDataFrame or view#

x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view

GroupedDataFrame with 2 groups based on key: id

First Group (3 rows): id = 2
Rowidx
Int64Int64
121
223
315

Last Group (3 rows): id = 2
Rowidx
Int64Int64
122
224
326
s = view(x, 5:6, :)
2×2 SubDataFrame
Rowidx
Int64Int64
115
226
delete!(x, 3:6)
2×2 DataFrame
Rowidx
Int64Int64
121
222

This is an error

s ## Will return BoundsError

Single column selection for a DataFrame#

Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :

x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
111111
222222
333333
x[1, 1] = 100
display(x)
3×5 DataFrame
Rowabcde
Int64Int64Int64Int64Int64
1100100111
222222
333333

When iterating rows of a data frame#

  • use eachrow to avoid compilation cost in wide tables,

  • but Tables.namedtupleiterator for fast execution in tall tables

The table below is tall:

df2 = DataFrame(rand(10^6, 10), :auto)
1000000×10 DataFrame
999975 rows omitted
Rowx1x2x3x4x5x6x7x8x9x10
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.8322650.1316740.5082250.4470580.7253310.09612340.2181120.2860770.1050180.2172
20.812760.8563910.4214790.5819010.9000610.9163820.06455170.8937730.2064520.520251
30.3263250.4994840.1904770.7135920.8351940.8992160.2167040.191150.5744790.137535
40.585970.7910570.7804970.2138260.4196660.03865140.8131860.6847020.9457240.909644
50.9242450.4537750.1601630.1854410.6133080.6300580.03495810.2422950.5503290.444379
60.568380.6710340.4600850.1741140.4447610.07117730.6395080.4400830.6082580.856337
70.2951280.9920510.1321230.2833210.5389770.03183840.3397270.1023580.5533670.890664
80.3918320.5841790.1516720.1043870.3881920.7003280.9957550.3076680.5102210.364263
90.03834720.6058810.09725780.8503240.2075110.1815420.8349320.7399940.2296990.565145
100.1377320.2216620.2696030.2604540.1165140.5457980.9715640.6958420.6097640.997268
110.3870030.8744270.3120950.6940090.5498080.8150020.3833850.7394810.6346610.605096
120.9926790.9351560.3380560.3621120.05559730.5841120.3608820.4236580.6019740.180581
130.8716350.6074970.5269080.8505670.4051410.5693240.812890.7311560.7544380.500285
9999890.5120120.5837350.8111740.6261860.5660220.2908650.7671670.393640.9619830.963371
9999900.9427630.03434730.07032650.01077930.9452520.1810980.2954940.9018090.7222180.133737
9999910.8853340.924450.6968730.1223860.8437560.1271460.470060.9161320.9865270.844013
9999920.764330.9647980.4668080.05654950.5425180.6411570.9765150.7751860.8291340.342463
9999930.669270.558090.2759910.5552360.8072040.4498710.8645210.440050.4074860.0877313
9999940.6194860.3241910.4124160.01732350.5866240.3151690.06044560.6996340.04650440.790883
9999950.2247030.3476060.3897630.05732550.7803790.6348560.5349640.402020.9981120.916573
9999960.7502590.3561310.4171820.711380.8458250.5875520.782170.7216940.8065110.80849
9999970.9796560.332320.444730.6755330.9379130.3024330.6687510.6939130.06892180.765755
9999980.812130.3719850.8418950.1464530.8315880.8775180.5983960.1665290.03127060.784023
9999990.6360620.2405010.04678850.6689530.2993940.04765160.4535270.4824030.02142260.484848
10000000.06446910.4605970.9898580.7204580.7966650.1923040.8574110.3221940.2281550.430382
@time map(sum, eachrow(df2));
  2.460167 seconds (60.04 M allocations: 1.053 GiB, 12.48% gc time, 2.24% compilation time)
@time map(sum, eachrow(df2));
  2.205419 seconds (59.99 M allocations: 1.050 GiB, 5.75% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.167873 seconds (370.37 k allocations: 27.104 MiB, 93.80% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
  0.018365 seconds (25 allocations: 7.631 MiB, 47.51% gc time)

as you can see - this time it is much faster to iterate a type stable container still you might want to use the select syntax, which is optimized for such reductions:

this includes compilation time

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.009417 seconds (6.67 k allocations: 7.971 MiB, 44.94% compilation time)
1000000-element Vector{Float64}:
 3.567084026690375
 6.17400049412799
 4.584157424151704
 6.182924491214273
 4.2389509124219185
 4.933736477186374
 4.159554933520796
 4.498496951245511
 4.350633848199879
 4.826199194033561
 ⋮
 6.359458205497677
 5.115449497624578
 3.8726771753912517
 5.286301698291334
 6.787193316962482
 5.869925333426452
 5.461787443903272
 3.3815507835547747
 5.06249293709858

Do it again

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
  0.005082 seconds (127 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
 3.567084026690375
 6.17400049412799
 4.584157424151704
 6.182924491214273
 4.2389509124219185
 4.933736477186374
 4.159554933520796
 4.498496951245511
 4.350633848199879
 4.826199194033561
 ⋮
 6.359458205497677
 5.115449497624578
 3.8726771753912517
 5.286301698291334
 6.787193316962482
 5.869925333426452
 5.461787443903272
 3.3815507835547747
 5.06249293709858

This notebook was generated using Literate.jl.