Possible pitfalls#
using DataFrames
Know what is copied when creating a DataFrame#
x = DataFrame(rand(3, 5), :auto)
Row | x1 | x2 | x3 | x4 | x5 |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.227126 | 0.588148 | 0.900643 | 0.547168 | 0.984877 |
2 | 0.738448 | 0.83943 | 0.406808 | 0.878479 | 0.0165944 |
3 | 0.498546 | 0.763384 | 0.91298 | 0.0515485 | 0.49099 |
x and y are not the same object
y = copy(x)
x === y
false
x and y are not the same object
y = DataFrame(x)
x === y
false
the columns are also not the same
any(x[!, i] === y[!, i] for i in ncol(x))
false
x and y are not the same object
y = DataFrame(x, copycols=false)
x === y
false
But the columns are the same
all(x[!, i] === y[!, i] for i in ncol(x))
true
the same when creating data frames using kwarg syntax
x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y);
different object
y === df.y
false
range is converted to a vector
typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})
slicing rows always creates a copy
y === df[:, :y]
false
you can avoid copying by using copycols=false keyword argument in functions.
df = DataFrame(x=x, y=y, copycols=false)
Row | x | y |
---|---|---|
Int64 | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
now it is the same
y === df.y
true
not the same
select(df, :y)[!, 1] === y
false
the same
select(df, :y, copycols=false)[!, 1] === y
true
Do not modify the parent of GroupedDataFrame
or view#
x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)
x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view
GroupedDataFrame with 2 groups based on key: id
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 3 |
3 | 1 | 5 |
⋮
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 2 |
2 | 2 | 4 |
3 | 2 | 6 |
s = view(x, 5:6, :)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 1 | 5 |
2 | 2 | 6 |
delete!(x, 3:6)
Row | id | x |
---|---|---|
Int64 | Int64 | |
1 | 2 | 1 |
2 | 2 | 2 |
This is an error
s ## Will return BoundsError
Single column selection for DataFrame
creates aliases with ! and getproperty
syntax and copies with :#
x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 1 | 1 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
x[1, 1] = 100
display(x)
Row | a | b | c | d | e |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 100 | 100 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 | 3 |
When iterating rows of a data frame#
use
eachrow
to avoid compilation cost (wide tables),but
Tables.namedtupleiterator
for fast execution (tall tables)
this table is wide
df1 = DataFrame([rand([1:2, 'a':'b', false:true, 1.0:2.0]) for i in 1:900], :auto)
Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | x11 | x12 | x13 | x14 | x15 | x16 | x17 | x18 | x19 | x20 | x21 | x22 | x23 | x24 | x25 | x26 | x27 | x28 | x29 | x30 | x31 | x32 | x33 | x34 | x35 | x36 | x37 | x38 | x39 | x40 | x41 | x42 | x43 | x44 | x45 | x46 | x47 | x48 | x49 | x50 | x51 | x52 | x53 | x54 | x55 | x56 | x57 | x58 | x59 | x60 | x61 | x62 | x63 | x64 | x65 | x66 | x67 | x68 | x69 | x70 | x71 | x72 | x73 | x74 | x75 | x76 | x77 | x78 | x79 | x80 | x81 | x82 | x83 | x84 | x85 | x86 | x87 | x88 | x89 | x90 | x91 | x92 | x93 | x94 | x95 | x96 | x97 | x98 | x99 | x100 | ⋯ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Char | Char | Float64 | Float64 | Char | Float64 | Float64 | Int64 | Bool | Char | Char | Int64 | Char | Char | Bool | Int64 | Int64 | Float64 | Int64 | Bool | Bool | Char | Int64 | Float64 | Int64 | Char | Char | Bool | Float64 | Int64 | Float64 | Char | Bool | Char | Float64 | Float64 | Float64 | Char | Float64 | Char | Float64 | Char | Bool | Float64 | Int64 | Char | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 | Char | Bool | Char | Float64 | Float64 | Bool | Int64 | Bool | Char | Char | Char | Bool | Char | Char | Bool | Bool | Int64 | Float64 | Float64 | Char | Char | Int64 | Char | Char | Char | Bool | Float64 | Bool | Float64 | Int64 | Int64 | Int64 | Int64 | Char | Float64 | Bool | Float64 | Bool | Bool | Bool | Float64 | Bool | Int64 | Float64 | Int64 | Float64 | Float64 | Int64 | ⋯ | |
1 | a | a | 1.0 | 1.0 | a | 1.0 | 1.0 | 1 | false | a | a | 1 | a | a | false | 1 | 1 | 1.0 | 1 | false | false | a | 1 | 1.0 | 1 | a | a | false | 1.0 | 1 | 1.0 | a | false | a | 1.0 | 1.0 | 1.0 | a | 1.0 | a | 1.0 | a | false | 1.0 | 1 | a | 1 | 1.0 | 1 | 1 | 1 | 1 | a | false | a | 1.0 | 1.0 | false | 1 | false | a | a | a | false | a | a | false | false | 1 | 1.0 | 1.0 | a | a | 1 | a | a | a | false | 1.0 | false | 1.0 | 1 | 1 | 1 | 1 | a | 1.0 | false | 1.0 | false | false | false | 1.0 | false | 1 | 1.0 | 1 | 1.0 | 1.0 | 1 | ⋯ |
2 | b | b | 2.0 | 2.0 | b | 2.0 | 2.0 | 2 | true | b | b | 2 | b | b | true | 2 | 2 | 2.0 | 2 | true | true | b | 2 | 2.0 | 2 | b | b | true | 2.0 | 2 | 2.0 | b | true | b | 2.0 | 2.0 | 2.0 | b | 2.0 | b | 2.0 | b | true | 2.0 | 2 | b | 2 | 2.0 | 2 | 2 | 2 | 2 | b | true | b | 2.0 | 2.0 | true | 2 | true | b | b | b | true | b | b | true | true | 2 | 2.0 | 2.0 | b | b | 2 | b | b | b | true | 2.0 | true | 2.0 | 2 | 2 | 2 | 2 | b | 2.0 | true | 2.0 | true | true | true | 2.0 | true | 2 | 2.0 | 2 | 2.0 | 2.0 | 2 | ⋯ |
@time collect(eachrow(df1))
0.037020 seconds (56.39 k allocations: 3.886 MiB, 99.93% compilation time)
2-element Vector{DataFrameRow}:
DataFrameRow
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 ⋯
│ Char Char Float64 Float64 Char Float64 Float64 Int64 Bool Cha ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ a a 1.0 1.0 a 1.0 1.0 1 false a ⋯
891 columns omitted
DataFrameRow
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 ⋯
│ Char Char Float64 Float64 Char Float64 Float64 Int64 Bool Char ⋯
─────┼──────────────────────────────────────────────────────────────────────────
2 │ b b 2.0 2.0 b 2.0 2.0 2 true b ⋯
890 columns omitted
@time collect(Tables.namedtupleiterator(df1));
7.351881 seconds (3.16 M allocations: 220.459 MiB, 1.31% gc time, 98.92% compilation time)
as you can see the time to compile Tables.namedtupleiterator
is very large in this case, and it would get much worse if the table was wider (that is why we include this tip in pitfalls notebook)
the table below is tall
df2 = DataFrame(rand(10^6, 10), :auto)
Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 |
---|---|---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.20305 | 0.612579 | 0.960699 | 0.0429461 | 0.561074 | 0.513787 | 0.95025 | 0.447074 | 0.805622 | 0.411043 |
2 | 0.331702 | 0.748592 | 0.817733 | 0.740208 | 0.644501 | 0.980437 | 0.282487 | 0.959924 | 0.499066 | 0.275427 |
3 | 0.171449 | 0.885185 | 0.0873129 | 0.275097 | 0.597741 | 0.933036 | 0.387212 | 0.0961785 | 0.393355 | 0.111773 |
4 | 0.789224 | 0.0365954 | 0.713769 | 0.36251 | 0.836241 | 0.812145 | 0.223275 | 0.907499 | 0.398539 | 0.713236 |
5 | 0.031842 | 0.360387 | 0.711545 | 0.0523417 | 0.620548 | 0.387297 | 0.714386 | 0.346333 | 0.680157 | 0.781664 |
6 | 0.429233 | 0.457554 | 0.0314419 | 0.0863836 | 0.301779 | 0.586468 | 0.225692 | 0.109725 | 0.611431 | 0.424604 |
7 | 0.980613 | 0.969515 | 0.58466 | 0.508379 | 0.490374 | 0.198298 | 0.269683 | 0.142288 | 0.970161 | 0.664537 |
8 | 0.327554 | 0.765649 | 0.117795 | 0.354197 | 0.877537 | 0.098652 | 0.329172 | 0.187273 | 0.560301 | 0.635005 |
9 | 0.0715169 | 0.865372 | 0.437003 | 0.396283 | 0.168619 | 0.67749 | 0.441958 | 0.485587 | 0.836927 | 0.266633 |
10 | 0.545102 | 0.50035 | 0.752733 | 0.239469 | 0.245474 | 0.958642 | 0.0998664 | 0.0571616 | 0.929764 | 0.150887 |
11 | 0.074379 | 0.676876 | 0.271264 | 0.941166 | 0.157689 | 0.636524 | 0.0101853 | 0.972159 | 0.152452 | 0.761652 |
12 | 0.450031 | 0.888344 | 0.682172 | 0.854843 | 0.88452 | 0.664781 | 0.947989 | 0.810652 | 0.155024 | 0.436813 |
13 | 0.726675 | 0.805576 | 0.150509 | 0.134265 | 0.570827 | 0.349513 | 0.864771 | 0.125271 | 0.476625 | 0.830557 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
999989 | 0.326895 | 0.396992 | 0.0251206 | 0.00276134 | 0.290215 | 0.726458 | 0.560868 | 0.760078 | 0.793187 | 0.624001 |
999990 | 0.192186 | 0.632419 | 0.590982 | 0.230539 | 0.721277 | 0.598367 | 0.241642 | 0.773044 | 0.801229 | 0.0218375 |
999991 | 0.177551 | 0.137556 | 0.440097 | 0.4134 | 0.597243 | 0.123575 | 0.390303 | 0.857051 | 0.80193 | 0.299181 |
999992 | 0.0594714 | 0.0544668 | 0.710189 | 0.223465 | 0.469936 | 0.898097 | 0.859902 | 0.0557366 | 0.884044 | 0.180341 |
999993 | 0.973761 | 0.173231 | 0.19324 | 0.639898 | 0.839009 | 0.590867 | 0.157215 | 0.649433 | 0.413117 | 0.599076 |
999994 | 0.187871 | 0.987062 | 0.796542 | 0.290989 | 0.310357 | 0.0180743 | 0.164409 | 0.931343 | 0.0511988 | 0.31726 |
999995 | 0.666839 | 0.127095 | 0.532349 | 0.661607 | 0.994068 | 0.0476131 | 0.593223 | 0.587135 | 0.0548164 | 0.093763 |
999996 | 0.304712 | 0.986163 | 0.409787 | 0.254253 | 0.623207 | 0.581812 | 0.681908 | 0.58666 | 0.514206 | 0.00413694 |
999997 | 0.306947 | 0.736463 | 0.372172 | 0.348905 | 0.274012 | 0.00784642 | 0.789987 | 0.758816 | 0.53837 | 0.73107 |
999998 | 0.569664 | 0.0571912 | 0.3477 | 0.774852 | 0.168478 | 0.575819 | 0.265557 | 0.562007 | 0.547183 | 0.445897 |
999999 | 0.402499 | 0.750417 | 0.962704 | 0.563778 | 0.147457 | 0.200337 | 0.479564 | 0.888691 | 0.62855 | 0.0769906 |
1000000 | 0.0831838 | 0.243365 | 0.627154 | 0.469265 | 0.0420256 | 0.452327 | 0.690367 | 0.999121 | 0.968719 | 0.243514 |
@time map(sum, eachrow(df2))
1.674922 seconds (60.08 M allocations: 1.056 GiB, 5.52% gc time, 4.54% compilation time)
1000000-element Vector{Float64}:
5.508125164172659
6.280076471054171
3.938338661817624
5.793033557407247
4.686500468265144
3.264311324258349
5.778508927749816
4.253135118411479
4.647388180990119
4.479449037206201
4.65434505685093
6.775168752263508
5.034589343875677
⋮
4.506575801584777
4.803521462121904
4.237887531871456
4.3956490758136155
5.228847615756573
4.055104832677104
4.358507829206297
4.946845026374001
4.864588441093377
4.3143479012392465
5.100987884997755
4.819042473984633
@time map(sum, eachrow(df2))
1.541101 seconds (59.99 M allocations: 1.050 GiB, 2.33% gc time)
1000000-element Vector{Float64}:
5.508125164172659
6.280076471054171
3.938338661817624
5.793033557407247
4.686500468265144
3.264311324258349
5.778508927749816
4.253135118411479
4.647388180990119
4.479449037206201
4.65434505685093
6.775168752263508
5.034589343875677
⋮
4.506575801584777
4.803521462121904
4.237887531871456
4.3956490758136155
5.228847615756573
4.055104832677104
4.358507829206297
4.946845026374001
4.864588441093377
4.3143479012392465
5.100987884997755
4.819042473984633
@time map(sum, Tables.namedtupleiterator(df2))
0.146836 seconds (200.16 k allocations: 21.167 MiB, 95.14% compilation time)
1000000-element Vector{Float64}:
5.508125164172659
6.280076471054171
3.938338661817624
5.793033557407247
4.686500468265144
3.264311324258349
5.778508927749816
4.253135118411479
4.647388180990119
4.479449037206201
4.65434505685093
6.775168752263508
5.034589343875677
⋮
4.506575801584777
4.803521462121904
4.237887531871456
4.3956490758136155
5.228847615756573
4.055104832677104
4.358507829206297
4.946845026374001
4.864588441093377
4.3143479012392465
5.100987884997755
4.819042473984633
@time map(sum, Tables.namedtupleiterator(df2))
0.007023 seconds (17 allocations: 7.630 MiB)
1000000-element Vector{Float64}:
5.508125164172659
6.280076471054171
3.938338661817624
5.793033557407247
4.686500468265144
3.264311324258349
5.778508927749816
4.253135118411479
4.647388180990119
4.479449037206201
4.65434505685093
6.775168752263508
5.034589343875677
⋮
4.506575801584777
4.803521462121904
4.237887531871456
4.3956490758136155
5.228847615756573
4.055104832677104
4.358507829206297
4.946845026374001
4.864588441093377
4.3143479012392465
5.100987884997755
4.819042473984633
as you can see - this time it is much faster to iterate a type stable container
still you might want to use the select
syntax, which is optimized for such reductions:
this includes compilation time
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.035839 seconds (7.72 k allocations: 8.147 MiB, 83.72% compilation time)
1000000-element Vector{Float64}:
5.508125164172659
6.280076471054171
3.938338661817624
5.793033557407247
4.686500468265144
3.264311324258349
5.778508927749816
4.253135118411479
4.647388180990119
4.479449037206201
4.65434505685093
6.775168752263508
5.034589343875677
⋮
4.506575801584777
4.803521462121904
4.237887531871456
4.3956490758136155
5.228847615756573
4.055104832677104
4.358507829206297
4.946845026374001
4.864588441093377
4.3143479012392465
5.100987884997755
4.819042473984633
Do it again
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.004841 seconds (121 allocations: 7.635 MiB)
1000000-element Vector{Float64}:
5.508125164172659
6.280076471054171
3.938338661817624
5.793033557407247
4.686500468265144
3.264311324258349
5.778508927749816
4.253135118411479
4.647388180990119
4.479449037206201
4.65434505685093
6.775168752263508
5.034589343875677
⋮
4.506575801584777
4.803521462121904
4.237887531871456
4.3956490758136155
5.228847615756573
4.055104832677104
4.358507829206297
4.946845026374001
4.864588441093377
4.3143479012392465
5.100987884997755
4.819042473984633