Possible pitfalls#
using DataFrames
using BenchmarkTools
Know what is copied when creating a DataFrame#
x = DataFrame(rand(3, 5), :auto)
| Row | x1 | x2 | x3 | x4 | x5 |
|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64 | |
| 1 | 0.904476 | 0.829423 | 0.0503674 | 0.712011 | 0.452017 |
| 2 | 0.531043 | 0.133402 | 0.0159836 | 0.130781 | 0.312669 |
| 3 | 0.774213 | 0.260022 | 0.779513 | 0.786003 | 0.105047 |
x and y are not the same object
y = copy(x)
x === y
false
x and y are not the same object
y = DataFrame(x)
x === y
false
the columns are also not the same
any(x[!, i] === y[!, i] for i in ncol(x))
false
x and y are not the same object
y = DataFrame(x, copycols=false)
x === y
false
But the columns are the same
all(x[!, i] === y[!, i] for i in ncol(x))
true
the same when creating data frames using kwarg syntax
x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)
| Row | x | y |
|---|---|---|
| Int64 | Int64 | |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
different object
y === df.y
false
range is converted to a vector
typeof(x), typeof(df.x)
(UnitRange{Int64}, Vector{Int64})
slicing rows always creates a copy
y === df[:, :y]
false
you can avoid copying by using copycols=false keyword argument in functions.
df = DataFrame(x=x, y=y, copycols=false)
| Row | x | y |
|---|---|---|
| Int64 | Int64 | |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
now it is the same
y === df.y
true
not the same object
select(df, :y)[!, 1] === y
false
the same object
select(df, :y, copycols=false)[!, 1] === y
true
Do not modify the parent of GroupedDataFrame or view#
x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)
x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view
GroupedDataFrame with 2 groups based on key: id
| Row | id | x |
|---|---|---|
| Int64 | Int64 | |
| 1 | 2 | 1 |
| 2 | 2 | 3 |
| 3 | 1 | 5 |
⋮
| Row | id | x |
|---|---|---|
| Int64 | Int64 | |
| 1 | 2 | 2 |
| 2 | 2 | 4 |
| 3 | 2 | 6 |
s = view(x, 5:6, :)
| Row | id | x |
|---|---|---|
| Int64 | Int64 | |
| 1 | 1 | 5 |
| 2 | 2 | 6 |
delete!(x, 3:6)
| Row | id | x |
|---|---|---|
| Int64 | Int64 | |
| 1 | 2 | 1 |
| 2 | 2 | 2 |
This is an error
s ## Will return BoundsError
Single column selection for a DataFrame#
Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :
x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)
| Row | a | b | c | d | e |
|---|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | Int64 | |
| 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 2 | 2 | 2 | 2 |
| 3 | 3 | 3 | 3 | 3 | 3 |
x[1, 1] = 100
display(x)
| Row | a | b | c | d | e |
|---|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | Int64 | |
| 1 | 100 | 100 | 1 | 1 | 1 |
| 2 | 2 | 2 | 2 | 2 | 2 |
| 3 | 3 | 3 | 3 | 3 | 3 |
When iterating rows of a data frame#
use
eachrowto avoid compilation cost in wide tables,but
Tables.namedtupleiteratorfor fast execution in tall tables
The table below is tall:
df2 = DataFrame(rand(10^6, 10), :auto)
| Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
| 1 | 0.478301 | 0.12516 | 0.443086 | 0.511008 | 0.881368 | 0.723646 | 0.891943 | 0.715338 | 0.226988 | 0.05517 |
| 2 | 0.537297 | 0.884842 | 0.761933 | 0.837568 | 0.142577 | 0.792199 | 0.602414 | 0.899725 | 0.594977 | 0.0328992 |
| 3 | 0.545046 | 0.146989 | 0.767474 | 0.750266 | 0.353807 | 0.901302 | 0.493169 | 0.289729 | 0.037967 | 0.601937 |
| 4 | 0.468204 | 0.860019 | 0.480399 | 0.433961 | 0.36144 | 0.0987604 | 0.869873 | 0.690603 | 0.0129994 | 0.858868 |
| 5 | 0.945626 | 0.737168 | 0.897333 | 0.721963 | 0.747643 | 0.378898 | 0.393668 | 0.965249 | 0.832653 | 0.368617 |
| 6 | 0.795077 | 0.24254 | 0.461301 | 0.071843 | 0.526477 | 0.885815 | 0.392322 | 0.804884 | 0.967799 | 0.484455 |
| 7 | 0.144368 | 0.0746826 | 0.296573 | 0.0946843 | 0.233691 | 0.207105 | 0.0322509 | 0.446124 | 0.316105 | 0.225936 |
| 8 | 0.246877 | 0.162375 | 0.829648 | 0.669706 | 0.255161 | 0.292926 | 0.255723 | 0.00901615 | 0.504461 | 0.0955296 |
| 9 | 0.601375 | 0.396815 | 0.630975 | 0.178445 | 0.373065 | 0.0397525 | 0.690713 | 0.504059 | 0.531599 | 0.243714 |
| 10 | 0.698136 | 0.881048 | 0.116188 | 0.486561 | 0.302637 | 0.0438733 | 0.503967 | 0.702356 | 0.273515 | 0.910538 |
| 11 | 0.0118833 | 0.856812 | 0.534104 | 0.0955872 | 0.191018 | 0.305049 | 0.689865 | 0.567668 | 0.320684 | 0.557577 |
| 12 | 0.191861 | 0.84547 | 0.239379 | 0.760855 | 0.362501 | 0.87176 | 0.729357 | 0.652585 | 0.76212 | 0.226549 |
| 13 | 0.172659 | 0.490196 | 0.286927 | 0.303485 | 0.00823402 | 0.120736 | 0.403559 | 0.755553 | 0.149911 | 0.605745 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 999989 | 0.25687 | 0.686048 | 0.141895 | 0.30186 | 0.749336 | 0.038252 | 0.688074 | 0.662057 | 0.655862 | 0.650975 |
| 999990 | 0.0755492 | 0.791997 | 0.615303 | 0.234143 | 0.874981 | 0.257686 | 0.833352 | 0.78647 | 0.0109449 | 0.341014 |
| 999991 | 0.465504 | 0.0621855 | 0.841689 | 0.589676 | 0.745905 | 0.359124 | 0.724153 | 0.480526 | 0.469831 | 0.730895 |
| 999992 | 0.643036 | 0.93191 | 0.992777 | 0.983357 | 0.459779 | 0.938882 | 0.776353 | 0.424043 | 0.234642 | 0.0562271 |
| 999993 | 0.160002 | 0.147143 | 0.496255 | 0.983457 | 0.276256 | 0.0725854 | 0.196438 | 0.210039 | 0.241376 | 0.978434 |
| 999994 | 0.951502 | 0.393141 | 0.449806 | 0.231952 | 0.275161 | 0.531019 | 0.845476 | 0.400511 | 0.293222 | 0.234725 |
| 999995 | 0.863946 | 0.468246 | 0.803938 | 0.588295 | 0.448115 | 0.466923 | 0.523416 | 0.492861 | 0.716858 | 0.131195 |
| 999996 | 0.814875 | 0.276221 | 0.46419 | 0.315931 | 0.413915 | 0.229485 | 0.889376 | 0.814128 | 0.0811651 | 0.0324598 |
| 999997 | 0.2221 | 0.669916 | 0.861596 | 0.87189 | 0.852959 | 0.325319 | 0.028256 | 0.54034 | 0.495246 | 0.832195 |
| 999998 | 0.183501 | 0.694183 | 0.236999 | 0.417095 | 0.0832853 | 0.364791 | 0.26945 | 0.420274 | 0.543252 | 0.0790998 |
| 999999 | 0.348597 | 0.0798918 | 0.500807 | 0.0939304 | 0.719459 | 0.142878 | 0.302941 | 0.391026 | 0.176512 | 0.981048 |
| 1000000 | 0.171675 | 0.261939 | 0.724141 | 0.796064 | 0.357761 | 0.792198 | 0.29389 | 0.269918 | 0.647273 | 0.575534 |
@time map(sum, eachrow(df2));
2.084860 seconds (60.13 M allocations: 1.057 GiB, 8.12% gc time, 5.40% compilation time)
@time map(sum, eachrow(df2));
1.929150 seconds (59.99 M allocations: 1.050 GiB, 8.39% gc time)
@time map(sum, Tables.namedtupleiterator(df2));
0.356282 seconds (1.29 M allocations: 70.545 MiB, 5.93% gc time, 97.51% compilation time)
@time map(sum, Tables.namedtupleiterator(df2));
0.008565 seconds (26 allocations: 7.631 MiB)
as you can see - this time it is much faster to iterate a type stable container
still you might want to use the select syntax, which is optimized for such reductions:
this includes compilation time
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.604399 seconds (999.89 k allocations: 57.228 MiB, 97.41% compilation time: 95% of which was recompilation)
1000000-element Vector{Float64}:
5.0520095448766185
6.086430758355689
4.887685872073017
5.135126449042814
6.988818506958208
5.632513763521528
2.071519953916969
3.321422331055035
4.190512286064625
4.918819423163244
⋮
6.441006218261728
3.761986370440841
4.606515314644607
5.503791814623299
4.331745999643407
5.69981683021577
3.29192976063087
3.7370911066160897
4.890392635140402
Do it again
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum
0.014927 seconds (127 allocations: 7.634 MiB)
1000000-element Vector{Float64}:
5.0520095448766185
6.086430758355689
4.887685872073017
5.135126449042814
6.988818506958208
5.632513763521528
2.071519953916969
3.321422331055035
4.190512286064625
4.918819423163244
⋮
6.441006218261728
3.761986370440841
4.606515314644607
5.503791814623299
4.331745999643407
5.69981683021577
3.29192976063087
3.7370911066160897
4.890392635140402
This notebook was generated using Literate.jl.