# Possible pitfalls

In [1]:
using DataFrames

## Know what is copied when creating a DataFrame

In [2]:
x = DataFrame(rand(3, 5), :auto)

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.587516,0.718981,0.62205,0.791086,0.505925
2,0.900673,0.0775141,0.210494,0.15541,0.0177345
3,0.436133,0.866412,0.603696,0.360206,0.424278


x and y are not the same object

In [3]:
y = copy(x)
x === y

false

x and y are not the same object

In [4]:
y = DataFrame(x)
x === y

false

the columns are also not the same

In [5]:
any(x[!, i] === y[!, i] for i in ncol(x))

false

x and y are not the same object

In [6]:
y = DataFrame(x, copycols=false)
x === y

false

But the columns are the same

In [7]:
all(x[!, i] === y[!, i] for i in ncol(x))

true

the same when creating data frames using `kwarg` syntax

In [8]:
x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)

Row,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


different object

In [9]:
y === df.y

false

range is converted to a vector

In [10]:
typeof(x), typeof(df.x)

(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

In [11]:
y === df[:, :y]

false

you can avoid copying by using copycols=false keyword argument in functions.

In [12]:
df = DataFrame(x=x, y=y, copycols=false)

Row,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


now it is the same

In [13]:
y === df.y

true

not the same object

In [14]:
select(df, :y)[!, 1] === y

false

the same object

In [15]:
select(df, :y, copycols=false)[!, 1] === y

true

## Do not modify the parent of `GroupedDataFrame` or view

In [16]:
x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view

Row,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,3
3,1,5

Row,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [17]:
s = view(x, 5:6, :)

Row,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,5
2,2,6


In [18]:
delete!(x, 3:6)

Row,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,2


This is an error

```julia
s ## Will return BoundsError
```

## Single column selection for `DataFrame` creates aliases with ! and `getproperty` syntax and copies with :

In [19]:
x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)

Row,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


In [20]:
x[1, 1] = 100
display(x)

Row,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,100,100,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


## When iterating rows of a data frame

- use `eachrow` to avoid compilation cost (wide tables),
- but `Tables.namedtupleiterator` for fast execution (tall tables)

this table is wide

In [21]:
df1 = DataFrame([rand([1:2, 'a':'b', false:true, 1.0:2.0]) for i in 1:900], :auto)

Row,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36,x37,x38,x39,x40,x41,x42,x43,x44,x45,x46,x47,x48,x49,x50,x51,x52,x53,x54,x55,x56,x57,x58,x59,x60,x61,x62,x63,x64,x65,x66,x67,x68,x69,x70,x71,x72,x73,x74,x75,x76,x77,x78,x79,x80,x81,x82,x83,x84,x85,x86,x87,x88,x89,x90,x91,x92,x93,x94,x95,x96,x97,x98,x99,x100,⋯
Unnamed: 0_level_1,Bool,Char,Bool,Int64,Int64,Int64,Float64,Char,Int64,Int64,Char,Float64,Float64,Int64,Char,Int64,Char,Int64,Int64,Float64,Char,Float64,Bool,Char,Int64,Bool,Int64,Float64,Float64,Char,Float64,Bool,Int64,Bool,Char,Int64,Int64,Int64,Char,Int64,Float64,Int64,Bool,Int64,Char,Float64,Bool,Bool,Bool,Float64,Char,Float64,Char,Char,Char,Bool,Float64,Char,Char,Char,Float64,Char,Int64,Bool,Bool,Char,Bool,Float64,Float64,Int64,Int64,Char,Float64,Char,Bool,Int64,Float64,Float64,Float64,Int64,Int64,Char,Int64,Bool,Int64,Char,Char,Float64,Int64,Bool,Char,Float64,Int64,Int64,Char,Bool,Float64,Float64,Int64,Float64,⋯
1,False,a,False,1,1,1,1.0,a,1,1,a,1.0,1.0,1,a,1,a,1,1,1.0,a,1.0,False,a,1,False,1,1.0,1.0,a,1.0,False,1,False,a,1,1,1,a,1,1.0,1,False,1,a,1.0,False,False,False,1.0,a,1.0,a,a,a,False,1.0,a,a,a,1.0,a,1,False,False,a,False,1.0,1.0,1,1,a,1.0,a,False,1,1.0,1.0,1.0,1,1,a,1,False,1,a,a,1.0,1,False,a,1.0,1,1,a,False,1.0,1.0,1,1.0,⋯
2,True,b,True,2,2,2,2.0,b,2,2,b,2.0,2.0,2,b,2,b,2,2,2.0,b,2.0,True,b,2,True,2,2.0,2.0,b,2.0,True,2,True,b,2,2,2,b,2,2.0,2,True,2,b,2.0,True,True,True,2.0,b,2.0,b,b,b,True,2.0,b,b,b,2.0,b,2,True,True,b,True,2.0,2.0,2,2,b,2.0,b,True,2,2.0,2.0,2.0,2,2,b,2,True,2,b,b,2.0,2,True,b,2.0,2,2,b,True,2.0,2.0,2,2.0,⋯


In [22]:
@time collect(eachrow(df1));

  0.041341 seconds (52.71 k allocations: 3.614 MiB, 99.91% compilation time)


In [23]:
@time collect(Tables.namedtupleiterator(df1));

  7.539398 seconds (885.91 k allocations: 70.446 MiB, 99.68% compilation time)


as you can see the time to compile `Tables.namedtupleiterator` is very large in this case, and it would get much worse if the table was wider (that is why we include this tip in pitfalls notebook)

the table below is tall

In [24]:
df2 = DataFrame(rand(10^6, 10), :auto)

Row,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.791452,0.297286,0.049207,0.88973,0.309781,0.124727,0.526994,0.376758,0.982059,0.201495
2,0.220364,0.701363,0.37604,0.853463,0.292551,0.160833,0.403648,0.806145,0.686896,0.319294
3,0.375985,0.0946538,0.401264,0.259818,0.426253,0.992351,0.351296,0.897025,0.642545,0.36698
4,0.190554,0.799914,0.814485,0.664285,0.243209,0.538027,0.281022,0.716538,0.735261,0.0711319
5,0.794052,0.707524,0.982855,0.679856,0.33846,0.377405,0.393751,0.78178,0.44972,0.205827
6,0.388176,0.641699,0.793341,0.455715,0.24483,0.519513,0.697198,0.371501,0.707692,0.252749
7,0.432196,0.0478298,0.691303,0.952995,0.352697,0.359571,0.597546,0.151098,0.413731,0.00542862
8,0.796189,0.0284523,0.102914,0.570933,0.207063,0.212477,0.598974,0.391049,0.290673,0.416863
9,0.514341,0.574274,0.00955344,0.0426117,0.874077,0.198039,0.504121,0.848645,0.607743,0.464585
10,0.678113,0.601279,0.30789,0.037617,0.198317,0.0906097,0.0780964,0.339857,0.0657225,0.0773766


In [25]:
@time map(sum, eachrow(df2));

  2.127519 seconds (60.08 M allocations: 1.056 GiB, 6.17% gc time, 3.69% compilation time)


In [26]:
@time map(sum, eachrow(df2));

  2.119773 seconds (59.99 M allocations: 1.050 GiB, 6.50% gc time)


In [27]:
@time map(sum, Tables.namedtupleiterator(df2));

  0.195608 seconds (200.06 k allocations: 21.156 MiB, 24.98% gc time, 96.98% compilation time)


In [28]:
@time map(sum, Tables.namedtupleiterator(df2));

  0.005788 seconds (22 allocations: 7.631 MiB)


as you can see - this time it is much faster to iterate a type stable container
still you might want to use the `select` syntax, which is optimized for such reductions:

this includes compilation time

In [29]:
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum

  0.392389 seconds (509.34 k allocations: 41.875 MiB, 98.52% compilation time: 93% of which was recompilation)


1000000-element Vector{Float64}:
 4.549489608990553
 4.820598260856793
 4.808170226496026
 5.054427780801031
 5.711228368654866
 5.0724123056568216
 4.00439573440345
 3.61558727488398
 4.637989455369663
 2.474878422732406
 ⋮
 4.377478339806004
 4.073770679842652
 4.40725544523935
 3.836624901881449
 3.110289532802179
 4.945542466292135
 6.168372468024979
 5.415848010904047
 4.899889662604407

Do it again

In [30]:
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum

  0.004722 seconds (125 allocations: 7.635 MiB)


1000000-element Vector{Float64}:
 4.549489608990553
 4.820598260856793
 4.808170226496026
 5.054427780801031
 5.711228368654866
 5.0724123056568216
 4.00439573440345
 3.61558727488398
 4.637989455369663
 2.474878422732406
 ⋮
 4.377478339806004
 4.073770679842652
 4.40725544523935
 3.836624901881449
 3.110289532802179
 4.945542466292135
 6.168372468024979
 5.415848010904047
 4.899889662604407

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*