Possible pitfalls

Possible pitfalls#

using DataFrames
using BenchmarkTools

Know what is copied when creating a DataFrame#

x = DataFrame(rand(3, 5), :auto)

3×5 DataFrame

Row	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.627966	0.780746	0.216054	0.49819	0.740574
2	0.982644	0.863759	0.521577	0.791425	0.916146
3	0.931923	0.66291	0.42917	0.778559	0.110755

x and y are not the same object

y = copy(x)
x === y

false

x and y are not the same object

y = DataFrame(x)
x === y

false

the columns are also not the same

any(x[!, i] === y[!, i] for i in ncol(x))

false

x and y are not the same object

y = DataFrame(x, copycols=false)
x === y

false

But the columns are the same

all(x[!, i] === y[!, i] for i in ncol(x))

true

the same when creating data frames using kwarg syntax

x = 1:3;
y = [1, 2, 3];
df = DataFrame(x=x, y=y)

3×2 DataFrame

Row	x	y
	Int64	Int64
1	1	1
2	2	2
3	3	3

different object

y === df.y

false

range is converted to a vector

typeof(x), typeof(df.x)

(UnitRange{Int64}, Vector{Int64})

slicing rows always creates a copy

y === df[:, :y]

false

you can avoid copying by using copycols=false keyword argument in functions.

df = DataFrame(x=x, y=y, copycols=false)

3×2 DataFrame

Row	x	y
	Int64	Int64
1	1	1
2	2	2
3	3	3

now it is the same

y === df.y

true

not the same object

select(df, :y)[!, 1] === y

false

the same object

select(df, :y, copycols=false)[!, 1] === y

true

Do not modify the parent of `GroupedDataFrame` or view#

x = DataFrame(id=repeat([1, 2], outer=3), x=1:6)
g = groupby(x, :id)

x[1:3, 1] = [2, 2, 2]
g ## well - it is wrong now, g is only a view

GroupedDataFrame with 2 groups based on key: id

First Group (3 rows): id = 2

Row	id	x
	Int64	Int64
1	2	1
2	2	3
3	1	5

⋮

Last Group (3 rows): id = 2

Row	id	x
	Int64	Int64
1	2	2
2	2	4
3	2	6

s = view(x, 5:6, :)

2×2 SubDataFrame

Row	id	x
	Int64	Int64
1	1	5
2	2	6

delete!(x, 3:6)

2×2 DataFrame

Row	id	x
	Int64	Int64
1	2	1
2	2	2

This is an error

s ## Will return BoundsError

Single column selection for a `DataFrame`#

Single column selection for a DataFrame creates aliases with ! and getproperty syntax and copies with :

x = DataFrame(a=1:3)
x.b = x[!, 1] ## alias
x.c = x[:, 1] ## copy
x.d = x[!, 1][:] ## copy
x.e = copy(x[!, 1]) ## explicit copy
display(x)

3×5 DataFrame

Row	a	b	c	d	e
	Int64	Int64	Int64	Int64	Int64
1	1	1	1	1	1
2	2	2	2	2	2
3	3	3	3	3	3

x[1, 1] = 100
display(x)

3×5 DataFrame

Row	a	b	c	d	e
	Int64	Int64	Int64	Int64	Int64
1	100	100	1	1	1
2	2	2	2	2	2
3	3	3	3	3	3

When iterating rows of a data frame#

use eachrow to avoid compilation cost in wide tables,
but Tables.namedtupleiterator for fast execution in tall tables

The table below is tall:

df2 = DataFrame(rand(10^6, 10), :auto)

1000000×10 DataFrame

999975 rows omitted

Row	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	0.827493	0.522678	0.30587	0.697431	0.356574	0.147348	0.165574	0.35231	0.735541	0.734174
2	0.334083	0.721048	0.678431	0.338055	0.162462	0.96368	0.98222	0.310456	0.788059	0.808003
3	0.249476	0.185305	0.386422	0.670001	0.43029	0.300558	0.567526	0.604122	0.993032	0.412015
4	0.331262	0.241808	0.131926	0.47897	0.939951	0.866249	0.713522	0.85524	0.62698	0.94087
5	0.328309	0.913404	0.997025	0.631137	0.0583395	0.535069	0.115982	0.261661	0.771639	0.38268
6	0.49489	0.935854	0.849281	0.0651715	0.702173	0.973227	0.698805	0.911043	0.344635	0.0538166
7	0.639202	0.11087	0.380368	0.99173	0.732873	0.637735	0.130003	0.74666	0.825874	0.635628
8	0.594615	0.952221	0.118798	0.890403	0.731167	0.129171	0.190358	0.306405	0.598677	0.659036
9	0.326836	0.867387	0.779446	0.0514019	0.668467	0.0895685	0.825931	0.313593	0.506275	0.18824
10	0.451944	0.450682	0.695083	0.279971	0.518991	0.481737	0.510053	0.235263	0.755214	0.512064
11	0.391553	0.422148	0.679026	0.815403	0.120333	0.932611	0.784385	0.168465	0.662867	0.567092
12	0.394268	0.510768	0.868931	0.401659	0.701443	0.701251	0.214795	0.632434	0.392702	0.716703
13	0.574541	0.509977	0.733312	0.357177	0.204279	0.7421	0.837905	0.222128	0.960075	0.405046
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
999989	0.0175485	0.480487	0.892262	0.130995	0.670485	0.00171493	0.120122	0.299505	0.951317	0.964039
999990	0.891715	0.647522	0.561725	0.833824	0.299696	0.981391	0.747893	0.205204	0.602296	0.577729
999991	0.716421	0.939543	0.298614	0.0477076	0.410623	0.211471	0.536963	0.0797999	0.541015	0.343898
999992	0.36239	0.76989	0.351988	0.0889811	0.689272	0.0768747	0.860183	0.0105461	0.769641	0.511566
999993	0.875264	0.91345	0.0497442	0.132398	0.501189	0.974346	0.733187	0.0699565	0.0915653	0.492524
999994	0.548949	0.127222	0.188089	0.486114	0.163262	0.518948	0.600134	0.23425	0.120563	0.484254
999995	0.00522245	0.332911	0.332718	0.166368	0.905102	0.660736	0.15789	0.70676	0.379854	0.407764
999996	0.612489	0.967301	0.379855	0.147552	0.662538	0.847811	0.954871	0.116529	0.907558	0.337808
999997	0.774333	0.5798	0.745301	0.7291	0.490831	0.730082	0.964367	0.830893	0.309956	0.728045
999998	0.25082	0.490147	0.227523	0.871503	0.946821	0.19416	0.198351	0.913674	0.50724	0.0972831
999999	0.060027	0.375635	0.558574	0.296796	0.297386	0.747957	0.0859935	0.996066	0.351147	0.0207747
1000000	0.040967	0.316655	0.0913896	0.469002	0.44254	0.0260206	0.739715	0.796013	0.837032	0.0939526

@time map(sum, eachrow(df2));

  2.677138 seconds (60.11 M allocations: 1.056 GiB, 27.30% gc time, 3.70% compilation time)

@time map(sum, eachrow(df2));

  1.773191 seconds (59.99 M allocations: 1.050 GiB, 5.91% gc time)

@time map(sum, Tables.namedtupleiterator(df2));

  0.348400 seconds (1.32 M allocations: 72.984 MiB, 4.03% gc time, 94.13% compilation time)

@time map(sum, Tables.namedtupleiterator(df2));

  0.006175 seconds (25 allocations: 7.631 MiB)

as you can see - this time it is much faster to iterate a type stable container still you might want to use the select syntax, which is optimized for such reductions:

this includes compilation time

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum

  0.554322 seconds (756.88 k allocations: 46.117 MiB, 98.88% compilation time: 95% of which was recompilation)

1000000-element Vector{Float64}:
844991690000428
086497806255862
798747007881859
126779894498693
9952461619646815
028895513503592
83094213599793
170850614032696
617145262159551
891002752222908
 ⋮
4913317950278175
833623718382274
471786429775383
055326374105319
9343112062272665
882706645838512
697521607782131
790353847411509
8532860025938884

Do it again

@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum

  0.004271 seconds (127 allocations: 7.634 MiB)

1000000-element Vector{Float64}:
844991690000428
086497806255862
798747007881859
126779894498693
9952461619646815
028895513503592
83094213599793
170850614032696
617145262159551
891002752222908
 ⋮
4913317950278175
833623718382274
471786429775383
055326374105319
9343112062272665
882706645838512
697521607782131
790353847411509
8532860025938884

This notebook was generated using Literate.jl.

Possible pitfalls

Contents

Possible pitfalls#

Know what is copied when creating a DataFrame#

Do not modify the parent of GroupedDataFrame or view#

Single column selection for a DataFrame#

When iterating rows of a data frame#

Do not modify the parent of `GroupedDataFrame` or view#

Single column selection for a `DataFrame`#