Performance tips

Performance tips#

using DataFrames
using BenchmarkTools
using CategoricalArrays
using PooledArrays
using Random

Access by column number is faster than by name#

x = DataFrame(rand(5, 1000), :auto)
@btime $x[!, 500]; ## Faster

  3.827 ns (0 allocations: 0 bytes)

@btime $x.x500;  ## Slower

  15.630 ns (0 allocations: 0 bytes)

When working with data DataFrame use barrier functions or type annotation#

function f_bad() ## this function will be slow
    Random.seed!(1)
    x = DataFrame(rand(1000000, 2), :auto)
    y, z = x[!, 1], x[!, 2]
    p = 0.0
    for i in 1:nrow(x)
        p += y[i] * z[i]
    end
    p
end

@btime f_bad();
# if you run @code_warntype f_bad() then you notice
# that Julia does not know column types of `DataFrame`

  149.432 ms (5999022 allocations: 122.06 MiB)

solution 1 is to use barrier function (it should be possible to use it in almost any code) for the calculation. You will notice much less memopry allocations and faster performance.

function f_inner(y, z)
    p = 0.0
    for i in eachindex(y, z)
        p += y[i] * z[i]
    end
    p
end

function f_barrier()
    Random.seed!(1)
    x = DataFrame(rand(1000000, 2), :auto)
    f_inner(x[!, 1], x[!, 2])
end

@btime f_barrier();

  3.681 ms (44 allocations: 30.52 MiB)

or use inbuilt function if possible

using LinearAlgebra

function f_inbuilt()
    Random.seed!(1)
    x = DataFrame(rand(1000000, 2), :auto)
    dot(x[!, 1], x[!, 2])
end

@btime f_inbuilt();

  3.165 ms (44 allocations: 30.52 MiB)

solution 2 is to provide the types of extracted columns. However, there are cases in which you will not know these types.

function f_typed()
    Random.seed!(1)
    x = DataFrame(rand(1000000, 2), :auto)
    y::Vector{Float64}, z::Vector{Float64} = x[!, 1], x[!, 2]
    p = 0.0
    for i in 1:nrow(x)
        p += y[i] * z[i]
    end
    p
end

@btime f_typed();

  3.763 ms (44 allocations: 30.52 MiB)

In general for tall and narrow tables it is often useful to use Tables.rowtable, Tables.columntable or Tables.namedtupleiterator for intermediate processing of data in a type-stable way.

Consider using delayed `DataFrame` creation technique#

also notice the difference in performance between copying vs non-copying data frame creation

function f1()
    x = DataFrame([Vector{Float64}(undef, 10^4) for i in 1:100], :auto, copycols=false) ## we work with a DataFrame directly
    for c in 1:ncol(x)
        d = x[!, c]
        for r in 1:nrow(x)
            d[r] = rand()
        end
    end
    x
end

function f1a()
    x = DataFrame([Vector{Float64}(undef, 10^4) for i in 1:100], :auto) ## we work with a DataFrame directly
    for c in 1:ncol(x)
        d = x[!, c]
        for r in 1:nrow(x)
            d[r] = rand()
        end
    end
    x
end

function f2()
    x = Vector{Any}(undef, 100)
    for c in 1:length(x)
        d = Vector{Float64}(undef, 10^4)
        for r in eachindex(d)
            d[r] = rand()
        end
        x[c] = d
    end
    DataFrame(x, :auto, copycols=false) ## we delay creation of DataFrame after we have our job done
end

function f2a()
    x = Vector{Any}(undef, 100)
    for c in eachindex(x)
        d = Vector{Float64}(undef, 10^4)
        for r in eachindex(d)
            d[r] = rand()
        end
        x[c] = d
    end
    DataFrame(x, :auto) ## we delay creation of DataFrame after we have our job done
end

@btime f1();
@btime f1a();
@btime f2();
@btime f2a();

541 ms (1949728 allocations: 37.40 MiB)
865 ms (1950028 allocations: 45.03 MiB)
142 ms (728 allocations: 7.66 MiB)
527 ms (1028 allocations: 15.29 MiB)

You can add rows to a DataFrame in place and it is fast#

x = DataFrame(rand(10^6, 5), :auto)
y = DataFrame(transpose(1.0:5.0), :auto)
z = [1.0:5.0;]

@btime vcat($x, $y); ## creates a new DataFrame - slow
@btime append!($x, $y); ## in place - fast

x = DataFrame(rand(10^6, 5), :auto) ## reset to the same starting point
@btime push!($x, $z); ## add a single row in place - fast

477 ms (212 allocations: 38.16 MiB)
124 μs (29 allocations: 1.50 KiB)
462 ns (16 allocations: 256 bytes)

Allowing missing as well as categorical slows down computations#

using StatsBase

function test(data) ## uses countmap function to test performance
    println(eltype(data))
    x = rand(data, 10^6)
    y = categorical(x)
    println(" raw:")
    @btime countmap($x)
    println(" categorical:")
    @btime countmap($y)
    nothing
end

test(1:10)
test([randstring() for i in 1:10])
test(allowmissing(1:10))
test(allowmissing([randstring() for i in 1:10]))

Int64
 raw:
  1.812 ms (8 allocations: 7.63 MiB)
 categorical:
  15.993 ms (1000004 allocations: 30.52 MiB)
String
 raw:
  21.960 ms (4 allocations: 448 bytes)
 categorical:
  32.089 ms (1000004 allocations: 30.52 MiB)
Union{Missing, Int64}
 raw:
  6.031 ms (4 allocations: 464 bytes)
 categorical:
  15.935 ms (1000004 allocations: 30.52 MiB)
Union{Missing, String}
 raw:
  22.191 ms (4 allocations: 448 bytes)
 categorical:
  33.176 ms (1000004 allocations: 30.52 MiB)

When aggregating use column selector and prefer integer, categorical, or pooled array grouping variable#

df = DataFrame(x=rand('a':'d', 10^7), y=1);

gdf = groupby(df, :x)

GroupedDataFrame with 4 groups based on key: x

First Group (2500724 rows): x = 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

2500699 rows omitted

Row	x	y
	Char	Int64
1	b	1
2	b	1
3	b	1
4	b	1
5	b	1
6	b	1
7	b	1
8	b	1
9	b	1
10	b	1
11	b	1
12	b	1
13	b	1
⋮	⋮	⋮
2500713	b	1
2500714	b	1
2500715	b	1
2500716	b	1
2500717	b	1
2500718	b	1
2500719	b	1
2500720	b	1
2500721	b	1
2500722	b	1
2500723	b	1
2500724	b	1

⋮

Last Group (2500304 rows): x = 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

2500279 rows omitted

Row	x	y
	Char	Int64
1	c	1
2	c	1
3	c	1
4	c	1
5	c	1
6	c	1
7	c	1
8	c	1
9	c	1
10	c	1
11	c	1
12	c	1
13	c	1
⋮	⋮	⋮
2500293	c	1
2500294	c	1
2500295	c	1
2500296	c	1
2500297	c	1
2500298	c	1
2500299	c	1
2500300	c	1
2500301	c	1
2500302	c	1
2500303	c	1
2500304	c	1

traditional syntax, slow

@btime combine(v -> sum(v.y), $gdf)

  15.846 ms (332 allocations: 19.09 MiB)

4×2 DataFrame

Row	x	x1
	Char	Int64
1	b	2500724
2	d	2500237
3	a	2498735
4	c	2500304

use column selector

@btime combine($gdf, :y => sum)

  6.916 ms (198 allocations: 10.14 KiB)

4×2 DataFrame

Row	x	y_sum
	Char	Int64
1	b	2500724
2	d	2500237
3	a	2498735
4	c	2500304

transform!(df, :x => categorical => :x);
gdf = groupby(df, :x)

GroupedDataFrame with 4 groups based on key: x

First Group (2498735 rows): x = CategoricalArrays.CategoricalValue{Char, UInt32} 'a'

2498710 rows omitted

Row	x	y
	Cat…	Int64
1	a	1
2	a	1
3	a	1
4	a	1
5	a	1
6	a	1
7	a	1
8	a	1
9	a	1
10	a	1
11	a	1
12	a	1
13	a	1
⋮	⋮	⋮
2498724	a	1
2498725	a	1
2498726	a	1
2498727	a	1
2498728	a	1
2498729	a	1
2498730	a	1
2498731	a	1
2498732	a	1
2498733	a	1
2498734	a	1
2498735	a	1

⋮

Last Group (2500237 rows): x = CategoricalArrays.CategoricalValue{Char, UInt32} 'd'

2500212 rows omitted

Row	x	y
	Cat…	Int64
1	d	1
2	d	1
3	d	1
4	d	1
5	d	1
6	d	1
7	d	1
8	d	1
9	d	1
10	d	1
11	d	1
12	d	1
13	d	1
⋮	⋮	⋮
2500226	d	1
2500227	d	1
2500228	d	1
2500229	d	1
2500230	d	1
2500231	d	1
2500232	d	1
2500233	d	1
2500234	d	1
2500235	d	1
2500236	d	1
2500237	d	1

@btime combine($gdf, :y => sum)

  6.866 ms (206 allocations: 10.62 KiB)

4×2 DataFrame

Row	x	y_sum
	Cat…	Int64
1	a	2498735
2	b	2500724
3	c	2500304
4	d	2500237

transform!(df, :x => PooledArray{Char} => :x)

10000000×2 DataFrame

9999975 rows omitted

Row	x	y
	Char	Int64
1	b	1
2	b	1
3	d	1
4	a	1
5	c	1
6	b	1
7	c	1
8	b	1
9	a	1
10	c	1
11	a	1
12	a	1
13	d	1
⋮	⋮	⋮
9999989	d	1
9999990	a	1
9999991	c	1
9999992	d	1
9999993	c	1
9999994	d	1
9999995	b	1
9999996	b	1
9999997	b	1
9999998	a	1
9999999	a	1
10000000	a	1

gdf = groupby(df, :x)

GroupedDataFrame with 4 groups based on key: x

First Group (2500724 rows): x = 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

2500699 rows omitted

Row	x	y
	Char	Int64
1	b	1
2	b	1
3	b	1
4	b	1
5	b	1
6	b	1
7	b	1
8	b	1
9	b	1
10	b	1
11	b	1
12	b	1
13	b	1
⋮	⋮	⋮
2500713	b	1
2500714	b	1
2500715	b	1
2500716	b	1
2500717	b	1
2500718	b	1
2500719	b	1
2500720	b	1
2500721	b	1
2500722	b	1
2500723	b	1
2500724	b	1

⋮

Last Group (2500304 rows): x = 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

2500279 rows omitted

Row	x	y
	Char	Int64
1	c	1
2	c	1
3	c	1
4	c	1
5	c	1
6	c	1
7	c	1
8	c	1
9	c	1
10	c	1
11	c	1
12	c	1
13	c	1
⋮	⋮	⋮
2500293	c	1
2500294	c	1
2500295	c	1
2500296	c	1
2500297	c	1
2500298	c	1
2500299	c	1
2500300	c	1
2500301	c	1
2500302	c	1
2500303	c	1
2500304	c	1

@btime combine($gdf, :y => sum)

  6.882 ms (200 allocations: 10.20 KiB)

4×2 DataFrame

Row	x	y_sum
	Char	Int64
1	b	2500724
2	d	2500237
3	a	2498735
4	c	2500304

Use views instead of materializing a new DataFrame#

x = DataFrame(rand(100, 1000), :auto)
@btime $x[1:1, :]

  177.661 μs (3993 allocations: 159.07 KiB)

1×1000 DataFrame

900 columns omitted

Row	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	x11	x12	x13	x14	x15	x16	x17	x18	x19	x20	x21	x22	x23	x24	x25	x26	x27	x28	x29	x30	x31	x32	x33	x34	x35	x36	x37	x38	x39	x40	x41	x42	x43	x44	x45	x46	x47	x48	x49	x50	x51	x52	x53	x54	x55	x56	x57	x58	x59	x60	x61	x62	x63	x64	x65	x66	x67	x68	x69	x70	x71	x72	x73	x74	x75	x76	x77	x78	x79	x80	x81	x82	x83	x84	x85	x86	x87	x88	x89	x90	x91	x92	x93	x94	x95	x96	x97	x98	x99	x100	⋯
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	⋯
1	0.950058	0.587494	0.177994	0.655275	0.973326	0.409907	0.835926	0.649191	0.202792	0.808865	0.581285	0.34027	0.136758	0.660741	0.97527	0.579335	0.90986	0.848891	0.575617	0.627489	0.290292	0.728637	0.0211294	0.820952	0.846581	0.132386	0.107926	0.430422	0.961199	0.729527	0.59214	0.2748	0.334664	0.524227	0.82397	0.0176395	0.46553	0.135355	0.780029	0.58738	0.483812	0.65528	0.584815	0.0778716	0.537381	0.436296	0.0305809	0.462507	0.505382	0.170584	0.221943	0.535946	0.78498	0.730994	0.41515	0.418	0.450529	0.489728	0.791703	0.465304	0.26486	0.067945	0.508496	0.184748	0.461517	0.193533	0.387085	0.931858	0.425018	0.0951919	0.206334	0.918469	0.775217	0.865825	0.248852	0.430064	0.291865	0.100074	0.739642	0.685014	0.0949525	0.69768	0.466532	0.741147	0.563658	0.865301	0.540111	0.359868	0.0393546	0.540505	0.770954	0.873907	0.0977718	0.86035	0.567951	0.891952	0.0967576	0.345434	0.731782	0.21978	⋯

@btime $x[1, :]

  22.833 ns (0 allocations: 0 bytes)

DataFrameRow (1000 columns)

900 columns omitted

Row	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	x11	x12	x13	x14	x15	x16	x17	x18	x19	x20	x21	x22	x23	x24	x25	x26	x27	x28	x29	x30	x31	x32	x33	x34	x35	x36	x37	x38	x39	x40	x41	x42	x43	x44	x45	x46	x47	x48	x49	x50	x51	x52	x53	x54	x55	x56	x57	x58	x59	x60	x61	x62	x63	x64	x65	x66	x67	x68	x69	x70	x71	x72	x73	x74	x75	x76	x77	x78	x79	x80	x81	x82	x83	x84	x85	x86	x87	x88	x89	x90	x91	x92	x93	x94	x95	x96	x97	x98	x99	x100	⋯
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	⋯
1	0.950058	0.587494	0.177994	0.655275	0.973326	0.409907	0.835926	0.649191	0.202792	0.808865	0.581285	0.34027	0.136758	0.660741	0.97527	0.579335	0.90986	0.848891	0.575617	0.627489	0.290292	0.728637	0.0211294	0.820952	0.846581	0.132386	0.107926	0.430422	0.961199	0.729527	0.59214	0.2748	0.334664	0.524227	0.82397	0.0176395	0.46553	0.135355	0.780029	0.58738	0.483812	0.65528	0.584815	0.0778716	0.537381	0.436296	0.0305809	0.462507	0.505382	0.170584	0.221943	0.535946	0.78498	0.730994	0.41515	0.418	0.450529	0.489728	0.791703	0.465304	0.26486	0.067945	0.508496	0.184748	0.461517	0.193533	0.387085	0.931858	0.425018	0.0951919	0.206334	0.918469	0.775217	0.865825	0.248852	0.430064	0.291865	0.100074	0.739642	0.685014	0.0949525	0.69768	0.466532	0.741147	0.563658	0.865301	0.540111	0.359868	0.0393546	0.540505	0.770954	0.873907	0.0977718	0.86035	0.567951	0.891952	0.0967576	0.345434	0.731782	0.21978	⋯

@btime view($x, 1:1, :)

  21.908 ns (0 allocations: 0 bytes)

1×1000 SubDataFrame

900 columns omitted

Row	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	x11	x12	x13	x14	x15	x16	x17	x18	x19	x20	x21	x22	x23	x24	x25	x26	x27	x28	x29	x30	x31	x32	x33	x34	x35	x36	x37	x38	x39	x40	x41	x42	x43	x44	x45	x46	x47	x48	x49	x50	x51	x52	x53	x54	x55	x56	x57	x58	x59	x60	x61	x62	x63	x64	x65	x66	x67	x68	x69	x70	x71	x72	x73	x74	x75	x76	x77	x78	x79	x80	x81	x82	x83	x84	x85	x86	x87	x88	x89	x90	x91	x92	x93	x94	x95	x96	x97	x98	x99	x100	⋯
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	⋯
1	0.950058	0.587494	0.177994	0.655275	0.973326	0.409907	0.835926	0.649191	0.202792	0.808865	0.581285	0.34027	0.136758	0.660741	0.97527	0.579335	0.90986	0.848891	0.575617	0.627489	0.290292	0.728637	0.0211294	0.820952	0.846581	0.132386	0.107926	0.430422	0.961199	0.729527	0.59214	0.2748	0.334664	0.524227	0.82397	0.0176395	0.46553	0.135355	0.780029	0.58738	0.483812	0.65528	0.584815	0.0778716	0.537381	0.436296	0.0305809	0.462507	0.505382	0.170584	0.221943	0.535946	0.78498	0.730994	0.41515	0.418	0.450529	0.489728	0.791703	0.465304	0.26486	0.067945	0.508496	0.184748	0.461517	0.193533	0.387085	0.931858	0.425018	0.0951919	0.206334	0.918469	0.775217	0.865825	0.248852	0.430064	0.291865	0.100074	0.739642	0.685014	0.0949525	0.69768	0.466532	0.741147	0.563658	0.865301	0.540111	0.359868	0.0393546	0.540505	0.770954	0.873907	0.0977718	0.86035	0.567951	0.891952	0.0967576	0.345434	0.731782	0.21978	⋯

@btime $x[1:1, 1:20]

  3.850 μs (70 allocations: 3.09 KiB)

1×20 DataFrame

Row	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	x11	x12	x13	x14	x15	x16	x17	x18	x19	x20
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	0.950058	0.587494	0.177994	0.655275	0.973326	0.409907	0.835926	0.649191	0.202792	0.808865	0.581285	0.34027	0.136758	0.660741	0.97527	0.579335	0.90986	0.848891	0.575617	0.627489

@btime $x[1, 1:20]

  22.531 ns (0 allocations: 0 bytes)

DataFrameRow (20 columns)

Row	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	x11	x12	x13	x14	x15	x16	x17	x18	x19	x20
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	0.950058	0.587494	0.177994	0.655275	0.973326	0.409907	0.835926	0.649191	0.202792	0.808865	0.581285	0.34027	0.136758	0.660741	0.97527	0.579335	0.90986	0.848891	0.575617	0.627489

@btime view($x, 1:1, 1:20)

  23.893 ns (0 allocations: 0 bytes)

1×20 SubDataFrame

Row	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	x11	x12	x13	x14	x15	x16	x17	x18	x19	x20
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	0.950058	0.587494	0.177994	0.655275	0.973326	0.409907	0.835926	0.649191	0.202792	0.808865	0.581285	0.34027	0.136758	0.660741	0.97527	0.579335	0.90986	0.848891	0.575617	0.627489

This notebook was generated using Literate.jl.

Performance tips

Contents

Performance tips#

Access by column number is faster than by name#

When working with data DataFrame use barrier functions or type annotation#

Consider using delayed DataFrame creation technique#

You can add rows to a DataFrame in place and it is fast#

Allowing missing as well as categorical slows down computations#

When aggregating use column selector and prefer integer, categorical, or pooled array grouping variable#

Use views instead of materializing a new DataFrame#

Consider using delayed `DataFrame` creation technique#