Performance tips#

using DataFrames
using BenchmarkTools
using CategoricalArrays
using PooledArrays
using Random

Access by column number is faster than by name#

x = DataFrame(rand(5, 1000), :auto)
@btime $x[!, 500]; ## Faster
  3.827 ns (0 allocations: 0 bytes)
@btime $x.x500;  ## Slower
  15.952 ns (0 allocations: 0 bytes)

When working with data DataFrame use barrier functions or type annotation#

function f_bad() ## this function will be slow
    Random.seed!(1)
    x = DataFrame(rand(1000000, 2), :auto)
    y, z = x[!, 1], x[!, 2]
    p = 0.0
    for i in 1:nrow(x)
        p += y[i] * z[i]
    end
    p
end

@btime f_bad();
# if you run @code_warntype f_bad() then you notice
# that Julia does not know column types of `DataFrame`
  101.876 ms (5999022 allocations: 122.06 MiB)

solution 1 is to use barrier function (it should be possible to use it in almost any code) for the calculation. You will notice much less memopry allocations and faster performance.

function f_inner(y, z)
    p = 0.0
    for i in eachindex(y, z)
        p += y[i] * z[i]
    end
    p
end

function f_barrier()
    Random.seed!(1)
    x = DataFrame(rand(1000000, 2), :auto)
    f_inner(x[!, 1], x[!, 2])
end

@btime f_barrier();
  4.273 ms (44 allocations: 30.52 MiB)

or use inbuilt function if possible

using LinearAlgebra

function f_inbuilt()
    Random.seed!(1)
    x = DataFrame(rand(1000000, 2), :auto)
    dot(x[!, 1], x[!, 2])
end

@btime f_inbuilt();
  3.681 ms (44 allocations: 30.52 MiB)

solution 2 is to provide the types of extracted columns. However, there are cases in which you will not know these types.

function f_typed()
    Random.seed!(1)
    x = DataFrame(rand(1000000, 2), :auto)
    y::Vector{Float64}, z::Vector{Float64} = x[!, 1], x[!, 2]
    p = 0.0
    for i in 1:nrow(x)
        p += y[i] * z[i]
    end
    p
end

@btime f_typed();
  4.693 ms (44 allocations: 30.52 MiB)

In general for tall and narrow tables it is often useful to use Tables.rowtable, Tables.columntable or Tables.namedtupleiterator for intermediate processing of data in a type-stable way.

Consider using delayed DataFrame creation technique#

also notice the difference in performance between copying vs non-copying data frame creation

function f1()
    x = DataFrame([Vector{Float64}(undef, 10^4) for i in 1:100], :auto, copycols=false) ## we work with a DataFrame directly
    for c in 1:ncol(x)
        d = x[!, c]
        for r in 1:nrow(x)
            d[r] = rand()
        end
    end
    x
end

function f1a()
    x = DataFrame([Vector{Float64}(undef, 10^4) for i in 1:100], :auto) ## we work with a DataFrame directly
    for c in 1:ncol(x)
        d = x[!, c]
        for r in 1:nrow(x)
            d[r] = rand()
        end
    end
    x
end

function f2()
    x = Vector{Any}(undef, 100)
    for c in 1:length(x)
        d = Vector{Float64}(undef, 10^4)
        for r in eachindex(d)
            d[r] = rand()
        end
        x[c] = d
    end
    DataFrame(x, :auto, copycols=false) ## we delay creation of DataFrame after we have our job done
end

function f2a()
    x = Vector{Any}(undef, 100)
    for c in eachindex(x)
        d = Vector{Float64}(undef, 10^4)
        for r in eachindex(d)
            d[r] = rand()
        end
        x[c] = d
    end
    DataFrame(x, :auto) ## we delay creation of DataFrame after we have our job done
end

@btime f1();
@btime f1a();
@btime f2();
@btime f2a();
  27.122 ms (1949728 allocations: 37.40 MiB)
  27.939 ms (1950028 allocations: 45.03 MiB)
  1.147 ms (728 allocations: 7.66 MiB)
  1.602 ms (1028 allocations: 15.29 MiB)

You can add rows to a DataFrame in place and it is fast#

x = DataFrame(rand(10^6, 5), :auto)
y = DataFrame(transpose(1.0:5.0), :auto)
z = [1.0:5.0;]

@btime vcat($x, $y); ## creates a new DataFrame - slow
@btime append!($x, $y); ## in place - fast

x = DataFrame(rand(10^6, 5), :auto) ## reset to the same starting point
@btime push!($x, $z); ## add a single row in place - fast
  2.827 ms (212 allocations: 38.16 MiB)
  1.161 μs (29 allocations: 1.50 KiB)
  432.384 ns (16 allocations: 256 bytes)

Allowing missing as well as categorical slows down computations#

using StatsBase

function test(data) ## uses countmap function to test performance
    println(eltype(data))
    x = rand(data, 10^6)
    y = categorical(x)
    println(" raw:")
    @btime countmap($x)
    println(" categorical:")
    @btime countmap($y)
    nothing
end

test(1:10)
test([randstring() for i in 1:10])
test(allowmissing(1:10))
test(allowmissing([randstring() for i in 1:10]))
Int64
 raw:
  1.863 ms (8 allocations: 7.63 MiB)
 categorical:
  15.800 ms (1000004 allocations: 30.52 MiB)
String
 raw:
  17.820 ms (4 allocations: 448 bytes)
 categorical:
  28.717 ms (1000004 allocations: 30.52 MiB)
Union{Missing, Int64}
 raw:
  6.757 ms (4 allocations: 464 bytes)
 categorical:
  16.287 ms (1000004 allocations: 30.52 MiB)
Union{Missing, String}
 raw:
  18.310 ms (4 allocations: 448 bytes)
 categorical:
  30.633 ms (1000004 allocations: 30.52 MiB)

When aggregating use column selector and prefer integer, categorical, or pooled array grouping variable#

df = DataFrame(x=rand('a':'d', 10^7), y=1);
gdf = groupby(df, :x)

GroupedDataFrame with 4 groups based on key: x

First Group (2500895 rows): x = 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
2500870 rows omitted
Rowxy
CharInt64
1b1
2b1
3b1
4b1
5b1
6b1
7b1
8b1
9b1
10b1
11b1
12b1
13b1
2500884b1
2500885b1
2500886b1
2500887b1
2500888b1
2500889b1
2500890b1
2500891b1
2500892b1
2500893b1
2500894b1
2500895b1

Last Group (2499685 rows): x = 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
2499660 rows omitted
Rowxy
CharInt64
1a1
2a1
3a1
4a1
5a1
6a1
7a1
8a1
9a1
10a1
11a1
12a1
13a1
2499674a1
2499675a1
2499676a1
2499677a1
2499678a1
2499679a1
2499680a1
2499681a1
2499682a1
2499683a1
2499684a1
2499685a1

traditional syntax, slow

@btime combine(v -> sum(v.y), $gdf)
  17.078 ms (332 allocations: 19.09 MiB)
4×2 DataFrame
Rowxx1
CharInt64
1b2500895
2c2499377
3d2500043
4a2499685

use column selector

@btime combine($gdf, :y => sum)
  6.889 ms (198 allocations: 10.14 KiB)
4×2 DataFrame
Rowxy_sum
CharInt64
1b2500895
2c2499377
3d2500043
4a2499685
transform!(df, :x => categorical => :x);
gdf = groupby(df, :x)

GroupedDataFrame with 4 groups based on key: x

First Group (2499685 rows): x = CategoricalArrays.CategoricalValue{Char, UInt32} 'a'
2499660 rows omitted
Rowxy
Cat…Int64
1a1
2a1
3a1
4a1
5a1
6a1
7a1
8a1
9a1
10a1
11a1
12a1
13a1
2499674a1
2499675a1
2499676a1
2499677a1
2499678a1
2499679a1
2499680a1
2499681a1
2499682a1
2499683a1
2499684a1
2499685a1

Last Group (2500043 rows): x = CategoricalArrays.CategoricalValue{Char, UInt32} 'd'
2500018 rows omitted
Rowxy
Cat…Int64
1d1
2d1
3d1
4d1
5d1
6d1
7d1
8d1
9d1
10d1
11d1
12d1
13d1
2500032d1
2500033d1
2500034d1
2500035d1
2500036d1
2500037d1
2500038d1
2500039d1
2500040d1
2500041d1
2500042d1
2500043d1
@btime combine($gdf, :y => sum)
  6.921 ms (206 allocations: 10.62 KiB)
4×2 DataFrame
Rowxy_sum
Cat…Int64
1a2499685
2b2500895
3c2499377
4d2500043
transform!(df, :x => PooledArray{Char} => :x)
10000000×2 DataFrame
9999975 rows omitted
Rowxy
CharInt64
1b1
2b1
3c1
4d1
5c1
6a1
7c1
8c1
9a1
10b1
11d1
12a1
13d1
9999989a1
9999990b1
9999991d1
9999992c1
9999993c1
9999994c1
9999995a1
9999996a1
9999997c1
9999998c1
9999999a1
10000000d1
gdf = groupby(df, :x)

GroupedDataFrame with 4 groups based on key: x

First Group (2500895 rows): x = 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
2500870 rows omitted
Rowxy
CharInt64
1b1
2b1
3b1
4b1
5b1
6b1
7b1
8b1
9b1
10b1
11b1
12b1
13b1
2500884b1
2500885b1
2500886b1
2500887b1
2500888b1
2500889b1
2500890b1
2500891b1
2500892b1
2500893b1
2500894b1
2500895b1

Last Group (2499685 rows): x = 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
2499660 rows omitted
Rowxy
CharInt64
1a1
2a1
3a1
4a1
5a1
6a1
7a1
8a1
9a1
10a1
11a1
12a1
13a1
2499674a1
2499675a1
2499676a1
2499677a1
2499678a1
2499679a1
2499680a1
2499681a1
2499682a1
2499683a1
2499684a1
2499685a1
@btime combine($gdf, :y => sum)
  6.962 ms (200 allocations: 10.20 KiB)
4×2 DataFrame
Rowxy_sum
CharInt64
1b2500895
2c2499377
3d2500043
4a2499685

Use views instead of materializing a new DataFrame#

x = DataFrame(rand(100, 1000), :auto)
@btime $x[1:1, :]
  180.200 μs (3993 allocations: 159.07 KiB)
1×1000 DataFrame
900 columns omitted
Rowx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20x21x22x23x24x25x26x27x28x29x30x31x32x33x34x35x36x37x38x39x40x41x42x43x44x45x46x47x48x49x50x51x52x53x54x55x56x57x58x59x60x61x62x63x64x65x66x67x68x69x70x71x72x73x74x75x76x77x78x79x80x81x82x83x84x85x86x87x88x89x90x91x92x93x94x95x96x97x98x99x100
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.4228010.09021450.4640350.2138220.6625240.8023740.9988220.6867110.4705170.7611530.1350810.5506650.3154530.4799520.5545030.9181380.814460.7110450.07906130.1409460.5992580.2299210.9464160.6381610.004742810.04414690.1517720.7911980.01434870.05235440.4641150.7149330.9737730.2025560.9373590.7198380.5862830.9864120.5443190.8507810.1291030.3772520.8853680.8758450.2266870.7312180.7463060.7070.3254310.2016310.08980950.3126460.445820.06761930.9531450.7297160.3194920.3142940.8937440.3562020.6237070.1895120.2760870.8067480.7649090.9503390.5093950.6072260.7162160.06376630.6874260.7664290.9602370.5902650.193680.3927270.3952470.4972340.457520.9857890.277880.7329450.4150650.3331080.5109740.1469020.7392980.3841250.1985630.2987910.7378010.1248190.3475730.1987610.3405090.4532840.2273760.8807610.9489520.656717
@btime $x[1, :]
  23.227 ns (0 allocations: 0 bytes)
DataFrameRow (1000 columns)
900 columns omitted
Rowx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20x21x22x23x24x25x26x27x28x29x30x31x32x33x34x35x36x37x38x39x40x41x42x43x44x45x46x47x48x49x50x51x52x53x54x55x56x57x58x59x60x61x62x63x64x65x66x67x68x69x70x71x72x73x74x75x76x77x78x79x80x81x82x83x84x85x86x87x88x89x90x91x92x93x94x95x96x97x98x99x100
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.4228010.09021450.4640350.2138220.6625240.8023740.9988220.6867110.4705170.7611530.1350810.5506650.3154530.4799520.5545030.9181380.814460.7110450.07906130.1409460.5992580.2299210.9464160.6381610.004742810.04414690.1517720.7911980.01434870.05235440.4641150.7149330.9737730.2025560.9373590.7198380.5862830.9864120.5443190.8507810.1291030.3772520.8853680.8758450.2266870.7312180.7463060.7070.3254310.2016310.08980950.3126460.445820.06761930.9531450.7297160.3194920.3142940.8937440.3562020.6237070.1895120.2760870.8067480.7649090.9503390.5093950.6072260.7162160.06376630.6874260.7664290.9602370.5902650.193680.3927270.3952470.4972340.457520.9857890.277880.7329450.4150650.3331080.5109740.1469020.7392980.3841250.1985630.2987910.7378010.1248190.3475730.1987610.3405090.4532840.2273760.8807610.9489520.656717
@btime view($x, 1:1, :)
  24.081 ns (0 allocations: 0 bytes)
1×1000 SubDataFrame
900 columns omitted
Rowx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20x21x22x23x24x25x26x27x28x29x30x31x32x33x34x35x36x37x38x39x40x41x42x43x44x45x46x47x48x49x50x51x52x53x54x55x56x57x58x59x60x61x62x63x64x65x66x67x68x69x70x71x72x73x74x75x76x77x78x79x80x81x82x83x84x85x86x87x88x89x90x91x92x93x94x95x96x97x98x99x100
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.4228010.09021450.4640350.2138220.6625240.8023740.9988220.6867110.4705170.7611530.1350810.5506650.3154530.4799520.5545030.9181380.814460.7110450.07906130.1409460.5992580.2299210.9464160.6381610.004742810.04414690.1517720.7911980.01434870.05235440.4641150.7149330.9737730.2025560.9373590.7198380.5862830.9864120.5443190.8507810.1291030.3772520.8853680.8758450.2266870.7312180.7463060.7070.3254310.2016310.08980950.3126460.445820.06761930.9531450.7297160.3194920.3142940.8937440.3562020.6237070.1895120.2760870.8067480.7649090.9503390.5093950.6072260.7162160.06376630.6874260.7664290.9602370.5902650.193680.3927270.3952470.4972340.457520.9857890.277880.7329450.4150650.3331080.5109740.1469020.7392980.3841250.1985630.2987910.7378010.1248190.3475730.1987610.3405090.4532840.2273760.8807610.9489520.656717
@btime $x[1:1, 1:20]
  3.862 μs (70 allocations: 3.09 KiB)
1×20 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.4228010.09021450.4640350.2138220.6625240.8023740.9988220.6867110.4705170.7611530.1350810.5506650.3154530.4799520.5545030.9181380.814460.7110450.07906130.140946
@btime $x[1, 1:20]
  23.871 ns (0 allocations: 0 bytes)
DataFrameRow (20 columns)
Rowx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.4228010.09021450.4640350.2138220.6625240.8023740.9988220.6867110.4705170.7611530.1350810.5506650.3154530.4799520.5545030.9181380.814460.7110450.07906130.140946
@btime view($x, 1:1, 1:20)
  24.685 ns (0 allocations: 0 bytes)
1×20 SubDataFrame
Rowx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.4228010.09021450.4640350.2138220.6625240.8023740.9988220.6867110.4705170.7611530.1350810.5506650.3154530.4799520.5545030.9181380.814460.7110450.07906130.140946

This notebook was generated using Literate.jl.