Performance tips#

using DataFrames
using BenchmarkTools
using CategoricalArrays
using PooledArrays
using Random

Access by column number is faster than by name#

x = DataFrame(rand(5, 1000), :auto)
@btime $x[!, 500]; ## Faster
  3.095 ns (0 allocations: 0 bytes)
@btime $x.x500;  ## Slower
  10.862 ns (0 allocations: 0 bytes)

When working with data DataFrame use barrier functions or type annotation#

function f_bad() ## this function will be slow
    Random.seed!(1)
    x = DataFrame(rand(1000000, 2), :auto)
    y, z = x[!, 1], x[!, 2]
    p = 0.0
    for i in 1:nrow(x)
        p += y[i] * z[i]
    end
    p
end

@btime f_bad();
# if you run @code_warntype f_bad() then you notice
# that Julia does not know column types of `DataFrame`
  97.197 ms (5999012 allocations: 122.06 MiB)

solution 1 is to use barrier function (it should be possible to use it in almost any code) for the calculation. You will notice much less memopry allocations and faster performance.

function f_inner(y, z)
    p = 0.0
    for i in eachindex(y, z)
        p += y[i] * z[i]
    end
    p
end

function f_barrier()
    Random.seed!(1)
    x = DataFrame(rand(1000000, 2), :auto)
    f_inner(x[!, 1], x[!, 2])
end

@btime f_barrier();
  4.133 ms (34 allocations: 30.52 MiB)

or use inbuilt function if possible

using LinearAlgebra

function f_inbuilt()
    Random.seed!(1)
    x = DataFrame(rand(1000000, 2), :auto)
    dot(x[!, 1], x[!, 2])
end

@btime f_inbuilt();
  3.480 ms (34 allocations: 30.52 MiB)

solution 2 is to provide the types of extracted columns. However, there are cases in which you will not know these types.

function f_typed()
    Random.seed!(1)
    x = DataFrame(rand(1000000, 2), :auto)
    y::Vector{Float64}, z::Vector{Float64} = x[!, 1], x[!, 2]
    p = 0.0
    for i in 1:nrow(x)
        p += y[i] * z[i]
    end
    p
end

@btime f_typed();
  4.135 ms (34 allocations: 30.52 MiB)

In general for tall and narrow tables it is often useful to use Tables.rowtable, Tables.columntable or Tables.namedtupleiterator for intermediate processing of data in a type-stable way.

Consider using delayed DataFrame creation technique#

also notice the difference in performance between copying vs non-copying data frame creation

function f1()
    x = DataFrame([Vector{Float64}(undef, 10^4) for i in 1:100], :auto, copycols=false) ## we work with a DataFrame directly
    for c in 1:ncol(x)
        d = x[!, c]
        for r in 1:nrow(x)
            d[r] = rand()
        end
    end
    x
end

function f1a()
    x = DataFrame([Vector{Float64}(undef, 10^4) for i in 1:100], :auto) ## we work with a DataFrame directly
    for c in 1:ncol(x)
        d = x[!, c]
        for r in 1:nrow(x)
            d[r] = rand()
        end
    end
    x
end

function f2()
    x = Vector{Any}(undef, 100)
    for c in 1:length(x)
        d = Vector{Float64}(undef, 10^4)
        for r in eachindex(d)
            d[r] = rand()
        end
        x[c] = d
    end
    DataFrame(x, :auto, copycols=false) ## we delay creation of DataFrame after we have our job done
end

function f2a()
    x = Vector{Any}(undef, 100)
    for c in eachindex(x)
        d = Vector{Float64}(undef, 10^4)
        for r in eachindex(d)
            d[r] = rand()
        end
        x[c] = d
    end
    DataFrame(x, :auto) ## we delay creation of DataFrame after we have our job done
end

@btime f1();
@btime f1a();
@btime f2();
@btime f2a();
  26.771 ms (1949524 allocations: 37.40 MiB)
  27.464 ms (1949724 allocations: 45.03 MiB)
  1.185 ms (524 allocations: 7.66 MiB)
  1.585 ms (724 allocations: 15.29 MiB)

You can add rows to a DataFrame in place and it is fast#

x = DataFrame(rand(10^6, 5), :auto)
y = DataFrame(transpose(1.0:5.0), :auto)
z = [1.0:5.0;]

@btime vcat($x, $y); ## creates a new DataFrame - slow
@btime append!($x, $y); ## in place - fast

x = DataFrame(rand(10^6, 5), :auto) ## reset to the same starting point
@btime push!($x, $z); ## add a single row in place - fast
  1.701 ms (211 allocations: 38.16 MiB)
  1.425 μs (30 allocations: 1.80 KiB)
  480.281 ns (16 allocations: 256 bytes)

Allowing missing as well as categorical slows down computations#

using StatsBase

function test(data) ## uses countmap function to test performance
    println(eltype(data))
    x = rand(data, 10^6)
    y = categorical(x)
    println(" raw:")
    @btime countmap($x)
    println(" categorical:")
    @btime countmap($y)
    nothing
end

test(1:10)
test([randstring() for i in 1:10])
test(allowmissing(1:10))
test(allowmissing([randstring() for i in 1:10]))
Int64
 raw:
  1.618 ms (7 allocations: 7.63 MiB)
 categorical:
  10.035 ms (4 allocations: 688 bytes)
String
 raw:
  26.519 ms (4 allocations: 528 bytes)
 categorical:
  31.321 ms (4 allocations: 688 bytes)
Union{Missing, Int64}
 raw:
  6.310 ms (4 allocations: 560 bytes)
 categorical:
  15.702 ms (1000004 allocations: 30.52 MiB)
Union{Missing, String}
 raw:
  19.843 ms (4 allocations: 528 bytes)
 categorical:
  31.528 ms (1000004 allocations: 30.52 MiB)

When aggregating use column selector and prefer integer, categorical, or pooled array grouping variable#

df = DataFrame(x=rand('a':'d', 10^7), y=1);
gdf = groupby(df, :x)

GroupedDataFrame with 4 groups based on key: x

First Group (2497251 rows): x = 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
2497226 rows omitted
Rowxy
CharInt64
1b1
2b1
3b1
4b1
5b1
6b1
7b1
8b1
9b1
10b1
11b1
12b1
13b1
2497240b1
2497241b1
2497242b1
2497243b1
2497244b1
2497245b1
2497246b1
2497247b1
2497248b1
2497249b1
2497250b1
2497251b1

Last Group (2500643 rows): x = 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
2500618 rows omitted
Rowxy
CharInt64
1c1
2c1
3c1
4c1
5c1
6c1
7c1
8c1
9c1
10c1
11c1
12c1
13c1
2500632c1
2500633c1
2500634c1
2500635c1
2500636c1
2500637c1
2500638c1
2500639c1
2500640c1
2500641c1
2500642c1
2500643c1

traditional syntax, slow

@btime combine(v -> sum(v.y), $gdf)
  18.568 ms (305 allocations: 19.07 MiB)
4×2 DataFrame
Rowxx1
CharInt64
1b2497251
2a2501619
3d2500487
4c2500643

use column selector

@btime combine($gdf, :y => sum)
  6.754 ms (186 allocations: 11.45 KiB)
4×2 DataFrame
Rowxy_sum
CharInt64
1b2497251
2a2501619
3d2500487
4c2500643
transform!(df, :x => categorical => :x);
gdf = groupby(df, :x)

GroupedDataFrame with 4 groups based on key: x

First Group (2501619 rows): x = CategoricalArrays.CategoricalValue{Char, UInt32} 'a'
2501594 rows omitted
Rowxy
Cat…Int64
1a1
2a1
3a1
4a1
5a1
6a1
7a1
8a1
9a1
10a1
11a1
12a1
13a1
2501608a1
2501609a1
2501610a1
2501611a1
2501612a1
2501613a1
2501614a1
2501615a1
2501616a1
2501617a1
2501618a1
2501619a1

Last Group (2500487 rows): x = CategoricalArrays.CategoricalValue{Char, UInt32} 'd'
2500462 rows omitted
Rowxy
Cat…Int64
1d1
2d1
3d1
4d1
5d1
6d1
7d1
8d1
9d1
10d1
11d1
12d1
13d1
2500476d1
2500477d1
2500478d1
2500479d1
2500480d1
2500481d1
2500482d1
2500483d1
2500484d1
2500485d1
2500486d1
2500487d1
@btime combine($gdf, :y => sum)
  6.805 ms (193 allocations: 12.03 KiB)
4×2 DataFrame
Rowxy_sum
Cat…Int64
1a2501619
2b2497251
3c2500643
4d2500487
transform!(df, :x => PooledArray{Char} => :x)
10000000×2 DataFrame
9999975 rows omitted
Rowxy
CharInt64
1b1
2a1
3a1
4b1
5d1
6d1
7d1
8b1
9c1
10b1
11b1
12a1
13d1
9999989d1
9999990a1
9999991b1
9999992b1
9999993b1
9999994c1
9999995b1
9999996c1
9999997a1
9999998d1
9999999d1
10000000b1
gdf = groupby(df, :x)

GroupedDataFrame with 4 groups based on key: x

First Group (2497251 rows): x = 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
2497226 rows omitted
Rowxy
CharInt64
1b1
2b1
3b1
4b1
5b1
6b1
7b1
8b1
9b1
10b1
11b1
12b1
13b1
2497240b1
2497241b1
2497242b1
2497243b1
2497244b1
2497245b1
2497246b1
2497247b1
2497248b1
2497249b1
2497250b1
2497251b1

Last Group (2500643 rows): x = 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
2500618 rows omitted
Rowxy
CharInt64
1c1
2c1
3c1
4c1
5c1
6c1
7c1
8c1
9c1
10c1
11c1
12c1
13c1
2500632c1
2500633c1
2500634c1
2500635c1
2500636c1
2500637c1
2500638c1
2500639c1
2500640c1
2500641c1
2500642c1
2500643c1
@btime combine($gdf, :y => sum)
  6.745 ms (188 allocations: 11.52 KiB)
4×2 DataFrame
Rowxy_sum
CharInt64
1b2497251
2a2501619
3d2500487
4c2500643

Use views instead of materializing a new DataFrame#

x = DataFrame(rand(100, 1000), :auto)
@btime $x[1:1, :]
  196.315 μs (2986 allocations: 159.50 KiB)
1×1000 DataFrame
900 columns omitted
Rowx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20x21x22x23x24x25x26x27x28x29x30x31x32x33x34x35x36x37x38x39x40x41x42x43x44x45x46x47x48x49x50x51x52x53x54x55x56x57x58x59x60x61x62x63x64x65x66x67x68x69x70x71x72x73x74x75x76x77x78x79x80x81x82x83x84x85x86x87x88x89x90x91x92x93x94x95x96x97x98x99x100
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.5219010.6928870.3664140.1027940.198860.3745430.7297840.3204780.2888890.9182550.3724050.5520240.3175770.5757980.4525720.7209020.568130.3226930.3001730.1167030.8937560.4911090.7755980.4931010.3619710.2617580.9995360.1898690.5460570.8672130.7370090.2939860.7336130.305690.6573650.5835290.6205940.620550.2580630.9421580.731840.8321710.9664580.8390330.7108230.5946070.2634430.7697770.5933570.805930.6687020.1196050.3585390.8782660.09876860.5780320.8277680.003863350.7831990.09035060.5606690.9366140.5506570.2774010.1301330.9398270.5240470.8239730.6412870.4780510.04538310.7536640.3377530.790560.2575760.3454450.7144760.3517750.6684960.7631430.567570.6633630.3732040.1355390.8116450.09797250.1375630.3508680.7840210.5809810.4255740.6529080.208120.8742750.6512650.2657680.02139730.7718020.6632130.396567
@btime $x[1, :]
  19.433 ns (0 allocations: 0 bytes)
DataFrameRow (1000 columns)
900 columns omitted
Rowx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20x21x22x23x24x25x26x27x28x29x30x31x32x33x34x35x36x37x38x39x40x41x42x43x44x45x46x47x48x49x50x51x52x53x54x55x56x57x58x59x60x61x62x63x64x65x66x67x68x69x70x71x72x73x74x75x76x77x78x79x80x81x82x83x84x85x86x87x88x89x90x91x92x93x94x95x96x97x98x99x100
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.5219010.6928870.3664140.1027940.198860.3745430.7297840.3204780.2888890.9182550.3724050.5520240.3175770.5757980.4525720.7209020.568130.3226930.3001730.1167030.8937560.4911090.7755980.4931010.3619710.2617580.9995360.1898690.5460570.8672130.7370090.2939860.7336130.305690.6573650.5835290.6205940.620550.2580630.9421580.731840.8321710.9664580.8390330.7108230.5946070.2634430.7697770.5933570.805930.6687020.1196050.3585390.8782660.09876860.5780320.8277680.003863350.7831990.09035060.5606690.9366140.5506570.2774010.1301330.9398270.5240470.8239730.6412870.4780510.04538310.7536640.3377530.790560.2575760.3454450.7144760.3517750.6684960.7631430.567570.6633630.3732040.1355390.8116450.09797250.1375630.3508680.7840210.5809810.4255740.6529080.208120.8742750.6512650.2657680.02139730.7718020.6632130.396567
@btime view($x, 1:1, :)
  19.433 ns (0 allocations: 0 bytes)
1×1000 SubDataFrame
900 columns omitted
Rowx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20x21x22x23x24x25x26x27x28x29x30x31x32x33x34x35x36x37x38x39x40x41x42x43x44x45x46x47x48x49x50x51x52x53x54x55x56x57x58x59x60x61x62x63x64x65x66x67x68x69x70x71x72x73x74x75x76x77x78x79x80x81x82x83x84x85x86x87x88x89x90x91x92x93x94x95x96x97x98x99x100
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.5219010.6928870.3664140.1027940.198860.3745430.7297840.3204780.2888890.9182550.3724050.5520240.3175770.5757980.4525720.7209020.568130.3226930.3001730.1167030.8937560.4911090.7755980.4931010.3619710.2617580.9995360.1898690.5460570.8672130.7370090.2939860.7336130.305690.6573650.5835290.6205940.620550.2580630.9421580.731840.8321710.9664580.8390330.7108230.5946070.2634430.7697770.5933570.805930.6687020.1196050.3585390.8782660.09876860.5780320.8277680.003863350.7831990.09035060.5606690.9366140.5506570.2774010.1301330.9398270.5240470.8239730.6412870.4780510.04538310.7536640.3377530.790560.2575760.3454450.7144760.3517750.6684960.7631430.567570.6633630.3732040.1355390.8116450.09797250.1375630.3508680.7840210.5809810.4255740.6529080.208120.8742750.6512650.2657680.02139730.7718020.6632130.396567
@btime $x[1:1, 1:20]
  4.414 μs (51 allocations: 4.28 KiB)
1×20 DataFrame
Rowx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.5219010.6928870.3664140.1027940.198860.3745430.7297840.3204780.2888890.9182550.3724050.5520240.3175770.5757980.4525720.7209020.568130.3226930.3001730.116703
@btime $x[1, 1:20]
  21.284 ns (0 allocations: 0 bytes)
DataFrameRow (20 columns)
Rowx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.5219010.6928870.3664140.1027940.198860.3745430.7297840.3204780.2888890.9182550.3724050.5520240.3175770.5757980.4525720.7209020.568130.3226930.3001730.116703
@btime view($x, 1:1, 1:20)
  20.047 ns (0 allocations: 0 bytes)
1×20 SubDataFrame
Rowx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.5219010.6928870.3664140.1027940.198860.3745430.7297840.3204780.2888890.9182550.3724050.5520240.3175770.5757980.4525720.7209020.568130.3226930.3001730.116703

This notebook was generated using Literate.jl.