DataFrames Extras#

Selected functionalities/packages

using DataFrames
using CategoricalArrays

Frequency Tables#

nalimilan/FreqTables.jl

using FreqTables
df = DataFrame(a=rand('a':'d', 1000), b=rand(["x", "y", "z"], 1000))
ft = freqtable(df, :a, :b) ## observe that dimensions are sorted if possible
4×3 Named Matrix{Int64}
a ╲ b │  x   y   z
──────┼───────────
a     │ 98  86  78
b     │ 87  92  81
c     │ 87  81  78
d     │ 73  75  84

you can index the result using numbers or names

ft[1,1], ft['b', "z"]
(98, 81)

getting proportions - 1 means we want to calculate them in rows (first dimension)

prop(ft, margins=1)
4×3 Named Matrix{Float64}
a ╲ b │        x         y         z
──────┼─────────────────────────────
a     │ 0.374046  0.328244   0.29771
b     │ 0.334615  0.353846  0.311538
c     │ 0.353659  0.329268  0.317073
d     │ 0.314655  0.323276  0.362069

and columns are normalized to 1.0 now

prop(ft, margins=2)
4×3 Named Matrix{Float64}
a ╲ b │        x         y         z
──────┼─────────────────────────────
a     │ 0.284058  0.257485  0.242991
b     │ 0.252174  0.275449  0.252336
c     │ 0.252174  0.242515  0.242991
d     │ 0.211594  0.224551  0.261682
x = categorical(rand(1:3, 10))
levels!(x, [3, 1, 2, 4]) ## reordering levels and adding an extra level
freqtable(x) ## order is preserved and not-used level is shown
4-element Named Vector{Int64}
Dim1  │ 
──────┼──
3     │ 0
1     │ 5
2     │ 5
4     │ 0

by default missings are listed

freqtable([1,1,2,3,missing])
4-element Named Vector{Int64}
Dim1    │ 
────────┼──
1       │ 2
2       │ 1
3       │ 1
missing │ 1

but we can skip them

freqtable([1,1,2,3,missing], skipmissing=true)
3-element Named Vector{Int64}
Dim1  │ 
──────┼──
1     │ 2
2     │ 1
3     │ 1
df = DataFrame(a=rand(3:4, 1000), b=rand(5:6, 1000))
ft = freqtable(df, :a, :b) # now dimensions are numbers
2×2 Named Matrix{Int64}
a ╲ b │   5    6
──────┼─────────
3     │ 254  251
4     │ 239  256

this is an error - standard array indexing takes precedence

try
    ft[3,5]
catch e
    show(e)
end
BoundsError(
[254 251; 239 256], (3, 5))

you have to use Name() wrapper

ft[Name(3), Name(5)]
254

DataFramesMeta.jl#

DataFramesMeta.jl provides a more terse syntax due to the benefits of metaprogramming.

using DataFramesMeta
df = DataFrame(x=1:8, y='a':'h', z=repeat([true,false], outer=4))
8×3 DataFrame
Rowxyz
Int64CharBool
11atrue
22bfalse
33ctrue
44dfalse
55etrue
66ffalse
77gtrue
88hfalse

expressions with columns of DataFrame

@with(df, :x + :z)
8-element Vector{Int64}:
 2
 2
 4
 4
 6
 6
 8
 8

you can define code blocks

@with df begin
    a = :x[:z]
    b = :x[.!:z]
    :y + [a; b]
end
8-element Vector{Char}:
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 'h': ASCII/Unicode U+0068 (category Ll: Letter, lowercase)
 'k': ASCII/Unicode U+006B (category Ll: Letter, lowercase)
 'g': ASCII/Unicode U+0067 (category Ll: Letter, lowercase)
 'j': ASCII/Unicode U+006A (category Ll: Letter, lowercase)
 'm': ASCII/Unicode U+006D (category Ll: Letter, lowercase)
 'p': ASCII/Unicode U+0070 (category Ll: Letter, lowercase)

@with creates hard scope so variables do not leak out

df2 = DataFrame(a = [:a, :b, :c])
@with(df2, :a .== ^(:a)) ## sometimes we want to work on a raw Symbol, ^() escapes it
3-element BitVector:
 1
 0
 0
x_str = "x"
y_str = "y"
df2 = DataFrame(x=1:3, y=4:6, z=7:9)
# $expression inderpolates the expression in-place; in particular this way you can use column names passed as strings
@with(df2, $x_str + $y_str)
3-element Vector{Int64}:
 5
 7
 9

a very useful macro for filtering

@subset(df, :x .< 4, :z .== true)
2×3 DataFrame
Rowxyz
Int64CharBool
11atrue
23ctrue

create a new DataFrame based on the old one

@select(df, :x, :y = 2*:x, :z=:y)
8×3 DataFrame
Rowxyz
Int64Int64Char
112a
224b
336c
448d
5510e
6612f
7714g
8816h

create a new DataFrame adding columns based on the old one

@transform(df, :x = 2*:x, :y = :x)
8×3 DataFrame
Rowxyz
Int64Int64Bool
121true
242false
363true
484false
5105true
6126false
7147true
8168false

sorting into a new data frame, less powerful than sort, but lightweight

@orderby(df, :z, -:x)
8×3 DataFrame
Rowxyz
Int64CharBool
18hfalse
26ffalse
34dfalse
42bfalse
57gtrue
65etrue
73ctrue
81atrue

Chaining operations#

jkrumbiegel/Chain.jl

using Chain

chaining of operations on DataFrame

@chain df begin
    @subset(:x .< 5)
    @orderby(:z)
    @transform(:x² = :x .^ 2)
    @select(:z, :x, :x²)
end
4×3 DataFrame
Rowzx
BoolInt64Int64
1false24
2false416
3true11
4true39

Working on grouped DataFrame#

df = DataFrame(a = 1:12, b = repeat('a':'d', outer=3))
12×2 DataFrame
Rowab
Int64Char
11a
22b
33c
44d
55a
66b
77c
88d
99a
1010b
1111c
1212d
g = groupby(df, :b)

GroupedDataFrame with 4 groups based on key: b

First Group (3 rows): b = 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
Rowab
Int64Char
11a
25a
39a

Last Group (3 rows): b = 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
Rowab
Int64Char
14d
28d
312d
using Statistics

groupby+combine in one shot

@by(df, :b, :first = first(:a), :last = last(:a), :mean = mean(:a))
4×4 DataFrame
Rowbfirstlastmean
CharInt64Int64Float64
1a195.0
2b2106.0
3c3117.0
4d4128.0

the same as by but on grouped DataFrame

@combine(g, :first = first(:a), :last = last(:a), :mean = mean(:a))
4×4 DataFrame
Rowbfirstlastmean
CharInt64Int64Float64
1a195.0
2b2106.0
3c3117.0
4d4128.0

similar in DataFrames.jl - we use auto-generated column names

combine(g, :a .=> [first, last, mean])
4×4 DataFrame
Rowba_firsta_lasta_mean
CharInt64Int64Float64
1a195.0
2b2106.0
3c3117.0
4d4128.0

perform operations within a group and return ungrouped DataFrame

@transform(g, :center = mean(:a), :centered = :a .- mean(:a))
12×4 DataFrame
Rowabcentercentered
Int64CharFloat64Float64
11a5.0-4.0
22b6.0-4.0
33c7.0-4.0
44d8.0-4.0
55a5.00.0
66b6.00.0
77c7.00.0
88d8.00.0
99a5.04.0
1010b6.04.0
1111c7.04.0
1212d8.04.0

this is defined in DataFrames.jl

DataFrame(g)
12×2 DataFrame
Rowab
Int64Char
11a
25a
39a
42b
56b
610b
73c
87c
911c
104d
118d
1212d

actually this is not the same as DataFrame() as it perserves the original row order

@transform(g)
12×2 DataFrame
Rowab
Int64Char
11a
22b
33c
44d
55a
66b
77c
88d
99a
1010b
1111c
1212d

Rowwise operations on DataFrame#

df = DataFrame(a = 1:12, b = repeat(1:4, outer=3))
12×2 DataFrame
Rowab
Int64Int64
111
222
333
444
551
662
773
884
991
10102
11113
12124

such conditions are often needed but are complex to write

@transform(df, :x = ifelse.((:a .> 6) .& (:b .== 4), "yes", "no"))
12×3 DataFrame
Rowabx
Int64Int64String
111no
222no
333no
444no
551no
662no
773no
884yes
991no
10102no
11113no
12124yes

one option is to use a function that works on a single observation and broadcast it

myfun(a, b) = a > 6 && b == 4 ? "yes" : "no"
@transform(df, :x = myfun.(:a, :b))
12×3 DataFrame
Rowabx
Int64Int64String
111no
222no
333no
444no
551no
662no
773no
884yes
991no
10102no
11113no
12124yes

or you can use @eachrow macro that allows you to process DataFrame rowwise

@eachrow df begin
   @newcol :x::Vector{String}
    :x = :a > 6 && :b == 4 ? "yes" : "no"
end
12×3 DataFrame
Rowabx
Int64Int64String
111no
222no
333no
444no
551no
662no
773no
884yes
991no
10102no
11113no
12124yes

In DataFramses.jl you would write this as:

transform(df, [:a, :b] => ByRow((a,b) -> ifelse(a > 6 && b == 4, "yes", "no")) => :x)
12×3 DataFrame
Rowabx
Int64Int64String
111no
222no
333no
444no
551no
662no
773no
884yes
991no
10102no
11113no
12124yes

You can also use eachrow from DataFrames to perform the same transformation. However @eachrow will be faster than the operation below.

df2 = copy(df)
df2.x = Vector{String}(undef, nrow(df2))
for row in eachrow(df2)
   row[:x] = row[:a] > 6 && row[:b] == 4 ? "yes" : "no"
end
df2
12×3 DataFrame
Rowabx
Int64Int64String
111no
222no
333no
444no
551no
662no
773no
884yes
991no
10102no
11113no
12124yes

StatsPlots.jl: Visualizing data#

JuliaPlots/StatsPlots.jl

using StatsPlots ## you might need to setup Plots package and some plotting backend first
[ Info: Precompiling IJuliaExt [2f4121a4-3b3a-5ce6-9c5e-1f2673ce168a]

A showcase of StatsPlots.jl functions

using Random
Random.seed!(1)
df = DataFrame(x = sort(randn(1000)), y=randn(1000), z = [fill("b", 500); fill("a", 500)]);

a most basic plot

@df df plot(:x, :y, legend=:topleft, label="y(x)")

density plot

@df df density(:x, label="")

and a histogram

@df df histogram(:y, label="y")

the warning is likely to be removed in future releases of plotting packages

@df df boxplot(:z, :x, label="x")

Violin plot

@df df violin(:z, :y, label="y")