Working with CategoricalArrays#

CategoricalArrays.jl is independent from DataFrames.jl but it is often used in combination

using DataFrames
using CategoricalArrays

Constructor#

unordered arrays

x = categorical(["A", "B", "B", "C"])
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "A"
 "B"
 "B"
 "C"

ordered, by default order is sorting order

y = categorical(["A", "B", "B", "C"], ordered=true)
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "A"
 "B"
 "B"
 "C"

unordered with missing values

z = categorical(["A", "B", "B", "C", missing])
5-element CategoricalArrays.CategoricalArray{Union{Missing, String},1,UInt32}:
 "A"
 "B"
 "B"
 "C"
 missing

ordered array cut into equal counts, possible to rename labels and give custom breaks

c = cut(1:10, 5)
10-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "Q1: [1.0, 2.8)"
 "Q1: [1.0, 2.8)"
 "Q2: [2.8, 4.6)"
 "Q2: [2.8, 4.6)"
 "Q3: [4.6, 6.4)"
 "Q3: [4.6, 6.4)"
 "Q4: [6.4, 8.2)"
 "Q4: [6.4, 8.2)"
 "Q5: [8.2, 10.0]"
 "Q5: [8.2, 10.0]"

(we will cover grouping later, but let us here use it to analyze the results, we use Chain.jl for chaining)

using Chain
@chain DataFrame(x=cut(randn(100000), 10)) begin
    groupby(:x)
    combine(nrow) ## just to make sure cut works right
end
10×2 DataFrame
Rowxnrow
Cat…Int64
1Q1: [-4.831279179158802, -1.2755672363487889)10000
2Q2: [-1.2755672363487889, -0.831281646927726)10000
3Q3: [-0.831281646927726, -0.5182226982485669)10000
4Q4: [-0.5182226982485669, -0.24846809808482806)10000
5Q5: [-0.24846809808482806, 0.002218451424087068)10000
6Q6: [0.002218451424087068, 0.25324368684323256)10000
7Q7: [0.25324368684323256, 0.526156296608527)10000
8Q8: [0.526156296608527, 0.8395138503897597)10000
9Q9: [0.8395138503897597, 1.2858811795362373)10000
10Q10: [1.2858811795362373, 4.380142247912848]10000

contains integers not strings

v = categorical([1, 2, 2, 3, 3])
5-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 2
 3
 3

sometimes you need to convert back to a standard vector

Vector{Union{String,Missing}}(z)
5-element Vector{Union{Missing, String}}:
 "A"
 "B"
 "B"
 "C"
 missing

Managing levels#

arr = [x, y, z, c, v]
5-element Vector{CategoricalArrays.CategoricalVector{T, UInt32} where T}:
 CategoricalArrays.CategoricalValue{String, UInt32}["A", "B", "B", "C"]
 CategoricalArrays.CategoricalValue{String, UInt32}["A", "B", "B", "C"]
 Union{Missing, CategoricalArrays.CategoricalValue{String, UInt32}}["A", "B", "B", "C", missing]
 CategoricalArrays.CategoricalValue{String, UInt32}["Q1: [1.0, 2.8)", "Q1: [1.0, 2.8)", "Q2: [2.8, 4.6)", "Q2: [2.8, 4.6)", "Q3: [4.6, 6.4)", "Q3: [4.6, 6.4)", "Q4: [6.4, 8.2)", "Q4: [6.4, 8.2)", "Q5: [8.2, 10.0]", "Q5: [8.2, 10.0]"]
 CategoricalArrays.CategoricalValue{Int64, UInt32}[1, 2, 2, 3, 3]

check if categorical array is orderd

isordered.(arr)
5-element BitVector:
 0
 1
 0
 1
 0

make x ordered

ordered!(x, true), isordered(x)
(CategoricalArrays.CategoricalValue{String, UInt32}["A", "B", "B", "C"], true)

and unordered again

ordered!(x, false), isordered(x)
(CategoricalArrays.CategoricalValue{String, UInt32}["A", "B", "B", "C"], false)

list levels

levels.(arr)
5-element Vector{Vector}:
 ["A", "B", "C"]
 ["A", "B", "C"]
 ["A", "B", "C"]
 ["Q1: [1.0, 2.8)", "Q2: [2.8, 4.6)", "Q3: [4.6, 6.4)", "Q4: [6.4, 8.2)", "Q5: [8.2, 10.0]"]
 [1, 2, 3]

missing will be included

unique.(arr)
5-element Vector{Vector}:
 ["A", "B", "C"]
 ["A", "B", "C"]
 Union{Missing, String}["A", "B", "C", missing]
 ["Q1: [1.0, 2.8)", "Q2: [2.8, 4.6)", "Q3: [4.6, 6.4)", "Q4: [6.4, 8.2)", "Q5: [8.2, 10.0]"]
 [1, 2, 3]

can compare as y is ordered

y[1] < y[2]
true

not comparable, v is unordered although it contains integers

try
    v[1] < v[2]
catch e
    show(e)
end
ArgumentError("Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this")

comparison against type underlying categorical value is not allowed

try
    y[2] < "A"
catch e
    show(e)
end
ArgumentError("cannot compare a `CategoricalValue` to value `v` of type `String`: wrap `v` using `CategoricalValue(v, catvalue)` or `CategoricalValue(v, catarray)` first")

you need to explicitly convert a value to a level

y[2] < CategoricalValue("A", y)
false

but it is treated as a level, and thus only valid levels are allowed

try
    y[2] < CategoricalValue("Z", y)
catch e
    show(e)
end
ArgumentError("level Z not found in source pool")

you can reorder levels, mostly useful for ordered CategoricalArrays

levels!(y, ["C", "B", "A"])
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "A"
 "B"
 "B"
 "C"

observe that the order is changed

y[1] < y[2]
false

you have to specify all levels that are present

try
    levels!(z, ["A", "B"])
catch e
    show(e)
end
ArgumentError("cannot remove level \"C\" as it is used at position 4 and allowmissing=false.")

unless the underlying array allows for missing values and force removal of levels

levels!(z, ["A", "B"], allowmissing=true)
5-element CategoricalArrays.CategoricalArray{Union{Missing, String},1,UInt32}:
 "A"
 "B"
 "B"
 missing
 missing

now z has only “B” entries

z[1] = "B"
z
5-element CategoricalArrays.CategoricalArray{Union{Missing, String},1,UInt32}:
 "B"
 "B"
 "B"
 missing
 missing

but it remembers the levels it had (the reason is mostly performance)

levels(z)
2-element Vector{String}:
 "A"
 "B"

this way we can clean it up by droplevels!(z)

droplevels!(z)
levels(z)
1-element Vector{String}:
 "B"

Data manipulation#

x, levels(x)
(CategoricalArrays.CategoricalValue{String, UInt32}["A", "B", "B", "C"], ["A", "B", "C"])

new level added at the end (works only for unordered)

x[2] = "0"
x, levels(x)
(CategoricalArrays.CategoricalValue{String, UInt32}["A", "0", "B", "C"], ["A", "B", "C", "0"])
v, levels(v)
(CategoricalArrays.CategoricalValue{Int64, UInt32}[1, 2, 2, 3, 3], [1, 2, 3])

even though the underlying data is Int, we cannot operate on it

try
    v[1] + v[2]
catch e
    show(e)
end
MethodError(+, (CategoricalArrays.CategoricalValue{Int64, UInt32} 1, CategoricalArrays.CategoricalValue{Int64, UInt32} 2), 0x000000000000686b)

you have either to retrieve the data by conversion (may be expensive)

Vector{Int}(v)
5-element Vector{Int64}:
 1
 2
 2
 3
 3

or get a single value by unwrap

unwrap(v[1]) + unwrap(v[2])
3

this will work for arrays without missing values

unwrap.(v)
5-element Vector{Int64}:
 1
 2
 2
 3
 3

also works on missing values

unwrap.(z)
5-element Vector{Union{Missing, String}}:
 "B"
 "B"
 "B"
 missing
 missing

or do the explicit conversion

Vector{Union{String,Missing}}(z)
5-element Vector{Union{Missing, String}}:
 "B"
 "B"
 "B"
 missing
 missing

recode some values in an array; has also in place recode! equivalent

recode([1, 2, 3, 4, 5, missing], 1 => 10)
6-element Vector{Union{Missing, Int64}}:
 10
  2
  3
  4
  5
   missing

here we provided a default value for not mapped recodes

recode([1, 2, 3, 4, 5, missing], "a", 1 => 10, 2 => 20)
6-element Vector{Union{Missing, Int64, String}}:
 10
 20
   "a"
   "a"
   "a"
   missing

to recode Missing you have to do it explicitly

recode([1, 2, 3, 4, 5, missing], 1 => 10, missing => "missing")
6-element Vector{Union{Int64, String}}:
 10
  2
  3
  4
  5
   "missing"
t = categorical([1:5; missing])
t, levels(t)
(Union{Missing, CategoricalArrays.CategoricalValue{Int64, UInt32}}[1, 2, 3, 4, 5, missing], [1, 2, 3, 4, 5])

note that the levels are dropped after recode

recode!(t, [1, 3] => 2)
t, levels(t)
(Union{Missing, CategoricalArrays.CategoricalValue{Int64, UInt32}}[2, 2, 2, 4, 5, missing], [2, 4, 5])

and if you introduce a new levels they are added at the end in the order of appearance

t = categorical([1, 2, 3], ordered=true)
levels(recode(t, 2 => 0, 1 => -1))
3-element Vector{Int64}:
  3
  0
 -1

when using default it becomes the last level

t = categorical([1, 2, 3, 4, 5], ordered=true)
levels(recode(t, 300, [1, 2] => 100, 3 => 200))
3-element Vector{Int64}:
 100
 200
 300

Comparisons#

x = categorical([1, 2, 3])
xs = [x, categorical(x), categorical(x, ordered=true), categorical(x, ordered=true)]
levels!(xs[2], [3, 2, 1])
levels!(xs[4], [2, 3, 1])
[a == b for a in xs, b in xs] ## all are equal - comparison only by contents
4×4 Matrix{Bool}:
 1  1  1  1
 1  1  1  1
 1  1  1  1
 1  1  1  1

this is actually the full signature of CategoricalArray

signature(x::CategoricalArray) = (x, levels(x), isordered(x))
signature (generic function with 1 method)

all are different, notice that x[1] and x[2] are unordered but have a different order of levels

[signature(a) == signature(b) for a in xs, b in xs]
4×4 Matrix{Bool}:
 1  0  0  0
 0  1  0  0
 0  0  1  0
 0  0  0  1

you cannot compare elements of unordered CategoricalArray

try
    x[1] < x[2]
catch e
    show(e)
end
ArgumentError("Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this")

but you can do it for an ordered one

t[1] < t[2]
true

isless() works within the same CategoricalArray even if it is not ordered

isless(x[1], x[2])
true

but not across categorical arrays

y = deepcopy(x)
try
    isless(x[1], y[2])
catch e
    show(e)
end
true

you can use get to make a comparison of the contents of CategoricalArray

isless(unwrap(x[1]), unwrap(y[2]))
true

equality tests works OK across CategoricalArrays

x[1] == y[2]
false

Categorical columns in a DataFrame#

df = DataFrame(x=1:3, y='a':'c', z=["a", "b", "c"])
3×3 DataFrame
Rowxyz
Int64CharString
11aa
22bb
33cc

Convert all String columns to categorical in-place

transform!(df, names(df, String) => categorical, renamecols=false)
3×3 DataFrame
Rowxyz
Int64CharCat…
11aa
22bb
33cc
describe(df)
3×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64DataType
1x2.012.030Int64
2yac0Char
3zac0CategoricalValue{String, UInt32}

This notebook was generated using Literate.jl.