Working with CategoricalArrays#
CategoricalArrays.jl is independent from DataFrames.jl but it is often used in combination
using DataFrames
using CategoricalArrays
Constructor#
unordered arrays
x = categorical(["A", "B", "B", "C"])
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"A"
"B"
"B"
"C"
ordered, by default order is sorting order
y = categorical(["A", "B", "B", "C"], ordered=true)
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"A"
"B"
"B"
"C"
unordered with missing values
z = categorical(["A", "B", "B", "C", missing])
5-element CategoricalArrays.CategoricalArray{Union{Missing, String},1,UInt32}:
"A"
"B"
"B"
"C"
missing
ordered array cut into equal counts, possible to rename labels and give custom breaks
c = cut(1:10, 5)
10-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"Q1: [1.0, 2.8)"
"Q1: [1.0, 2.8)"
"Q2: [2.8, 4.6)"
"Q2: [2.8, 4.6)"
"Q3: [4.6, 6.4)"
"Q3: [4.6, 6.4)"
"Q4: [6.4, 8.2)"
"Q4: [6.4, 8.2)"
"Q5: [8.2, 10.0]"
"Q5: [8.2, 10.0]"
(we will cover grouping later, but let us here use it to analyze the results, we use Chain.jl for chaining)
using Chain
@chain DataFrame(x=cut(randn(100000), 10)) begin
groupby(:x)
combine(nrow) ## just to make sure cut works right
end
Row | x | nrow |
---|---|---|
Cat… | Int64 | |
1 | Q1: [-4.432714496494445, -1.28061065302075) | 10000 |
2 | Q2: [-1.28061065302075, -0.8429386912986292) | 10000 |
3 | Q3: [-0.8429386912986292, -0.5190187887916001) | 10000 |
4 | Q4: [-0.5190187887916001, -0.24961896865750174) | 10000 |
5 | Q5: [-0.24961896865750174, 0.0033909184514677574) | 10000 |
6 | Q6: [0.0033909184514677574, 0.25756666932443384) | 10000 |
7 | Q7: [0.25756666932443384, 0.5275486789323077) | 10000 |
8 | Q8: [0.5275486789323077, 0.8435861617178939) | 10000 |
9 | Q9: [0.8435861617178939, 1.278357405991371) | 10000 |
10 | Q10: [1.278357405991371, 4.422718660979865] | 10000 |
contains integers not strings
v = categorical([1, 2, 2, 3, 3])
5-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
1
2
2
3
3
sometimes you need to convert back to a standard vector
Vector{Union{String,Missing}}(z)
5-element Vector{Union{Missing, String}}:
"A"
"B"
"B"
"C"
missing
Managing levels#
arr = [x, y, z, c, v]
5-element Vector{CategoricalArrays.CategoricalVector{T, UInt32} where T}:
CategoricalArrays.CategoricalValue{String, UInt32}["A", "B", "B", "C"]
CategoricalArrays.CategoricalValue{String, UInt32}["A", "B", "B", "C"]
Union{Missing, CategoricalArrays.CategoricalValue{String, UInt32}}["A", "B", "B", "C", missing]
CategoricalArrays.CategoricalValue{String, UInt32}["Q1: [1.0, 2.8)", "Q1: [1.0, 2.8)", "Q2: [2.8, 4.6)", "Q2: [2.8, 4.6)", "Q3: [4.6, 6.4)", "Q3: [4.6, 6.4)", "Q4: [6.4, 8.2)", "Q4: [6.4, 8.2)", "Q5: [8.2, 10.0]", "Q5: [8.2, 10.0]"]
CategoricalArrays.CategoricalValue{Int64, UInt32}[1, 2, 2, 3, 3]
check if categorical array is orderd
isordered.(arr)
5-element BitVector:
0
1
0
1
0
make x ordered
ordered!(x, true), isordered(x)
(CategoricalArrays.CategoricalValue{String, UInt32}["A", "B", "B", "C"], true)
and unordered again
ordered!(x, false), isordered(x)
(CategoricalArrays.CategoricalValue{String, UInt32}["A", "B", "B", "C"], false)
list levels
levels.(arr)
5-element Vector{Vector}:
["A", "B", "C"]
["A", "B", "C"]
["A", "B", "C"]
["Q1: [1.0, 2.8)", "Q2: [2.8, 4.6)", "Q3: [4.6, 6.4)", "Q4: [6.4, 8.2)", "Q5: [8.2, 10.0]"]
[1, 2, 3]
missing will be included
unique.(arr)
5-element Vector{Vector}:
["A", "B", "C"]
["A", "B", "C"]
Union{Missing, String}["A", "B", "C", missing]
["Q1: [1.0, 2.8)", "Q2: [2.8, 4.6)", "Q3: [4.6, 6.4)", "Q4: [6.4, 8.2)", "Q5: [8.2, 10.0]"]
[1, 2, 3]
can compare as y is ordered
y[1] < y[2]
true
not comparable, v is unordered although it contains integers
try
v[1] < v[2]
catch e
show(e)
end
ArgumentError("Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this")
comparison against type underlying categorical value is not allowed
try
y[2] < "A"
catch e
show(e)
end
ArgumentError("cannot compare a `CategoricalValue` to value `v` of type `String`: wrap `v` using `CategoricalValue(v, catvalue)` or `CategoricalValue(v, catarray)` first")
you need to explicitly convert a value to a level
y[2] < CategoricalValue("A", y)
false
but it is treated as a level, and thus only valid levels are allowed
try
y[2] < CategoricalValue("Z", y)
catch e
show(e)
end
ArgumentError("level Z not found in source pool")
you can reorder levels, mostly useful for ordered CategoricalArrays
levels!(y, ["C", "B", "A"])
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"A"
"B"
"B"
"C"
observe that the order is changed
y[1] < y[2]
false
you have to specify all levels that are present
try
levels!(z, ["A", "B"])
catch e
show(e)
end
ArgumentError("cannot remove level \"C\" as it is used at position 4 and allowmissing=false.")
unless the underlying array allows for missing values and force removal of levels
levels!(z, ["A", "B"], allowmissing=true)
5-element CategoricalArrays.CategoricalArray{Union{Missing, String},1,UInt32}:
"A"
"B"
"B"
missing
missing
now z has only “B” entries
z[1] = "B"
z
5-element CategoricalArrays.CategoricalArray{Union{Missing, String},1,UInt32}:
"B"
"B"
"B"
missing
missing
but it remembers the levels it had (the reason is mostly performance)
levels(z)
2-element Vector{String}:
"A"
"B"
this way we can clean it up by droplevels!(z)
droplevels!(z)
levels(z)
1-element Vector{String}:
"B"
Data manipulation#
x, levels(x)
(CategoricalArrays.CategoricalValue{String, UInt32}["A", "B", "B", "C"], ["A", "B", "C"])
new level added at the end (works only for unordered)
x[2] = "0"
x, levels(x)
(CategoricalArrays.CategoricalValue{String, UInt32}["A", "0", "B", "C"], ["A", "B", "C", "0"])
v, levels(v)
(CategoricalArrays.CategoricalValue{Int64, UInt32}[1, 2, 2, 3, 3], [1, 2, 3])
even though the underlying data is Int, we cannot operate on it
try
v[1] + v[2]
catch e
show(e)
end
MethodError(+, (CategoricalArrays.CategoricalValue{Int64, UInt32} 1, CategoricalArrays.CategoricalValue{Int64, UInt32} 2), 0x0000000000006892)
you have either to retrieve the data by conversion (may be expensive)
Vector{Int}(v)
5-element Vector{Int64}:
1
2
2
3
3
or get a single value by unwrap
unwrap(v[1]) + unwrap(v[2])
3
this will work for arrays without missing values
unwrap.(v)
5-element Vector{Int64}:
1
2
2
3
3
also works on missing values
unwrap.(z)
5-element Vector{Union{Missing, String}}:
"B"
"B"
"B"
missing
missing
or do the explicit conversion
Vector{Union{String,Missing}}(z)
5-element Vector{Union{Missing, String}}:
"B"
"B"
"B"
missing
missing
recode some values in an array; has also in place recode! equivalent
recode([1, 2, 3, 4, 5, missing], 1 => 10)
6-element Vector{Union{Missing, Int64}}:
10
2
3
4
5
missing
here we provided a default value for not mapped recodes
recode([1, 2, 3, 4, 5, missing], "a", 1 => 10, 2 => 20)
6-element Vector{Union{Missing, Int64, String}}:
10
20
"a"
"a"
"a"
missing
to recode Missing you have to do it explicitly
recode([1, 2, 3, 4, 5, missing], 1 => 10, missing => "missing")
6-element Vector{Union{Int64, String}}:
10
2
3
4
5
"missing"
t = categorical([1:5; missing])
t, levels(t)
(Union{Missing, CategoricalArrays.CategoricalValue{Int64, UInt32}}[1, 2, 3, 4, 5, missing], [1, 2, 3, 4, 5])
note that the levels are dropped after recode
recode!(t, [1, 3] => 2)
t, levels(t)
(Union{Missing, CategoricalArrays.CategoricalValue{Int64, UInt32}}[2, 2, 2, 4, 5, missing], [2, 4, 5])
and if you introduce a new levels they are added at the end in the order of appearance
t = categorical([1, 2, 3], ordered=true)
levels(recode(t, 2 => 0, 1 => -1))
3-element Vector{Int64}:
3
0
-1
when using default it becomes the last level
t = categorical([1, 2, 3, 4, 5], ordered=true)
levels(recode(t, 300, [1, 2] => 100, 3 => 200))
3-element Vector{Int64}:
100
200
300
Comparisons#
x = categorical([1, 2, 3])
xs = [x, categorical(x), categorical(x, ordered=true), categorical(x, ordered=true)]
levels!(xs[2], [3, 2, 1])
levels!(xs[4], [2, 3, 1])
[a == b for a in xs, b in xs] ## all are equal - comparison only by contents
4×4 Matrix{Bool}:
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
this is actually the full signature of CategoricalArray
signature(x::CategoricalArray) = (x, levels(x), isordered(x))
signature (generic function with 1 method)
all are different, notice that x[1] and x[2] are unordered but have a different order of levels
[signature(a) == signature(b) for a in xs, b in xs]
4×4 Matrix{Bool}:
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
you cannot compare elements of unordered CategoricalArray
try
x[1] < x[2]
catch e
show(e)
end
ArgumentError("Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this")
but you can do it for an ordered one
t[1] < t[2]
true
isless()
works within the same CategoricalArray even if it is not ordered
isless(x[1], x[2])
true
but not across categorical arrays
y = deepcopy(x)
try
isless(x[1], y[2])
catch e
show(e)
end
true
you can use get to make a comparison of the contents of CategoricalArray
isless(unwrap(x[1]), unwrap(y[2]))
true
equality tests works OK across CategoricalArrays
x[1] == y[2]
false
Categorical columns in a DataFrame#
df = DataFrame(x=1:3, y='a':'c', z=["a", "b", "c"])
Row | x | y | z |
---|---|---|---|
Int64 | Char | String | |
1 | 1 | a | a |
2 | 2 | b | b |
3 | 3 | c | c |
Convert all String columns to categorical in-place
transform!(df, names(df, String) => categorical, renamecols=false)
Row | x | y | z |
---|---|---|---|
Int64 | Char | Cat… | |
1 | 1 | a | a |
2 | 2 | b | b |
3 | 3 | c | c |
describe(df)
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
1 | x | 2.0 | 1 | 2.0 | 3 | 0 | Int64 |
2 | y | a | c | 0 | Char | ||
3 | z | a | c | 0 | CategoricalValue{String, UInt32} |
This notebook was generated using Literate.jl.