CategoricalArrays.jl is independent from DataFrames.jl but it is often used in combination
using DataFrames
using CategoricalArraysConstructor¶
unordered arrays
x = categorical(["A", "B", "B", "C"])4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"A"
"B"
"B"
"C"ordered, by default order is sorting order
y = categorical(["A", "B", "B", "C"], ordered=true)4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"A"
"B"
"B"
"C"unordered with missing values
z = categorical(["A", "B", "B", "C", missing])5-element CategoricalArrays.CategoricalArray{Union{Missing, String},1,UInt32}:
"A"
"B"
"B"
"C"
missingordered array cut into equal counts, possible to rename labels and give custom breaks
c = cut(1:10, 5)10-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"[1, 3)"
"[1, 3)"
"[3, 5)"
"[3, 5)"
"[5, 7)"
"[5, 7)"
"[7, 9)"
"[7, 9)"
"[9, 10]"
"[9, 10]"(we will cover grouping later, but let us here use it to analyze the results, we use Chain.jl for chaining)
using Chain
@chain DataFrame(x=cut(randn(100000), 10)) begin
groupby(:x)
combine(nrow) ## just to make sure cut works right
endcontains integers not strings
v = categorical([1, 2, 2, 3, 3])5-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
1
2
2
3
3sometimes you need to convert back to a standard vector
Vector{Union{String,Missing}}(z)5-element Vector{Union{Missing, String}}:
"A"
"B"
"B"
"C"
missingManaging levels¶
arr = [x, y, z, c, v]5-element Vector{CategoricalArrays.CategoricalVector{T, UInt32} where T}:
CategoricalArrays.CategoricalValue{String, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 3)]
CategoricalArrays.CategoricalValue{String, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"], true), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"], true), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"], true), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"], true), 3)]
Union{Missing, CategoricalArrays.CategoricalValue{String, UInt32}}[CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 3), missing]
CategoricalArrays.CategoricalValue{String, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 3), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 3), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 4), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 4), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 5), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 5)]
CategoricalArrays.CategoricalValue{Int64, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 3), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 3)]check if categorical array is orderd
isordered.(arr)5-element BitVector:
0
1
0
1
0make x ordered
ordered!(x, true), isordered(x)(CategoricalArrays.CategoricalValue{String, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"], true), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"], true), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"], true), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"], true), 3)], true)and unordered again
ordered!(x, false), isordered(x)(CategoricalArrays.CategoricalValue{String, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 3)], false)list levels
levels.(arr)5-element Vector{CategoricalArrays.CategoricalVector{T, UInt32, V, C, Union{}} where {T, V, C}}:
CategoricalArrays.CategoricalValue{String, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 3)]
CategoricalArrays.CategoricalValue{String, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"], true), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"], true), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"], true), 3)]
CategoricalArrays.CategoricalValue{String, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 3)]
CategoricalArrays.CategoricalValue{String, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 3), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 4), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 5)]
CategoricalArrays.CategoricalValue{Int64, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 3)]missing will be included
unique.(arr)5-element Vector{CategoricalArrays.CategoricalVector{T, UInt32} where T}:
CategoricalArrays.CategoricalValue{String, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 3)]
CategoricalArrays.CategoricalValue{String, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"], true), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"], true), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"], true), 3)]
Union{Missing, CategoricalArrays.CategoricalValue{String, UInt32}}[CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 3), missing]
CategoricalArrays.CategoricalValue{String, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 3), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 4), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["[1, 3)", "[3, 5)", "[5, 7)", "[7, 9)", "[9, 10]"], true), 5)]
CategoricalArrays.CategoricalValue{Int64, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 3)]can compare as y is ordered
y[1] < y[2]truenot comparable, v is unordered although it contains integers
try
v[1] < v[2]
catch e
show(e)
endArgumentError("Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this")comparison against type underlying categorical value is not allowed
try
y[2] < "A"
catch e
show(e)
endArgumentError("cannot compare a `CategoricalValue` to value `v` of type `String`: wrap `v` using `CategoricalValue(v, catvalue)` or `CategoricalValue(v, catarray)` first")you need to explicitly convert a value to a level
y[2] < CategoricalValue("A", y)falsebut it is treated as a level, and thus only valid levels are allowed
try
y[2] < CategoricalValue("Z", y)
catch e
show(e)
endArgumentError("level Z not found in source pool")you can reorder levels, mostly useful for ordered CategoricalArrays
levels!(y, ["C", "B", "A"])4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"A"
"B"
"B"
"C"observe that the order is changed
y[1] < y[2]falseyou have to specify all levels that are present
try
levels!(z, ["A", "B"])
catch e
show(e)
endArgumentError("cannot remove level \"C\" as it is used at position 4 and allowmissing=false.")unless the underlying array allows for missing values and force removal of levels
levels!(z, ["A", "B"], allowmissing=true)5-element CategoricalArrays.CategoricalArray{Union{Missing, String},1,UInt32}:
"A"
"B"
"B"
missing
missingnow z has only “B” entries
z[1] = "B"
z5-element CategoricalArrays.CategoricalArray{Union{Missing, String},1,UInt32}:
"B"
"B"
"B"
missing
missingbut it remembers the levels it had (the reason is mostly performance)
levels(z)2-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"A"
"B"this way we can clean it up by droplevels!(z)
droplevels!(z)
levels(z)1-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"B"Data manipulation¶
x, levels(x)(CategoricalArrays.CategoricalValue{String, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 3)], CategoricalArrays.CategoricalValue{String, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C"]), 3)])new level added at the end (works only for unordered)
x[2] = "0"
x, levels(x)(CategoricalArrays.CategoricalValue{String, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C", "0"]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C", "0"]), 4), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C", "0"]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C", "0"]), 3)], CategoricalArrays.CategoricalValue{String, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C", "0"]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C", "0"]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C", "0"]), 3), CategoricalValue(CategoricalArrays.CategoricalPool{String, UInt32}(["A", "B", "C", "0"]), 4)])v, levels(v)(CategoricalArrays.CategoricalValue{Int64, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 3), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 3)], CategoricalArrays.CategoricalValue{Int64, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 3)])even though the underlying data is Int, we cannot operate on it
try
v[1] + v[2]
catch e
show(e)
endMethodError(+, (CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3]), 2)), 0x00000000000097fb)you have either to retrieve the data by conversion (may be expensive)
Vector{Int}(v)5-element Vector{Int64}:
1
2
2
3
3or get a single value by unwrap
unwrap(v[1]) + unwrap(v[2])3this will work for arrays without missing values
unwrap.(v)5-element Vector{Int64}:
1
2
2
3
3also works on missing values
unwrap.(z)5-element Vector{Union{Missing, String}}:
"B"
"B"
"B"
missing
missingor do the explicit conversion
Vector{Union{String,Missing}}(z)5-element Vector{Union{Missing, String}}:
"B"
"B"
"B"
missing
missingrecode some values in an array; has also in place recode! equivalent
recode([1, 2, 3, 4, 5, missing], 1 => 10)6-element Vector{Union{Missing, Int64}}:
10
2
3
4
5
missinghere we provided a default value for not mapped recodes
recode([1, 2, 3, 4, 5, missing], "a", 1 => 10, 2 => 20)6-element Vector{Union{Missing, Int64, String}}:
10
20
"a"
"a"
"a"
missingto recode Missing you have to do it explicitly
recode([1, 2, 3, 4, 5, missing], 1 => 10, missing => "missing")6-element Vector{Union{Int64, String}}:
10
2
3
4
5
"missing"t = categorical([1:5; missing])
t, levels(t)(Union{Missing, CategoricalArrays.CategoricalValue{Int64, UInt32}}[CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3, 4, 5]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3, 4, 5]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3, 4, 5]), 3), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3, 4, 5]), 4), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3, 4, 5]), 5), missing], CategoricalArrays.CategoricalValue{Int64, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3, 4, 5]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3, 4, 5]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3, 4, 5]), 3), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3, 4, 5]), 4), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([1, 2, 3, 4, 5]), 5)])note that the levels are dropped after recode
recode!(t, [1, 3] => 2)
t, levels(t)(Union{Missing, CategoricalArrays.CategoricalValue{Int64, UInt32}}[CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([2, 4, 5]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([2, 4, 5]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([2, 4, 5]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([2, 4, 5]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([2, 4, 5]), 3), missing], CategoricalArrays.CategoricalValue{Int64, UInt32}[CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([2, 4, 5]), 1), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([2, 4, 5]), 2), CategoricalValue(CategoricalArrays.CategoricalPool{Int64, UInt32}([2, 4, 5]), 3)])and if you introduce a new levels they are added at the end in the order of appearance
t = categorical([1, 2, 3], ordered=true)
levels(recode(t, 2 => 0, 1 => -1))3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
3
0
-1when using default it becomes the last level
t = categorical([1, 2, 3, 4, 5], ordered=true)
levels(recode(t, 300, [1, 2] => 100, 3 => 200))3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
100
200
300Comparisons¶
x = categorical([1, 2, 3])
xs = [x, categorical(x), categorical(x, ordered=true), categorical(x, ordered=true)]
levels!(xs[2], [3, 2, 1])
levels!(xs[4], [2, 3, 1])
[a == b for a in xs, b in xs] ## all are equal - comparison only by contents4×4 Matrix{Bool}:
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1this is actually the full signature of CategoricalArray
signature(x::CategoricalArray) = (x, levels(x), isordered(x))signature (generic function with 1 method)all are different, notice that x[1] and x[2] are unordered but have a different order of levels
[signature(a) == signature(b) for a in xs, b in xs]4×4 Matrix{Bool}:
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1you cannot compare elements of unordered CategoricalArray
try
x[1] < x[2]
catch e
show(e)
endArgumentError("Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this")but you can do it for an ordered one
t[1] < t[2]trueisless() works within the same CategoricalArray even if it is not ordered
isless(x[1], x[2])truebut not across categorical arrays
y = deepcopy(x)
try
isless(x[1], y[2])
catch e
show(e)
endtrueyou can use get to make a comparison of the contents of CategoricalArray
isless(unwrap(x[1]), unwrap(y[2]))trueequality tests works OK across CategoricalArrays
x[1] == y[2]falseCategorical columns in a DataFrame¶
df = DataFrame(x=1:3, y='a':'c', z=["a", "b", "c"])Convert all String columns to categorical in-place
transform!(df, names(df, String) => categorical, renamecols=false)describe(df)This notebook was generated using Literate.jl.