# Working with CategoricalArrays
CategoricalArrays.jl is independent from DataFrames.jl but it is often used in combination

In [1]:
using DataFrames
using CategoricalArrays

## Constructor
unordered arrays

In [2]:
x = categorical(["A", "B", "B", "C"])

4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "A"
 "B"
 "B"
 "C"

ordered, by default order is sorting order

In [3]:
y = categorical(["A", "B", "B", "C"], ordered=true)

4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "A"
 "B"
 "B"
 "C"

unordered with missing values

In [4]:
z = categorical(["A", "B", "B", "C", missing])

5-element CategoricalArrays.CategoricalArray{Union{Missing, String},1,UInt32}:
 "A"
 "B"
 "B"
 "C"
 missing

ordered array cut into equal counts, possible to rename labels and give custom breaks

In [5]:
c = cut(1:10, 5)

10-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "Q1: [1.0, 2.8)"
 "Q1: [1.0, 2.8)"
 "Q2: [2.8, 4.6)"
 "Q2: [2.8, 4.6)"
 "Q3: [4.6, 6.4)"
 "Q3: [4.6, 6.4)"
 "Q4: [6.4, 8.2)"
 "Q4: [6.4, 8.2)"
 "Q5: [8.2, 10.0]"
 "Q5: [8.2, 10.0]"

(we will cover grouping later, but let us here use it to analyze the  results, we use Chain.jl for chaining)

In [6]:
using Chain
@chain DataFrame(x=cut(randn(100000), 10)) begin
    groupby(:x)
    combine(nrow) ## just to make sure cut works right
end

Row,x,nrow
Unnamed: 0_level_1,Cat…,Int64
1,"Q1: [-4.279904403135173, -1.2805922125580342)",10000
2,"Q2: [-1.2805922125580342, -0.8423538018796051)",10000
3,"Q3: [-0.8423538018796051, -0.5243664482105579)",10000
4,"Q4: [-0.5243664482105579, -0.2522461005507123)",10000
5,"Q5: [-0.2522461005507123, 0.003876346583846231)",10000
6,"Q6: [0.003876346583846231, 0.2548311831624185)",10000
7,"Q7: [0.2548311831624185, 0.5240154299257704)",10000
8,"Q8: [0.5240154299257704, 0.8392437647285013)",10000
9,"Q9: [0.8392437647285013, 1.2813360858102834)",10000
10,"Q10: [1.2813360858102834, 4.278250012822977]",10000


contains integers not strings

In [7]:
v = categorical([1, 2, 2, 3, 3])

5-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 2
 3
 3

sometimes you need to convert back to a standard vector

In [8]:
Vector{Union{String,Missing}}(z)

5-element Vector{Union{Missing, String}}:
 "A"
 "B"
 "B"
 "C"
 missing

## Managing levels

In [9]:
arr = [x, y, z, c, v]

5-element Vector{CategoricalArrays.CategoricalVector{T, UInt32} where T}:
 CategoricalArrays.CategoricalValue{String, UInt32}["A", "B", "B", "C"]
 CategoricalArrays.CategoricalValue{String, UInt32}["A", "B", "B", "C"]
 Union{Missing, CategoricalArrays.CategoricalValue{String, UInt32}}["A", "B", "B", "C", missing]
 CategoricalArrays.CategoricalValue{String, UInt32}["Q1: [1.0, 2.8)", "Q1: [1.0, 2.8)", "Q2: [2.8, 4.6)", "Q2: [2.8, 4.6)", "Q3: [4.6, 6.4)", "Q3: [4.6, 6.4)", "Q4: [6.4, 8.2)", "Q4: [6.4, 8.2)", "Q5: [8.2, 10.0]", "Q5: [8.2, 10.0]"]
 CategoricalArrays.CategoricalValue{Int64, UInt32}[1, 2, 2, 3, 3]

check if categorical array is orderd

In [10]:
isordered.(arr)

5-element BitVector:
 0
 1
 0
 1
 0

make x ordered

In [11]:
ordered!(x, true), isordered(x)

(CategoricalArrays.CategoricalValue{String, UInt32}["A", "B", "B", "C"], true)

and unordered again

In [12]:
ordered!(x, false), isordered(x)

(CategoricalArrays.CategoricalValue{String, UInt32}["A", "B", "B", "C"], false)

list levels

In [13]:
levels.(arr)

5-element Vector{Vector}:
 ["A", "B", "C"]
 ["A", "B", "C"]
 ["A", "B", "C"]
 ["Q1: [1.0, 2.8)", "Q2: [2.8, 4.6)", "Q3: [4.6, 6.4)", "Q4: [6.4, 8.2)", "Q5: [8.2, 10.0]"]
 [1, 2, 3]

missing will be included

In [14]:
unique.(arr)

5-element Vector{Vector}:
 ["A", "B", "C"]
 ["A", "B", "C"]
 Union{Missing, String}["A", "B", "C", missing]
 ["Q1: [1.0, 2.8)", "Q2: [2.8, 4.6)", "Q3: [4.6, 6.4)", "Q4: [6.4, 8.2)", "Q5: [8.2, 10.0]"]
 [1, 2, 3]

can compare as y is ordered

In [15]:
y[1] < y[2]

true

not comparable, v is unordered although it contains integers

In [16]:
try
    v[1] < v[2]
catch e
    show(e)
end

ArgumentError("Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this")

comparison against type underlying categorical value is not allowed

In [17]:
try
    y[2] < "A"
catch e
    show(e)
end

ArgumentError("cannot compare a `CategoricalValue` to value `v` of type `String`: wrap `v` using `CategoricalValue(v, catvalue)` or `CategoricalValue(v, catarray)` first")

you need to explicitly convert a value to a level

In [18]:
y[2] < CategoricalValue("A", y)

false

but it is treated as a level, and thus only valid levels are allowed

In [19]:
try
    y[2] < CategoricalValue("Z", y)
catch e
    show(e)
end

ArgumentError("level Z not found in source pool")

you can reorder levels, mostly useful for ordered CategoricalArrays

In [20]:
levels!(y, ["C", "B", "A"])

4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "A"
 "B"
 "B"
 "C"

observe that the order is changed

In [21]:
y[1] < y[2]

false

you have to specify all levels that are present

In [22]:
try
    levels!(z, ["A", "B"])
catch e
    show(e)
end

ArgumentError("cannot remove level \"C\" as it is used at position 4 and allowmissing=false.")

unless the underlying array allows for missing values and force removal of levels

In [23]:
levels!(z, ["A", "B"], allowmissing=true)

5-element CategoricalArrays.CategoricalArray{Union{Missing, String},1,UInt32}:
 "A"
 "B"
 "B"
 missing
 missing

now z has only "B" entries

In [24]:
z[1] = "B"
z

5-element CategoricalArrays.CategoricalArray{Union{Missing, String},1,UInt32}:
 "B"
 "B"
 "B"
 missing
 missing

but it remembers the levels it had (the reason is mostly performance)

In [25]:
levels(z)

2-element Vector{String}:
 "A"
 "B"

this way we can clean it up by `droplevels!(z)`

In [26]:
droplevels!(z)
levels(z)

1-element Vector{String}:
 "B"

## Data manipulation

In [27]:
x, levels(x)

(CategoricalArrays.CategoricalValue{String, UInt32}["A", "B", "B", "C"], ["A", "B", "C"])

new level added at the end (works only for unordered)

In [28]:
x[2] = "0"
x, levels(x)

(CategoricalArrays.CategoricalValue{String, UInt32}["A", "0", "B", "C"], ["A", "B", "C", "0"])

In [29]:
v, levels(v)

(CategoricalArrays.CategoricalValue{Int64, UInt32}[1, 2, 2, 3, 3], [1, 2, 3])

even though the underlying data is Int, we cannot operate on it

In [30]:
try
    v[1] + v[2]
catch e
    show(e)
end

MethodError(+, (CategoricalArrays.CategoricalValue{Int64, UInt32} 1, CategoricalArrays.CategoricalValue{Int64, UInt32} 2), 0x0000000000007b3f)

you have either to retrieve the data by conversion (may be expensive)

In [31]:
Vector{Int}(v)

5-element Vector{Int64}:
 1
 2
 2
 3
 3

or get a single value by `unwrap`

In [32]:
unwrap(v[1]) + unwrap(v[2])

3

this will work for arrays without missing values

In [33]:
unwrap.(v)

5-element Vector{Int64}:
 1
 2
 2
 3
 3

also works on missing values

In [34]:
unwrap.(z)

5-element Vector{Union{Missing, String}}:
 "B"
 "B"
 "B"
 missing
 missing

or do the explicit conversion

In [35]:
Vector{Union{String,Missing}}(z)

5-element Vector{Union{Missing, String}}:
 "B"
 "B"
 "B"
 missing
 missing

recode some values in an array; has also in place recode! equivalent

In [36]:
recode([1, 2, 3, 4, 5, missing], 1 => 10)

6-element Vector{Union{Missing, Int64}}:
 10
  2
  3
  4
  5
   missing

here we provided a default value for not mapped recodes

In [37]:
recode([1, 2, 3, 4, 5, missing], "a", 1 => 10, 2 => 20)

6-element Vector{Union{Missing, Int64, String}}:
 10
 20
   "a"
   "a"
   "a"
   missing

to recode Missing you have to do it explicitly

In [38]:
recode([1, 2, 3, 4, 5, missing], 1 => 10, missing => "missing")

6-element Vector{Union{Int64, String}}:
 10
  2
  3
  4
  5
   "missing"

In [39]:
t = categorical([1:5; missing])
t, levels(t)

(Union{Missing, CategoricalArrays.CategoricalValue{Int64, UInt32}}[1, 2, 3, 4, 5, missing], [1, 2, 3, 4, 5])

note that the levels are dropped after recode

In [40]:
recode!(t, [1, 3] => 2)
t, levels(t)

(Union{Missing, CategoricalArrays.CategoricalValue{Int64, UInt32}}[2, 2, 2, 4, 5, missing], [2, 4, 5])

and if you introduce a new levels they are added at the end in the order of appearance

In [41]:
t = categorical([1, 2, 3], ordered=true)
levels(recode(t, 2 => 0, 1 => -1))

3-element Vector{Int64}:
  3
  0
 -1

when using default it becomes the last level

In [42]:
t = categorical([1, 2, 3, 4, 5], ordered=true)
levels(recode(t, 300, [1, 2] => 100, 3 => 200))

3-element Vector{Int64}:
 100
 200
 300

## Comparisons

In [43]:
x = categorical([1, 2, 3])
xs = [x, categorical(x), categorical(x, ordered=true), categorical(x, ordered=true)]
levels!(xs[2], [3, 2, 1])
levels!(xs[4], [2, 3, 1])
[a == b for a in xs, b in xs] ## all are equal - comparison only by contents

4×4 Matrix{Bool}:
 1  1  1  1
 1  1  1  1
 1  1  1  1
 1  1  1  1

this is actually the full signature of CategoricalArray

In [44]:
signature(x::CategoricalArray) = (x, levels(x), isordered(x))

signature (generic function with 1 method)

all are different, notice that x[1] and x[2] are unordered but have a different order of levels

In [45]:
[signature(a) == signature(b) for a in xs, b in xs]

4×4 Matrix{Bool}:
 1  0  0  0
 0  1  0  0
 0  0  1  0
 0  0  0  1

you cannot compare elements of unordered CategoricalArray

In [46]:
try
    x[1] < x[2]
catch e
    show(e)
end

ArgumentError("Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this")

but you can do it for an ordered one

In [47]:
t[1] < t[2]

true

`isless()` works within the same CategoricalArray even if it is not ordered

In [48]:
isless(x[1], x[2])

true

but not across categorical arrays

In [49]:
y = deepcopy(x)
try
    isless(x[1], y[2])
catch e
    show(e)
end

true

you can use get to make a comparison of the contents of CategoricalArray

In [50]:
isless(unwrap(x[1]), unwrap(y[2]))

true

equality tests works OK across CategoricalArrays

In [51]:
x[1] == y[2]

false

## Categorical columns in a DataFrame

In [52]:
df = DataFrame(x=1:3, y='a':'c', z=["a", "b", "c"])

Row,x,y,z
Unnamed: 0_level_1,Int64,Char,String
1,1,a,a
2,2,b,b
3,3,c,c


Convert all String columns to categorical in-place

In [53]:
transform!(df, names(df, String) => categorical, renamecols=false)

Row,x,y,z
Unnamed: 0_level_1,Int64,Char,Cat…
1,1,a,a
2,2,b,b
3,3,c,c


In [54]:
describe(df)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,DataType
1,x,2.0,1,2.0,3,0,Int64
2,y,,a,,c,0,Char
3,z,,a,,c,0,"CategoricalValue{String, UInt32}"


---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*