Handling missing values#

using DataFrames

A singleton type Missing allows us to deal with missing values.

missing, typeof(missing)
(missing, Missing)

Arrays automatically create an appropriate union type.

x = [1, 2, missing, 3]
4-element Vector{Union{Missing, Int64}}:
 1
 2
  missing
 3

ismissing checks if passed value is missing.

ismissing(1), ismissing(missing), ismissing(x), ismissing.(x)
(false, true, false, Bool[0, 0, 1, 0])

We can extract the type combined with Missing from a Union via nonmissingtype (This is useful for arrays!)

eltype(x), nonmissingtype(eltype(x))
(Union{Missing, Int64}, Int64)

missing comparisons produce missing.

missing == missing, missing != missing, missing < missing
(missing, missing, missing)

This is also true when missings are compared with values of other types.

1 == missing, 1 != missing, 1 < missing
(missing, missing, missing)

isequal, isless, and === produce results of type Bool. Notice that missing is considered greater than any numeric value.

isequal(missing, missing), missing === missing, isequal(1, missing), isless(1, missing)
(true, true, false, true)

In the next few examples, we see that many (not all) functions handle missing.

map(x -> x(missing), [sin, cos, zero, sqrt]) ## part 1
4-element Vector{Missing}:
 missing
 missing
 missing
 missing
map(x -> x(missing, 1), [+, -, *, /, div]) ## part 2
5-element Vector{Missing}:
 missing
 missing
 missing
 missing
 missing
using Statistics ## needed for mean
map(x -> x([1, 2, missing]), [minimum, maximum, extrema, mean, float]) ## part 3
5-element Vector{Any}:
 missing
 missing
 (missing, missing)
 missing
 Union{Missing, Float64}[1.0, 2.0, missing]

skipmissing returns iterator skipping missing values. We can use collect and skipmissing to create an array that excludes these missing values.

collect(skipmissing([1, missing, 2, missing]))
2-element Vector{Int64}:
 1
 2

Here we use replace to create a new array that replaces all missing values with some value (NaN in this case).

replace([1.0, missing, 2.0, missing], missing => NaN)
4-element Vector{Float64}:
   1.0
 NaN
   2.0
 NaN

Another way is to use coalesce()

coalesce.([1.0, missing, 2.0, missing], NaN)
4-element Vector{Float64}:
   1.0
 NaN
   2.0
 NaN

You can also use recode from CategoricalArrays.jl if you have a default output value.

using CategoricalArrays
recode([1.0, missing, 2.0, missing], false, missing => true)
4-element Vector{Bool}:
 0
 1
 0
 1

There are also replace! and recode! functions that work in place. Here is an example how you can process missing input in a data frame.

df = DataFrame(a=[1, 2, missing], b=["a", "b", missing])
3×2 DataFrame
Rowab
Int64?String?
11a
22b
3missingmissing

we change df.a vector in place.

replace!(df.a, missing => 100)
3-element Vector{Union{Missing, Int64}}:
   1
   2
 100

Now we overwrite df.b with a new vector, because the replacement type is different than what eltype(df.b) accepts.

df.b = coalesce.(df.b, 100)
3-element Vector{Any}:
    "a"
    "b"
 100
df
3×2 DataFrame
Rowab
Int64?Any
11a
22b
3100100

You can use unique or levels to get unique values with or without missings, respectively.

unique([1, missing, 2, missing]), levels([1, missing, 2, missing])
(Union{Missing, Int64}[1, missing, 2], [1, 2])

In this next example, we convert x to y with allowmissing, where y has a type that accepts missing values.

x = [1, 2, 3]
y = allowmissing(x)
3-element Vector{Union{Missing, Int64}}:
 1
 2
 3

Then, we convert back with disallowmissing. This would fail if y contained missing values!

z = disallowmissing(y)
x, y, z
([1, 2, 3], Union{Missing, Int64}[1, 2, 3], [1, 2, 3])

disallowmissing has error keyword argument that can be used to decide how it should behave when it encounters a column that actually contains a missing value

df = allowmissing(DataFrame(ones(2, 3), :auto))
2×3 DataFrame
Rowx1x2x3
Float64?Float64?Float64?
11.01.01.0
21.01.01.0
df[1, 1] = missing
missing
df
2×3 DataFrame
Rowx1x2x3
Float64?Float64?Float64?
1missing1.01.0
21.01.01.0

an error is thrown by disallowmissing()

try
    disallowmissing(df)
catch e
    show(e)
end
ArgumentError("Missing value found in column :x1 in row 1")

column :x1 is left untouched as it contains missing

disallowmissing(df, error=false)
2×3 DataFrame
Rowx1x2x3
Float64?Float64Float64
1missing1.01.0
21.01.01.0

In this next example, we show that the type of each column in x is initially Int64. After using allowmissing! to accept missing values in columns 1 and 3, the types of those columns become Union{Int64,Missing}.

x = DataFrame(rand(Int, 2, 3), :auto)
println("Before: ", eltype.(eachcol(x)))
allowmissing!(x, 1) ## make first column accept missings
allowmissing!(x, :x3) ## make :x3 column accept missings
println("After: ", eltype.(eachcol(x)))
Before: DataType[Int64, Int64, Int64]
After: Type[Union{Missing, Int64}, Int64, Union{Missing, Int64}]

In this next example, we’ll use completecases to find all the rows of a DataFrame that have complete data.

x = DataFrame(A=[1, missing, 3, 4], B=["A", "B", missing, "C"])
println("Complete cases:\n", completecases(x))
Complete cases:
Bool[1, 0, 0, 1]

We can use dropmissing or dropmissing! to remove the rows with incomplete data from a DataFrame and either create a new DataFrame or mutate the original in-place.

y = dropmissing(x)
dropmissing!(x)
2×2 DataFrame
RowAB
Int64String
11A
24C
x
2×2 DataFrame
RowAB
Int64String
11A
24C
y
2×2 DataFrame
RowAB
Int64String
11A
24C

When we call describe on a DataFrame with dropped missing values, the columns do not allow missing values any more by default.

describe(x)
2×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64DataType
1A2.512.540Int64
2BAC0String

Alternatively you can pass disallowmissing keyword argument to dropmissing and dropmissing!

x = DataFrame(A=[1, missing, 3, 4], B=["A", "B", missing, "C"])
4×2 DataFrame
RowAB
Int64?String?
11A
2missingB
33missing
44C
dropmissing!(x, disallowmissing=false)
2×2 DataFrame
RowAB
Int64?String?
11A
24C

Making functions missing-aware#

If we have a function that does not handle missing values we can wrap it using passmissing function so that if any of its positional arguments is missing we will get a missing value in return. In the example below we change how string function behaves:

string(missing)
"missing"
string(missing, " ", missing)
"missing missing"
string(1, 2, 3)
"123"
lift_string = passmissing(string)
(::Missings.PassMissing{typeof(string)}) (generic function with 2 methods)
lift_string(missing)
missing
lift_string(missing, " ", missing)
missing
lift_string(1, 2, 3)
"123"

Aggregating rows containing missing values#

Create an example data frame containing missing values:

df = DataFrame(a=[1, missing, missing], b=[1, 2, missing])
3×2 DataFrame
Rowab
Int64?Int64?
111
2missing2
3missingmissing

If we just run sum on the rows we get two missing entries:

sum.(eachrow(df))
3-element Vector{Union{Missing, Int64}}:
 2
  missing
  missing

One can apply skipmissing on the rows to avoid this problem:

try
    sum.(skipmissing.(eachrow(df)))
catch e
    show(e)
end
ArgumentError("reducing over an empty collection of unknown element type is not allowed.\nYou may be able to prevent this error by supplying an `init` value to the reducer.")

However, we get an error. The problem is that the last row of df contains only missing values, and since eachrow is type unstable the eltype of the result of skipmissing is unknown (so it is marked Any)

collect(skipmissing(eachrow(df)[end]))
Any[]

In such cases it is useful to switch to Tables.namedtupleiterator which is type stable as discussed in 01_constructors.ipynb notebook.

sum.(skipmissing.(Tables.namedtupleiterator(df)))
3-element Vector{Int64}:
 2
 2
 0

Later in the tutorial you will learn that you can efficiently calculate such sums using the select function:

select(df, AsTable(:) => ByRow(sum  skipmissing))
3×1 DataFrame
Rowa_b_sum_skipmissing
Int64
12
22
30

Note that it correctly handles the rows with all missing values.


This notebook was generated using Literate.jl.