Handling missing values

Handling missing values#

A singelton type Missing allows us to deal with missing values.

using DataFrames

missing, typeof(missing)

(missing, Missing)

Arrays automatically create an appropriate union type.

x = [1, 2, missing, 3]

4-element Vector{Union{Missing, Int64}}:
 1
 2
  missing
 3

ismissing checks if passed value is missing.

ismissing(1), ismissing(missing), ismissing(x), ismissing.(x)

(false, true, false, Bool[0, 0, 1, 0])

We can extract the type combined with Missing from a Union via nonmissingtype (This is useful for arrays!)

eltype(x), nonmissingtype(eltype(x))

(Union{Missing, Int64}, Int64)

missing comparisons produce missing.

missing == missing, missing != missing, missing < missing

(missing, missing, missing)

This is also true when missings are compared with values of other types.

1 == missing, 1 != missing, 1 < missing

(missing, missing, missing)

isequal, isless, and === produce results of type Bool. Notice that missing is considered greater than any numeric value.

isequal(missing, missing), missing === missing, isequal(1, missing), isless(1, missing)

(true, true, false, true)

In the next few examples, we see that many (not all) functions handle missing.

map(x -> x(missing), [sin, cos, zero, sqrt]) ## part 1

4-element Vector{Missing}:
 missing
 missing
 missing
 missing

map(x -> x(missing, 1), [+, -, *, /, div]) ## part 2

5-element Vector{Missing}:
 missing
 missing
 missing
 missing
 missing

using Statistics ## needed for mean
map(x -> x([1, 2, missing]), [minimum, maximum, extrema, mean, float]) ## part 3

5-element Vector{Any}:
 missing
 missing
 (missing, missing)
 missing
 Union{Missing, Float64}[1.0, 2.0, missing]

skipmissing returns iterator skipping missing values. We can use collect and skipmissing to create an array that excludes these missing values.

collect(skipmissing([1, missing, 2, missing]))

2-element Vector{Int64}:
 1
 2

Here we use replace to create a new array that replaces all missing values with some value (NaN in this case).

replace([1.0, missing, 2.0, missing], missing => NaN)

4-element Vector{Float64}:
   1.0
 NaN
   2.0
 NaN

Another way is to use coalesce()

coalesce.([1.0, missing, 2.0, missing], NaN)

4-element Vector{Float64}:
   1.0
 NaN
   2.0
 NaN

You can also use recode from CategoricalArrays.jl if you have a default output value.

using CategoricalArrays
recode([1.0, missing, 2.0, missing], false, missing => true)

4-element Vector{Bool}:
 0
 1
 0
 1

There are also replace! and recode! functions that work in place. Here is an example how you can to missing inputation in a data frame.

df = DataFrame(a=[1, 2, missing], b=["a", "b", missing])

3×2 DataFrame

Row	a	b
	Int64?	String?
1	1	a
2	2	b
3	missing	missing

we change df.a vector in place.

replace!(df.a, missing => 100)

3-element Vector{Union{Missing, Int64}}:
   1
   2
 100

Now we overwrite df.b with a new vector, because the replacement type is different than what eltype(df.b) accepts.

df.b = coalesce.(df.b, 100)

3-element Vector{Any}:
    "a"
    "b"
 100

df

3×2 DataFrame

Row	a	b
	Int64?	Any
1	1	a
2	2	b
3	100	100

You can use unique or levels to get unique values with or without missings, respectively.

unique([1, missing, 2, missing]), levels([1, missing, 2, missing])

(Union{Missing, Int64}[1, missing, 2], [1, 2])

In this next example, we convert x to y with allowmissing, where y has a type that accepts missings.

x = [1, 2, 3]
y = allowmissing(x)

3-element Vector{Union{Missing, Int64}}:
 1
 2
 3

Then, we convert back with disallowmissing. This would fail if y contained missing values!

z = disallowmissing(y)
x, y, z

([1, 2, 3], Union{Missing, Int64}[1, 2, 3], [1, 2, 3])

disallowmissing has error keyword argument that can be used to decide how it should behave when it encounters a column that actually contains a missing value

df = allowmissing(DataFrame(ones(2, 3), :auto))

2×3 DataFrame

Row	x1	x2	x3
	Float64?	Float64?	Float64?
1	1.0	1.0	1.0
2	1.0	1.0	1.0

df[1, 1] = missing

missing

df

2×3 DataFrame

Row	x1	x2	x3
	Float64?	Float64?	Float64?
1	missing	1.0	1.0
2	1.0	1.0	1.0

an error is thrown

try
    disallowmissing(df)
catch e
    show(e)
end

ArgumentError("Missing value found in column :x1 in row 1")

column :x1 is left untouched as it contains missing

disallowmissing(df, error=false)

2×3 DataFrame

Row	x1	x2	x3
	Float64?	Float64	Float64
1	missing	1.0	1.0
2	1.0	1.0	1.0

In this next example, we show that the type of each column in x is initially Int64. After using allowmissing! to accept missing values in columns 1 and 3, the types of those columns become Union{Int64,Missing}.

x = DataFrame(rand(Int, 2, 3), :auto)
println("Before: ", eltype.(eachcol(x)))
allowmissing!(x, 1) ## make first column accept missings
allowmissing!(x, :x3) ## make :x3 column accept missings
println("After: ", eltype.(eachcol(x)))

Before: DataType

[Int64, Int64, Int64]

After: Type[Union{Missing, Int64}, Int64, Union{Missing, Int64}]

In this next example, we’ll use completecases to find all the rows of a DataFrame that have complete data.

x = DataFrame(A=[1, missing, 3, 4], B=["A", "B", missing, "C"])

4×2 DataFrame

Row	A	B
	Int64?	String?
1	1	A
2	missing	B
3	3	missing
4	4	C

println("Complete cases:\n", completecases(x))

Complete cases:
Bool

[1, 0, 0, 1]

We can use dropmissing or dropmissing! to remove the rows with incomplete data from a DataFrame and either create a new DataFrame or mutate the original in-place.

y = dropmissing(x)
dropmissing!(x)

2×2 DataFrame

Row	A	B
	Int64	String
1	1	A
2	4	C

2×2 DataFrame

Row	A	B
	Int64	String
1	1	A
2	4	C

2×2 DataFrame

Row	A	B
	Int64	String
1	1	A
2	4	C

When we call describe on a DataFrame with dropped missing values, the columns do not allow missing values any more by default.

describe(x)

2×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	DataType
1	A	2.5	1	2.5	4	0	Int64
2	B		A		C	0	String

Alternatively you can pass disallowmissing keyword argument to dropmissing and dropmissing!

x = DataFrame(A=[1, missing, 3, 4], B=["A", "B", missing, "C"])

4×2 DataFrame

Row	A	B
	Int64?	String?
1	1	A
2	missing	B
3	3	missing
4	4	C

dropmissing!(x, disallowmissing=false)

2×2 DataFrame

Row	A	B
	Int64?	String?
1	1	A
2	4	C

Making functions `missing`-aware#

If we have a function that does not handle missing values we can wrap it using passmissing function so that if any of its positional arguments is missing we will get a missing value in return. In the example below we change how string function behaves:

string(missing)

"missing"

string(missing, " ", missing)

"missing missing"

string(1, 2, 3)

"123"

lift_string = passmissing(string)

(::Missings.PassMissing{typeof(string)}) (generic function with 2 methods)

lift_string(missing)

missing

lift_string(missing, " ", missing)

missing

lift_string(1, 2, 3)

"123"

Aggregating rows containing missing values#

Create an example data frame containing missing values:

df = DataFrame(a=[1, missing, missing], b=[1, 2, missing])

3×2 DataFrame

Row	a	b
	Int64?	Int64?
1	1	1
2	missing	2
3	missing	missing

If we just run sum on the rows we get two missing entries:

sum.(eachrow(df))

3-element Vector{Union{Missing, Int64}}:
 2
  missing
  missing

One can apply skipmissing on the rows to avoid this problem:

try
    sum.(skipmissing.(eachrow(df)))
catch e
    show(e)
end

ArgumentError("reducing over an empty collection of unknown element type is not allowed.\nYou may be able to prevent this error by supplying an `init` value to the reducer.")

However, we get an error. The problem is that the last row of df contains only missing values, and since eachrow is type unstable the eltype of the result of skipmissing is unknown (so it is marked Any)

collect(skipmissing(eachrow(df)[end]))

Any[]

In such cases it is useful to switch to Tables.namedtupleiterator which is type stable as discussed in 01_constructors.ipynb notebook.

sum.(skipmissing.(Tables.namedtupleiterator(df)))

3-element Vector{Int64}:
 2
 2
 0

Later in the tutorial you will learn that you can efficiently calculate such sums using the select function:

select(df, AsTable(:) => ByRow(sum ∘ skipmissing))

3×1 DataFrame

Row	a_b_sum_skipmissing
	Int64
1	2
2	2
3	0

Note that it correctly handles the rows with all missing values.

Handling missing values

Contents

Handling missing values#

Making functions missing-aware#

Aggregating rows containing missing values#

Making functions `missing`-aware#