Joining DataFrames#
using DataFrames
Preparing DataFrames for a join#
x = DataFrame(ID=[1, 2, 3, 4, missing], name=["Alice", "Bob", "Conor", "Dave", "Zed"])
y = DataFrame(id=[1, 2, 5, 6, missing], age=[21, 22, 23, 24, 99])
| Row | id | age |
|---|---|---|
| Int64? | Int64 | |
| 1 | 1 | 21 |
| 2 | 2 | 22 |
| 3 | 5 | 23 |
| 4 | 6 | 24 |
| 5 | missing | 99 |
Rules for the on keyword argument:
a single
Symbolor string if joining on one column with the same name, e.g.on=:ida
PairofSymbols or string if joining on one column with different names, e.g.on=:id=>:id2a vector of
Symbols or strings if joining on multiple columns with the same name, e.g.on=[:id1, :id2]a vector of
Pairs ofSymbols or strings if joining on multiple columns with different names, e.g.on=[:a1=>:a2, :b1=>:b2]a vector containing a combination of
Symbols or strings orPairofSymbols or strings, e.g.on=[:a1=>:a2, :b1]
try
innerjoin(x, y, on=:ID => :id) ## missing is not allowed to join-on by default
catch e
show(e)
end
ArgumentError("Missing values in key columns are not allowed when matchmissing == :error. `missing` found in column :ID in left data frame.")
innerjoin(x, y, on=:ID => :id, matchmissing=:equal)
| Row | ID | name | age |
|---|---|---|---|
| Int64? | String | Int64 | |
| 1 | 1 | Alice | 21 |
| 2 | 2 | Bob | 22 |
| 3 | missing | Zed | 99 |
leftjoin(x, y, on="ID" => "id", matchmissing=:equal)
| Row | ID | name | age |
|---|---|---|---|
| Int64? | String | Int64? | |
| 1 | 1 | Alice | 21 |
| 2 | 2 | Bob | 22 |
| 3 | missing | Zed | 99 |
| 4 | 3 | Conor | missing |
| 5 | 4 | Dave | missing |
rightjoin(x, y, on=:ID => :id, matchmissing=:equal)
| Row | ID | name | age |
|---|---|---|---|
| Int64? | String? | Int64 | |
| 1 | 1 | Alice | 21 |
| 2 | 2 | Bob | 22 |
| 3 | missing | Zed | 99 |
| 4 | 5 | missing | 23 |
| 5 | 6 | missing | 24 |
outerjoin(x, y, on=:ID => :id, matchmissing=:equal)
| Row | ID | name | age |
|---|---|---|---|
| Int64? | String? | Int64? | |
| 1 | 1 | Alice | 21 |
| 2 | 2 | Bob | 22 |
| 3 | missing | Zed | 99 |
| 4 | 3 | Conor | missing |
| 5 | 4 | Dave | missing |
| 6 | 5 | missing | 23 |
| 7 | 6 | missing | 24 |
semijoin(x, y, on=:ID => :id, matchmissing=:equal)
| Row | ID | name |
|---|---|---|
| Int64? | String | |
| 1 | 1 | Alice |
| 2 | 2 | Bob |
| 3 | missing | Zed |
antijoin(x, y, on=:ID => :id, matchmissing=:equal)
| Row | ID | name |
|---|---|---|
| Int64? | String | |
| 1 | 3 | Conor |
| 2 | 4 | Dave |
Cross join#
(here no on argument)
crossjoin(DataFrame(x=[1, 2]), DataFrame(y=["a", "b", "c"]))
| Row | x | y |
|---|---|---|
| Int64 | String | |
| 1 | 1 | a |
| 2 | 1 | b |
| 3 | 1 | c |
| 4 | 2 | a |
| 5 | 2 | b |
| 6 | 2 | c |
Complex cases of joins#
x = DataFrame(id1=[1, 1, 2, 2, missing, missing],
id2=[1, 11, 2, 21, missing, 99],
name=["Alice", "Bob", "Conor", "Dave", "Zed", "Zoe"]
)
y = DataFrame(id1=[1, 1, 3, 3, missing, missing],
id2=[11, 1, 31, 3, missing, 999],
age=[21, 22, 23, 24, 99, 100]
)
| Row | id1 | id2 | age |
|---|---|---|---|
| Int64? | Int64? | Int64 | |
| 1 | 1 | 11 | 21 |
| 2 | 1 | 1 | 22 |
| 3 | 3 | 31 | 23 |
| 4 | 3 | 3 | 24 |
| 5 | missing | missing | 99 |
| 6 | missing | 999 | 100 |
joining on two columns
innerjoin(x, y, on=[:id1, :id2], matchmissing=:equal)
| Row | id1 | id2 | name | age |
|---|---|---|---|---|
| Int64? | Int64? | String | Int64 | |
| 1 | 1 | 11 | Bob | 21 |
| 2 | 1 | 1 | Alice | 22 |
| 3 | missing | missing | Zed | 99 |
with duplicates all combinations are produced
outerjoin(x, y, on=:id1, makeunique=true, indicator=:source, matchmissing=:equal)
| Row | id1 | id2 | name | id2_1 | age | source |
|---|---|---|---|---|---|---|
| Int64? | Int64? | String? | Int64? | Int64? | String | |
| 1 | 1 | 1 | Alice | 11 | 21 | both |
| 2 | 1 | 11 | Bob | 11 | 21 | both |
| 3 | 1 | 1 | Alice | 1 | 22 | both |
| 4 | 1 | 11 | Bob | 1 | 22 | both |
| 5 | missing | missing | Zed | missing | 99 | both |
| 6 | missing | 99 | Zoe | missing | 99 | both |
| 7 | missing | missing | Zed | 999 | 100 | both |
| 8 | missing | 99 | Zoe | 999 | 100 | both |
| 9 | 2 | 2 | Conor | missing | missing | left_only |
| 10 | 2 | 21 | Dave | missing | missing | left_only |
| 11 | 3 | missing | missing | 31 | 23 | right_only |
| 12 | 3 | missing | missing | 3 | 24 | right_only |
you can force validation of uniqueness of key on which you join
try
innerjoin(x, y, on=:id1, makeunique=true, validate=(true, true), matchmissing=:equal)
catch e
show(e)
end
ArgumentError("Merge key(s) are not unique in both df1 and df2. df1 contains 3 duplicate keys: @NamedTuple{id1::Union{Missing, Int64}}((1,)), ..., @NamedTuple{id1::Union{Missing, Int64}}((missing,)). df2 contains 3 duplicate keys: @NamedTuple{id1::Union{Missing, Int64}}((1,)), ..., @NamedTuple{id1::Union{Missing, Int64}}((missing,)).")
mixed on argument for joining on multiple columns
x = DataFrame(id1=1:6, id2=[1, 2, 1, 2, 1, 2], x1='a':'f')
y = DataFrame(id1=1:6, ID2=1:6, x2='a':'f')
innerjoin(x, y, on=[:id1, :id2 => :ID2])
| Row | id1 | id2 | x1 | x2 |
|---|---|---|---|---|
| Int64 | Int64 | Char | Char | |
| 1 | 1 | 1 | a | a |
| 2 | 2 | 2 | b | b |
joining more than two data frames
xs = [DataFrame("id" => 1:6, "v$i" => ((1:6) .+ 10i)) for i in 1:5]
5-element Vector{DataFrames.DataFrame}:
6×2 DataFrame
Row │ id v1
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
5 │ 5 15
6 │ 6 16
6×2 DataFrame
Row │ id v2
│ Int64 Int64
─────┼──────────────
1 │ 1 21
2 │ 2 22
3 │ 3 23
4 │ 4 24
5 │ 5 25
6 │ 6 26
6×2 DataFrame
Row │ id v3
│ Int64 Int64
─────┼──────────────
1 │ 1 31
2 │ 2 32
3 │ 3 33
4 │ 4 34
5 │ 5 35
6 │ 6 36
6×2 DataFrame
Row │ id v4
│ Int64 Int64
─────┼──────────────
1 │ 1 41
2 │ 2 42
3 │ 3 43
4 │ 4 44
5 │ 5 45
6 │ 6 46
6×2 DataFrame
Row │ id v5
│ Int64 Int64
─────┼──────────────
1 │ 1 51
2 │ 2 52
3 │ 3 53
4 │ 4 54
5 │ 5 55
6 │ 6 56
innerjoin as an example, it also works for outerjoin and crossjoin
innerjoin(xs..., on=:id)
| Row | id | v1 | v2 | v3 | v4 | v5 |
|---|---|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | |
| 1 | 1 | 11 | 21 | 31 | 41 | 51 |
| 2 | 2 | 12 | 22 | 32 | 42 | 52 |
| 3 | 3 | 13 | 23 | 33 | 43 | 53 |
| 4 | 4 | 14 | 24 | 34 | 44 | 54 |
| 5 | 5 | 15 | 25 | 35 | 45 | 55 |
| 6 | 6 | 16 | 26 | 36 | 46 | 56 |
matchmissing keyword argument#
In general you have three options how missing values are handled in joins that are handled by matchmisssing kewyowrd argument value as follows:
:error: throw an error if missings are encountered (this is the default):equal: assumemisssingvalues are equal to themselves:notequal: assumemisssingvalues are not equal to themselves (not available forouterjoin)
Here are some examples comparing the options:
df1 = DataFrame(id=[1, 2, missing], x=1:3)
df2 = DataFrame(id=[1, missing, 3], y=1:3)
try
innerjoin(df1, df2, on=:id)
catch e
show(e)
end
ArgumentError("Missing values in key columns are not allowed when matchmissing == :error. `missing` found in column :id in left data frame.")
innerjoin(df1, df2, on=:id, matchmissing=:equal)
| Row | id | x | y |
|---|---|---|---|
| Int64? | Int64 | Int64 | |
| 1 | 1 | 1 | 1 |
| 2 | missing | 3 | 2 |
innerjoin(df1, df2, on=:id, matchmissing=:notequal)
| Row | id | x | y |
|---|---|---|---|
| Int64? | Int64 | Int64 | |
| 1 | 1 | 1 | 1 |
Since DataFrames.jl 1.3 you can do an efficient left join of two data frames in-place. This means that the left data frame gets updated with new columns, but the columns that exist in it are not affected. This operation requires that there are no duplicates of keys in the right data frame that match keys in left data frame:
df1
| Row | id | x |
|---|---|---|
| Int64? | Int64 | |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | missing | 3 |
df2
| Row | id | y |
|---|---|---|
| Int64? | Int64 | |
| 1 | 1 | 1 |
| 2 | missing | 2 |
| 3 | 3 | 3 |
leftjoin!(df1, df2, on=:id, matchmissing=:notequal)
| Row | id | x | y |
|---|---|---|---|
| Int64? | Int64 | Int64? | |
| 1 | 1 | 1 | 1 |
| 2 | 2 | 2 | missing |
| 3 | missing | 3 | missing |
df1
| Row | id | x | y |
|---|---|---|---|
| Int64? | Int64 | Int64? | |
| 1 | 1 | 1 | 1 |
| 2 | 2 | 2 | missing |
| 3 | missing | 3 | missing |
This notebook was generated using Literate.jl.