Joining DataFrames#
using DataFrames
Preparing DataFrames for a join#
x = DataFrame(ID=[1, 2, 3, 4, missing], name=["Alice", "Bob", "Conor", "Dave", "Zed"])
y = DataFrame(id=[1, 2, 5, 6, missing], age=[21, 22, 23, 24, 99])
Row | id | age |
---|---|---|
Int64? | Int64 | |
1 | 1 | 21 |
2 | 2 | 22 |
3 | 5 | 23 |
4 | 6 | 24 |
5 | missing | 99 |
Rules for the on
keyword argument:
a single
Symbol
or string if joining on one column with the same name, e.g.on=:id
a
Pair
ofSymbol
s or string if joining on one column with different names, e.g.on=:id=>:id2
a vector of
Symbol
s or strings if joining on multiple columns with the same name, e.g.on=[:id1, :id2]
a vector of
Pair
s ofSymbol
s or strings if joining on multiple columns with different names, e.g.on=[:a1=>:a2, :b1=>:b2]
a vector containing a combination of
Symbol
s or strings orPair
ofSymbol
s or strings, e.g.on=[:a1=>:a2, :b1]
try
innerjoin(x, y, on=:ID => :id) ## missing is not allowed to join-on by default
catch e
show(e)
end
ArgumentError("Missing values in key columns are not allowed when matchmissing == :error. `missing` found in column :ID in left data frame.")
innerjoin(x, y, on=:ID => :id, matchmissing=:equal)
Row | ID | name | age |
---|---|---|---|
Int64? | String | Int64 | |
1 | 1 | Alice | 21 |
2 | 2 | Bob | 22 |
3 | missing | Zed | 99 |
leftjoin(x, y, on="ID" => "id", matchmissing=:equal)
Row | ID | name | age |
---|---|---|---|
Int64? | String | Int64? | |
1 | 1 | Alice | 21 |
2 | 2 | Bob | 22 |
3 | missing | Zed | 99 |
4 | 3 | Conor | missing |
5 | 4 | Dave | missing |
rightjoin(x, y, on=:ID => :id, matchmissing=:equal)
Row | ID | name | age |
---|---|---|---|
Int64? | String? | Int64 | |
1 | 1 | Alice | 21 |
2 | 2 | Bob | 22 |
3 | missing | Zed | 99 |
4 | 5 | missing | 23 |
5 | 6 | missing | 24 |
outerjoin(x, y, on=:ID => :id, matchmissing=:equal)
Row | ID | name | age |
---|---|---|---|
Int64? | String? | Int64? | |
1 | 1 | Alice | 21 |
2 | 2 | Bob | 22 |
3 | missing | Zed | 99 |
4 | 3 | Conor | missing |
5 | 4 | Dave | missing |
6 | 5 | missing | 23 |
7 | 6 | missing | 24 |
semijoin(x, y, on=:ID => :id, matchmissing=:equal)
Row | ID | name |
---|---|---|
Int64? | String | |
1 | 1 | Alice |
2 | 2 | Bob |
3 | missing | Zed |
antijoin(x, y, on=:ID => :id, matchmissing=:equal)
Row | ID | name |
---|---|---|
Int64? | String | |
1 | 3 | Conor |
2 | 4 | Dave |
Cross join#
(here no on
argument)
crossjoin(DataFrame(x=[1, 2]), DataFrame(y=["a", "b", "c"]))
Row | x | y |
---|---|---|
Int64 | String | |
1 | 1 | a |
2 | 1 | b |
3 | 1 | c |
4 | 2 | a |
5 | 2 | b |
6 | 2 | c |
Complex cases of joins#
x = DataFrame(id1=[1, 1, 2, 2, missing, missing],
id2=[1, 11, 2, 21, missing, 99],
name=["Alice", "Bob", "Conor", "Dave", "Zed", "Zoe"]
)
y = DataFrame(id1=[1, 1, 3, 3, missing, missing],
id2=[11, 1, 31, 3, missing, 999],
age=[21, 22, 23, 24, 99, 100]
)
Row | id1 | id2 | age |
---|---|---|---|
Int64? | Int64? | Int64 | |
1 | 1 | 11 | 21 |
2 | 1 | 1 | 22 |
3 | 3 | 31 | 23 |
4 | 3 | 3 | 24 |
5 | missing | missing | 99 |
6 | missing | 999 | 100 |
joining on two columns
innerjoin(x, y, on=[:id1, :id2], matchmissing=:equal)
Row | id1 | id2 | name | age |
---|---|---|---|---|
Int64? | Int64? | String | Int64 | |
1 | 1 | 11 | Bob | 21 |
2 | 1 | 1 | Alice | 22 |
3 | missing | missing | Zed | 99 |
with duplicates all combinations are produced
outerjoin(x, y, on=:id1, makeunique=true, indicator=:source, matchmissing=:equal)
Row | id1 | id2 | name | id2_1 | age | source |
---|---|---|---|---|---|---|
Int64? | Int64? | String? | Int64? | Int64? | String | |
1 | 1 | 1 | Alice | 11 | 21 | both |
2 | 1 | 11 | Bob | 11 | 21 | both |
3 | 1 | 1 | Alice | 1 | 22 | both |
4 | 1 | 11 | Bob | 1 | 22 | both |
5 | missing | missing | Zed | missing | 99 | both |
6 | missing | 99 | Zoe | missing | 99 | both |
7 | missing | missing | Zed | 999 | 100 | both |
8 | missing | 99 | Zoe | 999 | 100 | both |
9 | 2 | 2 | Conor | missing | missing | left_only |
10 | 2 | 21 | Dave | missing | missing | left_only |
11 | 3 | missing | missing | 31 | 23 | right_only |
12 | 3 | missing | missing | 3 | 24 | right_only |
you can force validation of uniqueness of key on which you join
try
innerjoin(x, y, on=:id1, makeunique=true, validate=(true, true), matchmissing=:equal)
catch e
show(e)
end
ArgumentError("Merge key(s) are not unique in both df1 and df2. df1 contains 3 duplicate keys: @NamedTuple{id1::Union{Missing, Int64}}((1,)), ..., @NamedTuple{id1::Union{Missing, Int64}}((missing,)). df2 contains 3 duplicate keys: @NamedTuple{id1::Union{Missing, Int64}}((1,)), ..., @NamedTuple{id1::Union{Missing, Int64}}((missing,)).")
mixed on
argument for joining on multiple columns
x = DataFrame(id1=1:6, id2=[1, 2, 1, 2, 1, 2], x1='a':'f')
y = DataFrame(id1=1:6, ID2=1:6, x2='a':'f')
innerjoin(x, y, on=[:id1, :id2 => :ID2])
Row | id1 | id2 | x1 | x2 |
---|---|---|---|---|
Int64 | Int64 | Char | Char | |
1 | 1 | 1 | a | a |
2 | 2 | 2 | b | b |
joining more than two data frames
xs = [DataFrame("id" => 1:6, "v$i" => ((1:6) .+ 10i)) for i in 1:5]
5-element Vector{DataFrames.DataFrame}:
6×2 DataFrame
Row │ id v1
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
5 │ 5 15
6 │ 6 16
6×2 DataFrame
Row │ id v2
│ Int64 Int64
─────┼──────────────
1 │ 1 21
2 │ 2 22
3 │ 3 23
4 │ 4 24
5 │ 5 25
6 │ 6 26
6×2 DataFrame
Row │ id v3
│ Int64 Int64
─────┼──────────────
1 │ 1 31
2 │ 2 32
3 │ 3 33
4 │ 4 34
5 │ 5 35
6 │ 6 36
6×2 DataFrame
Row │ id v4
│ Int64 Int64
─────┼──────────────
1 │ 1 41
2 │ 2 42
3 │ 3 43
4 │ 4 44
5 │ 5 45
6 │ 6 46
6×2 DataFrame
Row │ id v5
│ Int64 Int64
─────┼──────────────
1 │ 1 51
2 │ 2 52
3 │ 3 53
4 │ 4 54
5 │ 5 55
6 │ 6 56
innerjoin
as an example, it also works for outerjoin
and crossjoin
innerjoin(xs..., on=:id)
Row | id | v1 | v2 | v3 | v4 | v5 |
---|---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 1 | 11 | 21 | 31 | 41 | 51 |
2 | 2 | 12 | 22 | 32 | 42 | 52 |
3 | 3 | 13 | 23 | 33 | 43 | 53 |
4 | 4 | 14 | 24 | 34 | 44 | 54 |
5 | 5 | 15 | 25 | 35 | 45 | 55 |
6 | 6 | 16 | 26 | 36 | 46 | 56 |
matchmissing keyword argument#
In general you have three options how missing
values are handled in joins that are handled by matchmisssing
kewyowrd argument value as follows:
:error
: throw an error if missings are encountered (this is the default):equal
: assumemisssing
values are equal to themselves:notequal
: assumemisssing
values are not equal to themselves (not available forouterjoin
)
Here are some examples comparing the options:
df1 = DataFrame(id=[1, 2, missing], x=1:3)
df2 = DataFrame(id=[1, missing, 3], y=1:3)
try
innerjoin(df1, df2, on=:id)
catch e
show(e)
end
ArgumentError("Missing values in key columns are not allowed when matchmissing == :error. `missing` found in column :id in left data frame.")
innerjoin(df1, df2, on=:id, matchmissing=:equal)
Row | id | x | y |
---|---|---|---|
Int64? | Int64 | Int64 | |
1 | 1 | 1 | 1 |
2 | missing | 3 | 2 |
innerjoin(df1, df2, on=:id, matchmissing=:notequal)
Row | id | x | y |
---|---|---|---|
Int64? | Int64 | Int64 | |
1 | 1 | 1 | 1 |
Since DataFrames.jl 1.3 you can do an efficient left join of two data frames in-place. This means that the left data frame gets updated with new columns, but the columns that exist in it are not affected. This operation requires that there are no duplicates of keys in the right data frame that match keys in left data frame:
df1
Row | id | x |
---|---|---|
Int64? | Int64 | |
1 | 1 | 1 |
2 | 2 | 2 |
3 | missing | 3 |
df2
Row | id | y |
---|---|---|
Int64? | Int64 | |
1 | 1 | 1 |
2 | missing | 2 |
3 | 3 | 3 |
leftjoin!(df1, df2, on=:id, matchmissing=:notequal)
Row | id | x | y |
---|---|---|---|
Int64? | Int64 | Int64? | |
1 | 1 | 1 | 1 |
2 | 2 | 2 | missing |
3 | missing | 3 | missing |
df1
Row | id | x | y |
---|---|---|---|
Int64? | Int64 | Int64? | |
1 | 1 | 1 | 1 |
2 | 2 | 2 | missing |
3 | missing | 3 | missing |
This notebook was generated using Literate.jl.