Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

using DataFrames

Preparing DataFrames for a join

x = DataFrame(ID=[1, 2, 3, 4, missing], name=["Alice", "Bob", "Conor", "Dave", "Zed"])
y = DataFrame(id=[1, 2, 5, 6, missing], age=[21, 22, 23, 24, 99])
Loading...

Rules for the on keyword argument:

  • a single Symbol or string if joining on one column with the same name, e.g. on=:id

  • a Pair of Symbols or string if joining on one column with different names, e.g. on=:id=>:id2

  • a vector of Symbols or strings if joining on multiple columns with the same name, e.g. on=[:id1, :id2]

  • a vector of Pairs of Symbols or strings if joining on multiple columns with different names, e.g. on=[:a1=>:a2, :b1=>:b2]

  • a vector containing a combination of Symbols or strings or Pair of Symbols or strings, e.g. on=[:a1=>:a2, :b1]

try
    innerjoin(x, y, on=:ID => :id) ## missing is not allowed to join-on by default
catch e
    show(e)
end
ArgumentError("Missing values in key columns are not allowed when matchmissing == :error. `missing` found in column :ID in left data frame.")
innerjoin(x, y, on=:ID => :id, matchmissing=:equal)
Loading...
leftjoin(x, y, on="ID" => "id", matchmissing=:equal)
Loading...
rightjoin(x, y, on=:ID => :id, matchmissing=:equal)
Loading...
outerjoin(x, y, on=:ID => :id, matchmissing=:equal)
Loading...
semijoin(x, y, on=:ID => :id, matchmissing=:equal)
Loading...
antijoin(x, y, on=:ID => :id, matchmissing=:equal)
Loading...

Cross join

(here no on argument)

crossjoin(DataFrame(x=[1, 2]), DataFrame(y=["a", "b", "c"]))
Loading...

Complex cases of joins

x = DataFrame(id1=[1, 1, 2, 2, missing, missing],
    id2=[1, 11, 2, 21, missing, 99],
    name=["Alice", "Bob", "Conor", "Dave", "Zed", "Zoe"]
)
y = DataFrame(id1=[1, 1, 3, 3, missing, missing],
    id2=[11, 1, 31, 3, missing, 999],
    age=[21, 22, 23, 24, 99, 100]
)
Loading...

joining on two columns

innerjoin(x, y, on=[:id1, :id2], matchmissing=:equal)
Loading...

with duplicates all combinations are produced

outerjoin(x, y, on=:id1, makeunique=true, indicator=:source, matchmissing=:equal)
Loading...

you can force validation of uniqueness of key on which you join

try
    innerjoin(x, y, on=:id1, makeunique=true, validate=(true, true), matchmissing=:equal)
catch e
    show(e)
end
ArgumentError("Merge key(s) are not unique in both df1 and df2. df1 contains 3 duplicate keys: @NamedTuple{id1::Union{Missing, Int64}}((1,)), ..., @NamedTuple{id1::Union{Missing, Int64}}((missing,)). df2 contains 3 duplicate keys: @NamedTuple{id1::Union{Missing, Int64}}((1,)), ..., @NamedTuple{id1::Union{Missing, Int64}}((missing,)).")

mixed on argument for joining on multiple columns

x = DataFrame(id1=1:6, id2=[1, 2, 1, 2, 1, 2], x1='a':'f')
y = DataFrame(id1=1:6, ID2=1:6, x2='a':'f')
innerjoin(x, y, on=[:id1, :id2 => :ID2])
Loading...

joining more than two data frames

xs = [DataFrame("id" => 1:6, "v$i" => ((1:6) .+ 10i)) for i in 1:5]
5-element Vector{DataFrames.DataFrame}: 6×2 DataFrame Row id v1 Int64 Int64 ─────┼────────────── 1 │ 1 11 2 │ 2 12 3 │ 3 13 4 │ 4 14 5 │ 5 15 6 │ 6 16 6×2 DataFrame Row id v2 Int64 Int64 ─────┼────────────── 1 │ 1 21 2 │ 2 22 3 │ 3 23 4 │ 4 24 5 │ 5 25 6 │ 6 26 6×2 DataFrame Row id v3 Int64 Int64 ─────┼────────────── 1 │ 1 31 2 │ 2 32 3 │ 3 33 4 │ 4 34 5 │ 5 35 6 │ 6 36 6×2 DataFrame Row id v4 Int64 Int64 ─────┼────────────── 1 │ 1 41 2 │ 2 42 3 │ 3 43 4 │ 4 44 5 │ 5 45 6 │ 6 46 6×2 DataFrame Row id v5 Int64 Int64 ─────┼────────────── 1 │ 1 51 2 │ 2 52 3 │ 3 53 4 │ 4 54 5 │ 5 55 6 │ 6 56

innerjoin as an example, it also works for outerjoin and crossjoin

innerjoin(xs..., on=:id)
Loading...

matchmissing keyword argument

In general you have three options how missing values are handled in joins that are handled by matchmisssing kewyowrd argument value as follows:

  • :error: throw an error if missings are encountered (this is the default)

  • :equal: assume misssing values are equal to themselves

  • :notequal: assume misssing values are not equal to themselves (not available for outerjoin)

Here are some examples comparing the options:

df1 = DataFrame(id=[1, 2, missing], x=1:3)
df2 = DataFrame(id=[1, missing, 3], y=1:3)

try
    innerjoin(df1, df2, on=:id)
catch e
    show(e)
end
ArgumentError("Missing values in key columns are not allowed when matchmissing == :error. `missing` found in column :id in left data frame.")
innerjoin(df1, df2, on=:id, matchmissing=:equal)
Loading...
innerjoin(df1, df2, on=:id, matchmissing=:notequal)
Loading...

Since DataFrames.jl 1.3 you can do an efficient left join of two data frames in-place. This means that the left data frame gets updated with new columns, but the columns that exist in it are not affected. This operation requires that there are no duplicates of keys in the right data frame that match keys in left data frame:

df1
Loading...
df2
Loading...
leftjoin!(df1, df2, on=:id, matchmissing=:notequal)
Loading...
df1
Loading...

This notebook was generated using Literate.jl.