Joining DataFrames#

using DataFrames

Preparing DataFrames for a join#

x = DataFrame(ID=[1, 2, 3, 4, missing], name=["Alice", "Bob", "Conor", "Dave", "Zed"])
y = DataFrame(id=[1, 2, 5, 6, missing], age=[21, 22, 23, 24, 99])
5×2 DataFrame
Rowidage
Int64?Int64
1121
2222
3523
4624
5missing99

Rules for the on keyword argument:

  • a single Symbol or string if joining on one column with the same name, e.g. on=:id

  • a Pair of Symbols or string if joining on one column with different names, e.g. on=:id=>:id2

  • a vector of Symbols or strings if joining on multiple columns with the same name, e.g. on=[:id1, :id2]

  • a vector of Pairs of Symbols or strings if joining on multiple columns with different names, e.g. on=[:a1=>:a2, :b1=>:b2]

  • a vector containing a combination of Symbols or strings or Pair of Symbols or strings, e.g. on=[:a1=>:a2, :b1]

try
    innerjoin(x, y, on=:ID => :id) ## missing is not allowed to join-on by default
catch e
    show(e)
end
ArgumentError("Missing values in key columns are not allowed when matchmissing == :error. `missing` found in column :ID in left data frame.")
innerjoin(x, y, on=:ID => :id, matchmissing=:equal)
3×3 DataFrame
RowIDnameage
Int64?StringInt64
11Alice21
22Bob22
3missingZed99
leftjoin(x, y, on="ID" => "id", matchmissing=:equal)
5×3 DataFrame
RowIDnameage
Int64?StringInt64?
11Alice21
22Bob22
3missingZed99
43Conormissing
54Davemissing
rightjoin(x, y, on=:ID => :id, matchmissing=:equal)
5×3 DataFrame
RowIDnameage
Int64?String?Int64
11Alice21
22Bob22
3missingZed99
45missing23
56missing24
outerjoin(x, y, on=:ID => :id, matchmissing=:equal)
7×3 DataFrame
RowIDnameage
Int64?String?Int64?
11Alice21
22Bob22
3missingZed99
43Conormissing
54Davemissing
65missing23
76missing24
semijoin(x, y, on=:ID => :id, matchmissing=:equal)
3×2 DataFrame
RowIDname
Int64?String
11Alice
22Bob
3missingZed
antijoin(x, y, on=:ID => :id, matchmissing=:equal)
2×2 DataFrame
RowIDname
Int64?String
13Conor
24Dave

Cross join#

(here no on argument)

crossjoin(DataFrame(x=[1, 2]), DataFrame(y=["a", "b", "c"]))
6×2 DataFrame
Rowxy
Int64String
11a
21b
31c
42a
52b
62c

Complex cases of joins#

x = DataFrame(id1=[1, 1, 2, 2, missing, missing],
    id2=[1, 11, 2, 21, missing, 99],
    name=["Alice", "Bob", "Conor", "Dave", "Zed", "Zoe"]
)
y = DataFrame(id1=[1, 1, 3, 3, missing, missing],
    id2=[11, 1, 31, 3, missing, 999],
    age=[21, 22, 23, 24, 99, 100]
)
6×3 DataFrame
Rowid1id2age
Int64?Int64?Int64
111121
21122
333123
43324
5missingmissing99
6missing999100

joining on two columns

innerjoin(x, y, on=[:id1, :id2], matchmissing=:equal)
3×4 DataFrame
Rowid1id2nameage
Int64?Int64?StringInt64
1111Bob21
211Alice22
3missingmissingZed99

with duplicates all combinations are produced

outerjoin(x, y, on=:id1, makeunique=true, indicator=:source, matchmissing=:equal)
12×6 DataFrame
Rowid1id2nameid2_1agesource
Int64?Int64?String?Int64?Int64?String
111Alice1121both
2111Bob1121both
311Alice122both
4111Bob122both
5missingmissingZedmissing99both
6missing99Zoemissing99both
7missingmissingZed999100both
8missing99Zoe999100both
922Conormissingmissingleft_only
10221Davemissingmissingleft_only
113missingmissing3123right_only
123missingmissing324right_only

you can force validation of uniqueness of key on which you join

try
    innerjoin(x, y, on=:id1, makeunique=true, validate=(true, true), matchmissing=:equal)
catch e
    show(e)
end
ArgumentError("Merge key(s) are not unique in both df1 and df2. df1 contains 3 duplicate keys: @NamedTuple{id1::Union{Missing, Int64}}((1,)), ..., @NamedTuple{id1::Union{Missing, Int64}}((missing,)). df2 contains 3 duplicate keys: @NamedTuple{id1::Union{Missing, Int64}}((1,)), ..., @NamedTuple{id1::Union{Missing, Int64}}((missing,)).")

mixed on argument for joining on multiple columns

x = DataFrame(id1=1:6, id2=[1, 2, 1, 2, 1, 2], x1='a':'f')
y = DataFrame(id1=1:6, ID2=1:6, x2='a':'f')
innerjoin(x, y, on=[:id1, :id2 => :ID2])
2×4 DataFrame
Rowid1id2x1x2
Int64Int64CharChar
111aa
222bb

joining more than two data frames

xs = [DataFrame("id" => 1:6, "v$i" => ((1:6) .+ 10i)) for i in 1:5]
5-element Vector{DataFrames.DataFrame}:
 6×2 DataFrame
 Row  id     v1    
     │ Int64  Int64 
─────┼──────────────
   1 │     1     11
   2 │     2     12
   3 │     3     13
   4 │     4     14
   5 │     5     15
   6 │     6     16
 6×2 DataFrame
 Row  id     v2    
     │ Int64  Int64 
─────┼──────────────
   1 │     1     21
   2 │     2     22
   3 │     3     23
   4 │     4     24
   5 │     5     25
   6 │     6     26
 6×2 DataFrame
 Row  id     v3    
     │ Int64  Int64 
─────┼──────────────
   1 │     1     31
   2 │     2     32
   3 │     3     33
   4 │     4     34
   5 │     5     35
   6 │     6     36
 6×2 DataFrame
 Row  id     v4    
     │ Int64  Int64 
─────┼──────────────
   1 │     1     41
   2 │     2     42
   3 │     3     43
   4 │     4     44
   5 │     5     45
   6 │     6     46
 6×2 DataFrame
 Row  id     v5    
     │ Int64  Int64 
─────┼──────────────
   1 │     1     51
   2 │     2     52
   3 │     3     53
   4 │     4     54
   5 │     5     55
   6 │     6     56

innerjoin as an example, it also works for outerjoin and crossjoin

innerjoin(xs..., on=:id)
6×6 DataFrame
Rowidv1v2v3v4v5
Int64Int64Int64Int64Int64Int64
111121314151
221222324252
331323334353
441424344454
551525354555
661626364656

matchmissing keyword argument#

In general you have three options how missing values are handled in joins that are handled by matchmisssing kewyowrd argument value as follows:

  • :error: throw an error if missings are encountered (this is the default)

  • :equal: assume misssing values are equal to themselves

  • :notequal: assume misssing values are not equal to themselves (not available for outerjoin)

Here are some examples comparing the options:

df1 = DataFrame(id=[1, 2, missing], x=1:3)
df2 = DataFrame(id=[1, missing, 3], y=1:3)

try
    innerjoin(df1, df2, on=:id)
catch e
    show(e)
end
ArgumentError("Missing values in key columns are not allowed when matchmissing == :error. `missing` found in column :id in left data frame.")
innerjoin(df1, df2, on=:id, matchmissing=:equal)
2×3 DataFrame
Rowidxy
Int64?Int64Int64
1111
2missing32
innerjoin(df1, df2, on=:id, matchmissing=:notequal)
1×3 DataFrame
Rowidxy
Int64?Int64Int64
1111

Since DataFrames.jl 1.3 you can do an efficient left join of two data frames in-place. This means that the left data frame gets updated with new columns, but the columns that exist in it are not affected. This operation requires that there are no duplicates of keys in the right data frame that match keys in left data frame:

df1
3×2 DataFrame
Rowidx
Int64?Int64
111
222
3missing3
df2
3×2 DataFrame
Rowidy
Int64?Int64
111
2missing2
333
leftjoin!(df1, df2, on=:id, matchmissing=:notequal)
3×3 DataFrame
Rowidxy
Int64?Int64Int64?
1111
222missing
3missing3missing
df1
3×3 DataFrame
Rowidxy
Int64?Int64Int64?
1111
222missing
3missing3missing

This notebook was generated using Literate.jl.