Joining DataFrames

Joining DataFrames#

using DataFrames

Preparing DataFrames for a join#

x = DataFrame(ID=[1, 2, 3, 4, missing], name=["Alice", "Bob", "Conor", "Dave", "Zed"])
y = DataFrame(id=[1, 2, 5, 6, missing], age=[21, 22, 23, 24, 99])

5×2 DataFrame

Row	id	age
	Int64?	Int64
1	1	21
2	2	22
3	5	23
4	6	24
5	missing	99

Rules for the on keyword argument:

a single Symbol or string if joining on one column with the same name, e.g. on=:id
a Pair of Symbols or string if joining on one column with different names, e.g. on=:id=>:id2
a vector of Symbols or strings if joining on multiple columns with the same name, e.g. on=[:id1, :id2]
a vector of Pairs of Symbols or strings if joining on multiple columns with different names, e.g. on=[:a1=>:a2, :b1=>:b2]
a vector containing a combination of Symbols or strings or Pair of Symbols or strings, e.g. on=[:a1=>:a2, :b1]

try
    innerjoin(x, y, on=:ID => :id) ## missing is not allowed to join-on by default
catch e
    show(e)
end

ArgumentError("Missing values in key columns are not allowed when matchmissing == :error. `missing` found in column :ID in left data frame.")

innerjoin(x, y, on=:ID => :id, matchmissing=:equal)

3×3 DataFrame

Row	ID	name	age
	Int64?	String	Int64
1	1	Alice	21
2	2	Bob	22
3	missing	Zed	99

leftjoin(x, y, on="ID" => "id", matchmissing=:equal)

5×3 DataFrame

Row	ID	name	age
	Int64?	String	Int64?
1	1	Alice	21
2	2	Bob	22
3	missing	Zed	99
4	3	Conor	missing
5	4	Dave	missing

rightjoin(x, y, on=:ID => :id, matchmissing=:equal)

5×3 DataFrame

Row	ID	name	age
	Int64?	String?	Int64
1	1	Alice	21
2	2	Bob	22
3	missing	Zed	99
4	5	missing	23
5	6	missing	24

outerjoin(x, y, on=:ID => :id, matchmissing=:equal)

7×3 DataFrame

Row	ID	name	age
	Int64?	String?	Int64?
1	1	Alice	21
2	2	Bob	22
3	missing	Zed	99
4	3	Conor	missing
5	4	Dave	missing
6	5	missing	23
7	6	missing	24

semijoin(x, y, on=:ID => :id, matchmissing=:equal)

3×2 DataFrame

Row	ID	name
	Int64?	String
1	1	Alice
2	2	Bob
3	missing	Zed

antijoin(x, y, on=:ID => :id, matchmissing=:equal)

2×2 DataFrame

Row	ID	name
	Int64?	String
1	3	Conor
2	4	Dave

Cross join#

(here no on argument)

crossjoin(DataFrame(x=[1, 2]), DataFrame(y=["a", "b", "c"]))

6×2 DataFrame

Row	x	y
	Int64	String
1	1	a
2	1	b
3	1	c
4	2	a
5	2	b
6	2	c

Complex cases of joins#

x = DataFrame(id1=[1, 1, 2, 2, missing, missing],
    id2=[1, 11, 2, 21, missing, 99],
    name=["Alice", "Bob", "Conor", "Dave", "Zed", "Zoe"]
)
y = DataFrame(id1=[1, 1, 3, 3, missing, missing],
    id2=[11, 1, 31, 3, missing, 999],
    age=[21, 22, 23, 24, 99, 100]
)

6×3 DataFrame

Row	id1	id2	age
	Int64?	Int64?	Int64
1	1	11	21
2	1	1	22
3	3	31	23
4	3	3	24
5	missing	missing	99
6	missing	999	100

joining on two columns

innerjoin(x, y, on=[:id1, :id2], matchmissing=:equal)

3×4 DataFrame

Row	id1	id2	name	age
	Int64?	Int64?	String	Int64
1	1	11	Bob	21
2	1	1	Alice	22
3	missing	missing	Zed	99

with duplicates all combinations are produced

outerjoin(x, y, on=:id1, makeunique=true, indicator=:source, matchmissing=:equal)

12×6 DataFrame

Row	id1	id2	name	id2_1	age	source
	Int64?	Int64?	String?	Int64?	Int64?	String
1	1	1	Alice	11	21	both
2	1	11	Bob	11	21	both
3	1	1	Alice	1	22	both
4	1	11	Bob	1	22	both
5	missing	missing	Zed	missing	99	both
6	missing	99	Zoe	missing	99	both
7	missing	missing	Zed	999	100	both
8	missing	99	Zoe	999	100	both
9	2	2	Conor	missing	missing	left_only
10	2	21	Dave	missing	missing	left_only
11	3	missing	missing	31	23	right_only
12	3	missing	missing	3	24	right_only

you can force validation of uniqueness of key on which you join

try
    innerjoin(x, y, on=:id1, makeunique=true, validate=(true, true), matchmissing=:equal)
catch e
    show(e)
end

ArgumentError("Merge key(s) are not unique in both df1 and df2. df1 contains 3 duplicate keys: @NamedTuple{id1::Union{Missing, Int64}}((1,)), ..., @NamedTuple{id1::Union{Missing, Int64}}((missing,)). df2 contains 3 duplicate keys: @NamedTuple{id1::Union{Missing, Int64}}((1,)), ..., @NamedTuple{id1::Union{Missing, Int64}}((missing,)).")

mixed on argument for joining on multiple columns

x = DataFrame(id1=1:6, id2=[1, 2, 1, 2, 1, 2], x1='a':'f')
y = DataFrame(id1=1:6, ID2=1:6, x2='a':'f')
innerjoin(x, y, on=[:id1, :id2 => :ID2])

2×4 DataFrame

Row	id1	id2	x1	x2
	Int64	Int64	Char	Char
1	1	1	a	a
2	2	2	b	b

joining more than two data frames

xs = [DataFrame("id" => 1:6, "v$i" => ((1:6) .+ 10i)) for i in 1:5]

5-element Vector{DataFrames.DataFrame}:
 6×2 DataFrame
 Row │ id     v1    
     │ Int64  Int64 
─────┼──────────────
   1 │     1     11
   2 │     2     12
   3 │     3     13
   4 │     4     14
   5 │     5     15
   6 │     6     16
 6×2 DataFrame
 Row │ id     v2    
     │ Int64  Int64 
─────┼──────────────
   1 │     1     21
   2 │     2     22
   3 │     3     23
   4 │     4     24
   5 │     5     25
   6 │     6     26
 6×2 DataFrame
 Row │ id     v3    
     │ Int64  Int64 
─────┼──────────────
   1 │     1     31
   2 │     2     32
   3 │     3     33
   4 │     4     34
   5 │     5     35
   6 │     6     36
 6×2 DataFrame
 Row │ id     v4    
     │ Int64  Int64 
─────┼──────────────
   1 │     1     41
   2 │     2     42
   3 │     3     43
   4 │     4     44
   5 │     5     45
   6 │     6     46
 6×2 DataFrame
 Row │ id     v5    
     │ Int64  Int64 
─────┼──────────────
   1 │     1     51
   2 │     2     52
   3 │     3     53
   4 │     4     54
   5 │     5     55
   6 │     6     56

innerjoin as an example, it also works for outerjoin and crossjoin

innerjoin(xs..., on=:id)

6×6 DataFrame

Row	id	v1	v2	v3	v4	v5
	Int64	Int64	Int64	Int64	Int64	Int64
1	1	11	21	31	41	51
2	2	12	22	32	42	52
3	3	13	23	33	43	53
4	4	14	24	34	44	54
5	5	15	25	35	45	55
6	6	16	26	36	46	56

matchmissing keyword argument#

In general you have three options how missing values are handled in joins that are handled by matchmisssing kewyowrd argument value as follows:

:error: throw an error if missings are encountered (this is the default)
:equal: assume misssing values are equal to themselves
:notequal: assume misssing values are not equal to themselves (not available for outerjoin)

Here are some examples comparing the options:

df1 = DataFrame(id=[1, 2, missing], x=1:3)
df2 = DataFrame(id=[1, missing, 3], y=1:3)

try
    innerjoin(df1, df2, on=:id)
catch e
    show(e)
end

ArgumentError("Missing values in key columns are not allowed when matchmissing == :error. `missing` found in column :id in left data frame.")

innerjoin(df1, df2, on=:id, matchmissing=:equal)

2×3 DataFrame

Row	id	x	y
	Int64?	Int64	Int64
1	1	1	1
2	missing	3	2

innerjoin(df1, df2, on=:id, matchmissing=:notequal)

1×3 DataFrame

Row	id	x	y
	Int64?	Int64	Int64
1	1	1	1

Since DataFrames.jl 1.3 you can do an efficient left join of two data frames in-place. This means that the left data frame gets updated with new columns, but the columns that exist in it are not affected. This operation requires that there are no duplicates of keys in the right data frame that match keys in left data frame:

df1

3×2 DataFrame

Row	id	x
	Int64?	Int64
1	1	1
2	2	2
3	missing	3

df2

3×2 DataFrame

Row	id	y
	Int64?	Int64
1	1	1
2	missing	2
3	3	3

leftjoin!(df1, df2, on=:id, matchmissing=:notequal)

3×3 DataFrame

Row	id	x	y
	Int64?	Int64	Int64?
1	1	1	1
2	2	2	missing
3	missing	3	missing

df1

3×3 DataFrame

Row	id	x	y
	Int64?	Int64	Int64?
1	1	1	1
2	2	2	missing
3	missing	3	missing

This notebook was generated using Literate.jl.

Joining DataFrames

Contents

Joining DataFrames#

Preparing DataFrames for a join#

Cross join#

Complex cases of joins#

matchmissing keyword argument#