A modern guide to SQL JOINs
There are many SQL JOINs guides and tutorials, but this one takes a different approach. We try to avoid misleading wording and imagery, and we structure the material in a different way. The goal of this article is to clarify your mental model.
Draft outline (this is work in progress)
Chapter 1. We start with LEFT JOIN.
- We suggest using canonical syntax for clarity.
- We suggest using only ID equality in the ON condition;
- “Learning foreign language” metaphor: why you need to know more words and expressions than you actually use;
- Employees/payments example dataset;
- First we introduce the N:1 case of LEFT JOIN.
- SQL may be too permissive for our own good;
- We discuss the 1:N case of LEFT JOIN and why the result dataset is weird.
- We show typical mistakes that can happen if you use anything except for ID equality in the ON condition.
- We discuss self-join (e.g., employee/manager scenario);
- We present fully general LEFT JOIN algorithm based on imperative execution model;
Chapter 2. We continue with INNER JOIN. [TBW]
- We suggest using canonical syntax for clarity, and ID equality in the ON condition;
- We show that INNER JOIN is just a Cartesian product with filtering;
- We introduce the N:1 case of INNER JOIN;
- We introduce the idea of implicit filtering;
- We discuss why the 1:N case of INNER JOIN is not weird;
- We show that ON and WHERE are interchangeable in INNER JOIN;
- We discuss similarities between INNER JOIN and LEFT JOIN;
- Non-canonical syntax for INNER JOIN;
Chapter 3. Critique of common misleading wording and imagery [TBW]
- why Venn diagrams are a bad metaphor for JOINs (and the bearded guy picture too);
- why RIGHT JOIN is a waste of words and pixels;
- why “returns all rows” is misleading;
- why declarative interpretation of LEFT JOIN is not very useful;
- why foreign keys are ignored in JOINs;
Chapter 4. Advanced cases [TBW]
- LEFT JOIN and GROUP BY case;
- Some exotic ON conditions in LEFT JOIN;
- A different view on self-joins;
- a link between JOINs and entity relationships;
- multi-table JOINs (https://kb.databasedesignbook.com/posts/systematic-design-of-join-queries/);
- subqueries;
- correlated subqueries;
- lesser known: FULL OUTER JOIN, CROSS APPLY, OUTER APPLY, lateral joins;
Prerequisites
You need a basic understanding of basic SQL. You may have read something about JOINs: tutorials, textbooks, watched some Youtube videos or whatever, but you feel that you don’t have a good understanding. This text is aimed at fixing your understanding by clarifying your mental model.
The goal of each section is to give you something surprising or enlightening. I’m not sure if that goal has always been achieved.
Table and data definitions and all SQL queries are here: https://dbfiddle.uk/sDnmf_qA.
Table of contents
This is work in progress, subscribe here to receive updates.
Starting with LEFT JOIN
Use canonical syntax
Historically, there are many variations of SQL syntax. Generally speaking, you need to know various dialects to be able to read existing code that you may encounter. But here we’re going to primarily use a canonical simple syntax. This makes it easier to understand the meaning of different queries.
For LEFT JOIN we’re going to use a standard syntax:
SELECT tableA.*, tableB.*
FROM tableA LEFT JOIN tableB ON tableA.id = tableB.another_id
WHERE …
Here we have two different tables, called “tableA” and “tableB”.
tableA goes first, tableB goes second: table order is important. “ON tableA.id = tableB.another_id” is called “ON condition”; it is
different from “WHERE condition”.
Use only ID equality in ON condition
Generally speaking, ON condition can be arbitrary (maybe even more than you thought, see below). We’ll discuss the full range of ON conditions in the second half of this chapter.
Here we’re going to primarily use a restricted case of ON conditions: ID equality. Getting back to our abstract example above:
SELECT tableA.*, tableB.*
FROM tableA LEFT JOIN tableB ON tableA.id = tableB.another_id
WHERE …
Look at the ON condition: it is a simple equality of two IDs: one from the first table, another from the second table.
Any other conditions, if needed, go into WHERE condition.
Why do we insist on that restriction? Virtually all practical SQL queries could be structured in this way. Moreover, later we’ll show how deviations from this discipline can lead to bugs. Don’t worry, we’ll have some fun by discussing some truly exotic ON conditions later in part 4.
“Learning foreign language” metaphor
When you learn a foreign language, there is an inherent asymmetry in what you can read and hear vs what you would write and say.
You can learn a relatively smaller number of words and expressions to say what you want. But other people could use different words and expressions even for the same concepts.
To understand them, you need to learn some words and expressions that you won’t even be using yourself. You can always use “thank you”, but eventually you need to learn that “much obliged” also means expressing gratitude.
Same with SQL (and other programming languages). You need to understand syntax variations, SQL dialects, different ways of designing tables, etc.
In this text we present a very disciplined way of writing SQL queries. We discuss other ways that are technically valid but may be confusing.
Unfortunately, it’s impossible to fully cover all variations. Some queries that you will encounter will genuinely be hard to read and understand, this is just a fact of life.
Example dataset: employees/payments
To demonstrate all the cases we want to discuss, we’re going to use a single tiny database consisting of two tables. In our imaginary scenario, we employ some people and we pay them money more or less regularly. Employees could be either full-time or contractors. Payments could be either salary or bonus. For people, we have names; for payments: amount and date. Additionally, some people are managers to other people, hierarchically. Finally, sometimes we send some payments to other companies, so payments.employee_id could be NULL.
Here is a straightforward table design:
CREATE TABLE people (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(64) NOT NULL,
type VARCHAR(16) NOT NULL,
manager_id INTEGER NULL,
KEY(manager_id)
);
CREATE TABLE payments (
id INTEGER NOT NULL PRIMARY KEY,
employee_id INTEGER NULL,
type VARCHAR(16) NOT NULL,
amount INTEGER NOT NULL,
date DATE NOT NULL,
description VARCHAR(64) NOT NULL,
KEY(employee_id)
);
We also need to insert some representative data that covers all the patterns that we want to demonstrate. Let’s have five people, both full-time employees and contractors. Also, some of them are managers.
| id | name | type | manager_id |
|---|---|---|---|
| 1 | Alice | fulltime | NULL |
| 2 | Bob | fulltime | 1 |
| 3 | Carol | contractor | 2 |
| 4 | David | contractor | 1 |
| 5 | Eugene | fulltime | 2 |
As for payments, we need a different number of payments for each person, including zero. Also, we need payments of both types: bonus and salary. Also, we have one payment that is not directed to an employee.
| id | employee_id | type | amount | date | description |
|---|---|---|---|---|---|
| 1 | 1 | salary | 2000 | 2025-01-25 | Salary Jan 2025 |
| 2 | 1 | bonus | 1000 | 2025-02-10 | Bonus 2024 |
| 3 | 1 | salary | 2000 | 2025-02-25 | Salary Feb 2025 |
| 4 | 1 | salary | 2000 | 2025-03-25 | Salary Mar 2025 |
| 5 | 2 | salary | 1500 | 2025-02-25 | Salary Feb 2025 |
| 6 | 2 | bonus | 700 | 2025-03-10 | Bonus 2024Q4 |
| 7 | 2 | bonus | 300 | 2025-03-15 | Bonus Mar 2025 |
| 8 | 3 | salary | 3000 | 2025-04-25 | Salary Apr 2025 |
| 9 | 3 | salary | 3000 | 2025-05-25 | Salary May 2025 |
| 10 | 4 | bonus | 1500 | 2025-04-05 | Welcome bonus |
| 11 | NULL | external | 5000 | 2025-06-01 | Office rent |
Note that there are no payments at all for Eugene (id=5), no salary payments for David (id=4), no bonus payments for Carol (id=3).
Same data as SQL statement:
INSERT INTO people (id, name, type, manager_id) VALUES
(1, 'Alice', 'fulltime', NULL),
(2, 'Bob', 'fulltime', 1),
(3, 'Carol', 'contractor', 2),
(4, 'David', 'contractor', 1),
(5, 'Eugene', 'fulltime', 2);
INSERT INTO payments (id, employee_id, type, amount, date, description) VALUES
(1, 1, 'salary', 2000, '2025-01-25', 'Salary Jan 2025'),
(2, 1, 'bonus', 1000, '2025-02-10', 'Bonus 2024'),
(3, 1, 'salary', 2000, '2025-02-25', 'Salary Feb 2025'),
(4, 1, 'salary', 2000, '2025-03-25', 'Salary Mar 2025'),
(5, 2, 'salary', 1500, '2025-02-25', 'Salary Feb 2025'),
(6, 2, 'bonus', 700, '2025-03-10', 'Bonus 2024Q4'),
(7, 2, 'bonus', 300, '2025-03-15', 'Bonus Mar 2025'),
(8, 3, 'salary', 3000, '2025-04-25', 'Salary Apr 2025'),
(9, 3, 'salary', 3000, '2025-05-25', 'Salary May 2025'),
(10, 4, 'bonus', 1500, '2025-04-05', 'Welcome bonus'),
(11, NULL, 'external', 5000, '2025-06-01', 'Office rent');
N:1 case of LEFT JOIN
Remember that we’re looking into a restricted case of JOINs: ID equality. Here are all the IDs we have here:
people.id: primary key, so each ID is unique;people.manager_id: several people can have same manager, so each ID can appear multiple times;payments.id: primary key, so each ID is unique;payments.employee_id: one person can have multiple payments, so each ID can appear multiple times;
Let’s run a very simple query that shows list of payments, and the information about the employee:
SELECT payments.id, payments.amount, payments.date,
people.id AS person_id, people.name
FROM payments LEFT JOIN people ON payments.employee_id = people.id
Result:

We have 11 payments and we’re going to have 11 output rows; with filtering, we could have fewer rows, but we’ll never have more than that. So, this is a N:1 case.
Later on we’re going to discuss the fully general LEFT JOIN algorithm, but here we can talk about how this very specific case (N:1) is handled.
Roughly speaking, for each row in payments we’re going to look up a
corresponding row in the people table, using the value of
payments.employee_id.
people.id is a primary key, so we know that there is one
corresponding row, or none at all. The latter case can happen if
payments.employee_id is NULL. If no record is found, then
people.id and people.name would be NULL, as in the last result
row.
This is a very natural case.
This is a very performant case: joining on person ID requires only a primary key lookup which is the most performant data querying operation imaginable.
Here is a surprising suggestion: maybe all of your LEFT JOINs need to be N:1, unless there is a good reason.
This is a rather bold statement, but let’s look at the 1:N case of LEFT JOIN and see that it’s so different from N:1 in all possible ways.
SQL may be too permissive
Due to historical reasons, SQL is very permissive. If your query is syntactically correct, the database is going to try and execute your query. It will go out of its way to execute your query, and you will never receive a pushback.
In this text we’re going to discuss several cases where the database lets you do something that you probably did not intend to do. Particularly:
- 1:N case of LEFT JOIN has a weird semantics;
- anything except for ID equality in ON condition is a potential mistake;
- foreign key definitions are ignored for JOINs: this lets you write syntactically correct but meaningless query;
- chained LEFT JOINs can lead to surprising performance degradation (5-6 or more orders of magnitude) that is not solved by indexes;
All those scenarios will sometimes bite you. Beginners would be especially vulnerable to that lack of guardrails, so in this article we give two strong suggestions for avoiding common query problems:
- always use N:1 case of LEFT JOIN;
- always use ID equality comparison in ON condition;
But again, continue reading for more rationale behind those suggestions.
1:N case of LEFT JOIN
Let’s take a query from the “N:1 case” section and swap the joined tables around:
SELECT people.id, people.name,
payments.id AS payment_id, payments.amount, payments.date
FROM people LEFT JOIN payments ON people.id = payments.employee_id
Result:

For readability, we also swapped the order of fields in the ON condition, and rearranged the list of columns in SELECT.
This query is syntactically correct. It is well-defined: anybody who knows SQL can easily understand what this query will return. Let’s see why we think that this type of query makes no particular sense.
Roughly speaking, for each row in people we’re going to query the
payments table, using the value of people.id. There could be any
number of payments corresponding to each person: zero, one or more.
If there is at least one payment, then there would be the same number
of result rows. However, if there are no payments then we’re going to
emit exactly one row, with the values of “payments.id”,
“payments.amount” and “payments.date” set to NULL.
The last sentence is what makes this case (we call it 1:N) so weird. This logic (NULL if no rows in the right table) works perfectly in the N:1 case, but let’s think about what we have here.
Is it one or another?
Do we have a list of payments here, or a list of people? Actually, we have something in between, depending on the data distribution. Here are three options:
1/ if it so happens that each person has at least one payment, then we’re going to have a list of payments, it makes sense;
2/ if one or more people has no payments (because they just joined the company), then we’re going to have a list of payments AND weird rows that are just record about a person; this makes less sense;
3/ imagine that the table of payments is empty (because our company was just established, and never had the chance to pay anyone). In this case, the result of our query would be basically a list of people. Here is how it looks like with empty list of payments:

Frankly, I know no other widely-used API that would return this sort of mixed results, except for this case. You can probably find some, because our industry has limitless imagination, but would you call such an API a well-designed API?
Forget SQL for a moment, how would you design a better result? You can return two lists: one is a list of payments, like we did in the N:1 case; another is a list of people who were never paid.
Fun fact: you can do that in SQL too, you just need to do two queries. First query we saw in a previous section, the second query would look something like (several variations are possible):
SELECT id, name
FROM people
WHERE id NOT IN (SELECT employee_id FROM payments);
So, we’re in a difficult situation here: this is, of course, one of the most basic SQL queries, it works in literally every database built in the past decades. Yet it’s not clear why this case even needed to be supported. I mean, you can certainly explain the chain of thought that happened around 40 years ago that has led us here, but this explanation does not make this case sensible and logical from the practical point of view.
See also the section about Venn diagrams below.
So what’s the practical outcome here? First, you must understand how LEFT JOIN works in a general case and in specific cases, that goes without saying. Everyone would expect you to understand it, it’s how databases work, it’s a completely non-controversial knowledge, it is what it is.
At the same time, notice when you attempt to use 1:N LEFT JOIN in one of your queries, and ask yourself if this is what you really need to do. There is a high chance that your query would be better written as N:1 LEFT JOIN, or as an INNER JOIN (see below).
However, see also the section about LEFT JOIN and GROUP BY below, where a possible exception from this rule is discussed.
Why you should only use ID equality
(TODO: This subsection needs to be moved below.)
In this guide we strongly suggest that you use only ID equality in ON condition. All other filtering must go into WHERE. This applies to both LEFT JOIN and INNER JOIN. For the latter, this is technically not necessary, but we’ll present a scenario where this can become a gotcha.
If your query does not follow this rule and you are getting unexpected results, the first recommendation would be to rewrite your query and see if it works.
Let’s look first at an example of a simple query that works in a confusing way. Suppose that we want to get a list of payments less than some amount. This is basically a N:1 LEFT JOIN plus an extra filtering condition. Let’s knowingly violate the “ID equality only” rule and see what happens:
SELECT payments.id, payments.amount, payments.date,
people.id AS person_id, people.name
FROM payments LEFT JOIN people
-- this is incorrect condition
ON payments.employee_id = people.id AND payments.amount < 1000
Here is the result:

Wow, what happened to our “AND payments.amount < 1000”? To understand why this happens, you need to look at the fully general LEFT JOIN algorithm below and see that this is exactly how it’s supposed to work.
Here, this part of a condition is used for matching, and not for filtering. And due to the way LEFT JOIN works, when the matching function does returns false, we still output the rows from the left table.
To fix this, you just need to follow the advice and move the filtering condition into WHERE clause:
SELECT payments.id, payments.amount, payments.date,
people.id AS person_id, people.name
FROM payments LEFT JOIN people
ON payments.employee_id = people.id
WHERE payments.amount < 1000

Second part of the potential gotcha is that for INNER JOIN this does not apply: ON and WHERE conditions in INNER JOIN are fully interchangeable (see the section below).
But you still have to keep this discipline of only using ID equality in ON condition, even for INNER JOIN! This is because as you work on query, you can sometimes switch back and forth between LEFT and INNER JOIN. So it happens like this:
- you have a sloppy ON condition in INNER JOIN;
- you change to LEFT JOIN: for example, you want to include even payments that are not sent to employees;
- your ON condition becomes incorrect;
In this example everything is clearly visible, and the query is as simple as possible, you will quickly find the mistake. But if you have a subquery or a CTE it may be much more tricky.
So, keep some discipline and use only ID equality comparison in ON condition. Unfortunately, this is one of the places where SQL does not prevent you from doing the potentially incorrect thing.
Self-join
So, here is a simple problem: show a list of people with the name of their manager. Remember that some people, like Alice, don’t have managers.
In our people table, there are two columns that contain IDs: “id” and “manager_id”. They both contain IDs of people, and that means that we can write a JOIN based on those two IDs. Those two columns are in the same table, but that’s fine, it’s just a self-join.
One thing that we must do when writing a self-join query is to remove
name ambiguity by aliasing one or both tables. This is done using the
“AS” keyword.
Also, we need to make sure that we choose the N:1 JOIN. Let’s talk this through:
- each employee has only one employee as manager, or none; (N:1)
- each employee can be a manager of several employees; (1:N)
Here is the query:
SELECT subordinates.id, subordinates.name, managers.id, managers.name
FROM people AS subordinates LEFT JOIN people AS managers
ON subordinates.manager_id = managers.id
And here is the result:

As you can see, Alice does not have a manager. Bob has Alice as manager, and so on.
Let’s look at the query. We joined a table to itself, and we specified table aliases for both sides of JOIN: “subordinates” and “managers”.
There is nothing complicated about self-joins, really. You should use the N:1 case, and you should alias the table names. Self-joins are mostly used when there is some sort of hierarchy, like in manager-subordinate relationships. Here are some other possible self-join scenarios that come to mind:
- manager/subordinates is an archetypal example;
- nested folders like in Google Docs: a folder can contain other folders;
- referred accounts: if your website allows to refer people for signup, you can have a hierarchy of sign-ups;
- threaded comments, like on Reddit: you can reply directly to the post, or to another comment;
If you want to get acquainted with self-join, try implementing one of the scenarios listed here, add some test data and try running queries. All self-join queries are very similar; only the table names, aliases and column names would be different.
One advanced case not covered here is called recursive joins. Not all databases support it, and you have to read about that elsewhere.
Fully general LEFT JOIN algorithm
So far we’ve discussed three restricted cases of ON condition:
- ID equality comparison, N:1 case;
- ID equality comparison, 1:N case;
- ID equality comparison + some other condition;
But LEFT JOIN was introduced as a fully general algebraic operation half a century ago, and it is implemented as a fully general algebraic operation in all SQL databases.
In this text we will deviate again from the commonly-presented declarative approach to LEFT JOIN. Instead we present an imperative approach to explaining how LEFT JOIN works in the most general case.
(See the corresponding section below for the discussion of declarative vs imperative approaches.)
Anyway, here is the:
Generalized LEFT JOIN algorithm
For each row of the first table we do the following:
-
For each row of the second table we evaluate the ON condition. If it returns true, we add a row to the output. Data from all columns of both tables is available for output.
-
If the ON condition has never returned true, we add one row to the output. But here only the data from the left table is available for output; data from the right table is not available, and NULL values are used instead.
End.
This algorithm is carefully worded to handle all possible cases. You need to understand it because you may sometimes see it in queries, or you may be asked about that as some sort of a quiz.
Here this is abstractly defined as a nested loop, but the real query engine processes LEFT JOINs in a variety of optimized ways depending on ON condition and table schemas. This is why the N:1 case is so important: it is the most efficient to implement in practice: basically there is no inner loop. Do not worry about optimization and execution plan here: our goal is to learn the abstract definition.
To understand why it says “has never returned true”, consider an empty second table. For each row in the first table, we need to go through each row of the second table, evaluating the ON condition. But there are no rows, so ON condition is never evaluated, and thus, it has never returned true. In this case, a row from the first table is added in the output.
Note also that there are no requirements on what ON condition does. In real-world queries it virtually always compares two IDs from two tables. But there are some exotic ON conditions such as “always true” and “always false” that we discuss in the next section.
We’ll discuss the declarative approach to understanding LEFT JOIN below.
Also, we need to discuss why “returns all rows” wording is misleading. Unfortunately, it is very commonly seen on the internet and in LLM outputs, so you need to understand how this wording may hurt your understanding. We, of course, do not use this wording here. See the corresponding section below.
Continue with INNER JOIN (TBW)
Common bad explanations (TBW)
Advanced cases (TBW)
See also: https://kb.databasedesignbook.com/posts/systematic-design-of-join-queries/.
Conclusion
A brief summary of things I want you to remember from this:
-
use only ID equality comparison in ON condition, everything else goes to WHERE;
-
learn LEFT JOIN first, even though traditionally people begin with INNER JOIN;
-
LEFT JOIN is better understood as three distinct subcases:
-
N:1 case, ID equality only: the most useful and most performant;
-
1:N case, ID equality only: quite weird, and you can avoid it;
-
fully general case with arbitrary ON conditions;
-
-
SQL is too permissive and does not protect you from simple mistakes: performance issues and unexpected results may occur;
-
you need to understand the entirety of LEFT JOIN semantics so that you can read and modify queries written by other people;
-
fully-general LEFT JOIN algorithm is better presented as simple imperative nested loop;
-
if you follow advice presented here, you will prevent many typical mistakes and make your queries faster and more readable;
“Database Design Book” (2025)
Learn how to get from business requirements to a database schema
If this post was useful, you may find this book useful too.
Table of contents and sample chapters
Book length: 145 pages, ~32.000 words. Available in both PDF and in EPUB format.
