Foreign Keys
Author: Alexey Makhotkin <squadette@gmail.com>
(Word count: 2900).
Foreign keys are one topic that you cannot ignore if you want to talk about database design. In this informational two-pager I’d like to point out the following aspects of foreign key technology:
- foreign keys are only a partial solution to the problem of database consistency;
- in classic relational databases, eliminating foreign keys may be an easy performance win;
- in many real-world scenarios foreign keys could not be enforced even if the underlying database supports them;
This text is a complementary chapter of “Database Design Book” https://databasedesignbook.com/.
Foreign keys, conceptually
Consider a social network with users who can publish posts. We have the “users
” table, each user has an integer ID, for example ID=3. This ID is stored in the “users.id
” column, and this column is a primary key.
We also have the “posts
” table, where each post has an integer ID, for example ID=100. The “posts
” table refers to the “users
” table via “posts.user_id
” column. If a post with ID=100 has user_id=3 it means that it was written by the user with ID=3.
In this section we talk about foreign keys in the abstract, so that we have no idea which database we use and what are its capabilities. We want our database to be self-consistent. Foreign keys are one aspect of this consistency. We want to make sure that each value that is stored in the “posts.user_id
” column has a corresponding value in the “users.id
” column.
For example, if a post has user_id=3000 and there is no record about a user with ID=3000, the database is inconsistent. We want to prevent this situation and so we declare a foreign key that would somehow prevent this. “Somehow” is an important word here: in this section we don’t care how exactly this is going to happen.
History of foreign keys
Historically (since the very beginning) this particular aspect of data integrity was a hugely important concept for the early proponents of relational database technology. If you listen to some early talks you will hear foreign keys mentioned very pointedly and specifically as a crucial concept. For example (search for “foreign” here):
- https://minimalmodeling.substack.com/p/watching-chris-date-1983-1;
- https://minimalmodeling.substack.com/p/watching-chris-date-1983-3.
As a result of this insistence, many classic relational database management systems (such as Oracle, Postgres, DB2 etc.) had something that we would call enforced foreign keys. If such foreign keys were declared on the database tables, the server would automatically prevent this sort of inconsistencies.
Later on, during the NoSQL revolution, support for foreign keys was often conspicuously missing from new-school databases. But let’s focus on relational databases.
Enforced foreign keys
Let’s go back to our social network example. There are two main ways for our database to become accidentally inconsistent. First is if we create a row in the “posts” table with incorrect user_id value, for example:
INSERT INTO posts (user_id, body) VALUES (3000, “Hello world”);
The second way is that we accidentally delete a row from the “users” table, while there are some posts published by that user:
DELETE FROM users WHERE id = 3;
Note that it’s not clear how exactly we got this “incorrect user_id”. Why is it 3000? Where did the system get that number from? This is an interesting question actually, we’ll discuss it later.
We could declare enforced foreign keys on those two tables in a standard way:
CREATE TABLE users (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(64) NOT NULL
);
CREATE TABLE posts (
id INTEGER NOT NULL PRIMARY KEY,
body TEXT NOT NULL,
user_id INTEGER NOT NULL,
FOREIGN KEY (user_id) REFERENCES users (id) ON DELETE RESTRICT
);
This is not a tutorial on foreign keys, so we’re going to discuss only this particular scenario (preventing deleting). There are some other ways to declare foreign keys, explained elsewhere.
When the “FOREIGN KEY
” declaration is added to the “posts
” table, both queries listed above would fail. You won’t be able to insert a new post with dangling user_id, and you won’t be able to remove a user record if their posts become orphaned.
Many people consider enforced foreign keys a baseline of a competently designed database. In many textbooks and courses on relational database technology foreign keys are also presented as a necessity.
But let’s look at some consequences of enforced foreign keys.
Enforced foreign keys and transactions
Another pervasive database technology is transactions (I mean ACID here: “atomicity, consistency, integrity, durability”). Basically it works this way:
- you begin a transaction;
- you make several database changes (for example, adding several rows in different tables);
- you commit a transaction;
All changes are applied all together, or the entire operation would fail. If you inserted three rows in table A
and five rows in table B
, it’s going to be either that, or none at all. It’s impossible for, say, only the rows in table A
to be added.
Transactions are much, much more fundamental technology than foreign keys. (The full picture is of course much more complicated, and I definitely urge you to study how the transactions work in relational databases: both conceptually and on a lower level.)
One important aspect of transactions is that even single-statement SQL queries run in a transaction of its own.
Let’s get back to the way we insert data into the “posts” table:
INSERT INTO posts (user_id, body) VALUES (3, “Hello world”);
What happens here if the enforced foreign keys are declared? We need to make sure that a user with ID=3 exists. It means that we actually do the following sequence of steps:
- begin transaction;
SELECT id FROM users WHERE id = 3;
- we see that the record is here, so we can proceed;
INSERT INTO posts (user_id, body) VALUES (3, “Hello world”);
- commit transaction;
The database server must protect the integrity of the database. What if there is another process that tries to simultaneously delete a row from the “users” table? Here is a hypothetical scenario that must be prevented:
- process 1: begin transaction;
- process 2: begin transaction;
- (few microseconds later) process 1:
SELECT id FROM users WHERE id = 3;
- (okay, the record does exist, we can proceed)
- (few microseconds later) process 2:
DELETE FROM users WHERE id = 3;
- (few microseconds later) process 1:
INSERT INTO posts (user_id, body) VALUES (3, “Hello world”);
- (few microseconds later) process 2: commit transaction, so that a record about user ID=3 no longer exists;
- (few microseconds later) process 1: commit transaction, so that there is now a record in “
posts
” table that refers to user_id=3 that no longer exists;
Transactions DO prevent this scenario, in the following way:
- process 1: begin transaction;
- process 2: begin transaction;
- (few microseconds later) process 1:
SELECT id FROM users WHERE id = 3;
- (okay, the record does exist, we can proceed)
- (few microseconds later) process 2: submits
DELETE FROM users WHERE id = 3;
- actually, this query is not executed yet! Because there is an outstanding read from the same table on the same ID, the process 2 gets to wait until process 1 commits transaction;
- (few microseconds later) process 1:
INSERT INTO posts (user_id, body) VALUES (3, “Hello world”);
- (few microseconds later) process 1: commit transaction, the database is consistent at that point;
- (few microseconds later) process 2 get unblocked, and tries to execute the
DELETE FROM users WHERE id = 3
statement, but this statement now fails, because there is now a row in the “posts
” table;
So, in the end our database is consistent. The database server has to carefully manage all possible concurrent access patterns from multiple clients to prevent all possible inconsistencies.
This is a beautiful technology, very reliable, there is absolutely nothing wrong with that, it works exactly as we want. But let’s consider some other consequences a few sections below.
JOINs do not honor foreign keys
Suppose we have carefully defined all the foreign keys in our database, just like the textbook says. Let’s run the following query:
SELECT * FROM posts INNER JOIN users ON posts.id = users.id;
Do you see a problem? The ON condition is incorrect: it should be “ON posts.user_id = users.id
”. I don’t know any relational database that would catch that. Why?
This for me is one of the most puzzling aspects of this concept. If foreign keys are so important for database consistency, why don’t you use them for, like, consistency?
One equals one
Why does the query from the previous section work? Because historically, relational database technology used untyped primary keys, where all integer numbers can be freely compared to each other.
User ID=1 is a different thing than Post ID=1: you should not be able to compare them directly (like in the query above). The query compiler should complain about type mismatch and make you fix the mistake.
But that’s not the reality, and we can’t have nice things due to unlucky historical timing. In an alternative timeline, modern data types would have been introduced earlier, and we would have better support for ID data types. Alas.
(But that still does not explain why JOINs can’t take foreign keys into account).
Application bugs and foreign keys
Let’s go back again to our social network example.
INSERT INTO posts (user_id, body) VALUES (3000, “Hello world”);
If there is no user with ID=3000, this query would fail: foreign keys protected our database against inconsistency.
But where did we get this number: 3000? Maybe there is a bug in our application? Foreign keys are supposed to protect our database against bugs in applications, but what sort of bugs?
What if instead our application has a bug that in some cases would provide user_id=1 (hardcoded number), and there is actually a user with ID=1. The posts would get assigned to this user, and foreign keys won’t help.
Or, what if our application has a bug that strips the last digit of user_id? So when the user with ID=321 submits a post, the user_id=32 is used instead?
Real-world case: consider a table of companies. Later, a concept of a holding company was introduced. Initially all existing companies were initialized to be a holding company of themselves. Later, new-format data started to arrive into the system, with several companies having the same holding company. Due to codebase complexity, in one of the if-branches a company ID was accidentally used instead of a holding company ID. Due to the amount of data, it took some time until people noticed that a fraction of data references was incorrect.
And so on, and so on. If you think about that, some bugs that would not get caught by foreign keys are realistically more likely to happen than the bugs that would be caught by foreign keys.
The point of this is that we have a problem to solve: database consistency. We have a solution: foreign keys. This solution is suspiciously incomplete, and it breaks quickly. And when it breaks, there is no fallback solution. If you use an existing but incorrect ID, it won’t get detected. To protect against that, you have to go back to good old testing, sanity checking and data anomalies detection.
Weak references are no big deal
Another thing that I find rather suspicious: the idea of the data structure that is “correct at all time” is actually quite rare in other areas of computer science.
You can treat foreign keys as pointers, they are very similar. Foreign keys aim to guarantee that there is always something at the other side of foreign key. Apparently, you can rely on this guarantee in the way you write your code (here: SQL queries).
But in many other areas you’re supposed to write defensively. If you access an array element by an index, you should be prepared for the possibility that the index is out of bounds.
You could go to some physical address and find that it does not exist any more, maybe because the building has been demolished. You always have a fallback plan for what to do when the address is invalid.
Even closer to the topic of databases: if a customer ordered some item, but this item is no longer in the warehouse (maybe it was lost or stolen), you’re prepared for that possibility.
Enforced foreign keys just go against the grain here. You’re supposed to trust the correctness of foreign keys — but how? Do you need to write queries that would explicitly break if not for foreign keys? Or do you still write the queries in a defensive way? Actually, you should, because as I’m going to show, you need to be prepared to write code, e.g. SQL queries, in the absence of foreign keys too.
Case study: Amazon.com orders
Almost 20 years ago, maybe around 2006 I listened to a podcast with Werner Vogels, CTO of Amazon.com. Since then I tried to find that podcast, but to no avail, so I can only offer a retelling. At that time, Amazon.com (the website) used Oracle as a database management system. Originally they built the database in a classic way, with enforced foreign keys. As the business continued to grow, this became a performance bottleneck.
At that time there was nothing to migrate to (AWS was in very early stages), so they had to reengineer the existing solution. One problem that they had was how they stored orders. Similarly to the scenario explained in “Enforced foreign keys and transactions” section above, they had the following sequence of steps:
- begin transaction;
- create a new row in the “
orders
” table, thus generating the newid
value; - insert one or more rows in the “
order_lines
” table, using the newly-generated value fororder_id
; - commit transaction;
Due to the way transaction isolation works, this requires some locking on both “orders
” and “order_lines
” tables. Amazon’s traffic was quite high even at that time, and this was one of the bottlenecks, according to Vogels. To solve this, engineering team decided to embrace dangling keys, and switched to the following processing sequence:
- allocate new order_id;
- insert one or more rows in the “
order_lines
” table, using the allocated order_id;- this insert implicitly committed the one-statement transaction;
- insert a row into the “orders” table; (also in a separate one-statement transaction);
The way the rest of the system worked is that the order considered valid only after a row in the “orders
” table was inserted. All other queries basically ignored dangling rows in “order_lines
”.
If the process that created the transaction failed in the middle, some dangling rows were left in “order_lines
”. They were ignored, and periodically cleaned up by a cron script.
According to Vogels, this change allowed them to recoup some of the performance budget associated with cross-table transactional locking.
For me hearing this story was a revelation. I can’t remember now what my best mental model of practical relational modeling was, but it has certainly changed the day I heard this.
Anyway, the point of this story is that in some scenarios eliminating foreign keys could be an easy performance win. You may be able to achieve some extra write throughput staying within the same storage and compute constraints.
Can I have more than one database?
Finally, here is another scenario where enforced foreign keys are practically impossible. Enforced foreign keys require a single database (I don’t know what’s the word for that, maybe it could be called “a single transactional domain”).
There are several reasons why you may want to split the database into several databases, so that you wouldn’t be able to keep transactional protection across them all.
First, you may have regulatory requirements: things like SOX (financial controls) and GDPR (personally identifiable information). It’s a very common requirement in business, and having a separate database is basically a very easy first step towards compliance.
Second scenario is architectural splits. For example, you may want to separate your user authentication database (not related to personal information). This is again a very common pattern. Such a database would be a primary source of user IDs. It would be separated by a transactional boundary from all other databases (which is the whole point of such a split).
Third case is technology-driven splits. In our hypothetical social media service example, you may want to implement full-text search by keeping a copy of posts in Elasticsearch. When a user tries to search for some keywords, you query Elasticsearch, get a list of post IDs, and use that list to render the posts from the main database. Here it’s possible that a post was deleted in the main database, but was still available in the Elasticsearch cluster. This is a foreign key violation, and you have no choice but to handle this in your application (which is actually super simple and not a big deal).
Conclusion
Foreign keys are a foundational concept in database modeling. It is necessary to understand what foreign keys are. You must also understand how standard enforced foreign keys work. You must also understand the inevitable consequences of this approach, primarily its interplay with transactions.
But you also need to understand how the absence of foreign keys would affect the design of a database system.
In this text, I tried to illustrate how disappointing foreign keys actually are, as a concept. They are a very limited solution to the problem of database consistency. They don’t even help with querying data. You could say that with foreign keys “there is no there”, or at least not much “there”.
In practice, you must still code defensively, even if you have set up all required foreign keys constraints. As soon as you cross the transactional boundary (by querying a different database), you don’t have the protection of foreign keys. And crossing the transaction boundaries may happen in a lot of situations that seem hard to avoid as your system grows.