Historized attributes: systematic table design
Author: Alexey Makhotkin squadette@gmail.com.
(Word count: 3200).
A common problem in business-oriented database design: keeping the history of values of a certain data attribute. For example, we may want to track the price of various goods, as they change with time. Many other tasks could be reduced to this problem: for example, when people change their address in the government database, we may want to keep track of previous addresses.
TL;DR: A solution is presented below, in the “Final SQL schema” section.
But let’s also try and see how you can arrive at it independently, using the approach explained in my upcoming book about database design: https://databasedesignbook.com/.
Outline of this article
In the first half we build the logical schema. It is completely independent from any specific database server implementation.
In the second half we present physical table design for a typical relational database, using the most common approach. Other approaches are possible, we’ll briefly talk about that too.
We also analyze a solution provided by the commonly-available version of ChatGPT.
We talk a bit about the ambiguity of natural language, and how it affects database design.
Table of Contents
- Problem statement
- Trivial schema without history
- Logical schema: anchors
- Logical schema: attributes
- Logical schema: links
- Physical schema: choosing table strategy
- Draft SQL schema
- Improved SQL schema: using sentinel value for
valid_to
- Final SQL schema: indexes
- TL;DR: Adapting schema for arbitrary attributes
- Extra: Why two timestamps?
- What ChatGPT thinks
- Extra: Dealing with ambiguous human language
- Conclusion
Problem statement
Suppose that our company sells various kinds of fruits, and from time to time the price of each fruit can be changed to accommodate the market situation. We want to keep the information about historical prices: when did it change and to what value. We also want to know the current price of fruits.
For simplicity, we choose to talk about fruits specifically, and not about more generic “items”. We pretend that the price is always “dollar per kilo”, so that we don’t need to bother with “price per box”, “per item”, etc. Extending the schema to take this into account is left as an exercise for the reader.
Trivial schema without history
Let’s begin with a schema for a much simpler task.
Suppose that our company sells various kinds of fruits for a certain price. We sometimes change the price, but we don’t care about the history of those prices. That’s it.
The logical model is very simple: one anchor (“type of fruit”), and two attributes: name of the fruit and price of the fruit. Here is the list of anchors in tabular form:
Noun | Physical table |
---|---|
TypeOfFruit | fruits |
And the list of attributes:
Anchor | Question | Logical data type | Example value | Physical column | Physical type |
---|---|---|---|---|---|
TypeOfFruit | “What is the name of this TypeOfFruit?” | string | “Lime” | fruits. name |
VARCHAR(64) NOT NULL |
TypeOfFruit | “What is the price of this TypeOfFruit, in USD per kilo?” | monetary amount | 10.50 | fruits. price |
DECIMAL(7, 2) NULL |
No links are needed here. The physical schema is almost trivial:
CREATE TABLE fruits (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(64) NOT NULL,
price DECIMAL(7, 2) NULL
);
Even writing down the logical schema for this feels like overkill, but we do that to establish a baseline.
The problem with that schema is that when you change the price of a fruit, the information about previous price is lost. So let’s get back to our original system requirement, and design for that.
Logical schema: anchors
First part of building the logical schema is to find some anchors. Anchors are nouns, and they could be counted.
We already found the first anchor: “type of fruit”. To confirm that it is the right anchor, let’s use it in sentences and see if they make sense:
- “How many types of fruit does our company sell?” (counting sentence)
- “Could you add another type of fruit, lemons, into our database?” (adding sentence)
Fruits have a price. Normally you would say that price is an attribute, but when you change the attribute value, the previous price information would be lost. So we need to find where to store those historical prices. Maybe we have a “historical price” anchor? Let’s use this in sample sentences:
- “How many records of historical prices do we have for that type of fruit?” (counting sentence)
- “Let’s change the price one more time by adding one more price record” (adding sentence)
The sentences make sense, so let’s add it into the list of anchors:
Noun | Physical table |
---|---|
TypeOfFruit | <TBD> |
HistoricalPrice | <TBD> |
“TBD” means “to be discussed”, we’ll talk about physical tables later.
Why did we choose “TypeOfFruit” and not just “Fruit”? Why did we choose “HistoricalPrice” and not just “Price”? We’ll talk about that later in a separate section: “Dealing with ambiguous human language”.
Logical schema: attributes
Now it’s time to define attributes.
NB: This word is quite overloaded, so we have to be a bit careful. The term “historized attribute” means the entire history of attribute values. We implement it here using several other modeling attributes. Let’s get back to that later in the “Dealing with ambiguous human language” section.
Above we already defined the attribute for “name of the fruit”. Let’s look at the HistoricalPrice anchor. For each price we want to store:
- the price itself;
- time period during which this price is valid.
Anchor | Question | Logical data type | Example value | Physical column | Physical type |
---|---|---|---|---|---|
HistoricalPrice | “What is the value of this HistoricalPrice, in USD per kilo?” | monetary amount | 10.50 | <TBD> | <TBD> |
HistoricalPrice | “When did this HistoricalPrice start being valid?” | timestamp (UTC) | “2024-10-27 19:00:00” | <TBD> | <TBD> |
HistoricalPrice | “When did this HistoricalPrice stop being valid?” | timestamp (UTC) | “2024-11-27 09:00:00” | <TBD> | <TBD> |
Let’s discuss what we see here. Note that this here is the logical model: it is more pure than the physical model. Physical model has more variation in possible implementations, we’re going to discuss those in the second half of this post.
This set of attributes allows us to encode the following cases that seem to be useful in practice:
First case is the simple current price, valid from some time ago until some time in the future, e.g.:
(value: 5.5, valid_from: “2024-05-01 00:00:00”, valid_to: <unset>).
Note here that on the logical level, value of attribute can be either unset, or set to a definite value. There are no NULLs, sentinel/fake values etc. Those may appear later on the physical level, we’ll talk about that. Here, if the “valid_to” attribute is unset it means that the price is valid indefinitely in the future.
Second case is a price in the past, e.g.:
(value: 4.5, valid_from: “2023-10-01 00:00:00”, valid_to: “2024-05-01 00:00:00”);
Third case is planned future discount, e.g. a set of three records:
- (value: 5.5, valid_from: “2024-05-01 00:00:00”, valid_to: “2024-12-18 00:00:00”);
- (value: 5.0, valid_from: “2024-12-18 00:00:00”, valid_to: “2024-12-26 00:00:00”);
- (value: 5.5, valid_from: “2024-12-26 00:00:00”, valid_to: <unset>);
Here we see that the week before Christmas we reduce the price for a week and then it automatically goes back to where it was.
Fourth important case is the absence of price. For example, maybe in 2023 there was a world shortage of lime, so our company did not sell at all. We want to keep this information about the absence of trades. Otherwise we would mistakenly believe that the price was constant during 2023. Here is how we do that:
- (value: 4.5, valid_from: “2022-01-01 00:00:00”, valid_to: “2023-01-01 00:00:00”);
- (value: 5.5, valid_from: “2024-01-01 00:00:00”, valid_to: <unset>);
What we see here is that we were selling limes until the beginning of 2023 at $4.5/kg. Then there is an explicit gap in price information, and then starting from the beginning of 2024 we restart trading at $5.5/kg.
In this logical schema, how do we find the current price of a certain fruit? Answer: we find the HistoricalPrice record that is:
- either valid_from <= $current_time and $current_time < valid_to;
- or valid_from <= $current_time and $current_time is unset;
Note that in the physical schema this could be implemented more efficiently.
Logical schema: links
The last part of the logical schema is a list of links. Here we have only one link, the link between fruits and their prices.
Anchor 1 * Anchor 2 |
Sentences | Cardinality | Physical table or column |
---|---|---|---|
TypeOfFruit < HistoricalPrice |
Types of fruits can have several historical prices Historical price may belong to only one fruit |
1:N | <TBD> |
Cardinality is the most important piece of information about the link, so we write it here three times:
- “<” symbol means “1:N”;
- sentences use mandatory words: “several” and “only one”; we use them to confirm that the logical model matches the actual reality;
- we repeat cardinality in a separate column, using a more traditional, but less informative notation. Technically this is duplicate information, but it’s very convenient when you do the schema overview.
Physical schema: choosing table strategy
So far most of the text is completely independent from the physical implementation. But at this point we must choose the general strategy on how to proceed. As we mentioned in the beginning, we assume that we design for a typical relational database, such as Postgres or MySQL.
Moreover, we use the most common and simple table strategy: table per anchor. Other table strategies are possible, but this is discussed in the book. First, let’s gather the full logical schema that we have so far. Also, we fill in the cells that were marked as “<TBD>”.
For anchors, we choose sensible table names (last column):
Noun | Physical table |
---|---|
TypeOfFruit | fruits |
HistoricalPrice | fruit_prices |
For attributes, we choose a sensible column name, and a recommended physical type (last two columns). Full discussion of types is available in the book.
Anchor | Question | Logical data type | Example value | Physical column | Physical type |
---|---|---|---|---|---|
TypeOfFruit | “What is the name of this TypeOfFruit?” | string | “Lime” | fruits. name |
VARCHAR(64) NOT NULL |
HistoricalPrice | “What is the value of this HistoricalPrice, in USD per kilo?” | monetary amount | 10.50 | fruit_prices. value |
DECIMAL(7, 2) NOT NULL |
HistoricalPrice | “When did this HistoricalPrice start to be valid?” | date/time (UTC) | “2024-10-27 19:00:00” | fruit_prices. valid_from |
DATETIME NOT NULL |
HistoricalPrice | “When did this HistoricalPrice stop to be valid?” | date/time (UTC) | “2024-11-27 09:00:00” | fruit_prices. valid_to |
DATETIME NULL * further discussed below |
We have just one link. This is a 1:N link, so we can implement it as a column in the N-side anchor table:
Anchor 1 * Anchor 2 |
Sentences | Cardinality | Physical table or column |
---|---|---|---|
TypeOfFruit < HistoricalPrice |
Types of fruits can have several historical prices Historical price may belong to only one fruit | 1:N | fruit_prices. fruit_id |
Draft SQL schema
Let’s combine all that into the first draft of the schema (MySQL-compatible in this case):
-- this is a draft schema
CREATE TABLE fruits (
id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(64) NOT NULL
);
CREATE TABLE fruit_prices (
id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
fruit_id INTEGER NOT NULL,
value DECIMAL(7, 2) NOT NULL,
valid_from DATETIME NOT NULL,
valid_to DATETIME NULL
);
Writing down this schema is very straightforward and even boring. You go over each element in the logical schema and add the corresponding table or column.
But we need a couple of finishing touches: a) indexes and b) dealing with valid_from
/valid_to
columns on the physical level.
How would a typical query look? Suppose that we want to get the price of fruit with ID=3 at some point in time, say “2023-06-01 00:00:00”. Here is the query that works with this version of the schema:
SELECT *
FROM fruit_prices
WHERE fruit_id = 3
AND valid_from <= "2023-06-01 00:00:00"
AND (valid_to > "2023-06-01 00:00:00" OR valid_to IS NULL);
There are several things to notice here. First, for “valid_from
” we
used the “less than or equal” operator, whereas for “valid_to
” we
used the “greater than” operator. This is important.
Second, for “valid_to
” we have to handle both a case of NULL and a case of definite timestamp. Can we improve that to simplify queries?
Third, there are no indexes. If we look at the query execution plan, we’ll see that this query is doing a full table scan. We certainly want to improve it in the name of performance.
Fourth, there are a lot of texts on the internet that attempt to solve this problem using only one timestamp. Why did we choose two timestamps? Let’s discuss it below.
Improved SQL schema: using sentinel value for valid_to
Logically, “valid_to
” could contain a specific time or be indefinite. We tried to use a NULL
value to represent this. We can also do it with the fake (sentinel value). DATETIME physical type has a ridiculous upper limit, such as “9999-12-31 23:59:59”. What if we could exploit this? Let’s choose some constant that would never be reached: “9000-01-01 00:00:00”, and use it when the “valid_to
” time is indefinite. Then we can mark it as non-null:
CREATE TABLE fruit_prices (
-- ...
valid_to DATETIME NOT NULL
);
Our main query would then look simply:
SELECT *
FROM fruit_prices
WHERE fruit_id = 3
AND valid_from <= "2023-06-01 00:00:00"
AND valid_to > "2023-06-01 00:00:00";
(Unfortunately, you cannot use BETWEEN here, because it uses “<=” for both left and right side, and you must use “<” here. This is a real bummer, it would have been so nice if there was something like “XBETWEEN” operator, or a Python-style comparison such as “valid_from <= "2023-06-01 00:00:00" < valid_to
”. But alas, we can’t have nice things.)
Getting the current price is also simple, just use the corresponding function:
SELECT *
FROM fruit_prices
WHERE fruit_id = 3
AND valid_from <= NOW()
AND valid_to > NOW();
Final SQL schema: indexes
To understand indexes fully, you may want to read the book about indexes, such as “Use The Index, Luke” by Markus Winand: https://use-the-index-luke.com/
For our purposes, we need to add an index covering two columns: (fruit_id
, valid_from
).
Here is the final schema with the two changes applied to the “fruit_prices
” table:
CREATE TABLE fruits (
id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(64) NOT NULL
);
CREATE TABLE fruit_prices (
id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
fruit_id INTEGER NOT NULL,
value DECIMAL(7, 2) NOT NULL,
valid_from DATETIME NOT NULL,
valid_to DATETIME NOT NULL,
INDEX (fruit_id, valid_from)
);
TL;DR: Adapting schema for arbitrary attributes
We used a very concrete example, “fruit prices”, but the same pattern could be reused for any piece of data that changes (for example, “person’s address”). You need to change:
- table name that contains values;
- reference to the anchor ID;
- physical type of value.
Here is the template to copy-paste and adjust to your use case:
CREATE TABLE <<table_name>> (
id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
<<anchor_id>> INTEGER NOT NULL,
value <<PHYSICAL_TYPE> NOT NULL,
valid_from DATETIME NOT NULL,
valid_to DATETIME NOT NULL,
INDEX (<<anchor_id>>, valid_from)
);
Extra: Why two timestamps?
There are lots of texts on the internet that attempt to solve this problem using only one timestamp. Such a schema would look like this:
CREATE TABLE fruit_prices (
id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
fruit_id INTEGER NOT NULL,
value DECIMAL(7, 2) NULL,
valid_from DATETIME NOT NULL,
INDEX (fruit_id, valid_from)
);
This approach works, but it requires a subquery to get the historical price for any given time (and the current price too):
SELECT *
FROM fruit_prices
WHERE fruit_id = 3
AND valid_from = (SELECT MAX(valid_from)
FROM fruit_prices
WHERE fruit_id = 3
AND `valid_from` <= "2023-06-01 00:00:00")
I hope I did not screw up the query. But that’s exactly the point: the two-timestamp schema with sentinel values is super straightforward, it’s just so much easier to use, and more performant.
What ChatGPT thinks
I used ChatGPT 4o to ask for advice: https://chatgpt.com/share/67237760-bb98-8012-9d61-b10057433f65
Here are things that are discussed in this post that are worth noticing in that answer:
- it uses one-timestamp approach;
- it suggests more indexes than needed: it creates an index on (
fruit_id
) and another index on (fruit_id
,change_date
); actually the second index supersedes the first one (read the book!); - it loves foreign keys which I completely ignored in this post;
- the “
price
” column is declared asNOT NULL
: this assumes that there is always a price for a fruit, even if sometimes they are simply not available; to handle this case you need either a two-timestamp solution, or make this column nullable; - it suggests keeping a copy of the current price directly in the “
fruits
” table; this may be a great performance improvement, but you would have to keep this up-to-date carefully. Particularly, note that our schema handles the use-case of “schedule a discounted price for some time period, then go back to the original one”. Thinking through how you would implement that is a great exercise.
Extra: Dealing with ambiguous human language
There are several places in this text where we can’t help but use somewhat ambiguous language. Noticing this may sometimes lead to small breakthroughs in data modeling of a certain cases.
First elephant in the room is that there are two different meanings of “attribute” here. “Historized attribute” is a commonly used term, but it’s quite fuzzy. On the logical level, every historized attribute is modeled by three logical attributes, one anchor and one link. Overall the word “attribute” is hopelessly overloaded, but it’s not clear what to replace it with. Sometimes I use “piece of data” for intentional vagueness.
Second, why do we use “TypeOfFruit”, and not “Fruit”? Suppose that our company specializes in watermelons and grapefruits, and does not sell anything else. Suppose that we’ve sold a truckload of watermelons and a truckload of grapefruit: how many fruits did we sell? If we write a report about our business, we’d probably say: “our company sells only two types of fruit, but we’re selling it by thousands every week.”
For a nastier example, imagine a company sells airline tickets. What does “flight” mean? Without clarification, this word could mean at least three things:
- the fact that there is an aerial connection between Amsterdam to Berlin;
- the fact that there is a daily flight at 19:35 that has flight number “KL 1783”;
- a specific flight that happened on 2024-10-01, with ten minutes delay.
Carefully distinguishing between those three things greatly simplifies collaboration when designing and implementing the system.
Third, why do we use “HistoricalPrice” and not “Price”? Again, we want to avoid ambiguity in the sentence “This fruit has several (historical) prices in our database”. You can be confused and assume that there are, say, different prices for different groups of customers.
Thinking about how you would model this particular case could be a great exercise in data modeling.
Fourth, and final thing is the ambiguity of the word “selling”. “We sell watermelons for $10 per item”. But does it mean that anybody is buying them at this price? Maybe you would better use something like “offer”. I’ve seen a case where this particular realization became a small breakthrough for the person trying to write a complicated SQL query over undocumented database. So, sharing it here.
Conclusion
Keeping a history of some attribute is a very common requirement. For some reason people sometimes treat it as something advanced, almost a black box. In this article we showed how historized attributes could be designed and implemented in a systematic way.
This article demonstrates the approach explained in the upcoming book: https://databasedesignbook.com/. We lay out the logical model first, clarifying all the business-level requirements. Logical model is independent from any particular database server technology: relational, NoSQL, etc.
As a second step, we construct a traditional relational implementation of historized attributes. This step is very straightforward because most complexity is contained in the logical model. It is possible to build alternative physical implementations, based on the same logical model.
You can use this approach to design any database, no matter if you start from scratch or extend the existing schema.