Monday, May 11, 2020

Renaming Tables and Columns in PostgreSQL with a Zero-Downtime Pipeline

As any programmer will tell you, one of the hardest things in programming is naming things. There are many reasons for this, and we don't need to get into the weeds with it, but this leads to a fairly common scenario where something is given a poor name, and later on it needs to be refactored to be given a better name. (Or possibly it had a great name to begin with, but as the code evolved, the name became outdated. This leads to the same scenario. It needs to be renamed.)

But when it comes to databases, programmers frequently find that renaming things is hard. In addition to writing the migration to rename the table or column, you'll also need to scour all the queries in your code and make sure that you've changed all references. And if you missed something, you won't know about it until you or someone else happens to hit that query during runtime. (If things were done right, you'll have an automated test suite that you can run that will instantly tell you whether or not you missed something, but in the real world, less than ideal situations are common, and you very well might not have this luxury.)

And with zero-downtime deploys, things become even trickier. While in many programmers' minds they believe that all changes in a commit go out simultaneously, this actually is not the case. The process of deploying, which involves running migrations, bringing down old servers, starting up new servers, etc., takes time, and these parts can't happen all at once. As such, a zero-downtime deploy pipeline will often look something like this:
  1. The migration is run against the database
  2. The old application containers (of which there are many instances) are brought down one at a time and replaced with the new application containers in turn.
This results in a period of time where there are both old application servers and new application servers running. This time is short, but it is still there. And any user that is on an old application server after the migration is run will get errors because they are trying to reference the old table or column name.

Some programmers, knowing about these issues, will wait until off hours to run the deploy. In best case scenarios, this has you up late at night, outside of normal work hours, running a deploy. In worst case scenarios, where you have a global audience, this isn't even an option.

As a result of all these issues, I've frequently run into situations where programmers simply forgo renaming things in the database, which results in most of the code having the cleaner, clearer refactored name, but once you reach the database level, you're looking at the old ugly name.

To be quite frank, there had to be a better way to deal with this, so that sent me doing some looking and researching, and something that I discovered is that in Postgres views are updatable if they are kept simple enough. An article on this can be found here.

So with this being the case, we can use views to create aliases of sorts, and I'll give examples for table names and column names below.

But before I get to the examples, I'd like to address a concern that I've heard regarding views. The concern is that through using views the database will be drastically slowed down because the views do not have access to the indexes on the table. This is simply not true. Views in Postgres are implemented through rules, and the query planner is more than capable of taking a query referencing the view and the query embedded inside the view and combining and optimizing them. Now, if you have some rather complex views, or views referencing views, or other such things, then the query planner might not be intelligent enough to properly optimize it in the same way that a hand crafted query would, but that won't be a concern for the very simplistic views that we use below.

Renaming a Table


Let's walk through an example of renaming a table. In the query below I create a table with the name "old_table_name" which I'll plan on refactoring to the name "new_table_name". I also insert some basic data in to this table.
create table old_table_name (
    id serial primary key,
    some_value text not null
);

insert into old_table_name (some_value) values
('Value 1'),
('Value 2'),
('Value 3'),
('Value 4'),
('Value 5'),
('Value 6'),
('Value 7'),
('Value 8'),
('Value 9'),
('Value 10');
Now as a way verify that the migrations that we'll write won't be causing problems, let's insert a large amount of data into this table. By running the following query 19 times we'll end up with over 5 million rows in the table.
insert into old_table_name (some_value)
select some_value
from old_table_name;

select count(*)
from old_table_name;
Additionally, here are some crud operation queries that we can use to represent the kind of queries that we'll find in the code. We'll want to make sure that these don't break during the migration. Note that I've intentionally designed these four queries so that if they are run in order, it will return the table back to it's initial state. This gives us a way to repeatedly test our crud operations.
select id, some_value
from old_table_name
where id = 1;

update old_table_name
set some_value = 'New Value'
where id = 1
returning id, some_value;

delete from old_table_name
where id = 1;

insert into old_table_name (id, some_value) values
(1, 'Value 1')
returning id, some_value;
Now, for our first migration. We'll add a view with the "new_table_name" that we're wanting to use and have it reference the table using the "old_table_name". This should put us in a situation where queries using the "old_table_name" and queries using the "new_table_name" will both work. After running this migration, we can rerun the above crud operation queries and verify that they still work. Note that this migration ran in 50 milliseconds (times will of course vary).
create view new_table_name as
select id, some_value
from old_table_name;
This then allows us to update the queries in the code at our leisure, whether that's part of the deploy with the above migration, or in a subsequent migration, or even many subsequent migrations. This puts us in a nice situation where all renames are not immediately required. We'll discuss the benefits of this later. In any case, over time you'll want to update all of your queries. We can take the above four crud operation queries and update them to use the "new_table_name" and verify that the queries work:
select id, some_value
from new_table_name
where id = 1;

update new_table_name
set some_value = 'New Value'
where id = 1
returning id, some_value;

delete from new_table_name
where id = 1;

insert into new_table_name (id, some_value) values
(1, 'Value 1')
returning id, some_value;
Once you're certain that you've updated all queries referencing the table to use the "new_table_name", then you can run the final migration to drop the view and update the table name. Note that you'll want to run these two queries inside of a transaction. Be sure that you understand how your migration tool works and how it handles transactions, because it may not mean explicitly running "begin;" and "commit;" as I'm showing below. In my tests dropping the view took 45 milliseconds, and renaming the table took 45 milliseconds, so even though this migration will lock up the table with an access exclusive lock, everything will still work fine, because the amount of time is inconsequential.
begin;
drop view new_table_name;

alter table old_table_name rename to new_table_name;
commit;
After running the migration, the queries referencing the "new_table_name" work, and any queries referencing the "old_table_name" will not.

Renaming a Column


This will have a number of similarities to table renaming, along with some key differences. So to set up our example, we'll create a table with a column named "old_column_name" that we intend to update to "new_column_name", and we'll add some data.
create table table_name (
    id serial primary key,
    old_column_name text not null
);

insert into table_name (old_column_name) values
('Value 1'),
('Value 2'),
('Value 3'),
('Value 4'),
('Value 5'),
('Value 6'),
('Value 7'),
('Value 8'),
('Value 9'),
('Value 10');
Once again, just to verify that our migrations won't cause issues with large amounts of data, we'll run the following query 19 times to give ourselves over 5 million rows.
insert into table_name (old_column_name)
select old_column_name
from table_name;

select count(*)
from table_name;
And here are our crud operation queries referencing the "old_column_name", which will be representative of queries in our application code:
select id, old_column_name
from table_name
where id = 1;

update table_name
set old_column_name = 'New Value'
where id = 1
returning id, old_column_name;

delete from table_name
where old_column_name = 'New Value';

insert into table_name (id, old_column_name) values
(1, 'Value 1')
returning id, old_column_name;
At this point we can run a migration that gives the table a temporary name, and creates a view with the table name, that provides columns with both the "old_column_name" and the "new_column_name", where the "new_column_name" is just an alias to the "old_column_name". Note that These two queries should be run inside of a transaction. In my tests, the table rename ran in 45 milliseconds, and the create view ran in 47 milliseconds. After the migration runs, you can verify that the above crud operation queries still work.
begin;
alter table table_name rename to temp_table_name;

create view table_name as
select id, old_column_name, old_column_name as new_column_name
from temp_table_name;
commit;
At this point we can change the queries above to reference the "new_column_name" and verify that they still work.
select id, new_column_name
from table_name
where id = 1;

update table_name
set new_column_name = 'New Value'
where id = 1
returning id, new_column_name;

delete from table_name
where new_column_name = 'New Value';

insert into table_name (id, new_column_name) values
(1, 'Value 1')
returning id, new_column_name;
Once we are certain that all queries have been updated to reference the "new_column_name", we can then run a migration to drop the view, rename the table back to its original name, and rename the column to the "new_column_name". Once again, these queries should be run inside of a transaction. In my tests, dropping the view took 45 milliseconds, altering the table name took 46 milliseconds, and renaming the column took 45 milliseconds.
begin;
drop view table_name;

alter table temp_table_name rename to table_name;

alter table table_name rename old_column_name to new_column_name;
commit;
After this migration is run, then only the queries referencing the "new_column_name" will work, and the ones referencing the "old_column_name" will no longer work.

So with the how to out of the way, let's discuss a few benefits. The first benefit is that by breaking thing into at least two separate deploys, the first with the first migration and the changes to the queries, and then a follow up with the final migration, we are able to rename a table or column, and have a zero-downtime deploy go off without a hitch. During that short period of time where there are still old services making database calls in the old format as well as new services using the new format, everything will work without issue.

But what's perhaps even more beneficial here is the fact that you actually don't need to update all the queries at once. You could have one deploy that only runs the first migration that adds the view, then after that you could have one or many deploys updating the queries in the code, and these updates can be spread over time. Then once you know that all necessary queries have been updated, you could run the final migration removing the view and making the final changes to the table. There are multiple benefits that can come from this.

The first of these benefits is small batch size. There are many sources out there that discuss the benefits of small batch size, which involve fewer bugs, code that is easier to review and deploy, and the ability to be more responsive to changing priorities. And this gives us the ability to do small batches. Instead of being forced to update all queries in the database all at once, I can instead make smaller changes, where perhaps a deploy simply changes one or two queries. It gives the developer more control and say over how to handle things.

Another benefit that can come from this is that it can help with those scenarios where you're not sure if you managed to find and update all of the references. There is nothing that says that that final migration needs to be run immediately. It could instead make sense to have a period of time where you wait and you monitor logs to see if there are any queries coming through that are referencing the old names (this is assuming that you are logging the queries that are run against your database). Once you've had a period of time where no references have been made to the old name, then you can run the final migration with some level of confidence that nothing is going to break.

And so with that, we have a strategy for renaming things at the database level that is fairly straightforward and simple to implement that gives us the flexibility that we need to do the job with a level of confidence that we won't be breaking things in the process.