Monday, January 13, 2020

Caveat for Optimizing PostgreSQL Migrations

In last week's post I wrote about minimizing ACCESS EXCLUSIVE lock time on a table when writing a migration, and gave the example of taking a migration that looked like this:
alter table people
add column if not exists guid varchar(50) default uuid_generate_v4() not null;

create index if not exists people_guid_index
    on people using btree(guid);
And changing it to this:
alter table people
add column if not exists guid varchar(50);

alter table people
alter column guid set default uuid_generate_v4();

update people set guid = uuid_generate_v4()
where guid is null;

alter table people
add constraint temp_null_check check ( guid is not null ) not valid;

alter table people
validate constraint temp_null_check;

alter table people
alter column guid set not null;

alter table people
drop constraint temp_null_check;

create index concurrently if not exists people_guid_index
    on people using btree(guid);
Because on our example data set the first version caused the people table to have 1 m 6 s 455 ms of ACCESS EXCLUSIVE lock time and 25 s 751 ms of SHARE lock time during the migration, which meant that for over a minute, no select, update, insert, or delete queries could be run against the people table, and then for nearly another 30 seconds following that select queries could run, but update, insert, and delete queries still could not. Alternatively, the second version resulted in 35 ms of ACCESS EXCLUSIVE lock time, 41 s 450 ms of SHARE UPDATE EXCLUSIVE lock time, and 1 m 46 s 636 ms of ROW EXCLUSIVE lock time, which meant that your select, update, insert, and delete queries would only be blocked from running for a total of 35 milliseconds.

Now one important caveat that I'd like to mention here is that this only matters if it's important to still be able to run queries against the table during the migration. This is typically important if you have a zero-downtime deployment pipeline in place. But I have worked at places where the deployment pipeline would intentionally take the servers down before running the migrations, and then bring them back up afterwards. Now, is this an ideal pipeline? No, not really. But many times in the real world we need to work in less than ideal situations. While it would be nice to fix the pipeline in such a situation, it is likely to take a lot of time and effort that may or may not be able to spare at the current time, and it is unrealistic to expect all other work to halt until the pipeline is addressed.

With that being said, which of the two versions of the migration given above would be better in the situation with the pipeline described? The first version. Why? Because at this point we don't care about the lock times. What we care about is how long the server will be down, and the first version's total run time is 1 m 32 s 206 ms, whereas the second version's total run time is 2 m 28 s 121 ms. As such, the second version would cause the downtime to be nearly a minute longer.

It is always important when optimizing to know which metrics are important to optimize to.