Dead Letter Notes
Notes on data pipelines, message queues, and the failure modes in between.

Backfilling a partitioned table without taking down production

Backfills are where careful pipelines go to die. You changed a transformation, or fixed a bug, and now three years of history need to be recomputed. The naive approach — one giant INSERT ... SELECT over the whole range — locks things up, blows out your temp space, and if it fails at hour six you get to start over. Here's the boring, reliable way I do it now.

Partition by time, iterate by partition

If the table is partitioned by day or month (and a table big enough to need a careful backfill almost always should be), treat each partition as an independent unit of work. Recompute one partition at a time. Each one is small enough to fit in memory, fast enough to retry, and isolated enough that a failure in April 2024 doesn't touch March.

Write to a shadow, then swap

Rather than mutating the live partition in place, I build the corrected data in a staging table and swap it in atomically. In Postgres, partition exchange makes this nearly instant and keeps readers happy:

-- build the corrected partition off to the side
CREATE TABLE events_2024_04_new (LIKE events INCLUDING ALL);
INSERT INTO events_2024_04_new
SELECT * FROM recompute_source WHERE day >= '2024-04-01' AND day < '2024-05-01';

-- swap: detach the old partition, attach the new one
ALTER TABLE events DETACH PARTITION events_2024_04;
ALTER TABLE events ATTACH PARTITION events_2024_04_new
  FOR VALUES FROM ('2024-04-01') TO ('2024-05-01');

Readers see the old data until the instant of the swap, then the new data. No half-written partitions, no long-held locks over millions of rows.

Make it resumable and rate-limited

Drive the loop from a small control table that records which partitions are done, running, or pending. If the job dies, it picks up where it left off instead of restarting. And put a deliberate sleep between partitions — a backfill that saturates your I/O will page someone at 3am even if it technically "isn't touching" production reads. Slower and invisible beats fast and disruptive every time.

postgresbackfillpartitioning