statement_timeout 0: Dirty plates, dead cooks, and kitchen fires

so you got tired of alerts that your queries were timing out. that’s bad because alerts are good even though nobody likes the alerts. my mom didn’t like acknowledging she might have cancer either so she just ignored the headaches and vertigo and balance issues and then it was like ope stage IV brain cancer.

the name makes it sound like a fix. it’s not. removing the timeout doesn’t make the slow queries faster, it just lets them run forever, and “forever” on a 1tb db with bad queries is how you lose the restaurant.

let me explain what i mean by restaurant.

the kitchen

think of postgres like a restaurant’s kitchen. food goes out as it’s done, not waiting for the plates to be done. the house rule is that if a cook takes more than 5 minutes (statement_timeout) on an order, kill them (pg_terminate_backend) and tell the customer they died but we got a new one already. the kitchen stays functional.

above is what happens when u dont kill the cook.

and the side effects compound. it’s not “one bad query takes a long time, oh well.” it’s “one bad query slowly kills every other query in the kitchen.” here’s the order things go bad in.

the slow order doesn’t get faster

this is obvious. it’ll just sit on the stove.

but whoever is cooking it can’t take other orders cuz they only get one burner. each burner is a backend process—pg forks a process per connection, and that process is pinned to whatever query it’s running until that query finishes or gets killed. because kitchen has a fixed number of burners (max_connections / pgbouncer pool size), once they’re all stuck on slow orders, every new customer waits at the door.

from outside it looks like the restaurant is closed even though it’s “running normally”—pg_stat_activity shows like 6 active queries and you go “wtf the db is fine” but the app is throwing connection pool timeouts because there’s nothing left to hand out.

this is ok because we have 1000 burners and 1000 cooks. but that’s not really okay—but we’ll pretend like it’s okay because i’m more concerned about other things.

the busser is locked out

because nothing has left the kitchen, the dishwasher is just chillin’ on his phone waiting to do something.

pg has a background process that cleans up after itself—a busser clearing plates (autovacuum). when you UPDATE or DELETE a row, pg doesn’t actually remove the old version, it just marks it dead and leaves it there. mvcc. autovacuum comes through later and reclaims the space.

but the busser is only allowed to clear plates from tables where everyone has already left—specifically, dead tuples newer than the oldest running transaction’s xmin can’t be cleaned up, because that transaction might still need to see them. if one table has been sitting there for 3 hours (one query running for 3 hours), the busser can’t clear any plates from the last 3 hours, anywhere in the restaurant.

dirty plates pile up. the cooks have to walk around them. everything gets slower. the slower it gets, the more tables stay seated for hours, the more plates pile up. we’ve been in this death spiral before; it took weeks to recover from.

why your 1-second query suddenly takes a minute

when the busser can’t keep up, every table has dirty plates stacked on it. a waiter who used to grab one clean plate now has to dig through 60 dirty ones to find it.

nothing about the waiter changed. the plate they want is still there. but the pile on top of it grew.

or, translated: a SELECT COUNT(*) that used to scan 10k live rows now scans 10k live rows plus 600k dead ones pg can’t throw away yet. it has to read every dead tuple, check the xmin/xmax to figure out it’s invisible to this transaction, and skip it. (the “heap” is just the pile of rows on disk for a table—live and dead mixed together. pg has to walk the whole pile.) same query, same code, same indexes—60x slower because the heap is full of bloat nobody’s allowed to vacuum while the long query is running.

and it’s not just the bad query (the one you “optimized” using statement_timeout) that gets slow. it’s everyone’s queries against that table. the bug isn’t in the slow query. it’s in everything around it.

(side note: this is why pg_stat_user_tables.n_dead_tup is the number i look at first, not query duration. by the time query duration spikes, the dead tuples already told you what was coming.)

the walk-in cooler

this is where it gets worse and where the analogy earns its keep, because i want you to understand DataFileRead without making you read the postgres source.

shared_buffers is the prep counter—the small workspace next to the stove where cooks keep the ingredients they’re using right now. it’s fast because everything is within arm’s reach. the walk-in cooler in the back is disk. ingredients in the cooler are slower to grab because somebody has to walk back there, open the door, find the thing, walk it back to the line.

in a healthy kitchen, almost everything a cook needs is already on the prep counter. they grab it and go. fast.

now bloat shows up. every shelf in the walk-in has 60x more stuff on it, and most of it is rotten food nobody’s allowed to throw out because the busser is locked out (probably again, since you’re reading this). the prep counter is still the same size. it doesn’t fit the working set anymore. cooks who used to find their ingredients on the counter now have to walk to the walk-in for every order, dig through piles of rotten food to find the one good onion, and walk it back.

they’re not cooking. they’re walking. and standing in front of the cooler with the door open.

DataFileRead is pg telling you “this cook is currently standing in front of the walk-in waiting for an ingredient.” you can see it in pg_stat_activity—there’s a column called wait_event, and when it says DataFileRead, that backend is not doing work, it’s waiting on the kernel to hand it a page from disk. in a healthy db you barely see this column populated. in a bloated one, every cook in the kitchen is stuck on it. you open pg_stat_activity during the incident and the wait_event column is just DataFileRead, DataFileRead, DataFileRead all the way down.

turtles.

the buffer cache hit ratio is the metric for “what percent of the time did the cook find what they needed on the prep counter vs having to walk to the cooler.” healthy kitchen, 99%. bloated kitchen, less than 93%. the difference between those two numbers sounds small. it isn’t, watch: 99% uptime vs 93% uptime—your business is dead.

grabbing something off the prep counter is ~100 nanoseconds. walking to the walk-in is ~100 microseconds. that’s 1000x slower. drop the hit ratio from 99% to 60% and your average query gets dramatically slower without anything in the app or the query plan changing at all.

(the query plan is just pg’s recipe for how to cook the order. same recipe. same ingredients. same steps. just way more walking.)

and this is the part that makes people lose their minds during the incident: nothing changed. no deploy. no new code. wow, weird, nothing changed! the recipe must still be the same because the indexes are the same and my query certainly didn’t change because it was always select * from users.

but your graphs are vertical and everyone is on a call. and the answer is: the heap got fat (more dead rows piled up on the shelves), the prep counter didn’t, and now every cook is walking to the cooler for every order. the only thing that changed is the busser stopped being allowed to throw out rotten food, because of the 3-hour query nobody noticed.

DataFileRead in pg_stat_activity.wait_event is the smoking gun. it’s pg pointing at the cooler door and saying “this is where your latency went.”

one slow table blocks the whole restaurant

if someone’s slowly eating at table 7 (long SELECT holding AccessShareLock) and a contractor shows up to repaint table 7 (ALTER TABLE wanting AccessExclusiveLock), the contractor waits.

and then every new customer who wants table 7 lines up behind the contractor—even the ones who’d be done in 2 minutes—because pg’s lock queue is fifo. doesn’t matter that a fresh SELECT only needs AccessShareLock and would be compatible with the running one; it’s behind the ALTER, so it waits.

one slow eater freezes the whole section.

translated: one slow query blocks a migration, and the migration blocks every other query on that table. deploys hang. writes hang. and the worst part is you’ll see the waiting queries in pg_stat_activity and think those are the problem, when really it’s the 3-hour SELECT at the front of the queue.

spilled food keeps spilling

when a query needs more memory than work_mem allows, it spills to disk—hash joins, sorts, ctes, the usual suspects. those temp files live in base/pgsql_tmp on the data volume.

right now those queries get cut off before they finish spilling. without the timeout they finish. and a single bad join can write tens of gigs.

we’ve already had an outage this year from running out of floor space. this is a direct path to another one. and pg_stat_tmp doesn’t alert on “about to fill the disk,” it just fills it. the next thing that happens is pg can’t write wal, replication breaks, and now you’re in a real outage instead of a slow one.

you’re firing the smoke alarm because it keeps going off

right now slow queries show up in sentry as PG::QueryCanceled with a stack trace pointing at the exact line of code. that’s free debugging info—pg is literally telling you which activerecord call is the problem.

turn off the timeout and the alerts disappear because there’s no cancellation to raise. the queries don’t get faster. you just stop seeing them.

and the symptom isn’t gonna be “the bad query is slow,” it’s gonna be “everything is slow,” because the bloat pileup and the cache evictions affect every query touching those tables. way harder to debug at 3am from pg_stat_activity than from a stack trace pointing at the controller action.