A single idle-in-transaction connection can starve every table in your database. Most senior Rails devs don’t know that.
You learned MVCC the way most of us did—”writers don’t block readers,” concurrent transactions get isolated snapshots, UPDATE doesn’t lock the table. That’s marketing copy. The operational reality is that those snapshots have a cost, and the cost compounds across every table at once if one connection gets stuck.
I’m going to try to translate what this mental model is that I’ve built while building and operating a Rails-based data pipeline that handles 10TB/day. Not “what is MVCC”—you’ve heard that—but instead consequences of MVCC. You know, the ones that turn a quiet Monday into a panic akin to “why is production at 95% disk.”
This application of mine hangs out on a single server.
The application
Before you’re like “omg Rails” I will disclose that ActiveRecord is only used at the read layer for the user-facing API, for its database connection, and the orchestration of the pipeline. The work is largely done in SQL.
I have a nice little helper that I’ve enjoyed using. You may enjoy it too; here are the good parts:
module CoolBeans
class SQL
class Error < StandardError; end
class UnknownProfile < Error; end
class AlreadyInTransaction < Error; end
class InvalidSetting < Error; end
class << self
def execute(profile, connection: ActiveRecord::Base.connection, &block)
raise ArgumentError, "block required" unless block
new(profile: profile, connection: connection).call(&block)
end
def settings_for(profile)
config.fetch(profile.to_s) do
raise UnknownProfile, "no profile=#{profile} in config/sql.yml for env=#{Rails.env}"
end
end
def profiles
config.keys
end
def reload!
@config = nil
end
def config
@config ||= Rails.application.config_for(:sql).to_h.deep_stringify_keys
end
end
def initialize(profile:, connection:)
@profile = profile.to_s
@connection = connection
@settings = self.class.settings_for(@profile)
end
def call
if connection.transaction_open?
raise AlreadyInTransaction,
"SQL.execute(#{profile.inspect}) cannot nest inside an existing transaction; " \
"SET LOCAL would leak past this block. Call it at the top of the unit of work."
end
logger = ActiveRecord::Base.logger
original_level = logger&.local_level
begin
if logger
logger.silence(Logger::ERROR) do
connection.transaction do
apply_settings!
logger.silence(original_level) do
yield(connection)
end
end
end
else
connection.transaction do
apply_settings!
yield(connection)
end
end
ensure
ActiveSupport::Notifications.unsubscribe(subscriber) if subscriber
warn_if_insert_heavy(stats)
end
end
private
attr_reader :profile, :connection, :settings
def apply_settings!
settings.each do |key, value|
connection.execute("SET LOCAL #{key} = #{connection.quote(value.to_s)}")
end
end
module ConnectionMixin
def tuned(profile, &block)
proxy = Proxy.new(self, profile)
return proxy unless block
proxy.call(&block)
end
end
class Proxy
DELEGATED = %i[
execute exec_query exec_insert exec_update exec_delete
select_all select_one select_value select_values
].freeze
def initialize(connection, profile)
@connection = connection
@profile = profile
end
def call(&block)
SQL.execute(@profile, connection: @connection, &block)
end
DELEGATED.each do |m|
define_method(m) do |*args, **kwargs, &blk|
SQL.execute(@profile, connection: @connection) do |c|
c.public_send(m, *args, **kwargs, &blk)
end
end
end
end
end
end
That fancy config file:
development:
bulk_import:
statement_timeout: 0
lock_timeout: 0
synchronous_commit: "off"
maintenance_work_mem: "16GB"
work_mem: "512MB"
max_parallel_maintenance_workers: 8
max_parallel_workers_per_gather: 8
promote:
statement_timeout: 0
lock_timeout: 0
synchronous_commit: "off"
session_replication_role: "replica"
maintenance_work_mem: "16GB"
work_mem: "512MB"
max_parallel_maintenance_workers: 8
max_parallel_workers_per_gather: 8
create:
statement_timeout: 0
lock_timeout: 0
synchronous_commit: "off"
session_replication_role: "replica"
maintenance_work_mem: "16GB"
work_mem: "32GB"
max_parallel_maintenance_workers: 8
max_parallel_workers_per_gather: 8
production:
bulk_import:
statement_timeout: 0
lock_timeout: 0
synchronous_commit: "off"
maintenance_work_mem: "16GB"
work_mem: "512MB"
max_parallel_maintenance_workers: 8
max_parallel_workers_per_gather: 8
promote:
statement_timeout: 0
lock_timeout: 0
synchronous_commit: "off"
session_replication_role: "replica"
maintenance_work_mem: "16GB"
work_mem: "512MB"
max_parallel_maintenance_workers: 8
max_parallel_workers_per_gather: 8
create:
statement_timeout: 0
lock_timeout: 0
synchronous_commit: "off"
session_replication_role: "replica"
maintenance_work_mem: "16GB"
work_mem: "32GB"
max_parallel_maintenance_workers: 8
max_parallel_workers_per_gather: 8
Before we get into things
A lot of this becomes irrelevant if your scale is not scale.
MVCC, compressed
Every row has two hidden columns: xmin (the transaction that wrote it) and xmax (the transaction that deleted or superseded it). A transaction reads a row when xmin is committed and visible to its snapshot, and xmax either isn’t set or isn’t visible. That’s how postgres avoids locking readers against writers—they’re looking at different versions of the same rows, decided by visibility rules instead of locks.
UPDATE doesn’t modify a row in place. It writes a new tuple, sets the old one’s xmax, and leaves the old version on the page until vacuum proves nobody can still see it. DELETE does even less—it sets xmax and walks away. The space stays allocated. Vacuum is the garbage collector. Without it, the heap grows monotonically and the indexes follow.
That’s the model. The interesting question is what happens when vacuum can’t or won’t do its job.
The xmin horizon and cluster-wide vacuum suckage
Vacuum can only reclaim a dead tuple if it’s invisible to every active snapshot in the database. Not just transactions touching this table—every transaction in the cluster. Postgres tracks the oldest xmin still in use as the xmin horizon, and refuses to remove any tuple newer than that horizon.
So: one transaction starts at 9am, holds the snapshot, and goes idle. At 9:01, a thousand UPDATEs land on users. At 9:02, ten thousand UPDATEs land on events. Vacuum runs all afternoon, looks at every dead tuple in every table, and skips them. They’re newer than the 9am horizon.
Every table bloats. Indexes bloat worse. Replicas slow because they’re replaying the WAL backlog of those updates and can’t trim WAL until the primary tells them it’s safe. Disk fills. Plans degrade. The query that was 8ms last week is 400ms now and the indexes look fine in \d.
The cause is usually one of:
- A Rails console someone left open with a
transaction doand abinding.pryon the stack - A
pg_dumprunning unexpectedly long against the primary - A worker that began a transaction, raised before commit, and ended up in a connection state where the rollback never landed
- A long analytics query someone runs from a notebook against production
The diagnostic—the one query you should be able to type from memory in an outage:
SELECT pid, now() - xact_start AS duration, state, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY duration DESC;
If anything shows up with a duration measured in hours, that connection is your xmin horizon, and your bloat is its fault. The fix is pg_terminate_backend(pid). The follow-up is idle_in_transaction_session_timeout set to something sane—five or ten minutes—so the database protects itself when you forget.
This is the most underused safety net in postgres. The default is zero. Zero means “let one developer’s rails console eat the cluster.”
WAL, checkpoints, and the bulk-import storm
Every write goes to the write-ahead log first. The actual data pages get dirtied in shared buffers and written to disk later, at checkpoint time. WAL is sequential and fast. Random heap writes are not. That’s the trick that lets postgres survive a crash and also keeps writes cheap.
Checkpoints flush all dirty buffers to disk so WAL up to that point can be discarded. They run on a schedule (checkpoint_timeout) or when WAL volume hits max_wal_size, whichever comes first. The schedule is gentle. The volume trigger is not.
Here’s the failure mode: You start a bulk import; a worker is COPYing into staging tables and you’re generating WAL faster than the schedule can absorb. You hit max_wal_size halfway between scheduled checkpoints, postgres triggers a forced checkpoint, IO saturates, and every query holding a buffer pin waits its turn behind the flush. Latency on totally unrelated requests spikes for ten or twenty seconds.
Pagers scream.
The fix is configuration, not architecture:
-
max_wal_sizebig enough that your peak write rate doesn’t hit it between scheduled checkpoints. I use 48GB. That’s not a typo—modern disks are cheap, and the alternative is checkpoint storms. -
checkpoint_timeoutlong enough that the schedule does the work. I use 30 minutes. -
checkpoint_completion_target = 0.9so postgres spreads the flush across 90% of the interval instead of bunching it.
You’ll see the symptom in pg_stat_bgwriter: checkpoints_req (forced) climbing while checkpoints_timed (scheduled) stays flat. If checkpoints_req is more than a small fraction of checkpoints_timed, your max_wal_size is too small.
The thing nobody tells you: bigger max_wal_size is also disk space. 64GB of WAL means 64GB of disk reserved for WAL. Budget accordingly.
The lock graph that actually matters
Postgres has eight lock modes. You don’t need to memorize the table. You need to know:
-
AccessExclusiveblocks everything, including reads. This is whatALTER TABLE,DROP TABLE, and most non-CONCURRENTLYindex operations take. If you take it on a hot table during peak traffic, your app stops. You may have done this. -
ShareUpdateExclusiveis whatVACUUM,ANALYZE, andCREATE INDEX CONCURRENTLYtake. It conflicts with itself. You cannot run twoVACUUMon the same table at once. Autovacuum can be blocked by your manualVACUUM, and vice versa. -
RowExclusiveis whatUPDATE,DELETE,INSERTtake. It conflicts with locks meant for schema changes, but not with itself. That’s why your app keeps writing while another connection is also writing.
That’s most of what matters in production. The other modes exist mainly because schema changes need to coordinate with reads in specific ways. The two consequential rules: AccessExclusive stops the world, and ShareUpdateExclusive serializes maintenance with itself.
The Rails-shaped trap: a migration that runs add_column with a default value on a table with millions of rows, in a transaction. AccessExclusive for the duration of the rewrite. Every read on the table queues. PG11+ made the constant-default case cheap—adding a column with a literal default no longer rewrites the table — but ALTER COLUMN TYPE, adding NOT NULL the wrong way, and indexes without CONCURRENTLY all still rewrite.
The blocker tree
When the app feels stuck—requests piling up, response times climbing, no obvious culprit in the application logs—the question is “who’s blocking who.” Postgres ships the answer.
SELECT
blocked.pid AS blocked_pid,
blocked.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking.query AS blocking_query,
now() - blocking.xact_start AS blocking_duration
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
ON blocking.pid = ANY (pg_blocking_pids(blocked.pid))
WHERE blocked.wait_event_type = 'Lock';
pg_blocking_pids returns the array of pids that are blocking a given pid. The join expands the graph. In practice the output is short—usually one or two blockers at the root, with a fan of blocked queries waiting on them. The root is what you kill.
I keep this in a snippet file. So should you. The day you need it, you don’t have time to look it up.
What you do with the result is the harder question. pg_cancel_backend(pid) is polite—it asks the query to stop. pg_terminate_backend(pid) is not—it kills the connection and rolls the transaction back. Polite first, terminate second. If the polite kill doesn’t take inside a few seconds, the query is in an uninterruptible kernel call (usually IO) and you need the hammer.
TOAST and the silently-slow big-column problem
Postgres pages are 8KB. A row larger than the page can’t fit. TOAST—The Oversized Attribute Storage Technique (postgres devs have a sense of humor)—handles this by compressing large values, breaking them into chunks, and storing the chunks in a side table. The main row gets a pointer.
The compression and chunking are transparent. The performance characteristics are not.
A text or jsonb column that grows from kilobytes to megabytes—say, a vendor response payload you started caching—quietly turns every SELECT against the row into an extra round of TOAST chunk fetches. The query plan looks identical. EXPLAIN shows the same Index Scan. The actual IO doubles or triples.
The symptoms:
- Bitmap Heap Scan reading way more buffers than rows
- A query that was fast last quarter is slow now and nothing about it changed
- Disk read volume on the table is wildly disproportionate to row count
The fixes, in order of effort:
-
ALTER TABLE x ALTER COLUMN big_jsonb SET STORAGE EXTERNALto skip compression if it isn’t helping -
ALTER TABLE x ALTER COLUMN big_jsonb SET COMPRESSION lz4(PG14+, I think?) for faster decompression - Stop selecting the column when you don’t need it. If your AR model has a
payloadcolumn you only need on detail pages, setself.ignored_columns = %w[payload]on a slim subclass for list queries. Or useselect(:id, :name, ...)everywhere it matters. - Move the column to a side table you join in only when needed. Sounds redundant—postgres already moved it—but the catalog lookup, planner cost, and chunked decompression are real overhead avoided when the column isn’t selected at all.
The deeper lesson: a jsonb column is not free, and the cost is hidden under a layer that doesn’t show up in EXPLAIN until you read it carefully.
Transaction ID wraparound
Every transaction gets a 32-bit ID. Postgres needs xmin to determine row visibility. 32 bits gives you about 4 billion transactions, after which the counter wraps. To keep visibility correct across the wrap, postgres “freezes” old rows—marks them as definitively visible to everyone—so their original xmin doesn’t matter anymore. Vacuum does the freezing.
If autovacuum can’t keep up—usually because it’s blocked, throttled, or starving—the database approaches wraparound. At a configurable threshold postgres starts running aggressive anti-wraparound vacuums. If those can’t keep up, postgres shuts down the database to prevent corruption. This has happened to enough big companies to be a meme.
The query you should run on a schedule:
SELECT datname, age(datfrozenxid)
FROM pg_database
ORDER BY 2 DESC;
Default autovacuum_freeze_max_age is 200 million. A reasonable alarm is 500 million. Panic at 1.5 billion. If you’re seeing 1.5B and rising, something has been blocking autovacuum for a long time—usually that idle-in-transaction connection from earlier, or a manual VACUUM holding ShareUpdateExclusive on a huge table, or autovacuum workers maxed out on something else.
This is one of those operational concerns Rails developers never see until they own a database. Then they see it once and never forget.
The catalog as data
For whatever reason we don’t believe that the postgres catalog is not metadata. It’s tables. Real tables. With rows. You can query them. It’s like the postgres devs are smart.
pg_class is one row per table, index, sequence, or other relation. pg_attribute is one row per column, across every relation. pg_index is one row per index with the column references and the predicate for partials. pg_proc is one row per function and stored procedure.
These aren’t an API surface that postgres “exposes”—they’re how postgres stores its own bookkeeping, and you can SELECT from them like anything else. You can JOIN them. You can build views over them. You can point an ActiveRecord model at them.
A pattern from a codebase I work in:
module StoredProcedures
class BaseStoredProcedure < ApplicationRecord
self.table_name = "pg_proc"
self.primary_key = "oid"
end
end
StoredProcedures::BaseStoredProcedure.where("proname LIKE 'tasker_%'") returns every stored procedure prefixed with tasker_ in the catalog. AR doesn’t care that the table happens to be the function catalog—it’s a table with columns, AR is happy.
(I use this to manage migrations for stored procedures, which I’m usually against but at the scale of this application, I love saving 10ms across inserts because I have a couple million to do a day.)
What this enables, once you accept the framing:
- A migration that asserts a stored procedure exists with the right signature, written as an AR query
- A test that fails when a function definition drifts from what your code expects
- A diagnostic page in your admin tool that lists every function with its source, fed by
pg_get_functiondef(oid) - A check-on-boot that loads function definitions from disk and compares them to the catalog, surfacing manual edits
The framing scales further. A StagingTable model in the same codebase points at pg_stat_user_tables for live row counts and last-vacuum timestamps. It’s an AR model whose underlying table is a postgres-maintained statistics view. The implementation is two lines. The capability is “show me the freshness of every staging table on the dashboard, with sorting and search, for free.”
Maybe I need to create some sort of gem that exposes these things.
Beginners don’t realize you can do this. Many seniors haven’t tried. It costs almost nothing to start, and it changes how you think about the database—the catalog isn’t a black box you query through \d or DBeaver. It’s data. Your application can read it and reason about it like anything else.
What this buys you
The compressed version:
- One idle transaction can bloat every table in the cluster. Set
idle_in_transaction_session_timeout. Run thepg_stat_activityquery when something feels off - Bulk imports cause checkpoint storms. Bigger
max_wal_size, longercheckpoint_timeout, completion target 0.9 -
AccessExclusivestops the world. Most migrations should not take it on hot tables during peak - The blocker tree CTE is your “the app feels stuck” diagnostic. Memorize it or keep it in a snippet
- Big
jsonbcolumns turn into TOAST overhead that doesn’t show in EXPLAIN. Stop selecting them when you don’t need them - Wraparound is real. Monitor
age(datfrozenxid). Default thresholds will not save you if autovacuum is blocked - The catalog is data
None of this is exotic. It’s the operational side of postgres that the docs cover but the Rails ecosystem doesn’t talk about. Senior devs may have heard the words, but knowing what they cost in production is the difference.