How Postgres works at scale - Josh Brody How Postgres works at scale | Josh Brody
Back

How Postgres works at scale

A single idle-in-transaction connection can starve every table in your database. Most senior Rails devs don’t know that.

You learned MVCC the way most of us did—”writers don’t block readers,” concurrent transactions get isolated snapshots, UPDATE doesn’t lock the table. That’s marketing copy. The operational reality is that those snapshots have a cost, and the cost compounds across every table at once if one connection gets stuck.

I’m going to try to translate what this mental model is that I’ve built while building and operating a Rails-based data pipeline that handles 10TB/day. Not “what is MVCC”—you’ve heard that—but instead consequences of MVCC. You know, the ones that turn a quiet Monday into a panic akin to “why is production at 95% disk.”

This application of mine hangs out on a single server.

The application

Before you’re like “omg Rails” I will disclose that ActiveRecord is only used at the read layer for the user-facing API, for its database connection, and the orchestration of the pipeline. The work is largely done in SQL.

I have a nice little helper that I’ve enjoyed using. You may enjoy it too; here are the good parts:

module CoolBeans
  class SQL
    class Error < StandardError; end
    class UnknownProfile < Error; end
    class AlreadyInTransaction < Error; end
    class InvalidSetting < Error; end

    class << self
      def execute(profile, connection: ActiveRecord::Base.connection, &block)
        raise ArgumentError, "block required" unless block

        new(profile: profile, connection: connection).call(&block)
      end

      def settings_for(profile)
        config.fetch(profile.to_s) do
          raise UnknownProfile, "no profile=#{profile} in config/sql.yml for env=#{Rails.env}"
        end
      end

      def profiles
        config.keys
      end

      def reload!
        @config = nil
      end

      def config
        @config ||= Rails.application.config_for(:sql).to_h.deep_stringify_keys
      end
    end

    def initialize(profile:, connection:)
      @profile = profile.to_s
      @connection = connection
      @settings = self.class.settings_for(@profile)
    end

    def call
      if connection.transaction_open?
        raise AlreadyInTransaction,
              "SQL.execute(#{profile.inspect}) cannot nest inside an existing transaction; " \
                "SET LOCAL would leak past this block. Call it at the top of the unit of work."
      end

      logger = ActiveRecord::Base.logger
      original_level = logger&.local_level

      begin
        if logger
          logger.silence(Logger::ERROR) do
            connection.transaction do
              apply_settings!
              logger.silence(original_level) do
                yield(connection)
              end
            end
          end
        else
          connection.transaction do
            apply_settings!
            yield(connection)
          end
        end
      ensure
        ActiveSupport::Notifications.unsubscribe(subscriber) if subscriber
        warn_if_insert_heavy(stats)
      end
    end

    private

    attr_reader :profile, :connection, :settings

    def apply_settings!
      settings.each do |key, value|
        connection.execute("SET LOCAL #{key} = #{connection.quote(value.to_s)}")
      end
    end

    module ConnectionMixin
      def tuned(profile, &block)
        proxy = Proxy.new(self, profile)
        return proxy unless block

        proxy.call(&block)
      end
    end

    class Proxy
      DELEGATED = %i[
      execute exec_query exec_insert exec_update exec_delete
      select_all select_one select_value select_values
    ].freeze

      def initialize(connection, profile)
        @connection = connection
        @profile = profile
      end

      def call(&block)
        SQL.execute(@profile, connection: @connection, &block)
      end

      DELEGATED.each do |m|
        define_method(m) do |*args, **kwargs, &blk|
          SQL.execute(@profile, connection: @connection) do |c|
            c.public_send(m, *args, **kwargs, &blk)
          end
        end
      end
    end
  end
end

That fancy config file:

development:
  bulk_import:
    statement_timeout: 0
    lock_timeout: 0
    synchronous_commit: "off"
    maintenance_work_mem: "16GB"
    work_mem: "512MB"
    max_parallel_maintenance_workers: 8
    max_parallel_workers_per_gather: 8
  promote:
    statement_timeout: 0
    lock_timeout: 0
    synchronous_commit: "off"
    session_replication_role: "replica"
    maintenance_work_mem: "16GB"
    work_mem: "512MB"
    max_parallel_maintenance_workers: 8
    max_parallel_workers_per_gather: 8
  create:
    statement_timeout: 0
    lock_timeout: 0
    synchronous_commit: "off"
    session_replication_role: "replica"
    maintenance_work_mem: "16GB"
    work_mem: "32GB"
    max_parallel_maintenance_workers: 8
    max_parallel_workers_per_gather: 8

production:
  bulk_import:
    statement_timeout: 0
    lock_timeout: 0
    synchronous_commit: "off"
    maintenance_work_mem: "16GB"
    work_mem: "512MB"
    max_parallel_maintenance_workers: 8
    max_parallel_workers_per_gather: 8
  promote:
    statement_timeout: 0
    lock_timeout: 0
    synchronous_commit: "off"
    session_replication_role: "replica"
    maintenance_work_mem: "16GB"
    work_mem: "512MB"
    max_parallel_maintenance_workers: 8
    max_parallel_workers_per_gather: 8
  create:
    statement_timeout: 0
    lock_timeout: 0
    synchronous_commit: "off"
    session_replication_role: "replica"
    maintenance_work_mem: "16GB"
    work_mem: "32GB"
    max_parallel_maintenance_workers: 8
    max_parallel_workers_per_gather: 8

Before we get into things

A lot of this becomes irrelevant if your scale is not scale.

MVCC, compressed

Every row has two hidden columns: xmin (the transaction that wrote it) and xmax (the transaction that deleted or superseded it). A transaction reads a row when xmin is committed and visible to its snapshot, and xmax either isn’t set or isn’t visible. That’s how postgres avoids locking readers against writers—they’re looking at different versions of the same rows, decided by visibility rules instead of locks.

UPDATE doesn’t modify a row in place. It writes a new tuple, sets the old one’s xmax, and leaves the old version on the page until vacuum proves nobody can still see it. DELETE does even less—it sets xmax and walks away. The space stays allocated. Vacuum is the garbage collector. Without it, the heap grows monotonically and the indexes follow.

That’s the model. The interesting question is what happens when vacuum can’t or won’t do its job.

The xmin horizon and cluster-wide vacuum suckage

Vacuum can only reclaim a dead tuple if it’s invisible to every active snapshot in the database. Not just transactions touching this table—every transaction in the cluster. Postgres tracks the oldest xmin still in use as the xmin horizon, and refuses to remove any tuple newer than that horizon.

So: one transaction starts at 9am, holds the snapshot, and goes idle. At 9:01, a thousand UPDATEs land on users. At 9:02, ten thousand UPDATEs land on events. Vacuum runs all afternoon, looks at every dead tuple in every table, and skips them. They’re newer than the 9am horizon.

Every table bloats. Indexes bloat worse. Replicas slow because they’re replaying the WAL backlog of those updates and can’t trim WAL until the primary tells them it’s safe. Disk fills. Plans degrade. The query that was 8ms last week is 400ms now and the indexes look fine in \d.

The cause is usually one of:

  • A Rails console someone left open with a transaction do and a binding.pry on the stack
  • A pg_dump running unexpectedly long against the primary
  • A worker that began a transaction, raised before commit, and ended up in a connection state where the rollback never landed
  • A long analytics query someone runs from a notebook against production

The diagnostic—the one query you should be able to type from memory in an outage:

SELECT pid, now() - xact_start AS duration, state, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY duration DESC;

If anything shows up with a duration measured in hours, that connection is your xmin horizon, and your bloat is its fault. The fix is pg_terminate_backend(pid). The follow-up is idle_in_transaction_session_timeout set to something sane—five or ten minutes—so the database protects itself when you forget.

This is the most underused safety net in postgres. The default is zero. Zero means “let one developer’s rails console eat the cluster.”

WAL, checkpoints, and the bulk-import storm

Every write goes to the write-ahead log first. The actual data pages get dirtied in shared buffers and written to disk later, at checkpoint time. WAL is sequential and fast. Random heap writes are not. That’s the trick that lets postgres survive a crash and also keeps writes cheap.

Checkpoints flush all dirty buffers to disk so WAL up to that point can be discarded. They run on a schedule (checkpoint_timeout) or when WAL volume hits max_wal_size, whichever comes first. The schedule is gentle. The volume trigger is not.

Here’s the failure mode: You start a bulk import; a worker is COPYing into staging tables and you’re generating WAL faster than the schedule can absorb. You hit max_wal_size halfway between scheduled checkpoints, postgres triggers a forced checkpoint, IO saturates, and every query holding a buffer pin waits its turn behind the flush. Latency on totally unrelated requests spikes for ten or twenty seconds.

Pagers scream.

The fix is configuration, not architecture:

  • max_wal_size big enough that your peak write rate doesn’t hit it between scheduled checkpoints. I use 48GB. That’s not a typo—modern disks are cheap, and the alternative is checkpoint storms.
  • checkpoint_timeout long enough that the schedule does the work. I use 30 minutes.
  • checkpoint_completion_target = 0.9 so postgres spreads the flush across 90% of the interval instead of bunching it.

You’ll see the symptom in pg_stat_bgwriter: checkpoints_req (forced) climbing while checkpoints_timed (scheduled) stays flat. If checkpoints_req is more than a small fraction of checkpoints_timed, your max_wal_size is too small.

The thing nobody tells you: bigger max_wal_size is also disk space. 64GB of WAL means 64GB of disk reserved for WAL. Budget accordingly.

The lock graph that actually matters

Postgres has eight lock modes. You don’t need to memorize the table. You need to know:

  • AccessExclusive blocks everything, including reads. This is what ALTER TABLE, DROP TABLE, and most non-CONCURRENTLY index operations take. If you take it on a hot table during peak traffic, your app stops. You may have done this.
  • ShareUpdateExclusive is what VACUUM, ANALYZE, and CREATE INDEX CONCURRENTLY take. It conflicts with itself. You cannot run two VACUUM on the same table at once. Autovacuum can be blocked by your manual VACUUM, and vice versa.
  • RowExclusive is what UPDATE, DELETE, INSERT take. It conflicts with locks meant for schema changes, but not with itself. That’s why your app keeps writing while another connection is also writing.

That’s most of what matters in production. The other modes exist mainly because schema changes need to coordinate with reads in specific ways. The two consequential rules: AccessExclusive stops the world, and ShareUpdateExclusive serializes maintenance with itself.

The Rails-shaped trap: a migration that runs add_column with a default value on a table with millions of rows, in a transaction. AccessExclusive for the duration of the rewrite. Every read on the table queues. PG11+ made the constant-default case cheap—adding a column with a literal default no longer rewrites the table — but ALTER COLUMN TYPE, adding NOT NULL the wrong way, and indexes without CONCURRENTLY all still rewrite.

The blocker tree

When the app feels stuck—requests piling up, response times climbing, no obvious culprit in the application logs—the question is “who’s blocking who.” Postgres ships the answer.

SELECT
  blocked.pid AS blocked_pid,
  blocked.query AS blocked_query,
  blocking.pid AS blocking_pid,
  blocking.query AS blocking_query,
  now() - blocking.xact_start AS blocking_duration
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
  ON blocking.pid = ANY (pg_blocking_pids(blocked.pid))
WHERE blocked.wait_event_type = 'Lock';

pg_blocking_pids returns the array of pids that are blocking a given pid. The join expands the graph. In practice the output is short—usually one or two blockers at the root, with a fan of blocked queries waiting on them. The root is what you kill.

I keep this in a snippet file. So should you. The day you need it, you don’t have time to look it up.

What you do with the result is the harder question. pg_cancel_backend(pid) is polite—it asks the query to stop. pg_terminate_backend(pid) is not—it kills the connection and rolls the transaction back. Polite first, terminate second. If the polite kill doesn’t take inside a few seconds, the query is in an uninterruptible kernel call (usually IO) and you need the hammer.

TOAST and the silently-slow big-column problem

Postgres pages are 8KB. A row larger than the page can’t fit. TOAST—The Oversized Attribute Storage Technique (postgres devs have a sense of humor)—handles this by compressing large values, breaking them into chunks, and storing the chunks in a side table. The main row gets a pointer.

The compression and chunking are transparent. The performance characteristics are not.

A text or jsonb column that grows from kilobytes to megabytes—say, a vendor response payload you started caching—quietly turns every SELECT against the row into an extra round of TOAST chunk fetches. The query plan looks identical. EXPLAIN shows the same Index Scan. The actual IO doubles or triples.

The symptoms:

  • Bitmap Heap Scan reading way more buffers than rows
  • A query that was fast last quarter is slow now and nothing about it changed
  • Disk read volume on the table is wildly disproportionate to row count

The fixes, in order of effort:

  • ALTER TABLE x ALTER COLUMN big_jsonb SET STORAGE EXTERNAL to skip compression if it isn’t helping
  • ALTER TABLE x ALTER COLUMN big_jsonb SET COMPRESSION lz4 (PG14+, I think?) for faster decompression
  • Stop selecting the column when you don’t need it. If your AR model has a payload column you only need on detail pages, set self.ignored_columns = %w[payload] on a slim subclass for list queries. Or use select(:id, :name, ...) everywhere it matters.
  • Move the column to a side table you join in only when needed. Sounds redundant—postgres already moved it—but the catalog lookup, planner cost, and chunked decompression are real overhead avoided when the column isn’t selected at all.

The deeper lesson: a jsonb column is not free, and the cost is hidden under a layer that doesn’t show up in EXPLAIN until you read it carefully.

Transaction ID wraparound

Every transaction gets a 32-bit ID. Postgres needs xmin to determine row visibility. 32 bits gives you about 4 billion transactions, after which the counter wraps. To keep visibility correct across the wrap, postgres “freezes” old rows—marks them as definitively visible to everyone—so their original xmin doesn’t matter anymore. Vacuum does the freezing.

If autovacuum can’t keep up—usually because it’s blocked, throttled, or starving—the database approaches wraparound. At a configurable threshold postgres starts running aggressive anti-wraparound vacuums. If those can’t keep up, postgres shuts down the database to prevent corruption. This has happened to enough big companies to be a meme.

The query you should run on a schedule:

SELECT datname, age(datfrozenxid)
FROM pg_database
ORDER BY 2 DESC;

Default autovacuum_freeze_max_age is 200 million. A reasonable alarm is 500 million. Panic at 1.5 billion. If you’re seeing 1.5B and rising, something has been blocking autovacuum for a long time—usually that idle-in-transaction connection from earlier, or a manual VACUUM holding ShareUpdateExclusive on a huge table, or autovacuum workers maxed out on something else.

This is one of those operational concerns Rails developers never see until they own a database. Then they see it once and never forget.

The catalog as data

For whatever reason we don’t believe that the postgres catalog is not metadata. It’s tables. Real tables. With rows. You can query them. It’s like the postgres devs are smart.

pg_class is one row per table, index, sequence, or other relation. pg_attribute is one row per column, across every relation. pg_index is one row per index with the column references and the predicate for partials. pg_proc is one row per function and stored procedure.

These aren’t an API surface that postgres “exposes”—they’re how postgres stores its own bookkeeping, and you can SELECT from them like anything else. You can JOIN them. You can build views over them. You can point an ActiveRecord model at them.

A pattern from a codebase I work in:

module StoredProcedures
  class BaseStoredProcedure < ApplicationRecord
    self.table_name = "pg_proc"
    self.primary_key = "oid"
  end
end

StoredProcedures::BaseStoredProcedure.where("proname LIKE 'tasker_%'") returns every stored procedure prefixed with tasker_ in the catalog. AR doesn’t care that the table happens to be the function catalog—it’s a table with columns, AR is happy.

(I use this to manage migrations for stored procedures, which I’m usually against but at the scale of this application, I love saving 10ms across inserts because I have a couple million to do a day.)

What this enables, once you accept the framing:

  • A migration that asserts a stored procedure exists with the right signature, written as an AR query
  • A test that fails when a function definition drifts from what your code expects
  • A diagnostic page in your admin tool that lists every function with its source, fed by pg_get_functiondef(oid)
  • A check-on-boot that loads function definitions from disk and compares them to the catalog, surfacing manual edits

The framing scales further. A StagingTable model in the same codebase points at pg_stat_user_tables for live row counts and last-vacuum timestamps. It’s an AR model whose underlying table is a postgres-maintained statistics view. The implementation is two lines. The capability is “show me the freshness of every staging table on the dashboard, with sorting and search, for free.”

Maybe I need to create some sort of gem that exposes these things.

Beginners don’t realize you can do this. Many seniors haven’t tried. It costs almost nothing to start, and it changes how you think about the database—the catalog isn’t a black box you query through \d or DBeaver. It’s data. Your application can read it and reason about it like anything else.

What this buys you

The compressed version:

  • One idle transaction can bloat every table in the cluster. Set idle_in_transaction_session_timeout. Run the pg_stat_activity query when something feels off
  • Bulk imports cause checkpoint storms. Bigger max_wal_size, longer checkpoint_timeout, completion target 0.9
  • AccessExclusive stops the world. Most migrations should not take it on hot tables during peak
  • The blocker tree CTE is your “the app feels stuck” diagnostic. Memorize it or keep it in a snippet
  • Big jsonb columns turn into TOAST overhead that doesn’t show in EXPLAIN. Stop selecting them when you don’t need them
  • Wraparound is real. Monitor age(datfrozenxid). Default thresholds will not save you if autovacuum is blocked
  • The catalog is data

None of this is exotic. It’s the operational side of postgres that the docs cover but the Rails ecosystem doesn’t talk about. Senior devs may have heard the words, but knowing what they cost in production is the difference.

Stay in the loop

Occasional essays on design, tools, and the craft of building things. No spam, unsubscribe anytime.

Ambient weather

The background of this site reflects the current weather and time of day in Saint Paul. The orbs shift in color and behavior based on what's happening outside my window.

Learn more about how this works