I’m building an ETL system that imports billions of person records from data vendors. Each vendor sends different attributes—household income, automotive interests, education level, charitable donor status—in their own formats with their own column names. The system normalizes everything into a canonical schema and exposes a query interface for building audiences.
The obvious choice for storing hundreds of optional attributes is JSONB or EAV. Flexible, no migrations needed, handles sparse data well. I went with wide tables instead: every queryable attribute is a column on the people table.
This sounds like a 2005 decision. But Postgres handles wide tables fine—you can have up to 1,600 columns, and everything feels normal until you hit about 500 columns. The real win is developer experience: you get actual columns, ActiveRecord just works, queries are plain SQL.
Why not JSONB or EAV?
You could absolutely build validation on top of JSONB or EAV. The schema registry approach works with any storage backend. But wide tables give you a better developer experience.
With JSONB, your queries look like this:
Person.where("attributes->>'household_income' = ?", "150k_to_200k")
With wide tables:
Person.where(household_income: "150k_to_200k")
The second version works with ActiveRecord’s query interface, supports scopes, plays nice with Arel, and doesn’t require remembering JSON path syntax. Your editor can autocomplete column names. Your database can index them directly. rails console tab-completion works.
EAV has similar ergonomic problems—you’re always joining, always filtering on key/value pairs, always one abstraction away from the data.
Wide tables mean your attributes are just columns. Normal columns. The kind ActiveRecord was built for.
The query interface
I built a query interface that validates before it queries:
People.where(household_income: "150k_to_200k")
.where(individual_charitable_donor: true)
.near(latitude: 40.7128, longitude: -74.0060, miles: 25)
.count
This looks like ActiveRecord, but it’s not. It’s a query builder that checks every attribute and value against a canonical registry before generating SQL.
Try to query an attribute that doesn’t exist:
People.where(fake_attribute: "whatever")
# => People::UnknownAttributeError: unknown attribute: fake_attribute
Try to query an attribute that exists but isn’t enabled:
People.where(individual_language: "english")
# => People::AttributeNotActiveError: individual_language exists but is not active.
# Add it to CanonicalAttributes::ACTIVE to enable.
Try to query with an invalid value:
People.where(household_income: "rich")
# => People::InvalidAttributeValueError: household_income must be one of:
# under_20k, 20k_to_30k, 30k_to_40k, ... Got: "rich"
The query never runs. The error happens at the interface, with a message that tells you exactly what went wrong and what values are allowed.
The canonical registry
The validation layer works because there’s a single source of truth for what attributes exist and what values they accept. It’s a Ruby hash:
module CanonicalAttributes
ACTIVE = [
:household_income,
:individual_charitable_donor,
:individual_automotive_purchases,
# ... enabled attributes
]
ALL = {
household_income: {
type: :text,
description: "Estimated household income bracket",
values: %w[
under_20k 20k_to_30k 30k_to_40k 40k_to_50k
50k_to_60k 60k_to_75k 75k_to_100k 100k_to_125k
125k_to_150k 150k_to_200k 200k_to_250k 250k_to_500k
500k_plus
],
validates: [
{ bracket: { type: :income } }
]
},
individual_age: {
type: :integer,
description: "Estimated age of the individual (18-99)",
validates: [
{ numericality: { greater_than_or_equal_to: 18, less_than_or_equal_to: 99 } }
]
},
individual_gender: {
type: :text,
values: %w[male female],
validates: [
{ inclusion: { in: %w[male female] } }
]
},
# ... hundreds more
}
end
Sure, not sexy. I thought about:
attribute :household_income, :text, description: "Estimated household income bracket", validations: { bracket: { type: :income } }, values: %w[ ... ]
And I might go there, idk. But this feels fine for now, and there’s less runtime overhead—which is negligible until you scale.
This registry does three things.
- It defines what exists—if it’s not in
ALL, it’s not a valid attribute. - It controls what’s queryable—if it’s not in
ACTIVE, you can’t query it yet, maybe the column hasn’t been added, maybe the data isn’t populated. - It specifies valid values—the
validateskey mirrors ActiveModel validations, but runs against query values instead of record values.
The same registry drives migrations. When I add a new attribute, a generator reads the registry and creates the migration:
add_column :people, :household_income, :string
add_index :people, :household_income, where: "household_income IS NOT NULL"
Schema, validation rules, and column definitions all come from the same place. They can’t diverge.
Validation at query time
The query builder runs validation before it builds SQL:
def add_filter(key, value, type)
if @canonical_attributes.key?(key.to_s)
validate_attribute_value!(key, value)
@attribute_filters << { key: key, value: value, type: type }
elsif CanonicalAttributes::REGISTRY.key?(key.to_sym)
raise People::AttributeNotActiveError,
"#{key} exists but is not active."
else
raise People::UnknownAttributeError, "unknown attribute: #{key}"
end
end
The validation method handles different value types—arrays, ranges, hashes for comparisons:
def validate_attribute_value!(key, value)
config = CanonicalAttributes::REGISTRY[key.to_sym]
return unless config
validates = config[:validates]
return unless validates
values_to_check = case value
when Array then value
when Range then [value.begin, value.end].compact
when Hash then value.values
else [value]
end
values_to_check.each do |val|
validate_single_value!(key, val, config, validates)
end
end
This means all these query styles get validated:
# Exact match
People.where(household_income: "100k_to_125k")
# Multiple values (OR)
People.where(household_income: ["100k_to_125k", "125k_to_150k"])
# Comparisons
People.where(individual_age: { gte: 25, lte: 45 })
# Ranges
People.where(individual_age: 25..45)
# Negation
People.where.not(household_income: "under_20k")
Every value in every query form gets checked. If you pass individual_age: { gte: 150 }, you get an error before the query runs.
The result interface
The queries return wrapped results, not ActiveRecord objects:
person = People.where(household_income: "150k_to_200k").first
person.household_income # => "150k_to_200k"
person.first_name # => "Jane"
person.city # => "houston"
person.emails # => [{ "email" => "jane@example.com", ... }]
It looks like a model, but it’s a read-only projection. You can’t person.update!—this is a query interface, not an ORM. The underlying Person model exists, but you get to it explicitly with person.to_model if you need it.
(In case you were wondering: ActiveRecord::Persistence isn’t used for any of these attributes; there’s a separate pipeline for them entirely using some deep Postgres stuff.)
The class inspection shows you what’s available:
People::Result.inspect
# => "People::Result(id: integer, first_name: string, last_name: string,
# city: string, state: string, household_income: text, ...)"
Try it
The examples dropdown shows valid queries, invalid values, unknown attributes, and inactive attributes—each fails differently, before any SQL runs.
puts People.where(household_income: "150k_to_200k").first
Person Load (0.2ms) SELECT "people".* FROM "people" WHERE "people"."household_income" = '150k_to_200k' ORDER BY "people"."id" ASC LIMIT 1
#<People::Result id: 1, first_name: "Jane", last_name: "Smith", city: "houston", state: "TX", latitude: 0.297604e2, longitude: -0.953698e2, household_income: "150k_to_200k", individual_age: 34, individual_gender: "female", individual_charitable_donor: true, individual_automotive_purchases: "luxury_car", individual_education_level: "masters">
=> nilTrade-offs
Wide tables have real costs.
Every new queryable attribute requires a migration. In a system with hundreds of attributes across multiple vendors, this means more deploys. That’s fine, deploys are cheap. I’ve mitigated this with a generator that reads the registry and produces the migration, but it’s still a deploy.
Sparse data means lots of NULLs. Most people don’t have values for most attributes. Postgres handles this well—NULL storage is cheap, about 1 bit per column in the null bitmap—but it’s not as compact as JSONB for very sparse data.
Schema changes require coordination. If two vendors need new attributes at the same time, the migrations need to be sequenced. With JSONB, you’d just start writing different keys.
These are real trade-offs. I chose wide tables anyway because the developer experience wins outweigh the operational costs for my use case.
Why this works
The registry is the key. It’s not just documentation—it’s the schema, the validation rules, and the query contract in one place. The same hash that defines valid query values also drives migrations, generates API documentation, and powers the import normalizers.
When you add an attribute, you add it to the registry. A generator creates the migration. The query interface automatically validates it. The import pipeline knows how to normalize vendor data into canonical values. Nothing can diverge because everything reads from the same source.
Wide tables make the schema explicit. Your columns are your contract. ActiveRecord understands them. Your database can optimize for them. Your tests can assert on them. There’s no layer of indirection between “what attributes exist” and “what columns are in the database.”
The flexibility of JSONB is real. But so is the friction of working one abstraction away from your data. Wide tables put your attributes where Rails expects them to be.
JSONB and EAV are the right tools for plenty of problems. I considered both and decided they weren’t right for this one. I could be wrong—I often am—but a month in, the developer experience has been worth the migration overhead.