PostgreSQL Scales Further Than Most People Think
How to get more out of a single PostgreSQL instance before you move to a distributed database or a rewrite.
This post is about my production and non-production experience on scaling postgres for real and synthetic highload. Most teams which develop highload projects still have an extra room to move forward with postgres. And the possible issues are almost always caused by a missing index or a connection count nobody is tracking. So before we move to something new, we always need to pull out all the capabilities of the database we already have. The main areas of improvements are Connection pooling, queries inspection, indexes inspection, partitioning and replication. All these will be covered in this post.
Connection pooling
PostgreSQL starts a new process for each connection. That's fine when we have a few connections but it breaks if we have hundreds or thousands of connections.
PgBouncer is a good solution for this. It sits in front of Postgres and shares connections. The app can open many connections and Postgres will only see a few.
; pgbouncer.ini
[databases]
myapp = host=127.0.0.1 port=5432 dbname=myapp
[pgbouncer]
listen_port = 6432
listen_addr = 0.0.0.0
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25The setting that matters is pool_mode = transaction. Connections go back to the pool after each transaction instead of after each session, which is much more efficient. You lose prepared statements across transactions so to keep them, use protocol_native mode in newer PgBouncer versions, or switch to pgcat. On Kubernetes with a connection pool in each pod, you can still end up with thousands of connections even with app-level pooling. To handle this we need to put PgBouncer between the pods and Postgres and not inside each pod. Because "inside pod" setup still can use up the connection limit during a traffic spike.
Look at the queries first
Most scaling problems are a missing index and nothing more.
-- Find the slowest queries
SELECT
calls,
round(total_exec_time::numeric, 2) AS total_ms,
round(mean_exec_time::numeric, 2) AS avg_ms,
query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;Make sure "pg_stat_statements" is enabled, it is the first extension which we need to turn on for any performance improvement work.
When you know what is slow, check whether the indexes exist. I will write a detailed post on EXPLAIN later.
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 42 AND status = 'pending';If that shows a sequential scan on a large table:
CREATE INDEX CONCURRENTLY idx_orders_customer_status
ON orders (customer_id, status);"CONCURRENTLY" avoids locking the table while the index is built. If most queries only need active records, index just those:
CREATE INDEX idx_orders_active ON orders (customer_id)
WHERE status != 'archived';Such index is smaller and the lookups are faster.
Splitting big tables into partitions
Once a table grows past tens of millions of rows, even good indexes start to slow down and in this case we can use declarative partitioning. It has been available since postgres v10 and stable since postgres v12:
CREATE TABLE events (
id bigint GENERATED ALWAYS AS IDENTITY,
created_at timestamptz NOT NULL,
event_type text NOT NULL,
payload jsonb
) PARTITION BY RANGE (created_at);
CREATE TABLE events_2026_01 PARTITION OF events
FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
CREATE TABLE events_2026_02 PARTITION OF events
FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');
CREATE TABLE events_2026_03 PARTITION OF events
FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');The query planner automatically skips partitions it does not need. We can automate partition creation with pg_partman or a cron job:
#!/bin/bash
# Run monthly to create next month's partition
NEXT_MONTH=$(date -d "+1 month" +%Y-%m-01)
MONTH_AFTER=$(date -d "+2 months" +%Y-%m-01)
TABLE="events_$(date -d '+1 month' +%Y_%m)"
psql -d myapp -c "
CREATE TABLE IF NOT EXISTS $TABLE PARTITION OF events
FOR VALUES FROM ('$NEXT_MONTH') TO ('$MONTH_AFTER');
"One other profit of partitioning is that you don't need DELETE for old data. Deleting millions of rows is slow. And with partitioning you can do that by dropping the old partition:
DROP TABLE events_2025_01;It will work very fast.
Read replicas
Most applications read much more than they write, so we can utilize the streaming replication which is a reliable feature. One primary node handles writes, one or more replicas handle reads, and the app sends each query to the primary node or replica depending on the operation.
import psycopg2
write_conn = psycopg2.connect("host=primary dbname=myapp")
read_conn = psycopg2.connect("host=replica1 dbname=myapp")
def get_orders(customer_id):
# Reads go to replica
with read_conn.cursor() as cur:
cur.execute("SELECT * FROM orders WHERE customer_id = %s", (customer_id,))
return cur.fetchall()
def place_order(customer_id, items):
# Writes go to primary
with write_conn.cursor() as cur:
cur.execute(
"INSERT INTO orders (customer_id, items) VALUES (%s, %s)",
(customer_id, items)
)
write_conn.commit()The downside is a replication lag. A write to the primary takes a few milliseconds to show up on the replicas, so a user can save something and then not see it on the next page load. Usually the right move is to just accept the lag. For many usecases a few milliseconds lag does not matter. But when it does matter, then send reads to the primary for a short window right after a write. In this case the user will always see his own changes. There is a synchronous replication which kills the lag entirely but it adds latency to every write operation. So it should be applied to tables which really cannot show stale data.
Large values and TOAST
Store large JSON blobs or text and Postgres compresses them and moves them out-of-line using TOAST. It mostly happens on its own, and the nice side effect is that wide rows don't slow down scans. This is painful for SELECT * on tables with large JSONB columns, because you decompress data you may never read. The fix is simple - you should always list the columns you need:
-- Avoid this
SELECT * FROM events WHERE created_at > now() - interval '1 hour';
-- Prefer this
SELECT id, event_type, created_at FROM events WHERE created_at > now() - interval '1 hour';Precomputing expensive queries with materialized views
Some queries are just expensive, for example monthly report with complex aggregations across millions of rows. In such cases we need to compute it once and read the result.
CREATE MATERIALIZED VIEW monthly_revenue AS
SELECT
date_trunc('month', created_at) AS month,
sum(amount) AS total_revenue,
count(*) AS order_count
FROM orders
WHERE status = 'completed'
GROUP BY 1;
-- Refresh periodically
REFRESH MATERIALIZED VIEW CONCURRENTLY monthly_revenue;"CONCURRENTLY" keeps the view readable while it refreshes. For this to work, you need a unique index on the view:
CREATE UNIQUE INDEX idx_monthly_revenue_month ON monthly_revenue (month);Now few words about dead rows. Dead rows are being collected as you update and delete. Autovacuum can be used to clear them, but the defaults are very conservative. On write-heavy tables you need it to run more often:
ALTER TABLE events SET (
autovacuum_vacuum_scale_factor = 0.01, -- default 0.2
autovacuum_analyze_scale_factor = 0.005, -- default 0.1
autovacuum_vacuum_cost_delay = 2 -- default 2ms (was 20ms in older versions)
);On a large table, the default "vacuum when 20% of rows are dead" means millions of dead rows can build up before vacuum even starts. Lowering the scale factor fixes this. Also you can monitor dead tuples:
SELECT
relname,
n_dead_tup,
n_live_tup,
round(n_dead_tup::numeric / greatest(n_live_tup, 1) * 100, 2) AS dead_pct,
last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;What I would not suggest to use
- Application-level sharding which adds a lot of complexity. Cross-shard queries and rebalancing become your problem, and so does every migration from then on. Unless you are at a scale that genuinely needs it, it is too early.
- Switching to nosql databases for performance. It is a total disaster to rebuild the joins and constraints for big and complex project. Postgres with the right indexes is fast enough for almost all workloads.
Regarding Citus and distributed Postgres, I can not tell you much. Maybe I will look at them precisely a bit later. But in case when you need horizontal write scaling they may be the only options for you. They also change how you model data, since everything needs a distribution key, and some queries get slow or stop working entirely. So I would consider moving to these to be a complex and dangerous task.