Reading PostgreSQL Execution Plans with EXPLAIN ANALYZE

In this post I will say a word on how I am debugging query performance, and why I am using EXPLAIN (ANALYZE) instead of plain EXPLAIN.

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 42;

A query is slow and you want to know why before you start changing anything. This post covers the things I check first, not the whole feature.

Why ANALYZE, and not plain EXPLAIN

EXPLAIN shows the plan the planner intends to use and its estimates. And EXPLAIN ANALYZE runs the query and prints the real timings and row counts next to those estimates. The interesting problems live in the gap between estimated rows and actual rows.

First of all EXPLAIN ANALYZE really runs the query, and this is the dangerous fact. It is fine for a SELECT. For an UPDATE, DELETE, or INSERT, it should be wrapped in a transaction which you can roll back:

BEGIN;
EXPLAIN ANALYZE UPDATE orders SET status = 'shipped' WHERE id = 7;
ROLLBACK;

The options I always turn on

Plain EXPLAIN ANALYZE leaves out the most useful column. Here is what I actually type:

EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS, FORMAT TEXT)
SELECT ...;

BUFFERS is the option which I never skip. It shows buffer hits and reads, which tells me whether a node is slow because of CPU work or because it reads from disk. VERBOSE option adds the column list and full table names, which helps when there are a few subqueries. SETTINGS prints any planner settings that differ from the defaults.

FORMAT JSON also exists, and it is useful if some tool reads the output. For reading it yourself, TEXT is fine. Since Postgres 13 BUFFERS works without ANALYZE, but the real I/O numbers only show up once the query runs.

Reading a single node

One plan node looks like this:

Index Scan using orders_customer_id_idx on orders  (cost=0.43..8.45 rows=1 width=64) (actual time=0.024..0.031 rows=3 loops=1)
  Index Cond: (customer_id = 42)
  Buffers: shared hit=4

Two sets of numbers in those parentheses, so let's see which is which. The first, cost=0.43..8.45 rows=1 width=64, is what the planner guessed before running anything. It is a startup cost, a total cost, an estimated row count, and an average row size in bytes. The costs use arbitrary units, so they only mean something when you compare one plan to another. The second set, actual time=0.024..0.031 rows=3 loops=1, is what really happened: time to the first row, time to the last row, the row count, and how many times the node ran.

Watch the loops number, because it is the one people often miss. The times and row counts are per loop. A node that shows actual time=0.5..0.5 rows=1 loops=10000 spent five seconds in total, not half a millisecond. You have to multiply.

Scans and joins

For scans, a Seq Scan reads the whole table. That is fine on a small table, or when you need most of the rows anyway. But on a big table with a selective WHERE, it is the first thing I check. An Index Scan reads an index and fetches each matching row from the heap. This is what you want for selective predicates. An Index Only Scan does the same but does not touch the heap at all. It answers from the index alone. It needs a covering index and recently-vacuumed pages (the visibility map decides this), so it shows up less often than you would like. The Bitmap Index Scan / Bitmap Heap Scan pair builds a bitmap of matching rows and reads them in physical order. It is for the case in between, where the index matches too many rows for a plain Index Scan but not enough to read the whole table.

Joins produce 3 outcomes. A Nested Loop checks the inner side once per outer row. It is cheap when the outer side really is small, and very slow when the planner only thought it was small. A Hash Join builds a hash table on one side and probes it with the other. It is the default choice for larger unsorted inputs. A Merge Join reads two sorted inputs together and is best when both sides already come in sorted, usually from an index.

Things that I check ordered by priority

Estimated vs actual rows on every node, first. A 10x or 100x mismatch low in the tree usually spreads all the way up, and it is often the root cause. The planner picks a Nested Loop because it expected 5 rows on the outer side, gets 50,000, and the whole plan falls apart from there.

After that, the node has elevated "most time". "Most time" means actual time × loops, not the last-row number printed on the node. A child that runs ten thousand times is the cause far more often than its parent.

Buffers is the second thing which I check. A shared read=N means N pages came from disk. Postgres cannot tell its own reads from the OS page cache, so "disk" is approximate. A lot of read and little hit on a query that should be hot usually means you do not have enough cache, or the rows are spread all over the heap.

Filters is the third and last thing. A Filter: under a Seq Scan reporting Rows Removed by Filter: 9,999,000 means you scanned ten million rows to keep a thousand. Missing index.

Putting it together

Take a slow query:

EXPLAIN (ANALYZE, BUFFERS)
SELECT o.id, o.total
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE c.country = 'UA' AND o.created_at > now() - interval '7 days';

And the plan comes back like this (abbreviated):

Nested Loop  (actual time=0.412..2841.003 rows=1842 loops=1)
  ->  Seq Scan on customers c  (actual time=0.011..18.402 rows=4120 loops=1)
        Filter: (country = 'UA'::text)
        Rows Removed by Filter: 195880
  ->  Index Scan using orders_customer_id_idx on orders o  (actual time=0.681..0.684 rows=0 loops=4120)
        Index Cond: (customer_id = c.id)
        Filter: (created_at > (now() - '7 days'::interval))
        Rows Removed by Filter: 47

Two things are remarkable here. The customers Seq Scan drops 195k rows to keep 4k, so an index on customers(country) is required. And the inner Index Scan runs 4,120 times and drops almost everything it fetches, because the index only covers customer_id and created_at is left as a filter. A composite index moves both predicates into the index condition:

CREATE INDEX ON customers (country);
CREATE INDEX ON orders (customer_id, created_at);

After that the planner might switch to a Hash Join with two Index Scans, or keep the Nested Loop with the inner side no longer doing useless heap fetches. Either way the Rows Removed by Filter numbers should drop close to zero when you run it again.

When the estimates can be wrong

Stale statistics is the usual issue. ANALYZE table_name; refreshes them. Autovacuum usually keeps up, but right after a bulk load it has not run yet, so I run it by hand before I trust any plan.

Correlated columns are the other common one. The planner assumes columns are independent, so if country and city go together it badly underestimates WHERE country='UA' AND city='Kyiv'. Extended statistics tell it about the correlation:

CREATE STATISTICS addresses_country_city (dependencies)
  ON country, city FROM addresses;
ANALYZE addresses;

Then there is a histogram that is too big. default_statistics_target of 100 is fine for most columns. For a column with a very skewed distribution, raise it for that one column with ALTER TABLE ... ALTER COLUMN ... SET STATISTICS 1000, rather than raising it globally.

Once the plan gets large

Raw text is fine for a few nodes. Once a plan gets large, I advise you to paste it into explain.depesz.com. It colours the nodes whose estimate is off by 100x, which finds the problem faster than reading the text on my own. explain.dalibo.com shows the same thing as a tree.

Here are some things which you need to know before you trust the numbers. The per-row timing calls add real overhead, so a query that runs in 50 ms might report 80 ms; trust the shape of the plan more than the absolute times. The first run usually reads from disk and only the second reads from cache, so run it twice and compare Buffers before you decide anything about I/O. On parallel plans the Gather node shows total time while the children below it show the average per worker, which is confusing the first time you hit it.

So as a conclusion from this article, I suggest to run EXPLAIN (ANALYZE, BUFFERS) before doing any modification. Most of slow queries you will come across will have bad row estimate or a missing index.