PostgreSQL query optimization

With nanomize we built an URL Shortener complying strictly with the German Federal Data Protection Act. One core element of nanomize is the analysis dashboard of a customer, helping to understand the user flow behind links.

While the application scales perfect with the current requirements, we want to guarantee the same performance for companies with high traffic, too. This way we learnt a lot about how to tweak a PostgreSQL database with many rows.

Understand the architecture

A user doesn't recognize much about what's going on, when she's redirected from a short URL to the destination. But nanomize stores a bunch of data on this redirect. That's the related link, the referrer url, the extracted country code from the ip address and many more. All in all each redirect causes the creation of one row in a table of nanomize, where analysis can be performed later.

Things become slow

To understand the upcoming bottlenecks, we use a simplified table of nanomize, which we call hits and fill with 100 million rows. The timestamp is distributed uniformly over one year, the link_id and the country_code are chosen randomly (#link_ids = 100 and #country_codes = 250). There are no indices yet, so we can improve performance step by step. An extract from the hits table is shown below.

id	link_id	country_code	created_at
1	1	US	2016-03-07 10:51
2	1	US	2016-03-07 10:52
3	2	UK	2016-03-07 10:53

Furthermore we take care of caching by dropping the buffer cache and the system cache before each query.

service postgresql stop
sync
echo 3 > /proc/sys/vm/drop_caches
service postgresql start

As no indices and triggers exist until now, an INSERT with 171 ms is as fast as possible. That's useful to note and compare with a later constellation again. The next query is a SELECT to count the total number of entries in the hits table.

SELECT count(*) FROM hits;
Time: 64306,54935 ms

It takes over one minute until a result returns. Counting is due to implementation reasons slow in Postgres and there is no trick to speed up execution significantly. Keep this in mind and be sure, that you really need an exact count. Often you just use a wrong implementation (e.g. infinite scrolling) or an estimation is enough. However we want to understand if and how an index could improve counting. Therefore we need to analyse the query.

Postgres execution plan

Postgres uses a planner to analyse a query and to decide what the cheapest way of a execution might be. When executing a query it's possible to output the underlying decisions of the planner with the (EXPLAIN ANALYSE) statement, that combines the estimations of the planner and the costs of the real execution. Mistreating the syntax highlighting for didactic reasons the output of the count query looks like this.

EXPLAIN ANALYSE SELECT count(*) FROM hits;</span>
Aggregate
  (cost=1886943.00..1886943.01 rows=1 width=0)
  (actual time=63654.115..63654.115 rows=1 loops=1)
  ->  Seq Scan on hits
    (cost=0.00..1636943.00 rows=100000000 width=0)
    (actual time=48.896..57394.584 rows=100000000 loops=1)
Planning time: 50.857 ms
Execution time: 63654.274 ms

We read the query from bottom to top. The query itself took 63654.274 ms, whereas the planner has needed 50.857 ms before the real execution could be started. The next parts of the execution plan are separated in blocks and labeled with an action. The first block is a Seq Scan on hits including two additional information surrounded by brackets. For the following it's enough to just understand the first two attributes.

Estimated vs. actual costs: cost=0.00..1636943.00 vs. 48.896..57394.584
Estimated vs. real number of rows output: rows=100000000 vs. rows=100000000

The estimated cost of the planner have no units, whereas the actual results are in milliseconds. The first value describes the elapsed time before the first output begins. The last value measures the costs/time until the execution runs to completion. Most often we are interested in the real execution time, but always take care, that the estimations of the planner return similar values. For the inspected block this means, that the first row appeared within 48.896 ms, but the complete execution took over 57 seconds. After fetching all rows they were counted in the Aggregation block. The aggregation needed to wait for the result of the sequence scan, otherwise it may had started before 57394.584 ms.

Now we have seen the output without an index, so we add an index with the hope to make execution magically faster. Creating a primary key in Postgres adds an index with a unique constraint.

ALTER TABLE hits ADD PRIMARY KEY (id);

Unfortunately running the same query again changes nothing. For further understandings we need to take a closer look at what a sequence scan actual is.

Sequential scan vs. index only scan

Postgres stores every table and index as an array of pages of a fixed size on disc. The easiest ways to find a row with the ID 15 is to walk through every page and look if this page includes a row with the ID 15. If the row with the ID 15 is inside the last page of the array this might be expensive. The described method is called a sequence scan.

Now let's look at another approach called an index scan: If we know, that we are searching for IDs we could store the ID and the page address in a special data structure. Postgres uses for that by default a balanced tree. Searching in a balanced three is very fast and a found tuple can be resolved to a page address, where the complete row could be read.

Furthermore it's important to know that reading many pages in a row is much faster for a hard drive than reading the same pages in a random order, as the hard disk head is moved less. On top of that the cheapest way is an index only scan, which doesn't need to load any row at all, as all required columns are contained in the index.

We go back to our hits table to understand what's said above. Just postpone the count problem and look at a simple select with a condition on the ID, the column we created already an index for. The first thing we do, is to check the different execution times of an index only scan and a sequence scan. Therefore we need to disable some query plans.

EXPLAIN ANALYSE SELECT id FROM hits WHERE id < 10000;
Index Only Scan using hits_pkey on hits
    (cost=0.57..274.59 rows=9487 width=4)
    (actual time=22.931..79.963 rows=9999 loops=1)
   Index Cond: (id < 10000)
   Heap Fetches: 0
 Planning time: 150.086 ms
 Execution time: 80.296 ms

SET enable_indexscan = OFF;
SET enable_bitmapscan = OFF;
EXPLAIN ANALYSE SELECT id FROM hits WHERE id < 10000;
 Seq Scan on hits
    (cost=0.00..2523885.70 rows=9487 width=4)
    (actual time=53.630..131530.808 rows=9999 loops=1)
   Filter: (id < 10000)
   Rows Removed by Filter: 99990001
 Planning time: 15.590 ms
 Execution time: 131531.165 ms

As you can see, there is a big difference between the index only scan (80 ms) and the sequence scan (131531 ms). Now we can compare a normal index scan with a sequential scan. To prevent Postgres using an index only scan we select more columns than the index contains.

EXPLAIN ANALYSE SELECT id, country_code FROM hits WHERE id < 10000;
Index Scan using hits_pkey on hits
  (cost=0.57..398.59 rows=9487 width=7)
  (actual time=0.011..126.566 rows=9999 loops=1)
 Index Cond: (id < 10000)
Planning time: 128.091 ms
Execution time: 126.919 ms

SET enable_indexscan = OFF;
SET enable_bitmapscan = OFF;

EXPLAIN ANALYSE SELECT id, country_code FROM hits WHERE id < 10000;
Seq Scan on hits
  (cost=0.00..2523885.70 rows=9487 width=7)
  (actual time=47.433..131746.804 rows=9999 loops=1)
 Filter: (id < 10000)
 Rows Removed by Filter: 99990001
Planning time: 128.130 ms
Execution time: 131747.157 ms

The index scan is much faster. The next step is to look at the the query by changing the condition to match all rows.

SET enable_seqscan = OFF;
EXPLAIN ANALYSE SELECT id, country_code FROM hits WHERE id > 0;
Index Scan using hits_pkey on hits
  (cost=0.57..4120665.15 rows=99999976 width=7)
  (actual time=0.010..100631.736 rows=100000000 loops=1)
 Index Cond: (id > 0)
Planning time: 127.865 ms
Execution time: 103648.355 ms

SET enable_seqscan = ON;
EXPLAIN ANALYSE SELECT id, country_code FROM hits WHERE id > 0;
Seq Scan on hits
  (cost=0.00..2523885.70 rows=99999976 width=7)
  (actual time=0.344..128026.136 rows=100000000 loops=1)
 Filter: (id > 0)
Planning time: 3.234 ms
Execution time: 131079.683 ms

An index scan is still faster, but the percentage of the difference is far smaller. The reason for that is the already mentioned background of random reads. As the planner chooses the sequence scan over the faster index scan, you need to adjust the query configuration to support the planner making the right decision. We skip this, as it is a own topic itself. Also keep in mind that you should adjust the runtime resources depending on your system to get the best performance.

Back to the count query, we encounter the same estimation problem of the planner as in the query before. The planner prefers a sequence scan instead of the much cheaper index only scan.

SET enable_seqscan = OFF;
EXPLAIN ANALYSE SELECT count(*) FROM hits;
Aggregate
  (cost=2846817.45..2846817.46 rows=1 width=0)
  (actual time=23349.313..23349.313 rows=1 loops=1)
 ->  Index Only Scan using hits_pkey on hits
   (cost=0.57..2596811.61 rows=100002336 width=0)
   (actual time=60.504..18335.820 rows=100000000 loops=1)
       Heap Fetches: 0
Planning time: 142.866 ms
Execution time: 23349.428 ms

The index only scan is with 23349 ms more than twice faster than the sequence scan (63654 ms). But this value is actually a little misleading, as we concealed the heap fetches an index scan needs to made. The number of heap fetches depends on the visibility map, discussed in the next section.

Visibility Map (VM)

Postgres needs to check the rows in an index that were modified by other transactions at the moment the query runs. Therefore it uses a compact visibility map (VM), in which DML operations mark pages before execution as invisible for others. That way an index knows for each page if it needs to visit the page on disk to check the included rows or not. An index only scan is only that fast, if it needs to visit few page tuples, which are called heap fetches. Postgres uses a daemon called autovacuum, which runs periodically in the background and updating the visibility map of a table. Also you can run a VACUUM manually. The planner now can decide to use an index only scan over a sequence scan, as there is no need of many heap fetches.

For our previous count we see that the number of heap fetches is 0, that means the hits table is fully vacuumed. We run a pseudo-manipulating query on each row and check the execution time of the index only scan on ID again.

UPDATE hits SET country_code = country_code;
SET enable_seqscan = OFF;
EXPLAIN ANALYSE SELECT count(*) FROM hits;
Aggregate
   (cost=5044201.45..5044201.46 rows=1 width=0)
   (actual time=602128.776..602128.776 rows=1 loops=1)
 ->  Index Only Scan using hits_pkey on hits
   (cost=0.57..4794195.61 rows=100002336 width=0)
   (actual time=154.424..595217.106 rows=100000000 loops=1)
       Heap Fetches: 199999944
Planning time: 180.733 ms
Execution time: 602129.478 ms

With 602129 ms it's nearly ten times slower than the sequence scan. In conclusion we can say that the execution time of a count with an index depends on the correct estimation of the planner and the number of visible rows in the VM. Let's stop at this point with the count query and move on with bitmap index scans and multicolumn indices.

Bitmap Index Scan and Bitmap Heap Scan

Often you see a bitmap index scan followed by a bitmap heap scan. The output looks like the following.

SET enable_indexscan = OFF;
EXPLAIN ANALYSE SELECT id, country_code FROM hits WHERE id < 10000;
Bitmap Heap Scan on hits
   (cost=425.48..39329.72 rows=10440 width=7)
   (actual time=147.553..908.460 rows=9999 loops=1)
 Recheck Cond: (id < 10000)
 Heap Blocks: exact=66
 ->  Bitmap Index Scan on hits_pkey
   (cost=0.00..422.87 rows=10440 width=0)
   (actual time=81.640..81.640 rows=9999 loops=1)
       Index Cond: (id < 10000)
Planning time: 216.926 ms
Execution time: 909.083 ms

The index doesn't contain all desired columns, therefore they need to be loaded afterwards. While an index scan performs random reads, the bitmap heap scan read the pages in a sequential order. For this a bitmap in physically order is needed. That's what a bitmap index scan returns. It checks a condition on every tuple in an index and saves the result as a compressed bitmap. That way bitmaps from different indices can be combined before loading the corresponding rows from disk.

In the example we can speed up the query by adding an multicolumn index.

CREATE INDEX ON hits(id, country_code);
VACUUM hits;
EXPLAIN ANALYSE SELECT id, country_code FROM hits WHERE id < 10000;
 Index Only Scan using hits_id_country_code_idx on hits  (cost=0.57..229.53 rows=7141 width=7) (actual time=38.125..110.524 rows=9999 loops=1)
   Index Cond: (id < 10000)
   Heap Fetches: 0
 Planning time: 346.084 ms
 Execution time: 110.878 ms

Note that we VACUUM the table again, otherwise the planner won't choose the index only scan. Finally we INSERT a row again and see that with two indices the execution time increased from 171 ms to 332 ms. Many queries in nanomize were complexer to optimize, but the simplified queries needed the same knowledge and strategy.

Emanuel Denzel Mar 18, 2016