Preamble
After you ANALYZE a PostgreSQL table to collect value distribution statistics, you will find the gathered statistics for each column in the pg_stats system view. This article will explain the meaning of the correlation column and its impact on index scans.
Physical vs. logical ordering
Most common PostgreSQL data types have an ordering: they support the operators <, <=, =, >= and >.
These data types can be used with the “standard” index type, the B-tree index.
The values in a column of this kind tell you how the table rows should be put in order. An index on this column will be sorted according to that ordering.
A PostgreSQL table is made up of one or more 8KB block files. The physical ordering refers to the arrangement of the rows as they are stored in the file.
You can examine the physical ordering of the rows by selecting the ctid system column: it contains the block number and the item number inside the block, which describe the physical location of the table row.
The correlation for a column is a value between -1 and 1. It reveals how well logical and physical ordering match each other.
- The rows are stored in the table file in ascending column order if the correlation is 1; otherwise, they are stored in descending order.
- Less ideal matching is denoted by values between -1 and 1.
- There is no relationship between the physical and logical orders when the value is 0.
Why should I care?
On your tables, you’ll create indexes for quicker access.
The effectiveness of an index scan is impacted by a column’s correlation.
A full index or a portion of it is read in index sequential order during an index scan. The table’s corresponding row is retrieved for each entry that is located (this is skipped in an “index only scan,” but that is another story).
The rows that are fetched will come from all over the table if the correlation of the indexed column is close to zero. This will lead to a large number of randomly distributed reads of various table blocks.
The next row fetched during the index scan, however, tends to be in the same or the following table block if the correlation is close to 1 or -1.
High correlation offers two benefits:
- If many of the table rows fetched during the index scan are contained in the same table block, only a small number of blocks need to be read from storage because blocks read by the database are cached in shared memory.
- Due to the proximity of the blocks that need to be read from storage, sequential I/O—which is much faster than random I/O on spinning disks—is enabled.
An example
Let’s make two identical tables with different correlations, but with the same content:
CREATE TABLE corr (id, val) AS SELECT i, 'some text ' || i FROM generate_series(1, 100000) AS i; CREATE INDEX corr_idx ON corr (id); VACUUM (ANALYZE) corr; SELECT correlation FROM pg_stats WHERE tablename = 'corr' AND attname = 'id'; correlation ------------- 1 (1 row) CREATE TABLE uncorr AS SELECT * FROM corr ORDER BY random(); CREATE INDEX uncorr_idx ON uncorr (id); VACUUM (ANALYZE) uncorr; SELECT correlation FROM pg_stats WHERE tablename = 'uncorr' AND attname = 'id'; correlation ------------- -0.00522369 (1 row)
In order to compare index scans on the two tables, we disable bitmap index scans.
Then we assess how index scans function:
SET enable_bitmapscan = off; EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM corr WHERE id BETWEEN 1001 AND 1300; QUERY PLAN --------------------------------------------------- Index Scan using corr_idx on corr (cost=0.29..15.23 rows=297 width=19) (actual time=0.108..0.732 rows=300 loops=1) Index Cond: ((id >= 1001) AND (id <= 1300)) Buffers: shared hit=6 Planning time: 0.456 ms Execution time: 1.049 ms (5 rows) EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM uncorr WHERE id BETWEEN 1001 AND 1300; QUERY PLAN --------------------------------------------------- Index Scan using uncorr_idx on uncorr (cost=0.29..978.15 rows=298 width=19) (actual time=0.105..2.352 rows=300 loops=1) Index Cond: ((id >= 1001) AND (id <= 1300)) Buffers: shared hit=303 Planning time: 0.548 ms Execution time: 2.736 ms (5 rows)
Since all blocks were already in shared buffers, 2.7 milliseconds is actually not that bad.
The 303 blocks from the second query will perform much worse than the 6 from the first if some of these blocks need to be read from disk!
Each result row for the second query was located in a different table block. This resulted in the touching of 300 blocks. Index blocks make up the final three blocks.
Only three table blocks are touched by the first query:
SELECT ctid, id FROM corr WHERE id BETWEEN 1001 AND 1300; ctid | id ---------+------ (6,58) | 1001 (6,59) | 1002 (6,60) | 1003 (6,61) | 1004 (6,62) | 1005 (6,63) | 1006 (6,64) | 1007 ... (8,37) | 1294 (8,38) | 1295 (8,39) | 1296 (8,40) | 1297 (8,41) | 1298 (8,42) | 1299 (8,43) | 1300 (300 rows)
In fact, blocks 6, 7 and 8 of the table contain every row!
Correlation and the optimizer
The PostgreSQL optimizer figures out how much it will cost to run an SQL statement in each way.
Better cost estimates for an index scan can be obtained by using the correlation, which will help you make better plan decisions.
If the correlation is close to or equal to 1, the PostgreSQL optimizer will favor index scans.
Correlation and BRIN indexes
The BRIN index (block range index) was first made available in PostgreSQL 9.5.
This index functions by keeping track of the minimum and maximum values across all ranges of table block values. Only columns with perfect correlation can benefit from it. Its size advantage over the B-tree index makes it an intriguing alternative for large tables.
How to make use of correlation?
Keep the table in index order if you need to efficiently scan larger sections of an index.
In PostgreSQL, there are no “index ordered tables.”
Still, there are two ways to keep a column’s correlation high:
- Automatically:Good examples for that are primary key columns generated by sequences or measurements with a timestamp. If the table rows are inserted in logical column order and there are no updates or deletes on the table, the physical ordering will be identical to the logical ordering.
A BRIN index might be an interesting choice because correlation is always perfect in this scenario.Table partitioning can be used to delete outdated data from a table without affecting the physical ordering.
- Clustering: The SQL statement
CLUSTERcan be used to rewrite a table so that the physical ordering is identical to the logical ordering of an index.However, subsequent modifications of the table will reduce the correlation again. Because of that, you need to re-cluster the table regularly to maintain high correlation. This is annoying, becauseCLUSTERblocks all concurrent access to the table.
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Why Intelligent Database Workload Management Is Essential for High-Growth SaaS Platforms
- 19 June 2026
- Database Performance Management
Introduction Telecommunications providers are operating in one of the most competitive and technology-intensive industries in the world. While demand for connectivity, mobile services, broadband access, and digital experiences continues to grow, profit margins are increasingly challenged by rising infrastructure costs, complex network operations, and expanding customer expectations. Modern telecom organizations must support: 5G networks Cloud-native … Continue reading “Why Intelligent Database Workload Management Is Essential for High-Growth SaaS Platforms”
Reducing Operational Complexity with AI-Driven Database Observability and AIOps
Modern enterprises operate in increasingly complex digital environments. Applications now span hybrid cloud infrastructures, multi-cloud deployments, containerized platforms, microservices architectures, and globally distributed data systems. While this transformation enables greater scalability, agility, and innovation, it also creates significant operational challenges for IT and engineering teams. At the heart of these complex environments lies the database … Continue reading “Reducing Operational Complexity with AI-Driven Database Observability and AIOps”
How Predictive SQL Performance Analytics Accelerates Application Modernization
- 18 June 2026
- Database Performance Management
Application modernization has become a strategic priority for enterprises seeking greater agility, scalability, and competitive advantage. Organizations are increasingly transforming legacy systems into cloud-ready, data-driven, and highly scalable architectures to meet growing digital demands. Whether migrating monolithic applications to microservices, adopting cloud-native platforms, or modernizing data infrastructure, enterprises face a common challenge: maintaining database performance … Continue reading “How Predictive SQL Performance Analytics Accelerates Application Modernization”
How to Modernize BFSI Cost Management with Enteros Database Software and Cost Attribution Analytics
Introduction The Banking, Financial Services, and Insurance (BFSI) industry is undergoing rapid transformation driven by digital banking, fintech innovation, regulatory requirements, customer expectations, and growing data volumes. As organizations continue investing in cloud platforms, digital services, AI-powered applications, and advanced analytics, technology spending has become one of the largest operational expenses across the financial sector. … Continue reading “How to Modernize BFSI Cost Management with Enteros Database Software and Cost Attribution Analytics”