TABLE BLOAT ESTIMATION IN POSTGRESQL

Preamble

Recently, there have been a few posts on Planet Postgresql about “bloat” that are both extremely detailed and highly technical. Bloat is a necessary side effect of Postgres’s excellent multi-version concurrency control (MVCC) model, which makes new copies of rows when data is changed in a situation where multiple users are accessing the database at the same time. Your tables (and indexes) will eventually use more disk space than they should, especially if you run busy multi-user applications where “autovacuum” occasionally cannot keep up with the rate of data changes.

But ok, let’s skip over the more in-depth bloat and bloat fighting details this time and focus on how to actually quickly determine if you do in fact have a “bloat problem” on one of your tables (VACUUM FULL, CLUSTER, pg_repack).

Estimation approaches

There are 2 most common approaches here: to use a Postgres extension called “pgstattuple” (part of the official contrib modules), which includes a function named pgstattuple_approx() available since Postgres version 9.5, or to rely on a community-developed pure SQL query where no setup is needed and which works also for older versions. You can find community scripts on the Postgresql Wiki bloat page. There are a few scripts that differ slightly. Both methods of estimating are very cheap and involve making some educated guesses based on statistics. The “pgstattuple” extension also has a pgstattuple() method, which we will include in our comparisons to get accurate information on bloat, but this method may be expensive because it does a full table scan.

So let’s move on and try to use our different estimation methods to show how bloat happens and how it can be found.

Generating bloat artificially

The first important question is: How can we make bloat happen on purpose for our tests? Evidently, it’s not a simple matter.
We’ll use the handy command-line performance testing tool “pgbench” that comes with standard Postgres installation to generate test data (UPDATE statements). The tricky part is that we need to make sure that old versions of updated rows aren’t cleaned up and marked for re-use during our test; otherwise, we won’t get the desired bloat. To make the transaction truly active, I believe the simplest solution is to start a new parallel transaction in REPEATABLE READ mode and then run a data reading query in that session. You could also turn off autovacuum in postgresql.conf or on a per-table basis with “ALTER TABLE SET (autovacuum_enabled = false)”. This makes sure that all visible rows will stay until our “repeatable” transaction is finished. This approach would also prevent any potential instances of HOT pruning (on-the-fly page level mini-vacuuming) for updated fields without indexes.

First, we’ll set up the test schema with a scaling factor of 10, which means that 1 million bank accounts will be initialized in our chosen test table called “pgbench_accounts,” and then 1 million account updates will be run, which should double the table size and create a bloat ratio of 50%.

Running the test

 krl@r840:~$ pgbench -i -s10 pgbench creating tables: "10000 of 100000 tuples (10%) done (elapsed 0.04 s, remaining 0.32 s)"; "100000 of 100000 tuples (100%) done (elapsed 0.44 s, remaining 0.00 s)"; "Set primary keys"; "Done" pgbench=# \dt+ List of relations Schema | Name | Type | Owner | Size | Description --------+------------------+-------+--------+---------+------------- public | pgbench_accounts | table | krl | 128 MB | public | pgbench_branches | table | krl | 40 kB | public | pgbench_history | table | krl | 0 bytes | public | pgbench_tellers | table | krl | 40 kB | (4 rows)

Let’s run our “vacuum stopper” query in a separate session now, then begin the data update procedure.

 krl@pgbench=# begin transaction isolation level repeatable read ; BEGIN krl@pgbench=#* select * from pgbench_accounts limit 1; aid │ bid │ abalance │ filler ─────┼─────┼──────────┼────────────────────────────────────────────────────────────────────────────────────── 3 │ 1 │ 0 │

Let’s now begin our test for bloat generation. To achieve the best possible performance, I’m using the “prepared transaction” protocol here. Take note that the “-transactions” flag is per client.

 krl@r840:~$ pgbench --no-vacuum --client=2 --jobs=2 --transactions=500000 --protocol=prepared pgbench transaction type: TPC-B (sort of) scaling factor: 10; query mode: prepared number of clients: 2 number of threads; 2 number of transactions per client: 500000 Number of transactions actually processed: 1000000/1000000 Latency average: 0.000 ms tps = 1903.859294 (including connections establishing) tps = 1904.357546 (excluding connections establishing) krl@pgbench=# \dt+ List of relations Schema │ Name │ Type │ Owner │ Size │ Description ────────┼──────────────────┼───────┼───────┼────────┼───────────── public │ pgbench_accounts │ table │ krl │ 256 MB │ public │ pgbench_branches │ table │ krl │ 35 MB │ public │ pgbench_history │ table │ krl │ 50 MB │ public │ pgbench_tellers │ table │ krl │ 42 MB │ (4 rows)

Completed…and it appears that the size of our test table, “pgbench_accounts,” has indeed doubled.

Installing the “pgstattuple” extension will allow us to attempt to obtain some estimates.
We also need to end our parallel session and run VACUUM ANALYZE to update the internal statistics and get rid of old versions of rows, which will help us make educated guesses.

 krl@pgbench=#* rollback ; ROLLBACK krl@pgbench=# VACUUM ANALYZE pgbench_accounts; VACUUM krl@pgbench=# CREATE EXTENSION pgstattuple ; CREATE EXTENSION

Let’s now examine the estimate figures. In order to cut down on noise, I’ve only chosen the most important columns here and also added our table name to the WHERE clause of the community query.
Note: In the case of pgstattuple_approx(), a scanned_percent value of 0 means that we do have an estimated value. This means that this function can still do table scans even when things aren’t going well.

krl@pgbench=# select table_len, scanned_percent, approx_free_space, and approx_free_percent from pgstattuple_approx('pgbench_accounts'); table_len  scanned_percent  approx_free_space  approx_free_percent ───────────┼─────────────────┼───────────────────┼───────────────────── 268607488 │ 0 │ 131164800 │ 48.8314011558754 (1 row) krl@pgbench=# i /temp/bloat_investigation/table_bloat_ioguix.sql real_size  extra_size  extra_ratio ───────────┼────────────┼────────────────── 268607488 │ 136126464 │ 50.6785812315106 (1 row) pgbench=# select table_len, free_space, and free_percent from pgstattuple('pgbench_accounts'); table_len, free_space, and free_percent ───────────┼────────────┼────────────── 268607488 │ 131689396 │ 49.03 (1 row)

Making sense of the results

The actual number and the estimates are now available. We can see that the estimates from pgstattuple_approx (approx_free_space = 131164800) and pure SQL (extra_size = 136126464) are very close to the actual value of 131689396 (pgstattuple.free_space), with differences of 0.5% and 3%.
In order to wrap up, I’ll say that estimation methods for simple tables with mostly fixed data-type columns work surprisingly well and can actually be used to build a bloat removal check or alert.

About Enteros

Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.

The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.

Are you interested in writing for Enteros’ Blog? Please send us a pitch!

How Real-Time Database Observability Enables Proactive IT Operations

22 July 2026
Software Engineering

Introduction In today’s digital-first economy, enterprise success depends on applications that are always available, responsive, and secure. Whether processing financial transactions, managing patient records, supporting online learning, or powering e-commerce platforms, modern applications rely on high-performing databases to deliver seamless user experiences. As organizations accelerate cloud adoption and embrace hybrid and multi-cloud architectures, database environments … Continue reading “How Real-Time Database Observability Enables Proactive IT Operations”

Optimizing Enterprise Database Workloads with AI-Driven FinOps Strategies

Software Engineering

Introduction Cloud computing has revolutionized enterprise IT, enabling organizations to build scalable, resilient, and high-performing applications. Today, businesses across banking, healthcare, retail, manufacturing, telecommunications, education, and SaaS increasingly rely on cloud databases to support mission-critical operations. As digital transformation accelerates, enterprise workloads continue to grow in size and complexity, creating new challenges in database performance … Continue reading “Optimizing Enterprise Database Workloads with AI-Driven FinOps Strategies”

How Intelligent SQL Analytics Helps Organizations Reduce Cloud Infrastructure Costs

Software Engineering

Introduction Cloud computing has become the foundation of modern digital enterprises, enabling organizations to scale applications, support global users, and innovate faster than ever before. Businesses across industries—including banking, healthcare, retail, telecommunications, education, and SaaS—are rapidly migrating mission-critical databases to public, private, and hybrid cloud environments. However, while cloud platforms offer unprecedented flexibility and scalability, … Continue reading “How Intelligent SQL Analytics Helps Organizations Reduce Cloud Infrastructure Costs”

How to Optimize Telecommunications Infrastructure with Enteros Database Software, AI-Powered Operational Intelligence, and Cloud FinOps

Software Engineering

Introduction The telecommunications industry is at the center of the digital economy, enabling global connectivity for businesses, consumers, governments, and emerging technologies. With the rapid expansion of 5G networks, fiber broadband, IoT, edge computing, cloud services, and AI-powered customer experiences, telecommunications providers are processing unprecedented volumes of data every second. Every phone call, video stream, … Continue reading “How to Optimize Telecommunications Infrastructure with Enteros Database Software, AI-Powered Operational Intelligence, and Cloud FinOps”