Preamble
Recently, there have been a few posts on Planet Postgresql about “bloat” that are both extremely detailed and highly technical. Bloat is a necessary side effect of Postgres’s excellent multi-version concurrency control (MVCC) model, which makes new copies of rows when data is changed in a situation where multiple users are accessing the database at the same time. Your tables (and indexes) will eventually use more disk space than they should, especially if you run busy multi-user applications where “autovacuum” occasionally cannot keep up with the rate of data changes.
But ok, let’s skip over the more in-depth bloat and bloat fighting details this time and focus on how to actually quickly determine if you do in fact have a “bloat problem” on one of your tables (VACUUM FULL, CLUSTER, pg_repack).
Estimation approaches
There are 2 most common approaches here: to use a Postgres extension called “pgstattuple” (part of the official contrib modules), which includes a function named pgstattuple_approx() available since Postgres version 9.5, or to rely on a community-developed pure SQL query where no setup is needed and which works also for older versions. You can find community scripts on the Postgresql Wiki bloat page. There are a few scripts that differ slightly. Both methods of estimating are very cheap and involve making some educated guesses based on statistics. The “pgstattuple” extension also has a pgstattuple() method, which we will include in our comparisons to get accurate information on bloat, but this method may be expensive because it does a full table scan.
So let’s move on and try to use our different estimation methods to show how bloat happens and how it can be found.
Generating bloat artificially
The first important question is: How can we make bloat happen on purpose for our tests? Evidently, it’s not a simple matter.
We’ll use the handy command-line performance testing tool “pgbench” that comes with standard Postgres installation to generate test data (UPDATE statements). The tricky part is that we need to make sure that old versions of updated rows aren’t cleaned up and marked for re-use during our test; otherwise, we won’t get the desired bloat. To make the transaction truly active, I believe the simplest solution is to start a new parallel transaction in REPEATABLE READ mode and then run a data reading query in that session. You could also turn off autovacuum in postgresql.conf or on a per-table basis with “ALTER TABLE SET (autovacuum_enabled = false)”. This makes sure that all visible rows will stay until our “repeatable” transaction is finished. This approach would also prevent any potential instances of HOT pruning (on-the-fly page level mini-vacuuming) for updated fields without indexes.
First, we’ll set up the test schema with a scaling factor of 10, which means that 1 million bank accounts will be initialized in our chosen test table called “pgbench_accounts,” and then 1 million account updates will be run, which should double the table size and create a bloat ratio of 50%.
Running the test
krl@r840:~$ pgbench -i -s10 pgbench creating tables: "10000 of 100000 tuples (10%) done (elapsed 0.04 s, remaining 0.32 s)"; "100000 of 100000 tuples (100%) done (elapsed 0.44 s, remaining 0.00 s)"; "Set primary keys"; "Done" pgbench=# \dt+ List of relations Schema | Name | Type | Owner | Size | Description --------+------------------+-------+--------+---------+------------- public | pgbench_accounts | table | krl | 128 MB | public | pgbench_branches | table | krl | 40 kB | public | pgbench_history | table | krl | 0 bytes | public | pgbench_tellers | table | krl | 40 kB | (4 rows)
Let’s run our “vacuum stopper” query in a separate session now, then begin the data update procedure.
krl@pgbench=# begin transaction isolation level repeatable read ; BEGIN krl@pgbench=#* select * from pgbench_accounts limit 1; aid │ bid │ abalance │ filler ─────┼─────┼──────────┼────────────────────────────────────────────────────────────────────────────────────── 3 │ 1 │ 0 │
Let’s now begin our test for bloat generation. To achieve the best possible performance, I’m using the “prepared transaction” protocol here. Take note that the “-transactions” flag is per client.
krl@r840:~$ pgbench --no-vacuum --client=2 --jobs=2 --transactions=500000 --protocol=prepared pgbench transaction type: TPC-B (sort of) scaling factor: 10; query mode: prepared number of clients: 2 number of threads; 2 number of transactions per client: 500000 Number of transactions actually processed: 1000000/1000000 Latency average: 0.000 ms tps = 1903.859294 (including connections establishing) tps = 1904.357546 (excluding connections establishing) krl@pgbench=# \dt+ List of relations Schema │ Name │ Type │ Owner │ Size │ Description ────────┼──────────────────┼───────┼───────┼────────┼───────────── public │ pgbench_accounts │ table │ krl │ 256 MB │ public │ pgbench_branches │ table │ krl │ 35 MB │ public │ pgbench_history │ table │ krl │ 50 MB │ public │ pgbench_tellers │ table │ krl │ 42 MB │ (4 rows)
Completed…and it appears that the size of our test table, “pgbench_accounts,” has indeed doubled.
Installing the “pgstattuple” extension will allow us to attempt to obtain some estimates.
We also need to end our parallel session and run VACUUM ANALYZE to update the internal statistics and get rid of old versions of rows, which will help us make educated guesses.
krl@pgbench=#* rollback ; ROLLBACK krl@pgbench=# VACUUM ANALYZE pgbench_accounts; VACUUM krl@pgbench=# CREATE EXTENSION pgstattuple ; CREATE EXTENSION
Let’s now examine the estimate figures. In order to cut down on noise, I’ve only chosen the most important columns here and also added our table name to the WHERE clause of the community query.
Note: In the case of pgstattuple_approx(), a scanned_percent value of 0 means that we do have an estimated value. This means that this function can still do table scans even when things aren’t going well.
krl@pgbench=# select table_len, scanned_percent, approx_free_space, and approx_free_percent from pgstattuple_approx('pgbench_accounts'); table_len scanned_percent approx_free_space approx_free_percent ───────────┼─────────────────┼───────────────────┼───────────────────── 268607488 │ 0 │ 131164800 │ 48.8314011558754 (1 row) krl@pgbench=# i /temp/bloat_investigation/table_bloat_ioguix.sql real_size extra_size extra_ratio ───────────┼────────────┼────────────────── 268607488 │ 136126464 │ 50.6785812315106 (1 row) pgbench=# select table_len, free_space, and free_percent from pgstattuple('pgbench_accounts'); table_len, free_space, and free_percent ───────────┼────────────┼────────────── 268607488 │ 131689396 │ 49.03 (1 row)
Making sense of the results
The actual number and the estimates are now available. We can see that the estimates from pgstattuple_approx (approx_free_space = 131164800) and pure SQL (extra_size = 136126464) are very close to the actual value of 131689396 (pgstattuple.free_space), with differences of 0.5% and 3%.
In order to wrap up, I’ll say that estimation methods for simple tables with mostly fixed data-type columns work surprisingly well and can actually be used to build a bloat removal check or alert.
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
How to Reduce Operational Complexity with Enteros Database Optimization and Cloud Financial Intelligence
- 24 May 2026
- Database Performance Management
Introduction Modern enterprises are operating in increasingly complex digital environments driven by cloud computing, artificial intelligence, real-time analytics, distributed applications, and rapidly growing data ecosystems. Organizations across industries rely heavily on cloud-native platforms and database-driven applications to support scalability, operational agility, and customer experiences. Today’s technology ecosystems support: Cloud infrastructures SaaS applications AI and machine … Continue reading “How to Reduce Operational Complexity with Enteros Database Optimization and Cloud Financial Intelligence”
How to Improve Retail Cloud Efficiency with Enteros Database Software and Infrastructure Intelligence
Introduction The retail industry is rapidly transforming as organizations accelerate digital commerce initiatives, modernize customer engagement platforms, and expand cloud-based infrastructures. Retailers today operate highly connected ecosystems involving ecommerce platforms, supply chain systems, customer analytics applications, payment processing environments, and omnichannel retail experiences. Modern retail organizations rely heavily on cloud technologies to support: Ecommerce platforms … Continue reading “How to Improve Retail Cloud Efficiency with Enteros Database Software and Infrastructure Intelligence”
How AI-Driven Database Analytics Enhances Performance and Scalability in Modern Insurance Platforms
Introduction The insurance industry is undergoing rapid digital transformation. Modern insurance platforms now support a wide range of digital services, including online policy management, automated claims processing, customer self-service portals, fraud detection systems, and AI-powered risk analysis. As customer expectations continue to evolve, insurance providers must deliver faster, more personalized, and highly reliable digital experiences. … Continue reading “How AI-Driven Database Analytics Enhances Performance and Scalability in Modern Insurance Platforms”
How to Drive Intelligent Cloud Governance with Enteros Database Management Platform and AIOps
- 22 May 2026
- Database Performance Management
Introduction Cloud computing has become the foundation of modern digital transformation. Organizations across industries increasingly rely on cloud-native infrastructures, distributed applications, AI-driven services, and real-time analytics platforms to support innovation, scalability, and operational agility. Today’s enterprises operate highly complex cloud ecosystems that support: Business-critical applications Database environments Customer engagement platforms AI and machine learning workloads … Continue reading “How to Drive Intelligent Cloud Governance with Enteros Database Management Platform and AIOps”