Preamble
Despite the fact that Postgres is a fantastic all-around product with an unbeatable price-performance ratio, there is one area where things could be done a little more effectively: replica validation, or perhaps more broadly, cluster validation.
What’s the issue? There isn’t actually a “pushbutton” way to determine whether a new replica is flawless after it has been constructed. The replica building tool pg_basebackup simply streams over the datafiles and transaction logs and writes them to disk, and that’s it. There are, of course, other tools/approaches, but I suspect that the pg_basebackup tool has 90% of the “market.” Then you typically start your server, check to see if you can connect to it, and possibly run some “select count(*) from my_favorite_table” queries to make sure everything is more-or-less plausible, before grinning with satisfaction and going to get a coffee to work on other pressing issues. But how far did we go to ensure that the cluster was still intact?
What could go wrong with replicas?
What could possibly go wrong when creating a new replica? Naturally, there isn’t much cause for concern if pg_basebackup reports success; it is highly unlikely that something is wrong. But a few things still irritate me. I may be a little paranoid about, but keep reading to find out what I mean.
Starting with hard disks, disk errors should theoretically be a thing of the past, and business-grade disk systems should have error detection built in… But everything in the modern world is so virtualized and abstracted that you hardly even know what’s underneath. Perhaps there is some network storage at fault? Then, it definitely needs drivers for both the hardware and the operating system. And every piece of software has bugs, as we are all aware. Google says that there are usually 15 to 50 errors for every 1000 lines of code that is sent. Therefore, when writing datafiles, it is impossible to completely rule out the possibility of silent data corruption, especially since replicas frequently use new hardware that has not yet been thoroughly tested.
In these circumstances, Postgres’ “data-checksums” flag can be useful, but it must be configured during cluster initialization and has a negligible performance impact, so it isn’t always used.
What else? Postgres makes it possible to create replicas across different operating systems (for instance, master on Ubuntu and replica on CentOS, which is obviously a great feature in and of itself), as well as the potential for minor version differences, such as master on 9.6.2 and replica accidentally on 9.6.0. Furthermore, there is a chance that the replication code will malfunction; I can think of at least one instance from a few years ago in which committed rows showed up on replicas as uncommitted. This is very rare of course but could for sure technically happen again.
Status Quo
In other words, there are a few “ifs,” and the issue is, “What can be done there?” The only option that is 100 percent trustworthy is to perform a SQL dump/restore because, as was previously stated, there is sadly no tool from the Postgres project to perform full verification (and I’m not even sure if there should be one or if it can even be done). However, if you need to switch over the master in a covert manner, this obviously won’t work and could take hours and hours for databases that are 1TB or larger.
Therefore, simply dumping the database is a good compromise there to feel even marginally better before making the switchover! At the very least, this will confirm that there isn’t any silent data corruption. However, depending on your DB-size and hardware, this could also take hours. But there are some tricks to make it go faster—more on that later.
Speeding up dummy SQL dumps on replicas
1. Using “/dev/null” is a great performance booster since we only need to confirm that data can be read out, not actually store the SQL dump. When using the default plain SQL mode of pg_dumpall or pg_dump, it is very simple to:
pg_dumpall -h /var/run/postgresql >/dev/null
2. Connecting to the local server via the Unix socket rather than the “localhost” TCP stack should already yield a 10-15% speedup.
3. Using several processes to dump the data! When time is of the essence, this can be very helpful for larger databases. Simply set the “jobs” flag to an appropriate value to place a strain on the IO-subsystem. NB! For a number of tasks, the “directory” output format for the pg_dump (-Fd/-format=directory + -f/-file=dirname) is a must, but it also has some drawbacks:
- There are of course workarounds, e.g. check out the nullfs FUSE filesystem driver but it’s tedious to say the least. pg_dump “directory” format does not get along with /dev/null as it wants to set up a directory structure.
- For our “quick verification” goal, this is not too catastrophic, and we can ignore the initial error by providing the “-no-synchronized-snapshots” flag. This will be fixed in upcoming Postgres 10, but for now, it is not possible to obtain a consistent snapshot when dumping replicas with multiple processes.
- Pg_dump only works “per database,” so if your cluster contains a large number of DBs, you may need to use some lines of Bash, for example.
4. Create a quick custom script that takes care of the unfavorable effects of approach number 3 so you don’t have to. It’s a straightforward Python script that simply spawns a predetermined number of worker processes (by default, half of the CPUs) and dumps all tables sequentially to /dev/null; all that’s needed is the Postgres “bindir” and the Unix socket as the host.
NB: If there is activity on the master, make sure the query conflict situation has been resolved before starting the dumping process on the replica as it will fail quickly otherwise! Configuring hot standby feedback or allowing the replica to “fall behind” by setting the maximum standby streaming delay to -1 or a large value are two options.
That’s all, I sincerely hope you followed along and found some food for thought. Feedback is always welcome!
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
What Drives Growth in Technology Platforms: Enteros AI SQL, Database Management, and Performance Metrics
- 11 March 2026
- Database Performance Management
Introduction Technology platforms have become the backbone of the modern digital economy. From SaaS products and cloud-native applications to AI-powered analytics and global digital marketplaces, technology enterprises rely on robust infrastructure to deliver reliable, scalable services to millions of users. At the center of these digital ecosystems lies one of the most critical components of … Continue reading “What Drives Growth in Technology Platforms: Enteros AI SQL, Database Management, and Performance Metrics”
How to Modernize Fashion Data Platforms with Enteros Database Management and Generative AI
Introduction The global fashion industry has transformed dramatically in the digital era. Once driven primarily by seasonal collections and physical retail, fashion brands today rely heavily on digital platforms, e-commerce marketplaces, data analytics, and AI-powered customer experiences. From trend forecasting and inventory management to real-time customer engagement, modern fashion businesses are powered by complex data … Continue reading “How to Modernize Fashion Data Platforms with Enteros Database Management and Generative AI”
How Banking Platforms Achieve Accurate Cost Estimation with Enteros GenAI and Cloud Cost Attribution
- 10 March 2026
- Database Performance Management
Introduction The banking industry is undergoing one of the most significant technological transformations in its history. Digital banking platforms, mobile payment systems, AI-powered fraud detection, and real-time financial analytics are now fundamental components of modern banking operations. These innovations rely on powerful cloud infrastructure and highly optimized databases to process millions of financial transactions every … Continue reading “How Banking Platforms Achieve Accurate Cost Estimation with Enteros GenAI and Cloud Cost Attribution”
From Performance Monitoring to Growth Intelligence: Enteros AIOps for Technology Enterprises
Introduction Technology enterprises are operating in an era where digital platforms determine market success. Software products, cloud platforms, SaaS applications, data analytics tools, and AI-powered systems are the backbone of modern businesses. Behind these digital services lies an intricate ecosystem of databases, cloud infrastructure, and applications that must operate at peak performance. For technology companies, … Continue reading “From Performance Monitoring to Growth Intelligence: Enteros AIOps for Technology Enterprises”