Preamble
Despite the fact that Postgres is a fantastic all-around product with an unbeatable price-performance ratio, there is one area where things could be done a little more effectively: replica validation, or perhaps more broadly, cluster validation.
What’s the issue? There isn’t actually a “pushbutton” way to determine whether a new replica is flawless after it has been constructed. The replica building tool pg_basebackup simply streams over the datafiles and transaction logs and writes them to disk, and that’s it. There are, of course, other tools/approaches, but I suspect that the pg_basebackup tool has 90% of the “market.” Then you typically start your server, check to see if you can connect to it, and possibly run some “select count(*) from my_favorite_table” queries to make sure everything is more-or-less plausible, before grinning with satisfaction and going to get a coffee to work on other pressing issues. But how far did we go to ensure that the cluster was still intact?
What could go wrong with replicas?
What could possibly go wrong when creating a new replica? Naturally, there isn’t much cause for concern if pg_basebackup reports success; it is highly unlikely that something is wrong. But a few things still irritate me. I may be a little paranoid about, but keep reading to find out what I mean.
Starting with hard disks, disk errors should theoretically be a thing of the past, and business-grade disk systems should have error detection built in… But everything in the modern world is so virtualized and abstracted that you hardly even know what’s underneath. Perhaps there is some network storage at fault? Then, it definitely needs drivers for both the hardware and the operating system. And every piece of software has bugs, as we are all aware. Google says that there are usually 15 to 50 errors for every 1000 lines of code that is sent. Therefore, when writing datafiles, it is impossible to completely rule out the possibility of silent data corruption, especially since replicas frequently use new hardware that has not yet been thoroughly tested.
In these circumstances, Postgres’ “data-checksums” flag can be useful, but it must be configured during cluster initialization and has a negligible performance impact, so it isn’t always used.
What else? Postgres makes it possible to create replicas across different operating systems (for instance, master on Ubuntu and replica on CentOS, which is obviously a great feature in and of itself), as well as the potential for minor version differences, such as master on 9.6.2 and replica accidentally on 9.6.0. Furthermore, there is a chance that the replication code will malfunction; I can think of at least one instance from a few years ago in which committed rows showed up on replicas as uncommitted. This is very rare of course but could for sure technically happen again.
Status Quo
In other words, there are a few “ifs,” and the issue is, “What can be done there?” The only option that is 100 percent trustworthy is to perform a SQL dump/restore because, as was previously stated, there is sadly no tool from the Postgres project to perform full verification (and I’m not even sure if there should be one or if it can even be done). However, if you need to switch over the master in a covert manner, this obviously won’t work and could take hours and hours for databases that are 1TB or larger.
Therefore, simply dumping the database is a good compromise there to feel even marginally better before making the switchover! At the very least, this will confirm that there isn’t any silent data corruption. However, depending on your DB-size and hardware, this could also take hours. But there are some tricks to make it go faster—more on that later.
Speeding up dummy SQL dumps on replicas
1. Using “/dev/null” is a great performance booster since we only need to confirm that data can be read out, not actually store the SQL dump. When using the default plain SQL mode of pg_dumpall or pg_dump, it is very simple to:
pg_dumpall -h /var/run/postgresql >/dev/null
2. Connecting to the local server via the Unix socket rather than the “localhost” TCP stack should already yield a 10-15% speedup.
3. Using several processes to dump the data! When time is of the essence, this can be very helpful for larger databases. Simply set the “jobs” flag to an appropriate value to place a strain on the IO-subsystem. NB! For a number of tasks, the “directory” output format for the pg_dump (-Fd/-format=directory + -f/-file=dirname) is a must, but it also has some drawbacks:
- There are of course workarounds, e.g. check out the nullfs FUSE filesystem driver but it’s tedious to say the least. pg_dump “directory” format does not get along with /dev/null as it wants to set up a directory structure.
- For our “quick verification” goal, this is not too catastrophic, and we can ignore the initial error by providing the “-no-synchronized-snapshots” flag. This will be fixed in upcoming Postgres 10, but for now, it is not possible to obtain a consistent snapshot when dumping replicas with multiple processes.
- Pg_dump only works “per database,” so if your cluster contains a large number of DBs, you may need to use some lines of Bash, for example.
4. Create a quick custom script that takes care of the unfavorable effects of approach number 3 so you don’t have to. It’s a straightforward Python script that simply spawns a predetermined number of worker processes (by default, half of the CPUs) and dumps all tables sequentially to /dev/null; all that’s needed is the Postgres “bindir” and the Unix socket as the host.
NB: If there is activity on the master, make sure the query conflict situation has been resolved before starting the dumping process on the replica as it will fail quickly otherwise! Configuring hot standby feedback or allowing the replica to “fall behind” by setting the maximum standby streaming delay to -1 or a large value are two options.
That’s all, I sincerely hope you followed along and found some food for thought. Feedback is always welcome!
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Inside a Fintech Outage: How 200 Milliseconds of Latency Reshaped Risk
- 31 October 2025
- Software Engineering
Introduction In fintech, performance isn’t just a technical metric — it’s a financial one.Transactions, pricing engines, credit scoring, fraud detection — they all run on milliseconds.But what happens when those milliseconds multiply? In mid-2025, a mid-tier digital lender experienced an unusual outage.Not a crash.Not downtime.Just slow time — an invisible 200 ms delay that rippled … Continue reading “Inside a Fintech Outage: How 200 Milliseconds of Latency Reshaped Risk”
Open Banking APIs: Where Performance = Trust
- 30 October 2025
- Software Engineering
Introduction Open banking promised to be a paradigm shift — enabling consumers to share financial data securely and allowing banks, fintechs, and third parties to build innovative services on that foundation. But as the ecosystem evolves, one truth stands out: it’s not just about access — it’s about performance. An open banking API that’s slow, … Continue reading “Open Banking APIs: Where Performance = Trust”
Enteros for the Travel Industry: Enhancing Cost Estimation Accuracy Through AIOps, Observability, and Cloud FinOps
Introduction In the fast-moving world of travel and hospitality, accurate cost estimation isn’t just a finance issue—it’s a performance challenge. From dynamic booking systems and real-time analytics to backend inventory databases and AI-driven recommendation engines, every operational layer relies on complex data interactions. The travel industry has always faced volatile demand, fluctuating operating costs, and … Continue reading “Enteros for the Travel Industry: Enhancing Cost Estimation Accuracy Through AIOps, Observability, and Cloud FinOps”
Transforming Data Lake Efficiency in the Technology Sector: How Enteros’ AI Performance Platform Redefines Database Optimization
Introduction In today’s data-driven technology landscape, the backbone of innovation lies in how efficiently enterprises manage and utilize their data. With the rise of big data, cloud ecosystems, and AI workloads, data lakes have become the central hub of data intelligence—storing massive volumes of structured, semi-structured, and unstructured data. However, as organizations scale their digital … Continue reading “Transforming Data Lake Efficiency in the Technology Sector: How Enteros’ AI Performance Platform Redefines Database Optimization”