Article
A query about how to fork PostgreSQL databases like you can on Heroku was sent to me. Since I could not find any good examples of how to do this, I made the decision to write a brief article.
Making a copy of a database so you can make changes to it without affecting the original source database is known as “forking” a database. In this case, the easiest thing to do is make a copy of the database, which is also called a backup, and start a new database instance on it. I won’t linger on this point because there are enough articles and documents that cover backup procedures. The obvious solution has a performance problem because copying a lot of data takes a long time. We can’t afford to have to wait several hours every time we use database forks to update our staging or testing environments (depending on whether or not they need to be kept secret).
We must find a way to copy everything without actually making copies of it. The good news is that you can accomplish this with your storage layer’s help. The technical term for what we’re after is a copy-on-write snapshot. Specialized storage units and the Linux LVM layer, Btrfs, and ZFS filesystems all do this in software.
Btrfs is the easiest to set up for this demonstration, and it’s also what I just so happened to have on hand. I have a 99% filled up RAID1 Btrfs filesystem mounted at /mnt/data, backed by three Western Digital Green spinning disks (<6000RPM). To put it another way, a storage system as slow as you can make it.
I first created a database on which to run our tests. I make a fresh PostgreSQL database, start it up on port 6000, and create a fresh subvolume for the master database.
/mnt/data/dbforking$ btrfs subvolume create master Create subvolume './master /mnt/data/dbforking$ initdb master/ ... snipped /mnt/data/dbforking $ sed -i 's/# port =.*/port = 6000/' master/postgresql.conf /mnt/data/dbforking $ pg_ctl -D master/ -l master/postgres.log start server starting /mnt/data/dbforking $ createdb -p 6000 /mnt/data/dbforking $ psql -p 6000 -c "Select "Hello, World"?" column? ------------- hello world (1 row)
Let’s create some fake data on the master server now that it is operational. For this, I’ll use pgBench.
/mnt/data/dbforking$ time pgbench -p 6000 -i -s 1000 ... few minutes of progress reports go by 100000000 of 100000000 tuples (100%) done (elapsed 173.58 s, remaining 0.00 s). vacuum... set primary keys... done. real 8m35.011s user 0m21.746s sys 0m0.739s /mnt/data/dbforking /mnt/data/dbforking $ du -sh master 15G master/
I currently have a master database that contains 15 GB of data. Let’s update some master data as well so we can keep track of our forks. We will set thefiller column on a row in pgbench_branches table to do this.
/mnt/data/dbforking$ psql -p 6000 -c "UPDATE pgbench_branches" Set filler = "master before forking" WHERE bid = 1" UPDATE 1 /mnt/data/dbforking$ psql -p 6000 -c " SELECT * FROM pgbench_branches WHERE bid = "1" bid | bbalance | filler -----+----------+------------------------------------------------------------------------------------------ 1 | 0 | master before forking (1 row)
Normal PostgreSQL hot backup procedure is to start your backup with pg_start_backup(), copy database contents over, end backup with pg_stop_backup() and then copy xlogs over. However, Btrfs supports atomic snapshots. With atomic snapshots, we can quickly create a snapshot with the exact same contents as a regular copy that would have been made if PostgreSQL had been SIGKILLed at that time. PostgreSQL’s mechanisms for durability make sure that we will get a consistent state that only shows the successful commits up to that point. In our situation, the backup management commands are not necessary.
To add some spice, I will launch a workload on the master in a separate shell to show that we can do this on a production database with ease. It should be possible to maintain a steady load of 50 transactions per second with 16 clients. 50 transactions per second is slow, but it’s a big challenge for the slow storage system we’re using here.
/mnt/data/dbforking$ pgbench -p 6000 --rate=50 --client=16 --progress=5 --time=1200 starting vacuum... end. progress: 5.0 s, 44.6 tps, lat 129.484 ms, stddev 30.302, lag 0.304 ms.
Let’s fork the database now that we are prepared to do so:
/mnt/data/dbforking$ time btrfs subvolume snapshot master fork1 Create a snapshot of "master" in "./fork1". real 0m6.295s user 0m0.000s sys 0m0.202s /mnt/data/dbforking $ du -sh fork1/ 15G fork1/
And these are the performance metrics from that period:
progress: 30.0 s, 54.5 tps, lat 134.819 ms stddev 34.012, lag 0.500 ms progress: 35.1 s, 44.4 tps, lat 199.910 ms stddev 75.235, lag 3.243 ms progress: 40.1 s, 40.5 tps, lat 1281.642 ms stddev 791.303, lag 970.009 ms progress: 45.0 s, 62.1 tps, lat 349.229 ms stddev 313.576, lag 145.631 ms progress: 50.0 s, 50.0 tps, lat 146.155 ms stddev 45.599, lag 2.543 ms progress: 55.1 s, 53.0 tps, lat 146.554 ms stddev 40.694, lag 0.562 ms
So, we were able to split a 15 GB database in 6 seconds with only a small hiccup in performance. The forked database can now be launched.
We must do a few things before the fork can begin. The master’s pid file must first be deleted. PostgreSQL can usually remove it by itself, but since the master is still running on the same machine, it doesn’t know if it’s safe to do so in this case, so we have to do it ourselves. Second, the fork needs to be set up to use a different port from the master. When this is finished, we can restart the server and see that it recovers successfully from a crash.
/mnt/data/dbforking$ rm fork1/postmaster.pid /mnt/data/dbforking $ sed -i "s/port =. */port = 6001/" fork1/postgresql.conf /mnt/data/dbforking /mnt/data/dbforking $ pg_ctl -D fork1 -l fork1/postgres.log start /mnt/data/dbforking $ tail fork1/postgres.log LOG: The database system was interrupted; it was last known to be up at 2015-01-09 14:47:29 EET. LOG: database system was not properly shut down; automatic recovery in progress LOG: Undo starts at 0/7086E90 LOG: record with zero length at 0/8BDD888 LOG: redo done at 0/8BDD858 LOG: last completed transaction was at log time 2015-01-09 14:51:44.56637+02 Log: The database system is ready to accept connections. LOG: autovacuum launcher started
Next, we can confirm that we are actually using two different databases. Change the tracking row to ensure that it is distinct in both the fork database and the master database.
Let’s update the database that was forked:
/mnt/data/dbforking$ psql -p 6001 -c "UPDATE pgbench_branches" SET filler = "fork1 after forking" WHERE bid = 1" UPDATE 1 /mnt/data/dbforking$ psql -p 6001 -c " SELECT * FROM pgbench_branches WHERE bid = "1" bid | bbalance | filler -----+----------+------------------------------------------------------------------------------------------ 1 | 1025 | fork1 after forking (1 row)
And check that the master database still has the old data (except for thebbalance column that pgbench has updated):
/mnt/data/dbforking$ psql -p 6000 -c "SELECT * FROM pgbench_branches" WHERE bid = "1" bid | bbalance | filler -----+----------+------------------------------------------------------------------------------------------ 1 | -33546 | master before forking (1 row)
With this, we have a plan for how to fork a production database without any major problems. On the same machine as the master, you probably don’t want to run your staging tests or your development environment. To copy new databases from the streaming standby and put them on different machines, you would need a streaming replication standby running in your staging or development environment. Just make sure to replace or delete the recovery.conf file. If you don’t, your snapshot will keep running in standby mode.
Additionally, it’s a good idea to move the fork to a new timeline so PostgreSQL is aware of the fork and can alert you to mistakes like having the fork replicate data from master (or vice versa) if necessary. To do this create a recovery.conf that contains the single line restore_command = '/bin/false'. This will switch PostgreSQL to point-in-time recovery mode (as opposed to regular crash recovery), creating a timeline switch at the end of the transaction log./bin/false Is there any sign that there is no archive where more transaction logs can be found?
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
How Intelligent Database Analytics Improves Data Infrastructure Performance in Modern Hospital Management Systems
- 20 May 2026
- Database Performance Management
Introduction Healthcare organizations are undergoing rapid digital transformation. Hospitals now rely on advanced technologies such as Electronic Health Records (EHR), telemedicine platforms, AI-driven diagnostics, medical imaging systems, and hospital management systems to deliver faster and more efficient patient care. At the center of these systems lies a powerful data infrastructure supported by complex databases. Modern … Continue reading “How Intelligent Database Analytics Improves Data Infrastructure Performance in Modern Hospital Management Systems”
Enhancing Retail Platform Performance and Customer Experience with Intelligent Database Analytics
- 19 May 2026
- Database Performance Management
Retail has undergone a major transformation over the past decade. With the rapid growth of digital commerce, omnichannel experiences, and real-time customer engagement, modern retail platforms depend heavily on high-performance data infrastructure. From product searches and inventory updates to personalized recommendations and checkout transactions, every step of the customer journey relies on fast, reliable, and … Continue reading “Enhancing Retail Platform Performance and Customer Experience with Intelligent Database Analytics”
How to Optimize Retail Infrastructure Costs with Enteros Database Optimization and Azure Cost Allocation
Introduction The retail industry is rapidly evolving as organizations embrace cloud computing, AI-driven analytics, ecommerce expansion, omnichannel operations, and data-intensive customer engagement platforms. Modern retailers must manage increasingly complex digital infrastructures while delivering seamless customer experiences across online stores, physical retail locations, mobile applications, and supply chain ecosystems. Today’s retail organizations rely heavily on digital … Continue reading “How to Optimize Retail Infrastructure Costs with Enteros Database Optimization and Azure Cost Allocation”
How to Strengthen Ecommerce Scalability with Enteros AIOps Platform and Financial Intelligence
Introduction The ecommerce industry is evolving rapidly as organizations embrace cloud-native infrastructures, AI-powered customer engagement, real-time analytics, and digital commerce platforms to support global growth and customer expectations. Modern ecommerce businesses must manage massive operational complexity while delivering seamless digital experiences across websites, mobile applications, marketplaces, and omnichannel retail systems. Today’s ecommerce organizations rely heavily … Continue reading “How to Strengthen Ecommerce Scalability with Enteros AIOps Platform and Financial Intelligence”