THE ART OF COPYING WITHOUT COPYING: FORKING DATABASES

Article

A query about how to fork PostgreSQL databases like you can on Heroku was sent to me. Since I could not find any good examples of how to do this, I made the decision to write a brief article.

Making a copy of a database so you can make changes to it without affecting the original source database is known as “forking” a database. In this case, the easiest thing to do is make a copy of the database, which is also called a backup, and start a new database instance on it. I won’t linger on this point because there are enough articles and documents that cover backup procedures. The obvious solution has a performance problem because copying a lot of data takes a long time. We can’t afford to have to wait several hours every time we use database forks to update our staging or testing environments (depending on whether or not they need to be kept secret).

We must find a way to copy everything without actually making copies of it. The good news is that you can accomplish this with your storage layer’s help. The technical term for what we’re after is a copy-on-write snapshot. Specialized storage units and the Linux LVM layer, Btrfs, and ZFS filesystems all do this in software.

Btrfs is the easiest to set up for this demonstration, and it’s also what I just so happened to have on hand. I have a 99% filled up RAID1 Btrfs filesystem mounted at /mnt/data, backed by three Western Digital Green spinning disks (<6000RPM). To put it another way, a storage system as slow as you can make it.

I first created a database on which to run our tests. I make a fresh PostgreSQL database, start it up on port 6000, and create a fresh subvolume for the master database.

/mnt/data/dbforking$ btrfs subvolume create master Create subvolume './master /mnt/data/dbforking$ initdb master/ ... snipped /mnt/data/dbforking $ sed -i 's/# port =.*/port = 6000/' master/postgresql.conf /mnt/data/dbforking $ pg_ctl -D master/ -l master/postgres.log start server starting /mnt/data/dbforking $ createdb -p 6000 /mnt/data/dbforking $ psql -p 6000 -c "Select "Hello, World"?" column? ------------- hello world (1 row)

Let’s create some fake data on the master server now that it is operational. For this, I’ll use pgBench.

/mnt/data/dbforking$ time pgbench -p 6000 -i -s 1000 ... few minutes of progress reports go by 100000000 of 100000000 tuples (100%) done (elapsed 173.58 s, remaining 0.00 s). vacuum... set primary keys... done. real 8m35.011s user 0m21.746s sys 0m0.739s /mnt/data/dbforking /mnt/data/dbforking $ du -sh master 15G master/

I currently have a master database that contains 15 GB of data. Let’s update some master data as well so we can keep track of our forks. We will set thefiller column on a row in pgbench_branches table to do this.

/mnt/data/dbforking$ psql -p 6000 -c "UPDATE pgbench_branches" Set filler = "master before forking" WHERE bid = 1" UPDATE 1 /mnt/data/dbforking$ psql -p 6000 -c " SELECT * FROM pgbench_branches WHERE bid = "1" bid | bbalance | filler -----+----------+------------------------------------------------------------------------------------------ 1 | 0 | master before forking (1 row)

Normal PostgreSQL hot backup procedure is to start your backup with pg_start_backup(), copy database contents over, end backup with pg_stop_backup() and then copy xlogs over. However, Btrfs supports atomic snapshots. With atomic snapshots, we can quickly create a snapshot with the exact same contents as a regular copy that would have been made if PostgreSQL had been SIGKILLed at that time. PostgreSQL’s mechanisms for durability make sure that we will get a consistent state that only shows the successful commits up to that point. In our situation, the backup management commands are not necessary.

To add some spice, I will launch a workload on the master in a separate shell to show that we can do this on a production database with ease. It should be possible to maintain a steady load of 50 transactions per second with 16 clients. 50 transactions per second is slow, but it’s a big challenge for the slow storage system we’re using here.

/mnt/data/dbforking$ pgbench -p 6000 --rate=50 --client=16 --progress=5 --time=1200 starting vacuum... end. progress: 5.0 s, 44.6 tps, lat 129.484 ms, stddev 30.302, lag 0.304 ms.

Let’s fork the database now that we are prepared to do so:

/mnt/data/dbforking$ time btrfs subvolume snapshot master fork1 Create a snapshot of "master" in "./fork1". real 0m6.295s user 0m0.000s sys 0m0.202s /mnt/data/dbforking $ du -sh fork1/ 15G fork1/

And these are the performance metrics from that period:

progress: 30.0 s, 54.5 tps, lat 134.819 ms stddev 34.012, lag 0.500 ms progress: 35.1 s, 44.4 tps, lat 199.910 ms stddev 75.235, lag 3.243 ms progress: 40.1 s, 40.5 tps, lat 1281.642 ms stddev 791.303, lag 970.009 ms progress: 45.0 s, 62.1 tps, lat 349.229 ms stddev 313.576, lag 145.631 ms progress: 50.0 s, 50.0 tps, lat 146.155 ms stddev 45.599, lag 2.543 ms progress: 55.1 s, 53.0 tps, lat 146.554 ms stddev 40.694, lag 0.562 ms

So, we were able to split a 15 GB database in 6 seconds with only a small hiccup in performance. The forked database can now be launched.

We must do a few things before the fork can begin. The master’s pid file must first be deleted. PostgreSQL can usually remove it by itself, but since the master is still running on the same machine, it doesn’t know if it’s safe to do so in this case, so we have to do it ourselves. Second, the fork needs to be set up to use a different port from the master. When this is finished, we can restart the server and see that it recovers successfully from a crash.

/mnt/data/dbforking$ rm fork1/postmaster.pid /mnt/data/dbforking $ sed -i "s/port =. */port = 6001/" fork1/postgresql.conf /mnt/data/dbforking /mnt/data/dbforking $ pg_ctl -D fork1 -l fork1/postgres.log start /mnt/data/dbforking $ tail fork1/postgres.log LOG: The database system was interrupted; it was last known to be up at 2015-01-09 14:47:29 EET. LOG: database system was not properly shut down; automatic recovery in progress LOG: Undo starts at 0/7086E90 LOG: record with zero length at 0/8BDD888 LOG: redo done at 0/8BDD858 LOG: last completed transaction was at log time 2015-01-09 14:51:44.56637+02 Log: The database system is ready to accept connections. LOG: autovacuum launcher started

Next, we can confirm that we are actually using two different databases. Change the tracking row to ensure that it is distinct in both the fork database and the master database.

Let’s update the database that was forked:

/mnt/data/dbforking$ psql -p 6001 -c "UPDATE pgbench_branches" SET filler = "fork1 after forking" WHERE bid = 1" UPDATE 1 /mnt/data/dbforking$ psql -p 6001 -c " SELECT * FROM pgbench_branches WHERE bid = "1" bid | bbalance | filler -----+----------+------------------------------------------------------------------------------------------ 1 | 1025 | fork1 after forking (1 row)

And check that the master database still has the old data (except for thebbalance column that pgbench has updated):

/mnt/data/dbforking$ psql -p 6000 -c "SELECT * FROM pgbench_branches" WHERE bid = "1" bid | bbalance | filler -----+----------+------------------------------------------------------------------------------------------ 1 | -33546 | master before forking (1 row)

With this, we have a plan for how to fork a production database without any major problems. On the same machine as the master, you probably don’t want to run your staging tests or your development environment. To copy new databases from the streaming standby and put them on different machines, you would need a streaming replication standby running in your staging or development environment. Just make sure to replace or delete the recovery.conf file. If you don’t, your snapshot will keep running in standby mode.

Additionally, it’s a good idea to move the fork to a new timeline so PostgreSQL is aware of the fork and can alert you to mistakes like having the fork replicate data from master (or vice versa) if necessary. To do this create a recovery.conf that contains the single line restore_command = '/bin/false'. This will switch PostgreSQL to point-in-time recovery mode (as opposed to regular crash recovery), creating a timeline switch at the end of the transaction log./bin/false Is there any sign that there is no archive where more transaction logs can be found?

About Enteros

Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.

The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.

Are you interested in writing for Enteros’ Blog? Please send us a pitch!

What Drives Profitable Growth in BFSI: Enteros AI SQL, Cost Attribution, and AI Management Strategy

4 March 2026
Database Performance Management

Introduction The Banking, Financial Services, and Insurance (BFSI) sector is transforming at unprecedented speed. Digital banking platforms process millions of real-time transactions. Insurance firms deploy AI-driven underwriting engines. Capital markets rely on algorithmic trading and real-time analytics. FinTech disruptors scale cloud-native infrastructures globally. Yet amid innovation, one reality defines long-term success: Read more”Indian Country” highlights … Continue reading “What Drives Profitable Growth in BFSI: Enteros AI SQL, Cost Attribution, and AI Management Strategy”

What Technology Leaders Must Know About Cost Estimation: Enteros, RevOps Efficiency, and Cloud FinOps Intelligence

Database Performance Management

Introduction The technology sector runs on innovation velocity. SaaS platforms deploy weekly releases. AI models retrain continuously. DevOps pipelines automate infrastructure changes in minutes. Global applications scale dynamically across regions. Multi-cloud architectures distribute workloads for performance and resilience. But as digital acceleration increases, so does one persistent challenge: Read more”Indian Country” highlights Enteros and its … Continue reading “What Technology Leaders Must Know About Cost Estimation: Enteros, RevOps Efficiency, and Cloud FinOps Intelligence”

What Drives Growth in Fashion Tech: Enteros AI SQL, Database Performance Management, and RevOps Intelligence

3 March 2026
Database Performance Management

Introduction The fashion industry has evolved into a digital-first, data-intensive ecosystem. Global apparel brands, luxury houses, direct-to-consumer startups, and fast-fashion giants now compete not just on design and brand—but on digital performance. E-commerce platforms must handle flash sales without crashing. Omnichannel inventory systems must synchronize in real time. AI-driven personalization engines must respond instantly. Pricing … Continue reading “What Drives Growth in Fashion Tech: Enteros AI SQL, Database Performance Management, and RevOps Intelligence”

Who Should Use Enteros for Financial Performance Optimization and Cloud Cost Governance?

Database Performance Management

Introduction The financial sector is under relentless pressure to grow—without increasing risk, cost, or operational complexity. Digital-first banks are reshaping customer expectations. Capital markets firms demand real-time analytics. Insurers are automating underwriting and claims. Fintech startups scale at cloud speed. Meanwhile, regulatory requirements intensify, margins tighten, and infrastructure costs rise. At the center of all … Continue reading “Who Should Use Enteros for Financial Performance Optimization and Cloud Cost Governance?”