POSTGRESSQL JOINS PGSTROM ON A GPU

Article

I decided to try joins after reading the information on the pgstrom wiki page to see how much speed we could gain by offloading some of the work that PostgreSQL needs to do to the graphics card.

I started by producing some test data:

 test=# CREATE TABLE t_test AS SELECT id, id % 10 AS ten, id % 20 AS twenty FROM generate_series(1, 25000000) AS id ORDER BY id; SELECT 25000000 test=# CREATE TABLE t_join AS SELECT * FROM t_test ORDER BY random() LIMIT 1000000; SELECT 1000000 test=# SELECT count(*) FROM t_join; count --------- 1000000 (1 row) test=# SELECT count(*) FROM t_test; count ---------- 25000000 (1 row)

1 million rows, which are a 100% subset of the original data, should be joined with 25 million rows.

An initial evaluation reveals a lovely improvement. My CPU’s initial two queries are:

 test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.ten; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost = 724919.18..724919.28 rows = 10 width = 4) (actual time = 11283.769..11283.772 rows = 10 loops = 1). Group Key: a.ten -> Hash Join (cost = 31813.00..719919.18 rows = 1000000 width = 4) (actual time = 340.218..11061.192 rows = 1000000 loops = 1). Hash Cond: (a.id = b.id) -> Seq Scan on t_test a (cost = 0.00..385135.40 rows = 24999940 width = 8) (actual time = 0.046..3590.336 rows = 25000000 loops = 1). ->  Hash  (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=338.516..338.516 rows=1000000 loops=1) Buckets: 131072 Batches: 16 Memory Usage: 3226 kB ->  Seq Scan on t_join b  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.031..142.085 rows=1000000 loops=1) Planning time: 0.587 ms Execution time: 11284.411 ms (10 rows) test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.twenty; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost = 724919.18..724919.38 rows = 20 width = 4) (actual time = 10991.615..10991.619 rows = 20 loops = 1) Group Key: a.twenty -> Hash Join (cost = 31813.00..719919.18 rows = 1000000 width = 4) (actual time = 317.089..10766.779 rows = 1000000 loops = 1). Hash Cond: (a.id = b.id) -> Seq Scan on t_test a (cost = 0.00..385135.40 rows = 24999940 width = 8) (actual time = 0.50..3636.188 rows = 25000000 loops = 1). ->  Hash  (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=316.321..316.321 rows=1000000 loops=1) Buckets: 131072 Batches: 16 Memory Usage: 3226 kB -> Seq Scan on t_join b (cost = 0.00..15406.00 rows = 1000000 width = 4) (actual time = 0.026..141.551 rows = 1000000 loops = 1). Planning time: 0.240 ms Execution time: 10992.203 ms (10 rows)

The “AMD FX(tm)-8350 Eight-Core Processor” being used in this instance has a clock speed of 4 Ghz.

Let’s attempt the same on the GPU:

 test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.ten; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- HashAggregate  (cost=175445.25..175445.35 rows=10 width=4) (actual time=2556.159..2556.161 rows=10 loops=1) Group Key: a.ten -> Custom Scan (GPUPreAgg) (cost (cost=27659.66..174601.72 rows = 22 width = 8) (actual time = 2538.558..2556.123 rows = 30 loops = 1). Bulkload: Off Reduction: Local + Global -> Custom Scan (GpuJoin) (cost = 23659.66..170445.25 rows = 1000000 width = 4) (actual time = 1392.571..2437.772 rows = 1000000 loops = 1). Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:25000000 out:1000000, 4.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1) inner buffer: (82.25MB), DMA numbers: 13, size: 1069.32MB -> Custom Scan (BulkScan) on t_test a (cost = 0.00..385137.60 rows = 25000160 width = 8) (actual time = 16.137..1333.313 rows = 25000000 loops = 1). ->  Seq Scan on t_join b  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.018..109.725 rows=1000000 loops=1) Planning time: 1.720 ms Execution time: 3264.747 ms (14 rows) test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id group by twenty; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- HashAggregate  (cost=175445.25..175445.45 rows=20 width=4) (actual time=2532.113..2532.121 rows=20 loops=1) Group Key: a.twenty -> Custom Scan (GPUPreAgg) (cost (cost=27659.66..174601.94 rows = 44 width = 8) (actual time = 2508.512,2531.991 rows = 120 loops = 1). Bulkload: Off Reduction: Local + Global Custom Scan (GpuJoin) (cost = 23659.66..170445.25 rows = 1000000 width = 4) (actual time=1373.296..2410.850 rows=1000000 loops=1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:25000000 out:1000000, 4.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1) inner buffer: (82.25MB), DMA numbers: 11, size: 904.81MB -> Custom Scan (BulkScan) on t_test a (cost = 0.00..385137.60 rows = 25000160 width = 8) (actual time = 12.961..1308.733 rows = 25000000 loops = 1). -> Seq Scan on T_Join B (Cost: 0.00; Rows: 1000000; Width: 4); Actual Time: 0.018%; Rows: 1000000; Loops: 1) Planning time: 0.539 ms Execution time: 3225.901 ms (14 rows)

This is a nice development, in my opinion. The query is executed much more quickly.

Utilizing indexes

Observing what happens when indexes are added to both sides of the join is interesting.

 test=# CREATE INDEX idx_id ON t_test (id); CREATE INDEX test=# CREATE INDEX idx_id2 ON t_join (id); CREATE INDEX

Because the join requires too much information, the standard query that was previously displayed does not demonstrate any difference. Consequently, the test is run again using a reasonable filter:

 test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id AND a.id < 100000 GROUP BY a.ten; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost = 32541.70..32541.71 rows = 1 width = 4) (actual time = 45.845..45.846 rows = 10 loops = 1) Group Key: a.ten -> Merge Join (cost = 1.49..32519.75 rows = 4390 width = 4) (actual time = 0.76..44.423 rows = 4137 loops = 1). Merge condition: (a.id = b.id) -> Index Scan using idx_id on t_test a (cost = 0.44..3722.05 rows = 109749 width = 8) (actual time = 0.025..27.556 rows = 99999 loops = 1). Index Cond: (id < 100000) ->  Index Only Scan using idx_id2 on t_join b  (cost=0.42..25980.42 rows=1000000 width=4) (actual time=0.022..0.931 rows=4138 loops=1) Heap Fetches: 0 Planning time: 0.449 ms Execution time: 45.946 ms (10 rows)

The CPU version runs fairly quickly. PostgreSQL performs a nice merge join after scanning the indexes and using the sorted input. The aggregation happens fairly quickly as well.

The plan appears less effective in the default setting when GPU is enabled:

 test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id AND a.id < 100000 GROUP BY a.ten; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ HashAggregate (cost = 13404.21..13404.22 rows = 1 width = 4) (actual time=118.154..118.156 rows=10 loops=1) Group Key: a.ten -> Custom Scan (GPUJoin) (cost (cost=9649.48..13383.34 rows = 4174 width = 4) (actual time = 70.993..117.099 rows = 4137 loops = 1). Bulkload: On (100.00% density) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:1000000 out:4137, 0.41% planned 0.42%), KDS-Hash (size: 5859.39KB planned 8.58MB, nbatches: 1 planned 1) inner buffer: (8.58MB), DMA numbers: 1, size: 8.58MB -> Custom Scan (BulkScan) on t_join b (cost = 0.00..15406.00 rows = 1000000 width = 4) (actual time = 7.164.24.709 rows = 1000000 loops = 1). -> Index Scan using idx_id on t_test a (cost = 0.44..3542.55 rows = 104349 width = 8) (actual time = 0.018..21.684 rows = 99999 loops = 1) Index condition: (id  100000) Planning time: 0.858 ms Execution time: 553.543 ms (12 rows)

It shouldn’t be a big deal, even if the query is slower in this instance. The system should be able to determine that the CPU version is the faster one if the cost models are properly adjusted and the planner is given instructions on how to choose the best plan. In this case, the speed reduction is not really a problem. Once all infrastructure is established, balancing the cost model and making minor planner adjustments does not seem like a deal-breaker (at least not in light of the accomplishments already made).

Joining more tables

Just to see what happens, let’s simply add more tables to the join. I once again removed my two indexes to give the CPU and GPU a fair chance.

 test=# explain analyze SELECT count(*) FROM    t_test AS a, t_join AS b, t_test AS c, t_join AS d, t_test AS e, t_join AS f WHERE a.id = b.id, and b.id = c.id, and c.id = d.id, and d.id = e.id, and e.id = f.id; QUERY PLAN ------------------------------------------------------------------------------------------------------------ Aggregate  (cost=1745125.48..1745125.49 rows=1 width=0) (Actual time = 34234.209; 34234.209 rows = 1 loops = 1) Hash Join (cost = 1266218.99..1745121.48 rows = 1600 width = 0) (actual time = 23054.940..34133.602 rows = 1000000 loops = 1) Hash Cond: (e.id = a.id) -> Seq Scan on t_test e (cost = 0.00..385136.36 rows = 25000036 width = 4) (Actual time = 0.056...2728,388 rows = 25,000,000 loops = 1) -> Hash (cost (cost=1266198.99..1266198.99 rows = 1600 width = 20) (actual time = 23054.441..23054.441 rows = 1000000 loops = 1) Buckets: 131072 (originally 2048) Batches: 32 (originally 1). Memory Usage: 4205 kB Hash Join (cost = 787296.49...1266198.99 rows = 1600 width = 20) (actual time = 12247.386..22803.559 rows = 1000000 loops = 1) Hash Cond: (c.id = a.id) -> Seq Scan on t_test c (cost = 0.00..385136.36 rows = 25000036 width = 4) (actual time = 0.004..2727.218 rows = 25000000 loops = 1). Hash (cost (cost=787276.49..787276.49 rows = 1600; width = 16; actual time = 12246.958; rows = 1000000; loops = 1) Buckets: 131072 (originally 2048) Batches: 16 (originally 1). Memory Usage: 3960 kB Hash Join (cost = 768104.49..787276.49 rows = 1600 width = 16) (actual time = 11330.994..12044.482 rows = 1000000 loops = 1) Hash Cond: (f.id = a.id) -> Seq Scan on t_join f (cost = 0.00..15406.00 rows = 1000000 width = 4) (actual time = 0.030..111.371 rows = 1000000 loops = 1). Hash (cost (cost=767604.49..767604.49 rows = 40000; width = 12; actual time = 11330.687; rows = 1000000; loops = 1) Buckets: 131072 (originally 65536) Batches: 16 (originally 1). Memory Usage: 3716 kB Hash Join (cost = 63626.00..767604.49 rows = 40000 width = 12) (actual time = 620.425..11124.981 rows = 1000000 loops = 1) Hash Cond: (a.id = d.id) Hash Join (cost = 31813.00..719920.49 rows = 1000000 width = 8) (actual time = 306.078..10215.246 rows = 1000000 loops = 1) Hash Cond: (a.id = b.id) -> Seq Scan on t_test a (cost = 0.00..385136.36 rows = 25000036 width = 4) (actual time = 0.002..3430.792 rows = 25000000 loops = 1). ->  Hash  (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=305.466..305.466 rows=1000000 loops=1) Buckets: 131072 Batches: 16 Memory Usage: 3226 kB -> Seq Scan on t_join b (cost = 0.00..15406.00 rows = 1000000 width = 4) (actual time = 0.002..137.925 rows = 1000000 loops = 1) -> Hash (cost (cost=15406.00..15406.00 rows = 1000000 width = 4) (actual time = 314.051..314.051 rows = 1000000 loops = 1) Buckets: 131072 Batches: 16 Memory Usage: 3226 kB -> Seq Scan on t_join (cost = 0.00..15406.00 rows = 1000000 width = 4) (actual time=0.006..137.796 rows=1000000 loops=1) Planning time: 4.833 ms Execution time: 34238.964 ms (29 rows)

As anticipated, the CPU has trouble putting everything together. It’s a nasty connection. 34 seconds are required at the end.

That environment allows the GPU to really shine. After all, it appears that expensive joints are what this is all about:

 test=# explain analyze SELECT count(*) FROM    t_test AS a, t_join AS b, t_test AS c, t_join AS d, t_test AS e, t_join AS f WHERE a.id = b.id, and b.id = c.id, and c.id = d.id, and d.id = e.id, and e.id = f.id; NOTICE:  GpuJoin(0x204508860) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) Inners: (0, 1000000) Results: [312465, 312465] NOTICE:  GpuJoin(0x204506060) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) Results: [312465, 312465] NOTICE:  GpuJoin(0x204505860) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) Results: [312465, 312465] QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost = 471554.15..471554.16 rows = 1 width = 0) (actual time = 7270.748..7270.748 rows = 1 loops = 1). Custom Scan (GpuJoin) (cost = 334846.07..471550.15 rows = 1600 width = 0; actual time = 6212.393..7184.973 rows = 1000000 loops = 1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:25000000 out:1000000, 4.00% planned 0.01%), KDS-Hash (size: 64.85MB planned 134.86KB, nbatches: 1 planned 1) inner buffer: (124.00MB), DMA numbers: 9, size: 1116.01MB -> Custom Scan (BulkScan) on t_test e (cost = 0.00..385136.36 rows = 25000036 width = 4) (actual time = 13.106..1103.447 rows = 25000000 loops = 1). Custom Scan (GpuJoin) (cost = 192385.58..329089.66 rows = 1600 width = 20) (actual time=4240.682..5277.841 rows=1000000 loops=1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in: 25000000, out: 1000000, 4.00% planned, 0.01%), KDS-Hash (size: 57.22 MB, planned: 119.23 KB, nbatches: 1 planned), inner buffer: 62.00 MB), DMA numbers: 15, size: 930.01 MB -> Custom Scan (BulkScan) on t_test c (cost = 0.00..385136.36 rows = 25000036 width = 4) (actual time = 13.160..1093.463 rows = 25000000 loops = 1). Custom Scan (GpuJoin) (cost = 182921.05; 186629.17; rows = 1600; width = 16) (actual time = 3320.939..344.434 rows = 1000000 loops = Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in: 1000000, out: 1000000, 100.00% planned (0.16%), KDS-Hash (size: 57.22MB planned) 2978.59KB, nbatches: 1 planned 1) Inner Buffer: (62.00MB), DMA nums: 1, size: 62.00MB -> Custom Scan (BulkScan) on t_join (cost = 0.00..15406.00 rows = 1000000 width = 4) (actual time = 13.220..45.445 rows = 1000000 loops = Custom Scan (GpuJoin) (cost = 41543.18..176974.98 rows = 40000 width = 12) (actual time = 1736.740..2818.735 rows = 1000000 loops = 1). Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in: 25000000, out: 1000000, 4.00% planned), KDS-Hash (size: 57.22MB), 82.25MB, nbatches: 1 planned 1) Depth 2: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in: 1000000, out: 1000000, 100.00% planned, 4.00% actual), KDS-Hash (size: 57.22MB, actual: 82.25MB, nbatches: 1 planned), Inner Buffer: (82.25MB)x(82.25MB), DMA nums: 10, size: 1645.10MB -> Custom Scan (BulkScan) on t_test a (cost = 0.00..385136.36 rows = 25000036 width = 4) (actual time = 16.505..1599.395 rows = 25000000 loops = 1) -> Seq Scan on t_join b (cost = 0.00..15406.00 rows = 1000000 width = 4) (actual time = 0.022..111.602 rows = 1000000 loops = 1). -> Seq Scan on t_join d (cost = 0.00..15406.00 rows = 1000000 width = 4) (actual time = 0.005..95.175 rows = 1000000 loops = 1). Planning time: 11.106 ms Execution time: 8051.245 ms (31 rows)

This is a really nice number to see, especially given the number of rows involved. The real potential of a GPU is demonstrated in just 8 seconds.

You must, of course, keep in mind that I am executing arbitrary SQL statements, many of which are not very realistic. I have not yet tested PostgreSQL with a real workload involving more frequent queries like those found in PostgreSQL.

About Enteros

Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.

The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.

Are you interested in writing for Enteros’ Blog? Please send us a pitch!

Managing Real Estate AI Systems with Confidence: Enteros’ AIOps-Driven Performance Platform

29 January 2026
Database Performance Management

Introduction The real estate sector has entered a data-intensive, AI-powered era. From dynamic property pricing and demand forecasting to tenant analytics, fraud detection, and predictive maintenance, AI systems now sit at the core of modern real estate operations. PropTech platforms, commercial real estate (CRE) enterprises, listing marketplaces, and real estate investment firms rely on AI … Continue reading “Managing Real Estate AI Systems with Confidence: Enteros’ AIOps-Driven Performance Platform”

Beyond Cloud Bills in BFSI: Enteros Database Management Platform for Cost Estimation

Database Performance Management

Introduction Cloud adoption has fundamentally reshaped the Banking, Financial Services, and Insurance (BFSI) sector. Core banking modernization, real-time payments, digital lending platforms, fraud detection engines, AI-driven risk models, regulatory reporting systems, and omnichannel customer experiences all depend on highly complex database ecosystems operating across hybrid and multi-cloud environments. Yet as BFSI organizations mature in their … Continue reading “Beyond Cloud Bills in BFSI: Enteros Database Management Platform for Cost Estimation”

Eliminating Growth Friction: How Enteros Aligns Database Performance, Cloud FinOps, and RevOps

28 January 2026
Database Performance Management

Introduction For modern enterprises, growth is no longer limited by market demand alone—it is increasingly constrained by technology efficiency. As organizations scale digital platforms, launch new products, expand globally, and adopt AI-driven services, hidden friction inside their technology stack quietly erodes margins, slows execution, and undermines revenue outcomes. At the center of this friction sits … Continue reading “Eliminating Growth Friction: How Enteros Aligns Database Performance, Cloud FinOps, and RevOps”

AI SQL-Powered Database Management: Enteros’ Performance Intelligence Platform for Tech Enterprises

Database Performance Management

Introduction Technology enterprises today operate at unprecedented scale and speed. SaaS platforms, cloud-native applications, AI services, data marketplaces, and digital ecosystems now serve millions of users globally—often in real time. At the heart of this digital machinery lie databases. Databases power application responsiveness, AI pipelines, analytics engines, customer experiences, and revenue-generating workflows. Yet as technology … Continue reading “AI SQL-Powered Database Management: Enteros’ Performance Intelligence Platform for Tech Enterprises”