Preamble
You have probably noticed that artificial intelligence and machine learning are currently popular topics. Absolutely, as it is a very significant issue that will undoubtedly influence how we live in the future. However, after looking at the majority of the machine learning-related code that is available online, it occurs to me how much “coding” people actually use to prepare the data. Often, the input data is “just a matrix,” created over a long period of time by numerous people (and occasionally from memory).
The obvious question in my situation is: Why not prepare the data in SQL? It is much easier, more flexible, and demands less effort.
Creating some sample data
I’ll create a straightforward table with just 20 random values to show what we can do with PostgreSQL. Let’s define a table first:
test=# CREATE TABLE inputdata ( id int, data numeric DEFAULT random() * 1000 ); CREATE TABLE The table is populated: test=# INSERT INTO inputdata SELECT * FROM generate_series(1, 20); INSERT 0 20
Although we currently only have a table with 20 randomly chosen values, you actually already have some data available, such as:
test=# SELECT * FROM inputdata; id | data ----+------------------ 1 | 542.76927607134 2 | 813.954454381019 3 | 215.18046176061 4 | 989.989245776087 5 | 142.890753224492 6 | 326.086463406682 7 | 24.8975520953536 8 | 266.512574627995 9 | 86.0621216706932 10 | 801.756543107331 11 | 790.149183012545 12 | 317.997705657035 13 | 975.230060052127 14 | 385.490739252418 15 | 746.592517476529 16 | 621.084009762853 17 | 208.689162041992 18 | 529.119417071342 19 | 260.399237740785 20 | 563.285110052675 (20 rows)
Thoughts on sampling, training, and verification
You will always begin by dividing the data into different parts if you are training an AI model (perhaps a Support Vector Machine or SVM, a neural network, or whatever).
• Data needed to train your AI model
• Data for testing your AI model
To train your model, use the training data. The performance of your model is then evaluated using the test data. Data segmentation is crucial and, in a sense, the secret to success.
Table sampling in PostgreSQL
PostgreSQL core already includes sampling as of version 9.5. As an illustration:
test=# SELECT * FROM inputdata TABLESAMPLE BERNOULLI (50) REPEATABLE (87); id | data ----+------------------ 3 | 215.18046176061 4 | 989.989245776087 5 | 142.890753224492 10 | 801.756543107331 11 | 790.149183012545 12 | 317.997705657035 14 | 385.490739252418 15 | 746.592517476529 16 | 621.084009762853 19 | 260.399237740785 20 | 563.285110052675 (11 rows)
The table must be amended to include the TABLESAMPLE clause and guidelines for how it should function. I decided to use the repeatable Bernoulli table sampling method (TSM) in this case. In order to ensure that we can train our model repeatedly using the same input data, it makes sense to use the REPEATABLE clause in the context of machine learning. No matter how many times we run the code, PostgreSQL will consistently return the same sample (as long as the underlying data stays the same, of course).
Machine learning: Lots and lots of data …
So far, everything seems okay. This tactic does have a small disadvantage, though. To function properly, a state-of-the-art model requires a LOT of input data. The number of rows is in the millions. The following issue (expressed in pseudo code) is one we will eventually encounter:
SELECT * FROM inputdata WHERE id NOT IN (SELECT id FROM test_data_set);
This raises the following two issues:
• It requires a lot of storage space to keep the test data set.
• The price of the large NOT IN statement is reasonable.
Is there another way to carry out that task then? What I came up with is a fairly easy-to-implement tactic that works.
Because PostgreSQL lacks a “NOT IN TABLESAMPLE”-clause and because we don’t want to duplicate our data, the idea is to use a view that can be used to extract the sample:
test=# CREATE VIEW preparation_step1 AS SELECT *, abs(hashtext(id::text) % 100) FROM inputdata ; CREATE VIEW
The hashtext function is used to first generate a hash from the input data. The hashtext yields evenly spaced numbers, which is exactly what we need in this situation. It will produce 100 slices of data, each containing 1% of the total. Be mindful that the hashtext function is capable of returning negative values. Positive values will be generated by the “abs” function.
test=# \x Expanded display is on. test=# \df *hashtext* List of functions -[ RECORD 1 ]-------+----------- Schema | pg_catalog Name | hashtext Result data type | integer Argument data types | text Type | normal
The outcome will already be quite helpful when considering the view:
test=# SELECT * FROM preparation_step1 LIMIT 10; id | data | abs ----+------------------+----- 1 | 542.76927607134 | 47 2 | 813.954454381019 | 26 3 | 215.18046176061 | 4 4 | 989.989245776087 | 92 5 | 142.890753224492 | 58 6 | 326.086463406682 | 12 7 | 24.8975520953536 | 95 8 | 266.512574627995 | 88 9 | 86.0621216706932 | 36 10 | 801.756543107331 | 81 (10 rows)
We can now exclude certain data. An illustration: “abs 50” may be training data, and the remaining information may be used to validate and examine our models.
If your dataset is extremely huge, this strategy is generally acceptable (xxx-million rows or so). If your dataset is too small, it might not be the best. It would be preferable in this situation to utilize the sample techniques offered by your chosen library (TensorFlow, sklearn, etc.). What makes that so? You incur the danger of producing a biased sample if your data collection is small (e.g., only a few hundred rows). Why does that matter? Consider you have a dataset with details about men and women. The sample should have the same distribution as the original data set, which means that there should be an equal number of males and women. Libraries like sklearn and others support the “stratification” approach to solving the issue. In my straightforward SQL example, I’m assuming that the model will receive a significant amount of data, so stratification won’t be a problem.
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Enteros for Insurance: Redefining AI Performance, Data Lake Management, and Cloud FinOps Excellence
- 7 October 2025
- Database Performance Management
Introduction The insurance industry stands at the crossroads of transformation — where data-driven insights, artificial intelligence (AI), and cloud technologies are reshaping business operations. Today’s insurers operate in a digital-first world, collecting and analyzing enormous volumes of data from multiple channels — customer interactions, policy applications, claims, IoT devices, and third-party providers. This explosion of … Continue reading “Enteros for Insurance: Redefining AI Performance, Data Lake Management, and Cloud FinOps Excellence”
Revolutionizing Healthcare IT: How Enteros AIOps and Observability Platforms Are Transforming Database Performance Management
Introduction In today’s data-driven healthcare landscape, precision, speed, and reliability are paramount. From real-time patient monitoring and electronic health records (EHRs) to AI-assisted diagnostics and predictive analytics, healthcare organizations rely on massive databases and complex IT infrastructures to deliver life-saving insights and services. However, as data volumes surge and systems become more distributed, managing database … Continue reading “Revolutionizing Healthcare IT: How Enteros AIOps and Observability Platforms Are Transforming Database Performance Management”
From Data to Delivery: How Enteros Transforms D2C Brands Through Database Performance and Cloud FinOps
- 6 October 2025
- Database Performance Management
Introduction In the fast-evolving world of Direct-to-Consumer (D2C) brands, agility and performance are everything. D2C businesses thrive on personalization, speed, and customer trust — all of which depend on how efficiently data flows through their systems. Behind every product recommendation, every smooth checkout, and every on-time delivery is a complex network of databases, cloud infrastructure, … Continue reading “From Data to Delivery: How Enteros Transforms D2C Brands Through Database Performance and Cloud FinOps”
How Enteros AI Performance and Data Lake Optimization Are Powering Innovation in the Fashion Industry
Introduction The fashion industry, once ruled by instinct and creativity alone, is now undergoing a seismic transformation driven by data and artificial intelligence (AI). From forecasting trends to managing supply chains and customer experiences, data-driven insights are redefining how fashion brands operate. Central to this digital shift is the use of AI-driven performance management and … Continue reading “How Enteros AI Performance and Data Lake Optimization Are Powering Innovation in the Fashion Industry”