Preamble
You have probably noticed that artificial intelligence and machine learning are currently popular topics. Absolutely, as it is a very significant issue that will undoubtedly influence how we live in the future. However, after looking at the majority of the machine learning-related code that is available online, it occurs to me how much “coding” people actually use to prepare the data. Often, the input data is “just a matrix,” created over a long period of time by numerous people (and occasionally from memory).
The obvious question in my situation is: Why not prepare the data in SQL? It is much easier, more flexible, and demands less effort.
Creating some sample data
I’ll create a straightforward table with just 20 random values to show what we can do with PostgreSQL. Let’s define a table first:
test=# CREATE TABLE inputdata ( id int, data numeric DEFAULT random() * 1000 ); CREATE TABLE The table is populated: test=# INSERT INTO inputdata SELECT * FROM generate_series(1, 20); INSERT 0 20
Although we currently only have a table with 20 randomly chosen values, you actually already have some data available, such as:
test=# SELECT * FROM inputdata; id | data ----+------------------ 1 | 542.76927607134 2 | 813.954454381019 3 | 215.18046176061 4 | 989.989245776087 5 | 142.890753224492 6 | 326.086463406682 7 | 24.8975520953536 8 | 266.512574627995 9 | 86.0621216706932 10 | 801.756543107331 11 | 790.149183012545 12 | 317.997705657035 13 | 975.230060052127 14 | 385.490739252418 15 | 746.592517476529 16 | 621.084009762853 17 | 208.689162041992 18 | 529.119417071342 19 | 260.399237740785 20 | 563.285110052675 (20 rows)
Thoughts on sampling, training, and verification
You will always begin by dividing the data into different parts if you are training an AI model (perhaps a Support Vector Machine or SVM, a neural network, or whatever).
• Data needed to train your AI model
• Data for testing your AI model
To train your model, use the training data. The performance of your model is then evaluated using the test data. Data segmentation is crucial and, in a sense, the secret to success.
Table sampling in PostgreSQL
PostgreSQL core already includes sampling as of version 9.5. As an illustration:
test=# SELECT * FROM inputdata TABLESAMPLE BERNOULLI (50) REPEATABLE (87); id | data ----+------------------ 3 | 215.18046176061 4 | 989.989245776087 5 | 142.890753224492 10 | 801.756543107331 11 | 790.149183012545 12 | 317.997705657035 14 | 385.490739252418 15 | 746.592517476529 16 | 621.084009762853 19 | 260.399237740785 20 | 563.285110052675 (11 rows)
The table must be amended to include the TABLESAMPLE clause and guidelines for how it should function. I decided to use the repeatable Bernoulli table sampling method (TSM) in this case. In order to ensure that we can train our model repeatedly using the same input data, it makes sense to use the REPEATABLE clause in the context of machine learning. No matter how many times we run the code, PostgreSQL will consistently return the same sample (as long as the underlying data stays the same, of course).
Machine learning: Lots and lots of data …
So far, everything seems okay. This tactic does have a small disadvantage, though. To function properly, a state-of-the-art model requires a LOT of input data. The number of rows is in the millions. The following issue (expressed in pseudo code) is one we will eventually encounter:
SELECT * FROM inputdata WHERE id NOT IN (SELECT id FROM test_data_set);
This raises the following two issues:
• It requires a lot of storage space to keep the test data set.
• The price of the large NOT IN statement is reasonable.
Is there another way to carry out that task then? What I came up with is a fairly easy-to-implement tactic that works.
Because PostgreSQL lacks a “NOT IN TABLESAMPLE”-clause and because we don’t want to duplicate our data, the idea is to use a view that can be used to extract the sample:
test=# CREATE VIEW preparation_step1 AS SELECT *, abs(hashtext(id::text) % 100) FROM inputdata ; CREATE VIEW
The hashtext function is used to first generate a hash from the input data. The hashtext yields evenly spaced numbers, which is exactly what we need in this situation. It will produce 100 slices of data, each containing 1% of the total. Be mindful that the hashtext function is capable of returning negative values. Positive values will be generated by the “abs” function.
test=# \x Expanded display is on. test=# \df *hashtext* List of functions -[ RECORD 1 ]-------+----------- Schema | pg_catalog Name | hashtext Result data type | integer Argument data types | text Type | normal
The outcome will already be quite helpful when considering the view:
test=# SELECT * FROM preparation_step1 LIMIT 10; id | data | abs ----+------------------+----- 1 | 542.76927607134 | 47 2 | 813.954454381019 | 26 3 | 215.18046176061 | 4 4 | 989.989245776087 | 92 5 | 142.890753224492 | 58 6 | 326.086463406682 | 12 7 | 24.8975520953536 | 95 8 | 266.512574627995 | 88 9 | 86.0621216706932 | 36 10 | 801.756543107331 | 81 (10 rows)
We can now exclude certain data. An illustration: “abs 50” may be training data, and the remaining information may be used to validate and examine our models.
If your dataset is extremely huge, this strategy is generally acceptable (xxx-million rows or so). If your dataset is too small, it might not be the best. It would be preferable in this situation to utilize the sample techniques offered by your chosen library (TensorFlow, sklearn, etc.). What makes that so? You incur the danger of producing a biased sample if your data collection is small (e.g., only a few hundred rows). Why does that matter? Consider you have a dataset with details about men and women. The sample should have the same distribution as the original data set, which means that there should be an equal number of males and women. Libraries like sklearn and others support the “stratification” approach to solving the issue. In my straightforward SQL example, I’m assuming that the model will receive a significant amount of data, so stratification won’t be a problem.
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
How Enteros Transforms Healthcare Cost Management with AI Financial Intelligence and Database Performance Optimization
- 4 December 2025
- Database Performance Management
Introduction The healthcare sector is facing unprecedented financial and operational pressure. As medical organizations modernize their IT environments—embracing AI-driven diagnostics, telemedicine platforms, electronic health record (EHR) systems, imaging repositories, and cloud-native applications—the cost of operating these digital workloads continues to surge. At the same time, inefficiencies within databases, data pipelines, clinical software platforms, and analytics … Continue reading “How Enteros Transforms Healthcare Cost Management with AI Financial Intelligence and Database Performance Optimization”
Optimizing Retail Digital Operations: Enteros AI Platform for Accurate Cost Estimation and Superior Performance Management
Introduction The retail sector is undergoing one of the fastest digital transformations in history. From omnichannel commerce and predictive analytics to inventory automation and personalized customer experiences, today’s retail enterprises depend on complex, high-volume digital systems. These systems—spanning eCommerce platforms, databases, cloud services, POS solutions, and logistics software—process massive real-time workloads that directly influence customer … Continue reading “Optimizing Retail Digital Operations: Enteros AI Platform for Accurate Cost Estimation and Superior Performance Management”
How Technology Teams Improve Performance Management with Enteros’ AIOps and AI-First Operations Framework
- 3 December 2025
- Database Performance Management
Introduction The technology sector is undergoing a rapid transformation as cloud-native architectures, SaaS ecosystems, and real-time data systems redefine how organizations operate. Yet with this digital acceleration comes an overwhelming surge in complexity — distributed microservices, multi-cloud deployments, AI-augmented workflows, and massive data pipelines that demand precision, speed, and resilience. To navigate this complexity, enterprises … Continue reading “How Technology Teams Improve Performance Management with Enteros’ AIOps and AI-First Operations Framework”
The Future of Healthcare Databases: Enteros’ GenAI and AI Performance Management at Work
Introduction The healthcare sector is undergoing a digital revolution unlike anything seen before. From AI-assisted diagnostics and precision medicine to telehealth platforms and clinical research systems, today’s healthcare organizations rely heavily on massive data ecosystems. Databases power everything — electronic health records (EHRs), patient management systems, revenue cycle applications, insurance claim platforms, imaging archives, and … Continue reading “The Future of Healthcare Databases: Enteros’ GenAI and AI Performance Management at Work”