DATA PREPARATION FOR MACHINE LEARNING

Preamble

You have probably noticed that artificial intelligence and machine learning are currently popular topics. Absolutely, as it is a very significant issue that will undoubtedly influence how we live in the future. However, after looking at the majority of the machine learning-related code that is available online, it occurs to me how much “coding” people actually use to prepare the data. Often, the input data is “just a matrix,” created over a long period of time by numerous people (and occasionally from memory).

The obvious question in my situation is: Why not prepare the data in SQL? It is much easier, more flexible, and demands less effort.

Creating some sample data

I’ll create a straightforward table with just 20 random values to show what we can do with PostgreSQL. Let’s define a table first:

 test=# CREATE TABLE inputdata ( 	id int, 	data numeric 	DEFAULT random() * 1000 ); CREATE TABLE The table is populated: test=# INSERT INTO inputdata 	SELECT * FROM generate_series(1, 20); INSERT 0 20

Although we currently only have a table with 20 randomly chosen values, you actually already have some data available, such as:

 test=# SELECT * FROM inputdata; id | data ----+------------------ 1 | 542.76927607134 2 | 813.954454381019 3 | 215.18046176061 4 | 989.989245776087 5 | 142.890753224492 6 | 326.086463406682 7 | 24.8975520953536 8 | 266.512574627995 9 | 86.0621216706932 10 | 801.756543107331 11 | 790.149183012545 12 | 317.997705657035 13 | 975.230060052127 14 | 385.490739252418 15 | 746.592517476529 16 | 621.084009762853 17 | 208.689162041992 18 | 529.119417071342 19 | 260.399237740785 20 | 563.285110052675 (20 rows)

Thoughts on sampling, training, and verification

You will always begin by dividing the data into different parts if you are training an AI model (perhaps a Support Vector Machine or SVM, a neural network, or whatever).

• Data needed to train your AI model
• Data for testing your AI model

To train your model, use the training data. The performance of your model is then evaluated using the test data. Data segmentation is crucial and, in a sense, the secret to success.

Table sampling in PostgreSQL

PostgreSQL core already includes sampling as of version 9.5. As an illustration:

 test=# SELECT * 	FROM inputdata TABLESAMPLE BERNOULLI (50) 				REPEATABLE (87); id | data ----+------------------ 3 | 215.18046176061 4 | 989.989245776087 5 | 142.890753224492 10 | 801.756543107331 11 | 790.149183012545 12 | 317.997705657035 14 | 385.490739252418 15 | 746.592517476529 16 | 621.084009762853 19 | 260.399237740785 20 | 563.285110052675 (11 rows)

The table must be amended to include the TABLESAMPLE clause and guidelines for how it should function. I decided to use the repeatable Bernoulli table sampling method (TSM) in this case. In order to ensure that we can train our model repeatedly using the same input data, it makes sense to use the REPEATABLE clause in the context of machine learning. No matter how many times we run the code, PostgreSQL will consistently return the same sample (as long as the underlying data stays the same, of course).

Machine learning: Lots and lots of data …

So far, everything seems okay. This tactic does have a small disadvantage, though. To function properly, a state-of-the-art model requires a LOT of input data. The number of rows is in the millions. The following issue (expressed in pseudo code) is one we will eventually encounter:

 SELECT	* FROM 	inputdata WHERE	id NOT IN (SELECT id FROM test_data_set);

This raises the following two issues:

• It requires a lot of storage space to keep the test data set.
• The price of the large NOT IN statement is reasonable.

Is there another way to carry out that task then? What I came up with is a fairly easy-to-implement tactic that works.

Because PostgreSQL lacks a “NOT IN TABLESAMPLE”-clause and because we don’t want to duplicate our data, the idea is to use a view that can be used to extract the sample:

 test=# CREATE VIEW preparation_step1 AS 	SELECT 	*, abs(hashtext(id::text) % 100) 	FROM 	inputdata ; CREATE VIEW

The hashtext function is used to first generate a hash from the input data. The hashtext yields evenly spaced numbers, which is exactly what we need in this situation. It will produce 100 slices of data, each containing 1% of the total. Be mindful that the hashtext function is capable of returning negative values. Positive values will be generated by the “abs” function.

 test=# \x Expanded display is on. test=# \df *hashtext* List of functions -[ RECORD 1 ]-------+----------- Schema | pg_catalog Name | hashtext Result data type | integer Argument data types | text Type | normal

The outcome will already be quite helpful when considering the view:

 test=# SELECT * FROM preparation_step1 LIMIT 10; id | data | abs ----+------------------+----- 1 | 542.76927607134 | 47 2 | 813.954454381019 | 26 3 | 215.18046176061 | 4 4 | 989.989245776087 | 92 5 | 142.890753224492 | 58 6 | 326.086463406682 | 12 7 | 24.8975520953536 | 95 8 | 266.512574627995 | 88 9 | 86.0621216706932 | 36 10 | 801.756543107331 | 81 (10 rows)

We can now exclude certain data. An illustration: “abs 50” may be training data, and the remaining information may be used to validate and examine our models.

If your dataset is extremely huge, this strategy is generally acceptable (xxx-million rows or so). If your dataset is too small, it might not be the best. It would be preferable in this situation to utilize the sample techniques offered by your chosen library (TensorFlow, sklearn, etc.). What makes that so? You incur the danger of producing a biased sample if your data collection is small (e.g., only a few hundred rows). Why does that matter? Consider you have a dataset with details about men and women. The sample should have the same distribution as the original data set, which means that there should be an equal number of males and women. Libraries like sklearn and others support the “stratification” approach to solving the issue. In my straightforward SQL example, I’m assuming that the model will receive a significant amount of data, so stratification won’t be a problem.

About Enteros

Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.

The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.

Are you interested in writing for Enteros’ Blog? Please send us a pitch!

Scaling Revenue Platforms on Smarter Databases: Enteros’ AI SQL–Driven Management for Tech Enterprises

10 February 2026
Database Performance Management

Introduction For modern technology enterprises, revenue no longer flows from a single product or channel. It is generated across complex digital platforms—SaaS applications, subscription engines, usage-based billing systems, digital marketplaces, data products, and AI-driven services. These revenue platforms are expected to scale continuously, operate globally, and deliver consistent user experiences in real time. At the … Continue reading “Scaling Revenue Platforms on Smarter Databases: Enteros’ AI SQL–Driven Management for Tech Enterprises”

Beyond Cloud Bills in Real Estate: Enteros’ AI Platform for Database Management and Cost Attribution

Database Performance Management

Introduction The real estate sector is undergoing a fundamental digital transformation. Property management platforms, smart building systems, tenant experience applications, investment analytics, IoT-driven facilities management, and AI-powered valuation models now form the backbone of modern real estate enterprises. From global REITs and commercial property firms to proptech platforms and smart city operators, data-driven systems are … Continue reading “Beyond Cloud Bills in Real Estate: Enteros’ AI Platform for Database Management and Cost Attribution”

Real Estate IT Economics with Financial Precision: Enteros’ Cost Attribution Intelligence

9 February 2026
Database Performance Management

Introduction Real estate has always been an asset‑heavy, capital‑intensive industry. From commercial portfolios and residential developments to REITs and PropTech platforms, profitability depends on precise financial control. Yet while real estate organizations apply rigorous financial discipline to assets, leases, and investments, their IT and data environments often lack the same level of cost transparency. Modern … Continue reading “Real Estate IT Economics with Financial Precision: Enteros’ Cost Attribution Intelligence”

Managing Database Growth with Financial Precision: Enteros for Tech Leaders

Database Performance Management

Introduction For technology enterprises, databases are no longer just systems of record—they are engines of innovation. SaaS platforms, AI applications, digital marketplaces, analytics products, and customer-facing services all depend on rapidly growing databases that must scale continuously, remain highly performant, and stay available around the clock. But as database environments grow, so do costs. Cloud … Continue reading “Managing Database Growth with Financial Precision: Enteros for Tech Leaders”