If You’re Not Monitoring Your Resource Pools, You’re Doing It Wrong
Developing production-ready software in today’s world entails much more than just adding functionality. It’s only half the struggle to create a “functionally complete” software system. Systems must be designed to considerably higher standards to compete in today’s market; gone are the days of deploying software as soon as it passes your QA team’s functional validation.
You must be ready to deal with third-party dependence failures malicious users, scale your system as you add customers, and meet your dependability service-level goals (SLOs), indicators (SLIs), and agreements, among other things (SLAs).
Monitoring is, of course, an essential aspect of reliability. You’ll only know something is wrong if customers call—or tweet—to complain if you don’t have visibility into the health of your system (which is terrible). And the only way you’ll figure out what’s wrong is to stumble around aimlessly (which is very bad).
But how do you know what you need to monitor when reliability experts advise you that you need to monitor the health of your systems? Throughput? How long does it take for you to respond? Latency? These are the most obvious options, and while they can often signal when you have a problem, they don’t tell you much about what’s causing the issue.
You need to take a look at your pool of resources.
Any non-trivial software system will have pools of resources ready to handle requests as they come in. A collection of database connections is required to communicate with a database. A pool of threads is required to process tasks from a queue. The work queue is a pool, although one that fills rather than drains. (Keep in mind that a single “non-pooled” connection is practically the same as a single connection in a pool.)
A collection of resource pools underpins all streaming systems, which are made up of any number of services. Even if your benefit, such as a primary windowing data aggregator, doesn’t interact with databases or make any external requests, reading and writing to your message broker require several threads and buffers.
The same is true for HTTP services. Request queue, for example, is a pool of requests waiting to be handled by a collection of request threads in an ASP.NET application running on Microsoft Internet Information Services (IIS).
The sizes of resource pools are simple calculate, and this information might be helpful. Symptoms will inevitably appear in one or more of your resource pools when something is wrong with your system.

The agent state downsampler is being monitored.
The agent state downsampler is a basic Apache Kafka service that minimises the amount of data traveling to our downstream consumers from the language agents our clients have installed in their apps. It receives a large stream of agent metadata but only sends out one message per hour per agent. It keeps track of which agents have already received a message in the last hour using Memcached.
So, how are we going to keep track of this? Let’s start with the apparent aspects, such as throughput, processing time, and lag.
This appears to be some helpful information. But what will happen to these graphs if the down sampler begins to lag? Throughput will drop, while processing time and lag will increase. That’s fantastic, but what happens next? This data can’t tell us anything other than “something’s wrong” on its own, which is lovely for alerting reasons but doesn’t help us figure out what’s causing the issue. We need to dig a little deeper.
We can think more critically about the service now that we understand it better. “How full are our queues and buffers, and how busy are our thread pools?” should have been the first question we should be asking whenever something goes wrong.
A small list of latency cases that we may immediately diagnose by monitoring our resource pools is as follows:
|
Symptoms
|
Problem
|
Next Steps
|
|---|---|---|
|
Throughput is down and the Memcached thread pool is fully utilised
|
Memcached is down/slow
|
Investigate the health of the Memcached cluster
|
|
Throughput is down and the Kafka producer buffer is full
|
The destination Kafka brokers are down/slow
|
Investigate the health of the destination Kafka brokers
|
|
Throughput is down and the work queue is mostly empty
|
The source Kafka brokers are down/slow, and the consumer thread isn’t pulling messages fast enough
|
Investigate the health of the source Kafka cluster
|
|
Throughput is up and the Kafka producer buffer is full
|
An increase in traffic has caused us to hit a bottleneck in the producer
|
Address the bottleneck (tune the producer, possibly by increasing the buffer) or scale the service
|
A tried-and-true method for keeping track of resource pools
The first step is to gather information about your resource pools. As previously stated, this is quite simple: create a background thread in your service whose sole purpose is to regularly evaluate each of your resource pools’ size and fullness. ThreadPoolExecutor.getSize() and ThreadPoolExecutor.getActiveCount(), for example, will return the size of a thread pool and the number of active threads.
Using Guava’s AbstractScheduledService and Apache’s HttpClient libraries here’s a basic example:
So that you have good data granularity, you should check the thread pool’s stats somewhat frequently (I recommend once per second).
public class ThreadPoolReporter extends AbstractScheduledService { private final ObjectMapper jsonObjectMapper = new ObjectMapper(); private final ThreadPoolExecutor threadPoolToWatch; private final HttpClient httpClient; public ThreadPoolReporter(final ThreadPoolExecutor threadPoolToWatch, final HttpClient httpClient) { this.threadPoolToWatch = threadPoolToWatch; this.httpClient = httpClient; } @Override protected void runOneIteration() { try { final int poolSize = threadPoolToWatch.getPoolSize(); final int activeTaskCount = threadPoolToWatch.getActiveCount(); final ImmutableMap<String, Object> attributes = ImmutableMap.of("eventType", "ServiceStatus", "timestamp", System.currentTimeMillis(), "poolSize", poolSize, "activeCount", activeCount); final String json = jsonObjectMapper.writeValueAsString(ImmutableList.of(attributes)); final HttpResponse response = sendRequest(json); handleResponse(response); } catch (final Exception e) { NewRelic.noticeError(e); } } private HttpResponse sendRequest(final String json) throws IOException { final HttpPost request = new HttpPost("http://example-api.net"); request.setHeader("X-Insert-Key", "secret key value"); request.setHeader("content-type", "application/json"); request.setHeader("accept-encoding", "compress, gzip"); request.setEntity(new StringEntity(json)); return httpClient.execute(request); } private void handleResponse(final HttpResponse response) throws Exception { try (final InputStream responseStream = response.getEntity().getContent()) { final int statusCode = response.getStatusLine().getStatusCode(); if (statusCode != 200) { final String responseBody = extractResponseBody(responseStream); throw new Exception(String.format("Received HTTP %s response from Insights API. Response body: %s", statusCode, responseBody)); } } } private String extractResponseBody(final InputStream responseStream) throws Exception { try (final InputStreamReader responseReader = new InputStreamReader(responseStream, Charset.defaultCharset())) { return CharStreams.toString(responseReader); } } @Override protected Scheduler scheduler() { return Scheduler.newFixedDelaySchedule(1, 1, TimeUnit.SECONDS); } }
SELECT histogram(activeTaskCount, width: 300, buckets: 30) FROM ServiceStatus SINCE 1 minute ago FACET host LIMIT 100
So that you have good data granularity, you should check the thread pool’s stats somewhat frequently (I recommend once per second).
The information can be analysed as a line graph. However, I like to display my resource pool utilisations as two-dimensional histograms (or heat maps) because it’s easier to spot problems.
Our thread pools are mainly idle during “normal” operations. As you can see, we want to have a lot of headroom for traffic bursts. If the dark squares begin to migrate to the right, it’s a clear indication that something is wrong.
Add monitoring code to each of your resource pools in the same way. If you want to limit the number of events you save, consider integrating the data from each collection into a single Insights event.
Finally, to tie everything together, create an Insights dashboard. Our whole agent state downsampler dashboard is shown below—all it takes is a quick check to see whether anything is wrong with our service or resource pools.
It’s all about taking charge!
Resource pool monitoring has helped every system I’ve worked on, but high-throughput streaming services have benefited the most. We’ve identified a slew of unpleasant problems in record speed.
For example, we recently observed a devastating issue in one of the most high-throughput streaming systems, which caused all processing to stop. It showed out to have been an issue with the Kafka producer’s buffer space, which would have been extremely difficult to diagnose without monitoring. Instead, we could access the service’s dashboards, look at the Kafka producer charts, and see that the buffer was full. Within minutes, we had adjusted the producer with a more significant buffer and were back in business.
Monitoring allows you to prevent problems before they occur. Look for historical trends in your dashboards, not just during occurrences but regularly (once a week, for example). Scale the service before it starts sluggish, and a potential incident occurs if you notice your thread pool consumption slowly increasing.
Enteros
About Enteros
Enteros offers a patented database performance management SaaS platform. It proactively identifies root causes of complex business-impacting database scalability and performance issues across a growing number of RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Governing Cloud Economics at Scale: Enteros Cost Attribution and FinOps Intelligence for BFSI and Technology Organizations
- 25 January 2026
- Database Performance Management
Introduction Cloud adoption has become foundational for both BFSI institutions and technology-driven enterprises. Banks, insurers, fintechs, SaaS providers, and digital platforms now depend on cloud-native architectures to deliver real-time services, enable AI-driven innovation, ensure regulatory compliance, and scale globally. Yet as cloud usage accelerates, so does a critical challenge: governing cloud economics at scale. Despite … Continue reading “Governing Cloud Economics at Scale: Enteros Cost Attribution and FinOps Intelligence for BFSI and Technology Organizations”
Turning Telecom Performance into Revenue: Enteros Approach to Database Optimization and RevOps Efficiency
Introduction The telecom industry is operating in one of the most demanding digital environments in the world. Explosive data growth, 5G rollout, IoT expansion, cloud-native services, and digital customer channels have fundamentally transformed how telecom operators deliver services and generate revenue. Behind every call, data session, billing transaction, service activation, roaming event, and customer interaction … Continue reading “Turning Telecom Performance into Revenue: Enteros Approach to Database Optimization and RevOps Efficiency”
Scaling AI Without Overspend: How Enteros Brings Financial Clarity to AI Platforms
- 22 January 2026
- Database Performance Management
Introduction Artificial intelligence is no longer experimental. Across industries, AI platforms now power core business functions—recommendation engines, fraud detection, predictive analytics, conversational interfaces, autonomous decision systems, and generative AI applications. But as AI adoption accelerates, a critical problem is emerging just as fast: AI is expensive—and most organizations don’t fully understand why. Read more”Indian Country” … Continue reading “Scaling AI Without Overspend: How Enteros Brings Financial Clarity to AI Platforms”
AI-Native Database Performance Management for Real Estate Technology Enterprises with Enteros
Introduction Real estate has rapidly evolved into a technology-driven industry. From digital property marketplaces and listing platforms to smart building systems, valuation engines, CRM platforms, and AI-powered analytics, modern real estate enterprises run on data-intensive technology stacks. At the center of this transformation lies a critical foundation: databases. Every property search, pricing update, lease transaction, … Continue reading “AI-Native Database Performance Management for Real Estate Technology Enterprises with Enteros”