If You’re Not Monitoring Your Resource Pools, You’re Doing It Wrong
Developing production-ready software in today’s world entails much more than just adding functionality. It’s only half the struggle to create a “functionally complete” software system. Systems must be designed to considerably higher standards to compete in today’s market; gone are the days of deploying software as soon as it passes your QA team’s functional validation.
You must be ready to deal with third-party dependence failures malicious users, scale your system as you add customers, and meet your dependability service-level goals (SLOs), indicators (SLIs), and agreements, among other things (SLAs).
Monitoring is, of course, an essential aspect of reliability. You’ll only know something is wrong if customers call—or tweet—to complain if you don’t have visibility into the health of your system (which is terrible). And the only way you’ll figure out what’s wrong is to stumble around aimlessly (which is very bad).
But how do you know what you need to monitor when reliability experts advise you that you need to monitor the health of your systems? Throughput? How long does it take for you to respond? Latency? These are the most obvious options, and while they can often signal when you have a problem, they don’t tell you much about what’s causing the issue.
You need to take a look at your pool of resources.
Any non-trivial software system will have pools of resources ready to handle requests as they come in. A collection of database connections is required to communicate with a database. A pool of threads is required to process tasks from a queue. The work queue is a pool, although one that fills rather than drains. (Keep in mind that a single “non-pooled” connection is practically the same as a single connection in a pool.)
A collection of resource pools underpins all streaming systems, which are made up of any number of services. Even if your benefit, such as a primary windowing data aggregator, doesn’t interact with databases or make any external requests, reading and writing to your message broker require several threads and buffers.
The same is true for HTTP services. Request queue, for example, is a pool of requests waiting to be handled by a collection of request threads in an ASP.NET application running on Microsoft Internet Information Services (IIS).
The sizes of resource pools are simple calculate, and this information might be helpful. Symptoms will inevitably appear in one or more of your resource pools when something is wrong with your system.

The agent state downsampler is being monitored.
The agent state downsampler is a basic Apache Kafka service that minimises the amount of data traveling to our downstream consumers from the language agents our clients have installed in their apps. It receives a large stream of agent metadata but only sends out one message per hour per agent. It keeps track of which agents have already received a message in the last hour using Memcached.
So, how are we going to keep track of this? Let’s start with the apparent aspects, such as throughput, processing time, and lag.
This appears to be some helpful information. But what will happen to these graphs if the down sampler begins to lag? Throughput will drop, while processing time and lag will increase. That’s fantastic, but what happens next? This data can’t tell us anything other than “something’s wrong” on its own, which is lovely for alerting reasons but doesn’t help us figure out what’s causing the issue. We need to dig a little deeper.
We can think more critically about the service now that we understand it better. “How full are our queues and buffers, and how busy are our thread pools?” should have been the first question we should be asking whenever something goes wrong.
A small list of latency cases that we may immediately diagnose by monitoring our resource pools is as follows:
|
Symptoms
|
Problem
|
Next Steps
|
|---|---|---|
|
Throughput is down and the Memcached thread pool is fully utilised
|
Memcached is down/slow
|
Investigate the health of the Memcached cluster
|
|
Throughput is down and the Kafka producer buffer is full
|
The destination Kafka brokers are down/slow
|
Investigate the health of the destination Kafka brokers
|
|
Throughput is down and the work queue is mostly empty
|
The source Kafka brokers are down/slow, and the consumer thread isn’t pulling messages fast enough
|
Investigate the health of the source Kafka cluster
|
|
Throughput is up and the Kafka producer buffer is full
|
An increase in traffic has caused us to hit a bottleneck in the producer
|
Address the bottleneck (tune the producer, possibly by increasing the buffer) or scale the service
|
A tried-and-true method for keeping track of resource pools
The first step is to gather information about your resource pools. As previously stated, this is quite simple: create a background thread in your service whose sole purpose is to regularly evaluate each of your resource pools’ size and fullness. ThreadPoolExecutor.getSize() and ThreadPoolExecutor.getActiveCount(), for example, will return the size of a thread pool and the number of active threads.
Using Guava’s AbstractScheduledService and Apache’s HttpClient libraries here’s a basic example:
So that you have good data granularity, you should check the thread pool’s stats somewhat frequently (I recommend once per second).
public class ThreadPoolReporter extends AbstractScheduledService { private final ObjectMapper jsonObjectMapper = new ObjectMapper(); private final ThreadPoolExecutor threadPoolToWatch; private final HttpClient httpClient; public ThreadPoolReporter(final ThreadPoolExecutor threadPoolToWatch, final HttpClient httpClient) { this.threadPoolToWatch = threadPoolToWatch; this.httpClient = httpClient; } @Override protected void runOneIteration() { try { final int poolSize = threadPoolToWatch.getPoolSize(); final int activeTaskCount = threadPoolToWatch.getActiveCount(); final ImmutableMap<String, Object> attributes = ImmutableMap.of("eventType", "ServiceStatus", "timestamp", System.currentTimeMillis(), "poolSize", poolSize, "activeCount", activeCount); final String json = jsonObjectMapper.writeValueAsString(ImmutableList.of(attributes)); final HttpResponse response = sendRequest(json); handleResponse(response); } catch (final Exception e) { NewRelic.noticeError(e); } } private HttpResponse sendRequest(final String json) throws IOException { final HttpPost request = new HttpPost("http://example-api.net"); request.setHeader("X-Insert-Key", "secret key value"); request.setHeader("content-type", "application/json"); request.setHeader("accept-encoding", "compress, gzip"); request.setEntity(new StringEntity(json)); return httpClient.execute(request); } private void handleResponse(final HttpResponse response) throws Exception { try (final InputStream responseStream = response.getEntity().getContent()) { final int statusCode = response.getStatusLine().getStatusCode(); if (statusCode != 200) { final String responseBody = extractResponseBody(responseStream); throw new Exception(String.format("Received HTTP %s response from Insights API. Response body: %s", statusCode, responseBody)); } } } private String extractResponseBody(final InputStream responseStream) throws Exception { try (final InputStreamReader responseReader = new InputStreamReader(responseStream, Charset.defaultCharset())) { return CharStreams.toString(responseReader); } } @Override protected Scheduler scheduler() { return Scheduler.newFixedDelaySchedule(1, 1, TimeUnit.SECONDS); } }
SELECT histogram(activeTaskCount, width: 300, buckets: 30) FROM ServiceStatus SINCE 1 minute ago FACET host LIMIT 100
So that you have good data granularity, you should check the thread pool’s stats somewhat frequently (I recommend once per second).
The information can be analysed as a line graph. However, I like to display my resource pool utilisations as two-dimensional histograms (or heat maps) because it’s easier to spot problems.
Our thread pools are mainly idle during “normal” operations. As you can see, we want to have a lot of headroom for traffic bursts. If the dark squares begin to migrate to the right, it’s a clear indication that something is wrong.
Add monitoring code to each of your resource pools in the same way. If you want to limit the number of events you save, consider integrating the data from each collection into a single Insights event.
Finally, to tie everything together, create an Insights dashboard. Our whole agent state downsampler dashboard is shown below—all it takes is a quick check to see whether anything is wrong with our service or resource pools.
It’s all about taking charge!
Resource pool monitoring has helped every system I’ve worked on, but high-throughput streaming services have benefited the most. We’ve identified a slew of unpleasant problems in record speed.
For example, we recently observed a devastating issue in one of the most high-throughput streaming systems, which caused all processing to stop. It showed out to have been an issue with the Kafka producer’s buffer space, which would have been extremely difficult to diagnose without monitoring. Instead, we could access the service’s dashboards, look at the Kafka producer charts, and see that the buffer was full. Within minutes, we had adjusted the producer with a more significant buffer and were back in business.
Monitoring allows you to prevent problems before they occur. Look for historical trends in your dashboards, not just during occurrences but regularly (once a week, for example). Scale the service before it starts sluggish, and a potential incident occurs if you notice your thread pool consumption slowly increasing.
Enteros
About Enteros
Enteros offers a patented database performance management SaaS platform. It proactively identifies root causes of complex business-impacting database scalability and performance issues across a growing number of RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
How to Transform Financial Operations with Enteros Database Software and Growth Intelligence
- 10 June 2026
- Database Performance Management
Introduction The financial services industry is experiencing unprecedented digital transformation. Banks, insurance providers, fintech organizations, investment firms, and financial institutions are rapidly modernizing their technology infrastructures to meet evolving customer expectations, regulatory requirements, and competitive market demands. Modern financial organizations now rely on: Digital banking platforms Mobile financial applications Payment processing systems Risk management platforms … Continue reading “How to Transform Financial Operations with Enteros Database Software and Growth Intelligence”
How to Enable Intelligent AI Growth with Enteros Database Performance Management and Operational Intelligence
Introduction Artificial Intelligence (AI) is transforming industries across the globe. From generative AI applications and large language models (LLMs) to predictive analytics, intelligent automation, and machine learning platforms, organizations are investing heavily in AI technologies to improve productivity, accelerate innovation, and drive business growth. Modern AI ecosystems now support: Generative AI platforms Machine learning environments … Continue reading “How to Enable Intelligent AI Growth with Enteros Database Performance Management and Operational Intelligence”
How Real-Time Database Observability Accelerates Digital Transformation Initiatives
Digital transformation has become a strategic priority for organizations seeking to remain competitive in an increasingly data-driven world. Enterprises across industries are investing in cloud-native technologies, artificial intelligence, automation, advanced analytics, and modern applications to improve operational efficiency, enhance customer experiences, and drive innovation. However, successful digital transformation requires more than adopting new technologies. Organizations … Continue reading “How Real-Time Database Observability Accelerates Digital Transformation Initiatives”
Leveraging AI and Predictive Analytics for Autonomous Database Performance Management
In today’s digital-first economy, organizations depend on high-performing databases to support critical business applications, customer experiences, analytics platforms, and operational systems. As enterprises continue adopting cloud-native architectures, multi-cloud deployments, microservices, and real-time digital services, database environments are becoming increasingly complex and difficult to manage. Traditional database performance management approaches often rely on manual monitoring, reactive … Continue reading “Leveraging AI and Predictive Analytics for Autonomous Database Performance Management”