Every team wants to measure technical performance. The problem is that most teams measure the wrong things—or measure the right things in the wrong way. We've seen projects where a dashboard full of green numbers masked a system that was falling apart under load. We've also seen teams that abandoned good metrics because they misinterpreted the data. This guide is for engineers, tech leads, and engineering managers who need to decide which metrics actually matter for their specific context. By the end, you'll have a clear framework for choosing, tracking, and acting on the metrics that drive real improvements.
1. Choosing Your Metrics: The Decision Frame
Before you can measure anything, you need to decide what you're optimizing for. Technical performance isn't one thing—it's a bundle of trade-offs. A system that's blazing fast for one user might buckle under ten thousand concurrent requests. A database that's perfectly normalized might be too slow for your real-time dashboard. The first step is to identify the primary constraint that matters to your users and your business.
This decision isn't just technical—it's strategic. Do you need to improve page load time for a public-facing website? Then metrics like Time to First Byte (TTFB) and Largest Contentful Paint (LCP) are non-negotiable. Are you building an API that powers a mobile app? Then you might care more about p99 latency and error rates. The mistake we often see is teams trying to measure everything at once, which leads to dashboard overload and analysis paralysis. Pick no more than three to five primary metrics that align with your current goals.
Another common error is choosing metrics that are easy to measure but not meaningful. CPU utilization, for example, is trivial to collect but rarely tells you anything useful about user experience. A server can be at 90% CPU but still serving requests in milliseconds—or it could be at 30% CPU but struggling with a lock contention that causes timeouts. The metric itself isn't the problem; it's the lack of context. That's why we recommend starting with user-facing metrics and working backward to system metrics, not the other way around.
Finally, set a timeline. Metrics should be reviewed at least monthly, but the decision of which metrics to prioritize should be revisited quarterly. As your system evolves, so will your constraints. What mattered last quarter might be irrelevant today. By framing the decision as an ongoing process rather than a one-time choice, you avoid the trap of clinging to outdated indicators.
Who Needs to Be Involved in the Decision
This isn't a decision for the infrastructure team alone. Product managers, developers, and operations staff all have different perspectives on what "performance" means. A product manager might care about conversion rates, while a developer cares about code execution time. We recommend forming a small cross-functional group to define the top three metrics for the next quarter. This group should include at least one person who understands the user experience, one who understands the code, and one who understands the infrastructure.
2. The Metric Landscape: Options and Approaches
Once you know what you're optimizing for, it's time to look at the available metrics. Here are the most common categories, along with their strengths and weaknesses.
Latency Metrics: Average, Percentiles, and Tail Latency
Latency is the most intuitive performance metric: how long does it take to get a response? But the way you measure latency matters enormously. The average (mean) is often misleading because it's easily skewed by outliers. A single slow request can pull the average up, making the system look worse than it is for most users—or a few very fast requests can hide a problem that affects a small percentage of users. That's why percentiles are more useful. The p50 (median) tells you what the typical user experiences; p95 and p99 reveal the tail, where frustrated users are waiting too long. Many teams focus on p99 latency, but we've seen cases where optimizing p99 at the expense of p50 made the system feel slower for everyone. The right approach depends on your user base: if you have a few power users who generate most of your revenue, tail latency might matter more than median.
Throughput: Requests per Second and Beyond
Throughput measures how much work your system can handle. The classic metric is requests per second (RPS), but raw throughput without context is dangerous. A system that serves 10,000 requests per second with a 50% error rate is not performing well—it's failing fast. We recommend measuring successful throughput (requests that return a 2xx or 3xx status) separately from total throughput. Also, be aware that throughput can be gamed: if you add caching, you might increase throughput while hiding backend latency issues. Always pair throughput with a latency metric to get the full picture.
Error Rates: The Silent Killer
Error rate is the percentage of requests that fail. It's a straightforward metric, but teams often set error rate thresholds too high or monitor them too infrequently. A 1% error rate on a high-traffic system means thousands of users are having a bad experience every hour. We recommend setting a maximum acceptable error rate (often called an error budget) and alerting when it's exceeded. But be careful: not all errors are equal. A 503 Service Unavailable is different from a 404 Not Found. Categorize errors by severity and track them separately.
Resource Utilization: CPU, Memory, Disk, Network
These are the classic system-level metrics. They're useful for capacity planning but not for user experience. High CPU utilization might indicate a need to scale up, but it could also mean a runaway process. The key is to correlate resource metrics with user-facing metrics. If CPU goes up but latency stays flat, you probably have headroom. If latency spikes when CPU hits 80%, you have a bottleneck. We've found that tracking resource utilization trends over time is more valuable than setting static thresholds.
Saturation: Queue Length, Connection Pool Usage
Saturation metrics measure how close your system is to its limits. Queue length (number of requests waiting to be processed) is a leading indicator of trouble: when queues grow, latency increases. Connection pool usage in databases is another example. These metrics are often overlooked because they require instrumentation, but they're some of the most predictive. We recommend monitoring queue length for any component that has a backlog, whether it's a web server, a database, or a message queue.
3. How to Compare Metrics: Criteria for Choosing What to Track
Not all metrics are created equal. When deciding which metrics to put on your primary dashboard, use these criteria:
Actionability
A metric is only useful if it tells you what to do next. Latency above your threshold? You might need to optimize code or add resources. Error rate climbing? Check recent deployments. CPU at 99%? Maybe, but what's the next step? If a metric doesn't suggest a clear action, it's probably not worth tracking as a primary indicator. We recommend asking: "If this metric goes red, do I know what to do?" If the answer is no, find a different metric.
Correlation with User Experience
The best metrics are those that directly reflect what users feel. Time to Interactive (TTI) or First Input Delay (FID) are better than server-side response time because they capture the full user experience, including network latency and browser rendering. If you can't measure user-facing metrics directly, choose system metrics that have a strong correlation. For example, database query time is a good proxy for page load time if your app is database-bound.
Sensitivity to Change
A good metric should change when you make a change. If you optimize a slow query, your query time metric should drop. If you add a cache, your database load should decrease. Metrics that are too coarse (like daily average response time) can mask the impact of improvements. We prefer metrics that can be measured in short windows (minutes, not hours) so you can see the effect of deployments and experiments quickly.
Resistance to Gaming
Any metric that becomes a target can be gamed. If you reward teams for low p99 latency, they might drop slow requests instead of handling them properly. If you reward high throughput, they might add unnecessary caching that hides real performance issues. Choose metrics that are hard to manipulate without actually improving the system. Error rate, for example, is hard to game because you can't hide errors without fixing them—unless you stop reporting them, which is a different kind of problem.
Cost of Measurement
Some metrics are expensive to collect and store. Full request tracing can generate terabytes of data per day. High-cardinality metrics (like per-user latency) can overwhelm your monitoring system. Balance the value of the metric against the cost of collecting it. Often, sampling is a good compromise: track every request for error rates, but sample latency data at 1% to reduce overhead.
4. Trade-Offs: Choosing What to Optimize
No metric is perfect. Every choice involves trade-offs. Here's a structured comparison of common metric pairs.
| Metric Pair | Trade-Off | When to Prioritize the First | When to Prioritize the Second |
|---|---|---|---|
| Average vs. p99 Latency | Average hides outliers; p99 focuses on the tail but can be noisy | When most users are homogeneous and outliers are rare | When a small percentage of users have high value or when outliers cause churn |
| Throughput vs. Error Rate | High throughput can mask high error rates | When you're capacity planning and need to know maximum load | When reliability is more important than raw speed |
| CPU Utilization vs. Queue Length | CPU is easy to measure but not predictive; queue length is predictive but harder to instrument | When you have limited instrumentation and need a rough proxy | When you need early warning of saturation before latency spikes |
| Server-Side Latency vs. Client-Side Latency | Server-side is easier to measure; client-side reflects real user experience but is harder to collect | When you control the full stack and want to isolate backend issues | When user experience is the primary goal and network conditions vary |
In practice, you'll likely need a mix. For example, we often recommend tracking p99 latency (user-facing) alongside queue length (system-level) to get both the symptom and the cause. The trade-off is that you now have two metrics to maintain, but the insight is worth it.
5. Implementation Path: From Metrics to Action
Choosing the right metrics is only half the battle. You also need a process for turning data into decisions. Here's a step-by-step approach that we've seen work across many teams.
Step 1: Instrument and Collect
Start by instrumenting your code to emit the metrics you've chosen. Use a standard format like Prometheus metrics or StatsD so that you can aggregate them in a single dashboard. Don't over-instrument at first—focus on the top five metrics. You can always add more later. Make sure your collection pipeline is reliable; missing data is worse than no data because it can lead to false alarms.
Step 2: Set Baselines and Thresholds
Before you can detect anomalies, you need to know what normal looks like. Collect data for at least two weeks to establish baselines for each metric. Then set warning thresholds (e.g., p99 latency above 200ms) and critical thresholds (e.g., error rate above 1%). Use dynamic thresholds that adjust based on time of day and day of week if your traffic patterns vary.
Step 3: Build Dashboards That Tell a Story
A good dashboard doesn't just show numbers—it shows relationships. Put your primary user-facing metric at the top, then system-level metrics below. Use time-series graphs so you can see correlations. For example, overlay latency and queue length on the same chart to see if they move together. Avoid dashboards with dozens of metrics; limit to one screen so you can grasp the state of the system at a glance.
Step 4: Create a Response Playbook
For each primary metric, document what to do when it exceeds a threshold. Who gets alerted? What's the first thing to check? What's the escalation path? We've seen teams waste hours during incidents because they had to figure out the response from scratch. A playbook turns a metric from a number into a decision trigger.
Step 5: Review and Iterate
Every month, review your metrics with the team. Are they still aligned with your goals? Are there new metrics that would be more useful? Are any metrics causing unintended behavior? Adjust as needed. The goal is continuous improvement, not a static dashboard.
6. Risks of Choosing the Wrong Metrics
Bad metrics can do more harm than good. Here are the most common risks and how to avoid them.
Optimizing for the Wrong Thing
If you measure only latency, you might ignore error rates and sacrifice reliability for speed. If you measure only throughput, you might overload your system until it collapses. The classic example is a team that optimized for low server response time by adding aggressive caching, but the cache invalidation logic was buggy, causing users to see stale data. The metric looked great, but the user experience was terrible. Always balance metrics that could conflict.
Metric Manipulation
When a metric becomes a target, people will find ways to hit it without actually improving the system. We've heard of teams that reduced p99 latency by dropping slow requests (returning a 503) instead of fixing them. The metric improved, but the user experience got worse. To prevent this, use multiple metrics that cross-check each other. If latency drops but error rate spikes, something is wrong.
Alert Fatigue
Too many alerts lead to ignored alerts. If every minor metric fluctuation triggers a notification, your team will start tuning them out—and miss the ones that matter. Be selective: alert only on metrics that require immediate action. Use warning levels for things that can wait until business hours. And regularly review your alert rules to remove those that never fire or that fire too often.
Ignoring Context
Metrics without context are dangerous. A spike in latency could be caused by a legitimate traffic surge (like a product launch) or a bug. Always correlate metrics with events: deployments, traffic changes, third-party service outages. We recommend using a tool that can annotate your graphs with deployment markers and incident timelines.
Over-Reliance on a Single Metric
No single metric tells the whole story. Even the best metrics should be part of a balanced set. If you're only tracking p99 latency, you might miss that your throughput has dropped by half because users are leaving due to errors. We recommend a minimum of three metrics: one user-facing (latency or error rate), one system-level (queue length or utilization), and one business-level (conversion rate or revenue per request, if applicable).
7. Mini-FAQ: Common Questions About Performance Metrics
Should we track all requests or sample?
Sampling is fine for latency and throughput, but error rates should be tracked for every request. A single error can be critical even if it's statistically rare. For latency, 1% sampling is often enough to get a good picture, but be aware that sampling can miss tail events. We recommend full tracing for error paths and sampled tracing for normal paths.
How often should we review our metrics?
At a minimum, review your primary metrics weekly in a team standup. Monthly deep dives are good for identifying trends and adjusting thresholds. Quarterly, revisit the choice of metrics themselves. If your system has changed significantly (e.g., you moved from monolith to microservices), your metrics should change too.
What's the best way to share metrics with non-technical stakeholders?
Translate technical metrics into business outcomes. Instead of saying "p99 latency is 500ms," say "the slowest 1% of users wait half a second." Instead of "error rate is 0.5%," say "one in two hundred requests fails." Use simple charts with clear thresholds (green/yellow/red). Avoid jargon like "percentile" unless you explain it. We've found that a single dashboard with three metrics—user-facing latency, error rate, and uptime—works well for executive reporting.
How do we handle metrics for microservices?
Microservices introduce the challenge of distributed tracing. You need to track metrics per service, but also end-to-end metrics that capture the user's full request path. We recommend using a trace ID to correlate metrics across services. Focus on the services that are most critical to user experience, and don't try to monitor every service equally. A service that rarely changes and has low traffic might only need basic health checks.
What about synthetic monitoring vs. real user monitoring?
Synthetic monitoring (using scripted transactions from controlled environments) is useful for catching regressions before users see them. Real user monitoring (RUM) captures actual user experiences, including network conditions and device variations. Both have value. We recommend using synthetic monitoring for pre-deployment checks and RUM for ongoing monitoring. The key is to understand that synthetic metrics are not the same as real user metrics; they measure potential performance, not actual performance.
Measuring technical performance is a practice, not a destination. The metrics you choose today will evolve as your system and your users change. By focusing on a small set of meaningful, actionable metrics—and avoiding the common pitfalls we've outlined—you'll be able to make informed decisions that actually improve performance. Start with the five metrics that matter most to your context, build a process around them, and iterate from there.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!