cross-icon

Understanding Anomaly Detection in Monitoring

Anomaly detection: concepts, methods, & implementation

founder-image

Reliability Foundations: Understanding Anomaly Detection in Monitoring

15 min. Read
16 September 2024
trophy
+1
twitterlinkdintwitter
Share
menucross-iconblog-image

In this article, we explore what anomaly detection is and what benefits it can provide to an organization.

Anomaly detection is the process of identifying and recognizing potential issues or incidents within a system.  It is a critical component for maintaining system reliability and performance. 

Anomaly detection is essential because it allows organizations to proactively address problems before they escalate, minimizing downtime and reducing the impact on users. There are various types of detection, including dynamic, which identifies deviations from normal behavior, and static, which triggers alerts when predefined conditions are met. 

Benefits of Anomaly Detection

  • Early Issue Detection: Identifies problems before they become critical.
  • Reduced False Alarms: Minimizes unnecessary alerts and IT team fatigue.
  • Adaptive Monitoring: Adjusts to changing system behaviors automatically.
  • Enhanced Performance & Security: Quickly spots performance issues and potential threats.
  • Cost-Effective: Prevents costly downtime and optimizes resource use.
  • Scalable: Efficiently handles large, complex systems and growing data volumes.

Static vs. Dynamic

There are two main approaches to performing anomaly detection: static and dynamic. Each approach has its own strengths and is suited to different types of data and scenarios. Static anomaly detection relies on predefined thresholds that remain constant over time, while dynamic anomaly detection adjusts thresholds based on the evolving characteristics of the data.

Before discussing static and dynamic threshold detection, we need to understand what we mean by "normal" in this context.

What is “normal” here?

"normal" refers to the expected range of values or behavior for a particular metric or system performance parameter under typical operating conditions. Defining what constitutes "normal" is crucial for identifying anomalies or issues that deviate from usual patterns.

Consider an example:
Imagine you are monitoring the response time of a web application. The response time has been between 100ms and 300ms. This range is considered normal. If the response time suddenly drops to 50ms or spikes to 500ms, it may indicate an issue since these values deviate from the established normal range.

Static Threshold 

Static thresholds are predefined, fixed values set to monitor specific metrics such as CPU utilization, error rates, or response times. Alerts are triggered when these metrics exceed or fall below the set threshold.

Static thresholds set a fixed limit for a metric. This limit is used to compare with real-time data to decide if an alert should be triggered. If the metric goes above or below the threshold, an alert is generated. Since the threshold doesn’t adjust automatically, it needs to be updated manually if normal behavior changes. 

The following graph explains a basic demonstration of Static thresholds.

To handle temporary spikes or drops, you can set time limits so alerts are only raised if the metric consistently stays outside the threshold for a certain period. As shown in the picture (B).

For example: 

Imagine you set a static threshold of 70% for CPU utilization. If the CPU usage goes above 70%, an alert is triggered. However, if CPU usage temporarily spikes above 70% but then drops back down quickly, no alert is raised, as long as the usage does not stay above 70% for the duration of the set time limit. If the CPU usage consistently stays above 70% for a specified period, an alert will be generated to notify you of a potential issue.

What if: 

What if the CPU usage exceeds a static threshold, but there's no real issue? Alternatively, what if a problem arises within the range defined by the threshold because the normal trend has shifted?

For example, if your CPU usage typically ranges from 20% to 60%, but over time it shifts to 30% to 70%, the static threshold set at 65% might no longer be effective. You would then need to update the threshold to match the new range. 

If the range changes again, say to 27% to 67%? Constantly adjusting these thresholds manually is time-consuming and impractical, potentially leading to missed issues or false alarms.

This is where the dynamic threshold becomes essential. Unlike static thresholds, dynamic threshold detection adapts to changing usage patterns, identifying deviations from current trends. This approach ensures that alerts are triggered based on the system's actual behavior, not outdated thresholds, effectively eliminating blind spots.

Dynamic Threshold

A dynamic threshold automatically adjusts based on real-time data, eliminating the need for constant manual updates. It uses statistical algorithms to analyze data points, historical data, identify trends, and detect anomalies. This approach reduces false alarms and ensures alerts are triggered only when there’s a genuine issue, making it more efficient and less of a manual burden compared to static thresholds. 

The following graph explains a basic demonstration of Dynamic thresholds.

For example: 

Let's take a previous example using dynamic anomaly detection to monitor CPU utilization. Instead of a fixed threshold, the system learns that normal CPU usage typically ranges from 30% to 60% based on the past 30 days of data.

The system calculates that the average usage is 45% with a standard deviation of 7.5%. Using a 2-standard deviation rule, the dynamic threshold sets a normal range of 30% to 60% 

Lower bound: 45% - (2 * 7.5%) = 30% 

Upper bound: 45% + (2 * 7.5%) = 60%. 

If CPU usage goes outside this range, the system starts monitoring more closely. 

However, a brief spike to 65% that quickly returns to normal wouldn't trigger an alert. If the CPU usage consistently stays above 60% or below 30% for a specified period (let's say an hour), an alert is generated to notify you of a potential issue. This approach allows for normal fluctuations while still catching sustained abnormal behavior. 

The key difference is adaptability. If over time the normal CPU usage shifts to 40% to 70%, the system will automatically adjust its "normal" range to about 40% to 70%, ensuring that alerts remain relevant without manual reconfiguration.

Understanding the key Parameters

Metric

A metric is the specific quantitative measurement you're monitoring for anomalies. It's the foundation of your anomaly detection process, providing the raw data that you'll analyze. Metrics can be simple or complex, depending on your needs and the system you're monitoring.

For example, in an e-commerce setting, you might track metrics such as daily sales revenue, number of transactions, or average order value. In a technical environment, metrics could include server response time, CPU usage, or network throughput. The choice of metric is crucial as it directly impacts what kinds of anomalies you can detect.

Datapoints

Datapoints are the individual measurements of your chosen metric, collected at regular intervals. Each datapoint represents a specific value at a particular moment in time, forming the basic units of your dataset.

For instance, if you're monitoring hourly website traffic, each datapoint would represent the number of visitors for a specific hour. So you might have datapoints like:

  • 9:00 AM: 1000 visitors
  • 10:00 AM: 1200 visitors
  • 11:00 AM: 950 visitors

The frequency of datapoint collection depends on the nature of your metric and the granularity of analysis you need.

Time slice/Window

A time slice, also known as a time window, is the duration over which you group and analyze your data points. It defines the granularity of your analysis and can significantly impact your ability to detect different types of anomalies.

For example, if you're analyzing server performance, you might use different time windows for different purposes:

  • A 5-minute window to detect sudden spikes in traffic
  • An hourly window to identify unusual patterns in resource usage
  • A daily window to spot trends in overall system performance

The choice of time window depends on the typical behavior of your metric and the types of anomalies you're looking to detect.

Evaluation Period

The evaluation period is the specific timeframe over which you apply your anomaly detection analysis. It's the duration for which you want to identify anomalies, often representing the most recent data you're examining.

For instance, in monitoring website traffic:

  • You might set an evaluation period of the past 24 hours to detect recent anomalies
  • A week-long evaluation period could help identify patterns across different days
  • A month-long period might be used to catch slower-developing anomalies

The evaluation period is distinct from the historical look back (which establishes baselines) and the time slice (which determines analysis granularity). It focuses your analysis on a specific, usually recent, timeframe of interest. Choosing an appropriate evaluation period depends on the nature of your data and the types of anomalies you're trying to detect.

Historical look back

The historical look back period refers to how far back in time you consider data when establishing patterns and detecting anomalies. This historical context is crucial for understanding the normal behavior of your metric over time.

For instance, if you're monitoring retail sales:

  • A 30-day look back might help you identify short-term sales trends
  • A 1-year look back could reveal seasonal patterns
  • A 5-year look back might show long-term growth or decline

The appropriate look back period depends on factors like the stability of your metric, the presence of seasonal patterns, and how quickly the underlying system changes.

Baseline

Using your historical data, you establish a baseline - the expected "normal" behavior of your metric. This serves as a reference point for detecting deviations. The baseline can be static or dynamic, depending on the nature of your data.

For example, in monitoring energy consumption of a building:

  • A static baseline might be the average daily consumption over the past year
  • A dynamic baseline could adjust for factors like day of the week, season, or occupancy levels

Establishing an accurate baseline is crucial for effective anomaly detection, as it provides the context for determining what constitutes "abnormal" behavior.

Trend Analysis

Trend analysis involves examining how your metric changes over time. It helps in identifying patterns, seasonality, and gradual shifts that might not be apparent when looking at individual datapoints.

For example, in analyzing stock prices:

  • An upward trend might indicate growing investor confidence
  • A cyclical trend could reveal seasonal factors affecting the stock
  • A sudden change in trend might signal important news or market shifts

Trend analysis can help distinguish between normal variations and true anomalies, especially in metrics with complex patterns.

Snapshots

Snapshots are periodic "pictures" of your data at specific points in time. Comparing snapshots can help identify sudden changes or anomalies that might not be apparent in continuous data.

For instance, in monitoring database performance:

  • Daily snapshots of query response times could reveal gradual degradation
  • Weekly snapshots of data volume might show unusual growth patterns
  • Monthly snapshots of user activity could highlight shifts in usage patterns

Snapshots are particularly useful for detecting slow-moving anomalies or changes that occur over longer time scales.

Outlier detection

Outlier detection is the process of identifying datapoints that significantly differ from other observations. Various statistical and machine learning techniques can be used for outlier detection, considering factors like the baseline, threshold, and trends.

For example, in fraud detection:

  • A transaction amount far exceeding a customer's usual spending could be flagged
  • A login attempt from an unusual geographic location might be identified as an outlier
  • A sudden spike in account activity could trigger further investigation

Effective outlier detection requires careful tuning to balance between catching genuine anomalies and avoiding false positives.

Use of Statistics 

Standard deviation (SD) tells us how much data typically varies from the average. In anomaly detection, we use it to separate “normal" from “unusual” data. The 68-95-99.7 rule is a simple way to remember how much data falls within 1, 2, or 3 standard deviations from the average:

Here's a breakdown:

  • 68% Rule (1 SD): Approximately 68% of the data falls within one standard deviation (1 SD) of the mean.
  • 95% Rule (2 SD): Approximately 95% of the data falls within two standard deviations (2 SD) of the mean.
  • 99.7% Rule (3 SD): Approximately 99.7% of the data falls within three standard deviations (3 SD)of the mean.

In anomaly detection, we often use 2 SD (95% rule) or 3 SD (99.7% rule) as thresholds for identifying outliers or anomalies.

Simplified Example:

Let's say we're monitoring the CPU usage of a web server. After analyzing the historical CPU usage data, we find:

  • The average (mean) CPU usage is 50%
  • The standard deviation of the CPU usage is 10%.

Using the standard deviation rule for anomaly detection:

1 SD (68% rule):

  • Range: 40% to 60% CPU usage (50% ± 10%)
    • This range represents normal daily fluctuations in CPU usage. Most of the time, usage will stay within this range, and it's considered routine server behavior.

2 SD (95% rule):

  • Range: 30% to 70% CPU usage (50% ± 20%)some text
    • If CPU usage falls within this range, it's still considered normal, though readings near the edges may be worth monitoring closely. Spikes toward 70% could indicate temporary heavy load, and dips near 30% may suggest underutilization.

3 SD (99.7% rule):

  • Range: 20% to 80% CPU usage (50% ± 30%)some text
    • CPU usage outside this range is highly unusual and likely warrants immediate investigation. Such deviations might signal an application failure (low usage) or excessive load leading to potential crashes (high usage).

Anomaly Detection Example:

  • If we use the 2 SD rule for anomaly detection:some text
    • A day where CPU usage hits 75% could be flagged as an anomaly (perhaps due to an unexpected surge in traffic or a resource-intensive process).
    • A day where CPU usage drops to 25% could also be flagged (indicating possible underuse or a misconfiguration).
  • Using the 3 SD rule would make the system less sensitive, only triggering alerts for more severe anomalies, such as when CPU usage reaches 85% or drops to 15%.

This simple example demonstrates how standard deviation can be used to create thresholds for normal vs. anomalous behavior in a metric, allowing for automated detection of unusual patterns or events.

Anomaly Detection basic Approach & process

Basic Approach

1. Establish Normal Behavior:

  • Collect historical data for your chosen metric.
  • Calculate the average (mean) and standard deviation.
  • Define the "normal" range, typically using 2 standard deviations from the mean.

2. Monitor Current Data:

  • Continuously collect new data points for your metric.
  • Compare each new data point to the established normal range.

3. Identify and Respond to Anomalies:

  • Flag any data points that fall outside the normal range as potential anomalies.
  • Investigate the cause of these anomalies and take appropriate action.

4. Update Normal Behavior:

  • Periodically recalculate the average and standard deviation using recent data.
  • Adjust the normal range accordingly to adapt to gradual changes over time.

Understanding process with an Example

Example: Detecting anomalies in daily website visitors

Step 1: Define the metric

  • We'll track the number of daily visitors to a website.

Step 2: Collect historical data

  • Gather data for the past 30 days of daily visitor counts.

Step 3: Calculate baseline statistics

  • Calculate the average (mean) daily visitors: Let's say it's 1,000 visitors per day.
  • Calculate the standard deviation: Let's say it's 200 visitors.

Step 4: Set the anomaly threshold

  • We'll use the 2 standard deviation rule.
  • Lower threshold: 1,000 - (2 x 200) = 600 visitors
  • Upper threshold: 1,000 + (2 x 200) = 1,400 visitors

Step 5: Monitor current data

  • Start collecting daily visitor counts for the current period.

Step 6: Compare current data to thresholds

  • For each day, compare the visitor count to the thresholds.

Step 7: Identify anomalies

  • If a day's visitor count falls below 600 or above 1,400, flag it as an anomaly.

Step 8: Analyze and respond

  • Investigate the cause of any anomalies detected.
  • For low traffic: Check for website issues or tracking problems.
  • For high traffic: Look for successful marketing campaigns or viral content.

Step 9: Update the model (for dynamic thresholds)

  • Regularly recalculate the average and standard deviation using recent data.
  • This allows the model to adapt to gradual changes in traffic patterns.

Example scenario:

  • Day 1: 950 visitors (normal)
  • Day 2: 1,100 visitors (normal)
  • Day 3: 1,500 visitors (anomaly - unusually high traffic)
  • Day 4: 500 visitors (anomaly - unusually low traffic)
  • Day 5: 1,050 visitors (normal)

This process helps identify unusual patterns in your website traffic, allowing you to respond quickly to potential issues or opportunities. 

How Temperstack eases this process

Temperstack is here to revolutionize your workflow. We've combined the best of static and dynamic approaches into one seamless platform, eliminating the need for manual calculations and constant monitoring.

With Temperstack, you can say goodbye to:

  • Tedious statistical analyses
  • Complicated metric setting
  • Endless historical data tracking

Our solution comes pre-loaded with expert-crafted thresholds, meticulously developed through extensive research and industry insights. These intelligent settings provide a robust foundation for your anomaly detection needs right out of the box.

But we don't stop there. Temperstack offers the flexibility to fine-tune these thresholds to your organization's unique requirements. In just a few clicks, you can customize your anomaly detection system to perfection.

Ready to transform your approach to anomaly detection? Start your free trial today and experience the Temperstack difference. Streamline your processes, enhance your insights, and focus on what truly matters – growing your business.

About the Co-Authors

Hari is an accomplished engineering leader and innovator with over 15 years of experience across various industries. Currently serving as the cofounder and CTO of Temperstack, Hari has been instrumental in scaling engineering teams, products, and infrastructure to support hyper-growth. Previously, he held Director of Engineering positions at Practo, Dunzo, Zeta, and Aknamed, where he consistently drove innovation and operational excellence.

Samdisha is a skilled technical writer at Temperstack, leveraging her expertise to create clear and comprehensive documentation. In her role, she has been pivotal in developing user manuals, API documentation, and product specifications, contributing significantly to the company's technical communication strategy.

linkdin

Reliability Foundations: Understanding Anomaly Detection in Monitoring

Hari Prashanth K R | Co-Founder & CTO Temperstack

In this article

Let’s Stay in Touch

Subscribe to our newsletter & never miss our latest news and promotions.

arrow
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Build a culture of Resilient Proactive SRE

Get Started Today
arrow
scroll-to-top