cross-icon

Process Control in IT Infrastructure: Implementing process control learning from Manufacturing

In modern IT infrastructure management, we face a critical challenge: despite abundant observability data and sophisticated tools, organizations struggle with alert fatigue and ineffective monitoring. This white paper demonstrates how manufacturing process control principles can revolutionize infrastructure and application performance monitoring, drawing specific parallels with dairy plant quality control systems.

founder-image

Process Control in IT Infrastructure: Implementing process control learning from Manufacturing

1 Janaury 2025
trophy
+1
twitterlinkdintwitter
Share
menucross-iconblog-image

Alert Fatigue & reliability Crisis

In our recent series exploring manufacturing reliability principles, we traced the journey from understanding foundational principles to implementing control points to processing signals effectively. Now it's time to move beyond theory to practical implementation.

Organizations commonly face:

  • Alert storms during incidents that paralyze response teams
  • High volumes of auto-resolving alerts that train operators to ignore notifications
  • Unclear alert ownership leading to delayed responses
  • Alert fatigue causing teams to miss critical issues
  • Inconsistent alert configuration across different services
  • Missing alerts and mechanism to audit alert deployment and active state
  • Inconsistent alert policy application

The Root Cause

The challenge isn't a lack of observability data or sophisticated tools. Instead, we've fallen into three fundamental traps:

  • The belief that more observability will automatically deliver better reliability
  • The illusion that more signals & abundant telemetry equals understanding
  • Too much time and mind space spent on Technology vs Fundamentals of process control

Infrastructure & services monitoring as a Process Control System- the Manufacturing Parallel

Infrastructure performance can be understood as a process control problem, similar to quality control in food processing. Just as a dairy plant must ensure milk quality from input to output, your infrastructure must process code efficiently and reliably. Bad input code, like contaminated milk, can compromise the entire system - no amount of processing will make it safe for consumption. Similarly, even good code, like quality milk, must be processed correctly to maintain its integrity.

Core Process Control Metrics (referencing Google's SRE Golden Metrics Framework)

Latency (Processing Time)

  • This is like cycle time in manufacturing
  • Every request must be processed within specific time bounds
  • For example, each API call should complete within 200ms
  • Just as a manufacturing process must maintain consistent cycle times, your infrastructure must maintain consistent response times

Traffic (Production Rate)

  • This represents your throughput - how many requests you're handling
  • Like a production line's units per hour
  • For example: 1000 requests per second
  • Your infrastructure, like any process, has a designed throughput capacity

Errors (Quality Issues)

  • These are your defects - failed requests
  • Like rejected products in manufacturing
  • For example: 0.1% error rate threshold
  • Must stay below acceptable limits for the process to be "in control"

Saturation (Machine Utilization)

  • This is your resource utilization level
  • Like how close a machine is running to its maximum capacity
  • For example: CPU at 80% utilization
  • When you hit 100%, like any machine, you can't process more work

Process Control Parameters

Process Control in this context means:

  • Maintaining latency within acceptable bounds
  • Handling designed traffic levels
  • Keeping errors below threshold

Operating at efficient but safe saturation levels

For example, if your application needs to

  • Process requests within 200ms (latency)
  • Handle 1000 requests/second (traffic)
  • Keep errors below 0.1% (quality)
  • Run at 70% saturation (efficiency)

Your process is "in control" when all these parameters are met.

Detailed Process Parameters: The Dairy Plant Parallel

Imagine a dairy plant's pasteurization unit processing milk. Just as milk must be heated to exactly 72°C (161°F) for 15-20 seconds, your CPU must process requests within specific performance parameters.

Core Components and Their IT Equivalents

Processing Unit Comparison

  • Dairy Plant: A pasteurizer rated for 10,000 liters/hour at optimal temperature
  • IT Equivalent: An 8-core CPU rated at 100% utilization per core
  • Capacity Parallel: Both have finite processing capacity that affects quality when exceeded

Process Parameters and Golden Metrics

Processing Time (Latency)

  • Dairy Plant: Time milk stays at 72°C (must be 15-20 seconds)
  • IT Equivalent: Time CPU takes to complete a request
  • Control Factor: CPU time per request must stay under 100ms
  • Impact: Just as underheated milk is unsafe, slow CPU response makes applications unusabl

Production Rate (Traffic)

  • Dairy Plant: Current flow rate (e.g., 8,000 liters/hour)
  • IT Equivalent: Requests processed per second
  • Control Factor: Request volume that CPU must handle
  • Impact: Like milk backing up in pipes, requests queue when volume exceeds capacity

Quality Issues (Errors)

  • Dairy Plant: Batches failing temperature standards
  • IT Equivalent: Failed request processing due to CPU constraints
  • Control Factor: Error rate must stay below 0.1%
  • Impact: Both result in service failure

Machine Utilization (Saturation)

  • Dairy Plant: Pasteurizer running at 80% of max capacity
  • IT Equivalent: CPU running at 80% utilization
  • Control Factor: Operating efficiency vs. maximum capacity
  • Impact: Both systems degrade rapidly approaching 100% utilizatio

Process Control and Variations

Normal Variation

  • Dairy Plant: Temperature fluctuating 71.5°C to 72.5°C
  • IT Equivalent: CPU utilization varying between 40-60%
  • Impact: Expected variation, process remains in control
  • Example: CPU utilization increasing during business hours

Abnormal Variation

  • Dairy Plant: Temperature suddenly dropping to 70°C
  • IT Equivalent: CPU suddenly spiking to 85%
  • Impact: Requires immediate investigation
  • Example: Memory leak causing unexpected CPU spikes

Understanding Downtime

Capacity-Related Downtime

  • Dairy Plant: Trying to process 12,000 liters/hour through a 10,000 liters/hour pasteurizer
  • IT Equivalent: Running at 95% CPU utilization with increasing load
  • Result: Complete service failure
  • Example: Black Friday traffic exceeding CPU capacity

Process-Related Downtime

  • Dairy Plant: Heating element malfunction causing temperature variations
  • IT Equivalent: Application bug causing CPU thrashing
  • Result: Service degradation before capacity is reached
  • Example: Infinite loop in code causing CPU spikes

Anomaly Detection and Prevention

Early Warning Indicators

  • Dairy Plant: Temperature trending upward over hours
  • IT Equivalent: CPU utilization trending up over day
  • Value: Allows intervention before failure
  • Example: CPU trending up 5% daily indicates growing problem

Capacity Planning

  • Dairy Plant: Adding second pasteurizer at 80% sustained utilization
  • IT Equivalent: Adding CPU cores at 80% sustained utilization
  • Goal: Maintain headroom for spikes
  • Example: Scaling up instance size before holiday season

Process Control Success Criteria

Optimal Operation

  • Dairy Plant: Pasteurizer running at 60-70% capacity, maintaining 72°C
  • IT Equivalent: CPU running at 40-60%, maintaining sub-100ms latency
  • Indicator: Stable, predictable performance
  • Example: CPU handling daily peak loads without issues

Risk Indicators

  • Dairy Plant: Temperature control becoming erratic
  • IT Equivalent: CPU utilization becoming erratic
  • Warning Signs: Increasing variation in metrics
  • Example: CPU showing random spikes during normal operations

Statistical Process Control in Modern IT Operations

Learning from Manufacturing

The principles of Statistical Process Control (SPC), which revolutionized manufacturing quality control, are already embedded in your modern IT observability tools - you just might not recognize them. When your observability platform alerts you about "anomalies," it's applying the same fundamental concepts that dairy plants use to maintain consistent product quality.

Key Insights from Manufacturing to IT

No Need for New Tools

  • Your existing observability platforms (like Datadog, New Relic, or Grafana) already have built-in anomaly detection capabilities that mirror traditional SPC control charts
  • These tools are doing the complex statistical calculations behind the scenes, just like in manufacturing

Focus on Fundamentals, Not Complex Statistics

  • As demonstrated in the referenced study, you don't need deep statistical knowledge to apply SPC effectively
  • Understanding variation and the use of control charts does not require understanding of probabilities, normal distribution, binomial distribution, or any other probability distribution
  • Your observability tools handle this complexity automatically Real-World Success Stories
  • Software team reduced incidents from 8.67 to 4.5 per week by applying basic SPC principles
  • Achieved through
  • Distinguishing between normal system variation and actual problems
  • Avoiding overreaction to regular fluctuations
  • Identifying true special causes that needed intervention

Practical Application in Modern IT

Common Cause vs Special Cause

  • When your observability tool shows an "anomaly," it's identifying what SPC calls a "special cause variation"
  • Regular performance fluctuations within expected bounds are "common cause variation" Process Improvement Goals
  • Reduce false alerts by understanding normal system behavior
  • Focus efforts on true anomalies
  • Validate improvements through sustained performance changes

Implementation Approach

  • Use your existing observability tools' anomaly detection features
  • Apply manufacturing-proven SPC principles to interpret the data
  • Focus on understanding system behavior rather than statistical complexity

Design of Monitoring Strategy: Core Tenets of Process Control through an Alerting System

Control by Exception

  • Only alert on actual anomalies requiring human intervention
  • Every alert must drive a specific action
  • No alert should auto-resolve

Mandatory Response Protocol

  • Every alert requires either manual intervention or automated remediation
  • No alert should be ignored or auto-resolved
  • Clear escalation paths for each alert type

Human-Centric Design

  • Alert volume must match human capacity
  • Critical alerts must be distinguishable
  • Each alert requires clear next actions
  • Response procedures must be documented
  • Audit implementation and coverage

Design of an Alert- Foundational Tenets

Control by Exception

  • You only intervene when a process exceeds defined limits

Complete Alert Definition

Every alert must have these three components:

  • Operating mean
  • Threshold deviation limits
  • Evaluation period

No Auto-Resolution

  • An alert that auto-resolves is defective
  • It indicates incorrect threshold or evaluation period
  • Such alerts create noise and must be eliminated

Mandatory Intervention

Every alert must require either:

  • Manual intervention
  • Automated remediation

Process Variation Detection

  • Set up anomaly detection
  • Define control limits
  • Eliminate alerts that fire for normal variation

Capacity Monitoring

  • Alerts for process control and capacity saturation are usually not differen
  • Process control looks at short term within band
  • Capacity saturation looks at long term trends when capacity bands are going to be breached

Total Control Elements Per Metric

Eleven distinct elements are needed per metric:

Saturation Limits (3):

  • Red (P0) limit
  • Orange (Warning) limit
  • Blue (Underutilization) limit

Evaluation Periods (3):

  • Red alert evaluation period
  • Orange alert evaluation period
  • Blue alert evaluation period

Sampling Periods (3):

  • Red alert sampling frequency
  • Orange alert sampling frequency
  • Blue alert sampling frequency

Additional Elements (2):

  • Operating Mean (1): Baseline with seasonal adjustment
  • Deviation Control (1): Threshold from operating mean with its evaluation period and sampling frequency

Implementation Framework

A. Saturation Control (Static Limits)

Needed to prevent downtime:

Red Alert (P0)

  • Critical saturation breach
  • Requires immediate rescue
  • No delay tolerated

Orange Alert (Warning)

  • Approaching saturation=
  • 24-48 hour intervention window
  • Allows planned response

Blue Alert (Optimization)

  • Resource underutilization
  • Weekly/monthly review
  • Cost optimization focus

B. Deviation Control (Dynamic Limits)

Needed to detect anomalies and get to root causes before saturation limits are breached

Operating Mean

  • Adapts to seasonality
  • Accounts for time-based variations
  • Reflects normal business cycles

Deviation Thresholds

  • Set around operating mean
  • Must consider seasonal patterns
  • Triggers on anomalous behavior

Operational Scenarios

Scenario 1: Within Saturation Limits

When mean shifts within max/min bounds:

  • Operating mean adjusts for seasonality
  • Deviation alerts detect anomalies
  • Investigation required for deviations
  • No auto-resolution permitted

Scenario 2: Saturation Breach

When capacity limits exceeded:

  • P0 (Red): Immediate action required
  • Orange: Plan intervention within 48 hours
  • Blue: Schedule optimization review

Scenario 3: Mean Deviation

When operating mean deviates but saturation limits aren't breached:

  • Alert triggers based on deviation thresholds
  • Requires investigation despite capacity being okay
  • Must be resolved through intervention

Best Practices

Alert Design

  • Use anomaly detection for mean shifts
  • Account for seasonal changes
  • Set appropriate evaluation periods

Threshold Setting

  • Based on business impact
  • Consider intervention windows
  • Align with operational capacity

Monitoring Strategy

  • Prioritize saturation alerts
  • Balance sensitivity vs. noise
  • Regular threshold review

Implementation Checklist for Senior Leaders and SRE Leads

Strategic Planning

  • Form alert governance team
  • Define SLOs and error budgets
  • Establish alert implementation audit & review process
  • Create alert tuning framework

Technical Implementation

  • Audit existing alerts
  • Configure new thresholds
  • Set up alert routing
  • Implement runbooks

Operational Readiness

  • Train response teams
  • Document escalation paths
  • Set up on-call rotations
  • Create feedback loops

Monitoring and Optimization

  • Track alert metrics
  • Measure response times
  • Monitor false positive rates
  • Regular threshold reviews

Culture and Process

  • Clear ownership model
  • Regular team reviews
  • Continuous improvement process
  • Knowledge sharing and implementation standardisation framework

How a SRE automation product can help : The Six-Step Reliability Framework

1. Automated Discovery

  • Integrates seamlessly with existing tooling (AWS, Azure, Google Cloud, Datadog, Splunk, etc.)
  • Automatically discovers all infrastructure components and services
  • Creates comprehensive resource inventory without manual intervention
  • Establishes baseline performance patterns for intelligent thresholding

2. Alert Coverage Audit

  • Analyzes current alert coverage using golden templates
  • Generates ALCOM (Alert comprehensiveness score) score
  • Identifies gaps in monitoring coverage
  • Provides actionable recommendations
  • Evaluates alert quality and noise levels
  • Detects redundant and non-actionable alerts

3. Automated Protection with Advanced Anomaly Detection

  • Auto-applies standardized alert templates to new resources
  • Ensures consistent monitoring across all services
  • Implements best practices automatically
  • Maintains coverage as infrastructure grows
  • Deploys intelligent thresholds based on historical patterns
  • Distinguishes between normal and abnormal variations
  • Adapts to seasonal and business patterns automatically
  • Implements control-by-exception principle for alert generation
  • Enforces mandatory response protocols

4. Service Mapping

  • Creates relationships between infrastructure components and business services
  • Maps APIs and services to responsible teams
  • Enables contextual alerting and routing
  • Improves incident response accuracy
  • Groups related alerts to reduce noise
  • Provides topology-aware alert correlation

5. AI-Assisted Recovery

  • Generates context-aware runbooks
  • Provides AI-driven troubleshooting assistance
  • Correlates alerts across tools
  • Accelerates incident resolution
  • Predicts potential failures before they occur
  • Offers automated remediation suggestions=
  • Performs root cause analysis using ML
  • Reduces alert fatigue through intelligent suppression

6. Governance & Analytics

  • Delivers real-time reporting on coverage, MTTR, and SLOs
  • Enforces standardization across teams
  • Tracks improvement metrics=
  • Enables data-driven reliability decisions
  • Monitors alert effectiveness and noise levels
  • Provides alert quality metrics and trends
  • Measures alert fatigue impact on teams
  • Ensures adherence to alert design principles
  • Audits alert actionability and response patterns

Quick Reference: Key Success Metrics

Alert Volume Metrics

  • Total alerts per day
  • Auto-resolving alert count
  • Alert distribution by severity

Response Metrics

  • Mean Time To Acknowledge (MTTA)
  • Mean Time To Resolve (MTTR)\
  • Escalation frequency

Quality Metrics

  • False positive rate
  • Alert noise ratio
  • Miss detection rate

References

1. Holmwood, L. (2014). "Cardiac Alarms and Ops". Retrieved from https://fractio.nl/2014/08/26/cardiac-alarms-and-ops/

2. Google SRE Book - Chapter 6: Monitoring Distributed Systems.

3. Understanding control charting and anomaly detection https://youtu.be/Ugcb7Vlp0Ts?feature=shared

4. Anomaly detection in Observability tools just some examples some text

5. Temperstack feature walk through Temperstack demo November 2024

6. https://www.qualitydigest.com/inside/six-sigma-article/using-control-charts-software-applications-071519.html

7. https://www.temperstack.com/blog/the-lost-art-of-control-when-observability-masks-our-reliability-crisis-5-min-read

8. https://www.temperstack.com/blog/the-lost-art-of-control-points-what-it-can-learn-from-manufacturing-floors

9. Temperstack Capabilities and impact https://drive.google.com/file/d/15P9LbQLg7RfYMZwwdqYx3d2bK81OUovM/view?usp=sharing

About the Author

Mohan Narayanaswamy Natarajan is a technology executive and entrepreneur with over 20 years of experience in operations and systems management. His unique perspective comes from implementing process control systems at ITC's food processing facilities, where he learned the fundamentals of quality control and automated monitoring, and later at Amazon, where he helped build reliability mechanisms at scale. As co-founder of Temperstack, he focuses on bringing manufacturing-grade reliability to IT through SRE process automation.

linkdin

Process Control in IT Infrastructure: Implementing process control learning from Manufacturing

Mohan Narayanaswamy Natarajan | Co-Founder, Temperstack

In this article

Let’s Stay in Touch

Subscribe to our newsletter & never miss our latest news and promotions.

arrow
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

EMPOWERING ALL SOFTWARE SYSTEMS TO SELF-HEAL

Get Started Today
arrow
scroll-to-top