Process Control in IT Infrastructure: Implementing process control learning from Manufacturing

In modern IT infrastructure management, we face a critical challenge: despite abundant observability data and sophisticated tools, organizations struggle with alert fatigue and ineffective monitoring. This white paper demonstrates how manufacturing process control principles can revolutionize infrastructure and application performance monitoring, drawing specific parallels with dairy plant quality control systems.

Process Control in IT Infrastructure: Implementing process control learning from Manufacturing

1 Janaury 2025

Alert Fatigue & reliability Crisis

In our recent series exploring manufacturing reliability principles, we traced the journey from understanding foundational principles to implementing control points to processing signals effectively. Now it's time to move beyond theory to practical implementation.

Organizations commonly face:

Alert storms during incidents that paralyze response teams
High volumes of auto-resolving alerts that train operators to ignore notifications
Unclear alert ownership leading to delayed responses
Alert fatigue causing teams to miss critical issues
Inconsistent alert configuration across different services
Missing alerts and mechanism to audit alert deployment and active state
Inconsistent alert policy application

The Root Cause

The challenge isn't a lack of observability data or sophisticated tools. Instead, we've fallen into three fundamental traps:

The belief that more observability will automatically deliver better reliability
The illusion that more signals & abundant telemetry equals understanding
Too much time and mind space spent on Technology vs Fundamentals of process control

Infrastructure & services monitoring as a Process Control System- the Manufacturing Parallel

Infrastructure performance can be understood as a process control problem, similar to quality control in food processing. Just as a dairy plant must ensure milk quality from input to output, your infrastructure must process code efficiently and reliably. Bad input code, like contaminated milk, can compromise the entire system - no amount of processing will make it safe for consumption. Similarly, even good code, like quality milk, must be processed correctly to maintain its integrity.

Core Process Control Metrics (referencing Google's SRE Golden Metrics Framework)

Latency (Processing Time)

This is like cycle time in manufacturing
Every request must be processed within specific time bounds
For example, each API call should complete within 200ms
Just as a manufacturing process must maintain consistent cycle times, your infrastructure must maintain consistent response times

‍Traffic (Production Rate)

This represents your throughput - how many requests you're handling
Like a production line's units per hour
For example: 1000 requests per second
Your infrastructure, like any process, has a designed throughput capacity

‍Errors (Quality Issues)

These are your defects - failed requests
Like rejected products in manufacturing
For example: 0.1% error rate threshold
Must stay below acceptable limits for the process to be "in control"

‍Saturation (Machine Utilization)

This is your resource utilization level
Like how close a machine is running to its maximum capacity
For example: CPU at 80% utilization
When you hit 100%, like any machine, you can't process more work

Process Control Parameters

Process Control in this context means:

Maintaining latency within acceptable bounds
Handling designed traffic levels
Keeping errors below threshold

‍Operating at efficient but safe saturation levels

For example, if your application needs to

Process requests within 200ms (latency)
Handle 1000 requests/second (traffic)
Keep errors below 0.1% (quality)
Run at 70% saturation (efficiency)

‍Your process is "in control" when all these parameters are met.

Detailed Process Parameters: The Dairy Plant Parallel

Imagine a dairy plant's pasteurization unit processing milk. Just as milk must be heated to exactly 72°C (161°F) for 15-20 seconds, your CPU must process requests within specific performance parameters.

Core Components and Their IT Equivalents

Processing Unit Comparison

Dairy Plant: A pasteurizer rated for 10,000 liters/hour at optimal temperature
IT Equivalent: An 8-core CPU rated at 100% utilization per core
Capacity Parallel: Both have finite processing capacity that affects quality when exceeded

Process Parameters and Golden Metrics

Processing Time (Latency)

Dairy Plant: Time milk stays at 72°C (must be 15-20 seconds)
IT Equivalent: Time CPU takes to complete a request
Control Factor: CPU time per request must stay under 100ms
Impact: Just as underheated milk is unsafe, slow CPU response makes applications unusabl

‍Production Rate (Traffic)

Dairy Plant: Current flow rate (e.g., 8,000 liters/hour)
IT Equivalent: Requests processed per second
Control Factor: Request volume that CPU must handle
Impact: Like milk backing up in pipes, requests queue when volume exceeds capacity

‍Quality Issues (Errors)

Dairy Plant: Batches failing temperature standards
IT Equivalent: Failed request processing due to CPU constraints
Control Factor: Error rate must stay below 0.1%
Impact: Both result in service failure

‍Machine Utilization (Saturation)

Dairy Plant: Pasteurizer running at 80% of max capacity
IT Equivalent: CPU running at 80% utilization
Control Factor: Operating efficiency vs. maximum capacity
Impact: Both systems degrade rapidly approaching 100% utilizatio

Process Control and Variations

Normal Variation

Dairy Plant: Temperature fluctuating 71.5°C to 72.5°C
IT Equivalent: CPU utilization varying between 40-60%
Impact: Expected variation, process remains in control
Example: CPU utilization increasing during business hours

‍Abnormal Variation

Dairy Plant: Temperature suddenly dropping to 70°C
IT Equivalent: CPU suddenly spiking to 85%
Impact: Requires immediate investigation
Example: Memory leak causing unexpected CPU spikes

Understanding Downtime

Capacity-Related Downtime

Dairy Plant: Trying to process 12,000 liters/hour through a 10,000 liters/hour pasteurizer
IT Equivalent: Running at 95% CPU utilization with increasing load
Result: Complete service failure
Example: Black Friday traffic exceeding CPU capacity

‍Process-Related Downtime

Dairy Plant: Heating element malfunction causing temperature variations
IT Equivalent: Application bug causing CPU thrashing
Result: Service degradation before capacity is reached
Example: Infinite loop in code causing CPU spikes

Anomaly Detection and Prevention

Early Warning Indicators

Dairy Plant: Temperature trending upward over hours
IT Equivalent: CPU utilization trending up over day
Value: Allows intervention before failure
Example: CPU trending up 5% daily indicates growing problem

‍Capacity Planning

Dairy Plant: Adding second pasteurizer at 80% sustained utilization
IT Equivalent: Adding CPU cores at 80% sustained utilization
Goal: Maintain headroom for spikes
Example: Scaling up instance size before holiday season

Process Control Success Criteria

Optimal Operation

Dairy Plant: Pasteurizer running at 60-70% capacity, maintaining 72°C
IT Equivalent: CPU running at 40-60%, maintaining sub-100ms latency
Indicator: Stable, predictable performance
Example: CPU handling daily peak loads without issues

‍Risk Indicators

Dairy Plant: Temperature control becoming erratic
IT Equivalent: CPU utilization becoming erratic
Warning Signs: Increasing variation in metrics
Example: CPU showing random spikes during normal operations

Statistical Process Control in Modern IT Operations

Learning from Manufacturing

The principles of Statistical Process Control (SPC), which revolutionized manufacturing quality control, are already embedded in your modern IT observability tools - you just might not recognize them. When your observability platform alerts you about "anomalies," it's applying the same fundamental concepts that dairy plants use to maintain consistent product quality.

Key Insights from Manufacturing to IT

No Need for New Tools

Your existing observability platforms (like Datadog, New Relic, or Grafana) already have built-in anomaly detection capabilities that mirror traditional SPC control charts
These tools are doing the complex statistical calculations behind the scenes, just like in manufacturing

Focus on Fundamentals, Not Complex Statistics

As demonstrated in the referenced study, you don't need deep statistical knowledge to apply SPC effectively
Understanding variation and the use of control charts does not require understanding of probabilities, normal distribution, binomial distribution, or any other probability distribution
Your observability tools handle this complexity automatically Real-World Success Stories
Software team reduced incidents from 8.67 to 4.5 per week by applying basic SPC principles
Achieved through
Distinguishing between normal system variation and actual problems
Avoiding overreaction to regular fluctuations
Identifying true special causes that needed intervention

Practical Application in Modern IT

Common Cause vs Special Cause

When your observability tool shows an "anomaly," it's identifying what SPC calls a "special cause variation"
Regular performance fluctuations within expected bounds are "common cause variation" Process Improvement Goals
Reduce false alerts by understanding normal system behavior
Focus efforts on true anomalies
Validate improvements through sustained performance changes

Implementation Approach

Use your existing observability tools' anomaly detection features
Apply manufacturing-proven SPC principles to interpret the data
Focus on understanding system behavior rather than statistical complexity

Design of Monitoring Strategy: Core Tenets of Process Control through an Alerting System

Control by Exception

Only alert on actual anomalies requiring human intervention
Every alert must drive a specific action
No alert should auto-resolve

Mandatory Response Protocol

Every alert requires either manual intervention or automated remediation
No alert should be ignored or auto-resolved
Clear escalation paths for each alert type

Human-Centric Design

Alert volume must match human capacity
Critical alerts must be distinguishable
Each alert requires clear next actions
Response procedures must be documented
Audit implementation and coverage

Design of an Alert- Foundational Tenets

Control by Exception

You only intervene when a process exceeds defined limits

Complete Alert Definition

Every alert must have these three components:

Operating mean
Threshold deviation limits
Evaluation period

No Auto-Resolution

An alert that auto-resolves is defective
It indicates incorrect threshold or evaluation period
Such alerts create noise and must be eliminated

Mandatory Intervention

Every alert must require either:

Manual intervention
Automated remediation

Process Variation Detection

Set up anomaly detection
Define control limits
Eliminate alerts that fire for normal variation

Capacity Monitoring

Alerts for process control and capacity saturation are usually not differen
Process control looks at short term within band
Capacity saturation looks at long term trends when capacity bands are going to be breached

Total Control Elements Per Metric

Eleven distinct elements are needed per metric:

‍Saturation Limits (3):

Red (P0) limit
Orange (Warning) limit
Blue (Underutilization) limit

‍Evaluation Periods (3):

Red alert evaluation period
Orange alert evaluation period
Blue alert evaluation period

‍Sampling Periods (3):

Red alert sampling frequency
Orange alert sampling frequency
Blue alert sampling frequency

‍Additional Elements (2):

Operating Mean (1): Baseline with seasonal adjustment
Deviation Control (1): Threshold from operating mean with its evaluation period and sampling frequency

Implementation Framework

A. Saturation Control (Static Limits)

Needed to prevent downtime:

‍Red Alert (P0)

Critical saturation breach
Requires immediate rescue
No delay tolerated

‍Orange Alert (Warning)

Approaching saturation=
24-48 hour intervention window
Allows planned response

‍Blue Alert (Optimization)

Resource underutilization
Weekly/monthly review
Cost optimization focus

B. Deviation Control (Dynamic Limits)

Needed to detect anomalies and get to root causes before saturation limits are breached

Operating Mean

Adapts to seasonality
Accounts for time-based variations
Reflects normal business cycles

‍Deviation Thresholds

Set around operating mean
Must consider seasonal patterns
Triggers on anomalous behavior

Operational Scenarios

Scenario 1: Within Saturation Limits

When mean shifts within max/min bounds:

Operating mean adjusts for seasonality
Deviation alerts detect anomalies
Investigation required for deviations
No auto-resolution permitted

Scenario 2: Saturation Breach

When capacity limits exceeded:

P0 (Red): Immediate action required
Orange: Plan intervention within 48 hours
Blue: Schedule optimization review

Scenario 3: Mean Deviation

When operating mean deviates but saturation limits aren't breached:

Alert triggers based on deviation thresholds
Requires investigation despite capacity being okay
Must be resolved through intervention

Best Practices

Alert Design

Use anomaly detection for mean shifts
Account for seasonal changes
Set appropriate evaluation periods

Threshold Setting

Based on business impact
Consider intervention windows
Align with operational capacity

Monitoring Strategy

Prioritize saturation alerts
Balance sensitivity vs. noise
Regular threshold review

Implementation Checklist for Senior Leaders and SRE Leads

Strategic Planning

Form alert governance team
Define SLOs and error budgets
Establish alert implementation audit & review process
Create alert tuning framework

Technical Implementation

Audit existing alerts
Configure new thresholds
Set up alert routing
Implement runbooks

Operational Readiness

Train response teams
Document escalation paths
Set up on-call rotations
Create feedback loops

Monitoring and Optimization

Track alert metrics
Measure response times
Monitor false positive rates
Regular threshold reviews

Culture and Process

Clear ownership model
Regular team reviews
Continuous improvement process
Knowledge sharing and implementation standardisation framework

How a SRE automation product can help : The Six-Step Reliability Framework

1. Automated Discovery

Integrates seamlessly with existing tooling (AWS, Azure, Google Cloud, Datadog, Splunk, etc.)
Automatically discovers all infrastructure components and services
Creates comprehensive resource inventory without manual intervention
Establishes baseline performance patterns for intelligent thresholding

2. Alert Coverage Audit

Analyzes current alert coverage using golden templates
Generates ALCOM (Alert comprehensiveness score) score
Identifies gaps in monitoring coverage
Provides actionable recommendations
Evaluates alert quality and noise levels
Detects redundant and non-actionable alerts

3. Automated Protection with Advanced Anomaly Detection

Auto-applies standardized alert templates to new resources
Ensures consistent monitoring across all services
Implements best practices automatically
Maintains coverage as infrastructure grows
Deploys intelligent thresholds based on historical patterns
Distinguishes between normal and abnormal variations
Adapts to seasonal and business patterns automatically
Implements control-by-exception principle for alert generation
Enforces mandatory response protocols

4. Service Mapping

Creates relationships between infrastructure components and business services
Maps APIs and services to responsible teams
Enables contextual alerting and routing
Improves incident response accuracy
Groups related alerts to reduce noise
Provides topology-aware alert correlation

5. AI-Assisted Recovery

Generates context-aware runbooks
Provides AI-driven troubleshooting assistance
Correlates alerts across tools
Accelerates incident resolution
Predicts potential failures before they occur
Offers automated remediation suggestions=
Performs root cause analysis using ML
Reduces alert fatigue through intelligent suppression

6. Governance & Analytics

Delivers real-time reporting on coverage, MTTR, and SLOs
Enforces standardization across teams
Tracks improvement metrics=
Enables data-driven reliability decisions
Monitors alert effectiveness and noise levels
Provides alert quality metrics and trends
Measures alert fatigue impact on teams
Ensures adherence to alert design principles
Audits alert actionability and response patterns

Quick Reference: Key Success Metrics

Alert Volume Metrics

Total alerts per day
Auto-resolving alert count
Alert distribution by severity

Response Metrics

Mean Time To Acknowledge (MTTA)
Mean Time To Resolve (MTTR)\
Escalation frequency

Quality Metrics

False positive rate
Alert noise ratio
Miss detection rate

References

1. Holmwood, L. (2014). "Cardiac Alarms and Ops". Retrieved from https://fractio.nl/2014/08/26/cardiac-alarms-and-ops/‍

2. Google SRE Book - Chapter 6: Monitoring Distributed Systems.

3. Understanding control charting and anomaly detection https://youtu.be/Ugcb7Vlp0Ts?feature=shared‍

4. Anomaly detection in Observability tools just some examples some text

AWS https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html

Datadog https://www.datadoghq.com/blog/introducing-anomaly-detection-datadog/

New relic https://docs.newrelic.com/docs/alerts/create-alert/set-thresholds/anomaly-detection/

Splunk https://www.splunk.com/en_us/blog/platform/developing-the-splunk-app-for-anomaly-detection.html

Dynatrace https://docs.dynatrace.com/docs/discover-dynatrace/platform/davis-ai/anomaly-detection

5. Temperstack feature walk through Temperstack demo November 2024‍

6. https://www.qualitydigest.com/inside/six-sigma-article/using-control-charts-software-applications-071519.html‍

7. https://www.temperstack.com/blog/the-lost-art-of-control-when-observability-masks-our-reliability-crisis-5-min-read ‍‍

8. https://www.temperstack.com/blog/the-lost-art-of-control-points-what-it-can-learn-from-manufacturing-floors

9. Temperstack Capabilities and impact https://drive.google.com/file/d/15P9LbQLg7RfYMZwwdqYx3d2bK81OUovM/view?usp=sharing

‍

About the Author

Mohan Narayanaswamy Natarajan is a technology executive and entrepreneur with over 20 years of experience in operations and systems management. His unique perspective comes from implementing process control systems at ITC's food processing facilities, where he learned the fundamentals of quality control and automated monitoring, and later at Amazon, where he helped build reliability mechanisms at scale. As co-founder of Temperstack, he focuses on bringing manufacturing-grade reliability to IT through SRE process automation.

‍

Process Control in IT Infrastructure: Implementing process control learning from Manufacturing

Mohan Narayanaswamy Natarajan | Co-Founder, Temperstack

About Temperstack

Compliance From Temperstack

Latest Blogs

Latest Blogs

Process Control in IT Infrastructure: Implementing process control learning from Manufacturing

Process Control in IT Infrastructure: Implementing process control learning from Manufacturing

Alert Fatigue & reliability Crisis

The Root Cause

Infrastructure & services monitoring as a Process Control System- the Manufacturing Parallel

Core Process Control Metrics (referencing Google's SRE Golden Metrics Framework)

Process Control Parameters

Detailed Process Parameters: The Dairy Plant Parallel

Core Components and Their IT Equivalents

Process Parameters and Golden Metrics

Process Control and Variations

Understanding Downtime

Anomaly Detection and Prevention

Process Control Success Criteria

Statistical Process Control in Modern IT Operations

Learning from Manufacturing

Key Insights from Manufacturing to IT

Practical Application in Modern IT

Design of Monitoring Strategy: Core Tenets of Process Control through an Alerting System

Design of an Alert- Foundational Tenets

Complete Alert Definition

No Auto-Resolution

Mandatory Intervention

Process Variation Detection

Capacity Monitoring

Total Control Elements Per Metric

Implementation Framework

A. Saturation Control (Static Limits)

B. Deviation Control (Dynamic Limits)

Operational Scenarios

Scenario 1: Within Saturation Limits

Scenario 2: Saturation Breach

Scenario 3: Mean Deviation

Best Practices

Alert Design

Threshold Setting

Monitoring Strategy

Implementation Checklist for Senior Leaders and SRE Leads

Strategic Planning

Technical Implementation

Operational Readiness

Monitoring and Optimization

Culture and Process

How a SRE automation product can help : The Six-Step Reliability Framework

1. Automated Discovery

2. Alert Coverage Audit

3. Automated Protection with Advanced Anomaly Detection

4. Service Mapping

5. AI-Assisted Recovery

6. Governance & Analytics

Quick Reference: Key Success Metrics

Alert Volume Metrics

Response Metrics

Quality Metrics

References

Process Control in IT Infrastructure: Implementing process control learning from Manufacturing

Share with your community!

In this article

Let’s Stay in Touch

EMPOWERING ALL SOFTWARE SYSTEMS TO SELF-HEAL