NS
All Posts
AI & SREMarch 15, 20269 min read

Adopting AI in SRE, Service Assurance, and Observability

AISREAIOpsObservabilityService AssuranceMachine Learning

I recently stepped into the role of Team Lead for SRE, Service Assurance, and Observability. One of the first questions I get from leadership, vendors, and peers alike is: "What's your AI strategy?"

It's a fair question. AI is reshaping how we think about operations — from anomaly detection to root cause analysis to predictive maintenance. But having spent years in the trenches of telecom operations, I've learned that the gap between an AI demo and a production-ready operations tool is enormous.

Here's how I'm thinking about AI adoption for my team — where it genuinely helps, where it falls short, and how to get started without drowning in hype.

Where AI Actually Adds Value in SRE and Observability

Not all AI use cases are created equal. Some are transformative. Others are solutions looking for a problem. Here are the areas where I see real, measurable impact:

1. Anomaly Detection at Scale

A network with thousands of devices generates millions of data points per minute. No human team can watch all of it. Traditional threshold-based alerting catches the obvious failures — a link goes down, CPU hits 95% — but misses the subtle ones:

  • A gradual increase in interface errors that precedes a fiber degradation
  • A slow drift in latency across a specific path that signals an OSPF reconvergence loop
  • A pattern of intermittent packet loss that correlates with a specific line card's buffer behavior

Machine learning models — particularly time-series anomaly detection — can baseline normal behavior per interface, per device, per time of day, and flag when something deviates. This isn't replacing alerting; it's augmenting it with a layer that catches what static thresholds miss.

What we're doing: Starting with unsupervised anomaly detection on our top 200 critical interfaces. We're using historical SNMP data to train baselines and flagging deviations as "investigations" rather than hard alerts. The goal is to catch emerging issues before they become customer-impacting events.

2. Intelligent Alert Correlation and Noise Reduction

This is, in my opinion, the single highest-ROI application of AI in network operations. Our NOC receives thousands of alarms per day. Many of them are symptoms of the same root cause — a single fiber cut can generate 50+ alarms across different platforms and monitoring tools.

Traditional event correlation uses static rules: "if alarm A and alarm B fire within 60 seconds, group them." This works for known failure patterns but fails for novel ones. ML-based correlation can learn from historical incident data which alarms tend to co-occur, and group them dynamically.

What we're doing: Building a correlation engine that ingests alarms from our EMS platforms, syslog streams, and SNMP traps, and clusters them using temporal and topological proximity. Early results show a 60-70% reduction in alert volume reaching the NOC — that's hours of human attention freed up every day.

3. Predictive Capacity Planning

Traditional capacity planning is reactive — you look at utilization trends, project growth, and plan upgrades when links approach threshold. AI can make this smarter:

  • Forecasting traffic patterns based on historical seasonality, growth trends, and event-driven anomalies
  • Predicting when a link will hit capacity under different growth scenarios
  • Identifying underutilized resources that can be reclaimed or repurposed

This isn't about replacing the network planning team. It's about giving them better forecasts so they can make proactive decisions instead of reacting to capacity crises.

4. Assisted Root Cause Analysis

When an incident occurs, engineers spend significant time correlating logs, metrics, topology data, and recent changes to identify the root cause. AI can accelerate this by:

  • Surfacing similar past incidents and their resolutions
  • Correlating the timeline of events leading up to the failure
  • Highlighting recent changes (configuration pushes, firmware upgrades, maintenance activities) that overlap with the incident window

We're not at a point where AI can autonomously diagnose complex network failures — and I'm skeptical we will be anytime soon. But AI as a "co-pilot" that gathers context and suggests hypotheses? That's already valuable.

Where the Hype Outpaces Reality

I'd be doing a disservice if I didn't call out where AI promises more than it delivers in our domain:

"Fully Autonomous Operations"

The idea of a self-healing network that detects, diagnoses, and remediates issues without human intervention sounds appealing. In reality, the consequences of an incorrect automated action on a production network are severe. Imagine an AI that decides to reroute traffic away from a "failing" core router — except the anomaly was a false positive, and now you've created a traffic black hole.

Automation should be graduated: detect automatically, diagnose with assistance, remediate with human approval. We're nowhere near removing humans from the loop for critical infrastructure.

"Drop-In AIOps Platforms"

Vendors love to sell AIOps platforms that promise to "just work" out of the box. In practice, every network is unique. The topology, the vendor mix, the alarm formats, the operational workflows — all of this requires significant customization. A model trained on generic IT infrastructure data doesn't understand MPLS LSPs, DWDM wavelength paths, or BGP community policies.

The best AI tools are the ones that learn from your data, your incidents, and your network. That takes time and investment.

"AI-Generated Runbooks"

Large language models can generate plausible-looking troubleshooting runbooks, but "plausible" isn't the same as "correct." A runbook that suggests restarting a routing process on a core router without mentioning the impact on downstream BGP sessions is worse than no runbook at all. AI-generated content needs expert review, especially for procedures that touch production infrastructure.

My Framework for AI Adoption

Here's the approach I'm taking with my team:

Start with the Pain, Not the Technology

Don't start with "how can we use AI?" Start with "what's our biggest operational pain point?" For us, that's alert noise. So that's where we're investing first.

Measure Before You Model

AI needs data. Good data. Before building any models, we invested in cleaning up our telemetry pipeline — consistent labeling, proper tagging, accurate topology mapping. A model trained on bad data produces bad results, no matter how sophisticated the algorithm.

Build Trust Incrementally

We're introducing AI outputs as "suggestions" and "investigations," not as actionable alerts. This gives the team time to evaluate the model's accuracy, build trust in its outputs, and provide feedback that improves the model. When accuracy reaches a threshold the team is comfortable with, we'll promote AI-detected anomalies to a higher priority tier.

Invest in Your Team's AI Literacy

AI adoption isn't just a technology challenge — it's a people challenge. My team consists of network engineers and NOC analysts, not data scientists. We're investing in training so the team understands the basics: what a model is doing, what its limitations are, how to interpret confidence scores, and when to trust or override an AI recommendation.

Keep Humans in the Loop

For critical infrastructure, the role of AI is to make humans faster and more informed — not to replace them. Every AI-driven recommendation should be explainable. If a model flags an anomaly, the engineer should be able to see why: which metrics deviated, by how much, and what the historical baseline looks like.

What's Next

Over the coming months, I'll be sharing more about our specific implementations:

  • Alert correlation engine — architecture, data pipeline, and early results
  • Anomaly detection on network telemetry — choosing the right models for time-series infrastructure data
  • Building an AI-literate operations team — training approaches that work for network engineers

AI in SRE and observability is not about replacing the expertise that operations teams have built over years. It's about giving that expertise better tools — tools that can process the scale and complexity of modern networks at a speed humans can't match alone.

The teams that get this right won't be the ones chasing every AI trend. They'll be the ones who start with real problems, measure rigorously, and build trust one use case at a time.


Have thoughts on AI adoption in network operations? I'd love to hear from other SRE and observability practitioners. Reach out on LinkedIn.