Most SRE content on the internet is written from a software engineering perspective — microservices, Kubernetes, CI/CD pipelines, and cloud-native applications. And that makes sense. SRE was born at Google, and it evolved in a world of web-scale software.
But I work in telecom. My "services" are MPLS cores, DWDM transport rings, fixed wireless access networks, and 5G CNFs. My "deployments" are firmware upgrades on core routers during 2 AM maintenance windows. My "customers" notice when their internet drops for 30 seconds during a BGP reconvergence.
So what does SRE mean in this world? Quite a lot, actually.
The Core Principles Still Apply
The fundamental ideas of SRE translate directly to telecom:
1. Define What "Reliable" Means (SLIs and SLOs)
In a software context, you might define an SLI as "the proportion of HTTP requests that return in under 200ms." In a telecom context, SLIs look different but serve the same purpose:
- Core network availability — percentage of time the MPLS core is forwarding traffic without loss
- Alarm-to-acknowledgment time — how quickly the NOC picks up and triages a service-impacting alarm
- Circuit restoration time — how long it takes to restore a failed transport circuit
- Customer-impacting event duration — total minutes of service degradation experienced by end users
The magic of SLIs is that they force you to define what you care about. And SLOs force you to decide how much you care. An SLO of 99.95% availability sounds abstract until you calculate that it means no more than 22 minutes of downtime per month. Suddenly, every incident longer than a few minutes matters.
2. Measure Everything That Matters
Observability in telecom means watching a lot of things:
- SNMP metrics — interface counters, CPU, memory, environmental sensors across thousands of devices
- Syslog and event streams — alarms from EMS platforms, routing protocol events, hardware warnings
- Flow data — NetFlow/sFlow for traffic analysis, DDoS detection, and capacity planning
- Synthetic checks — probing reachability, latency, and packet loss across the network from the customer's perspective
The challenge isn't collecting data — it's making it useful. A network with 5,000 devices each generating 50 metrics means 250,000 time series. Without proper labeling, tagging, and service mapping, that data is just noise.
3. Reduce Toil
In the SRE book, "toil" is defined as manual, repetitive, automatable work that scales linearly with service size. In telecom, toil is everywhere:
- Manually acknowledging alarms that could be auto-correlated
- Hand-building device configurations that could be templated
- Copying and pasting metrics into reports that could be auto-generated
- Running the same health checks before and after every maintenance window
Every hour spent on toil is an hour not spent on reliability improvements. The SRE mindset says: if you're doing it more than twice, automate it.
Where Telecom SRE Gets Unique
While the principles transfer, the execution has some unique characteristics:
Hardware-Centric Failures
In software SRE, most failures are in code or configuration. In telecom, failures are often physical — a fiber cut, a power supply failure, a line card that overheats. You can't "roll back" a dead optic. This means your incident response playbooks need to account for hardware logistics, spare inventory, and field dispatch — things that don't exist in a cloud-native SRE world.
Multi-Vendor Complexity
A typical ISP core might have Cisco routers, Nokia routers, Ciena DWDM gear, Fortinet firewalls, Sandvine traffic management, and half a dozen EMS platforms. Each vendor has its own alarm format, CLI syntax, SNMP MIBs, and firmware lifecycle. Building unified observability across this landscape is a genuine engineering challenge.
Change Windows, Not Continuous Deployment
In software, you deploy continuously and roll back automatically. In telecom, changes to the core network happen in planned maintenance windows — often at 2 AM — with detailed rollback procedures and NOC coordination. The SRE principle of "reducing risk of change" still applies, but the mechanisms are different: pre-change health checks, staged rollouts across regions, and post-change validation scripts.
Regulatory and Safety Concerns
Telecom networks carry emergency services (911), regulatory obligations, and critical infrastructure dependencies. An SRE framework in telecom must account for compliance requirements that don't exist for most web applications.
Building an SRE Practice in Telecom: Where to Start
If you're in a telecom or ISP environment and want to adopt SRE practices, here's where I'd start:
Start with alert quality. Before defining SLOs, fix your alerts. If your NOC is drowning in noise, nothing else matters. Deduplicate alarms, tune thresholds, and ensure every alert has a clear action.
Define 3-5 SLIs for your most critical services. Don't try to measure everything at once. Pick your core network, your most important customer segment, or your most problematic platform, and define what "healthy" looks like.
Build one good dashboard. Not a wall of green dots — a dashboard that tells the story of your service. Customer-impacting events this month. Trending toward capacity limits. Top recurring alarms. This becomes the heartbeat of your SRE practice.
Start post-incident reviews. Even informal ones. The goal isn't blame — it's learning. What happened? What did we detect? What did we miss? What would we do differently?
Automate one thing per sprint. Pick the most painful manual process and automate it. Then pick the next one. Over time, this compounds.
SRE Is Not a Title — It's a Practice
You don't need "SRE" in your job title to do SRE work. If you're measuring service health, reducing alert noise, driving post-incident improvements, and automating toil — you're doing SRE. The title is secondary; the practice is what matters.
In telecom, we've been doing reliability engineering for decades. We just didn't call it that. The SRE framework gives us a shared vocabulary, a structured approach, and a community of practitioners to learn from. It's worth adopting, even if your "services" are measured in gigabit interfaces rather than API endpoints.
Next post: how to actually design SLIs and SLOs for network services. Stay tuned.