NS
All Posts
ObservabilityFebruary 28, 20269 min read

Designing SLIs and SLOs for Network Services

SLIsSLOsObservabilityNetwork EngineeringSRE

If SRE has one foundational practice, it's this: define what "reliable" means before you try to make things more reliable. That's what SLIs and SLOs do. They replace gut feelings with data, and they give teams a shared language for talking about service health.

But most SLI/SLO guidance is written for software services — request latency, error rates, throughput. How do you apply these concepts to network infrastructure where your "service" is an MPLS backbone, a DWDM transport ring, or a fixed wireless access layer?

Here's how I approach it.

SLIs: What to Measure

A Service Level Indicator (SLI) is a quantitative measure of some aspect of the service you care about. The key word is care. Not every metric is an SLI. An SLI should reflect something that matters to your users — internal or external.

Good SLIs for Network Services

Availability — Is the service reachable and forwarding traffic?

  • Core network: percentage of time all core routing adjacencies are established and forwarding
  • Transport: percentage of time DWDM circuits are in-service (no alarms, no protection switches)
  • Access: percentage of time access aggregation links are up and passing customer traffic

Latency — How fast is the service?

  • Round-trip time between key measurement points (data center to data center, core to access edge)
  • Jitter on latency-sensitive paths (especially relevant for voice/video services)

Packet Loss — Is the service delivering traffic intact?

  • Measured via synthetic probes across the network or via interface error counters
  • Particularly important on long-haul transport and wireless access links

Capacity / Saturation — How close is the service to its limits?

  • Interface utilization trending — are we approaching capacity on critical links?
  • Control plane CPU utilization on core routers
  • BGP table size trending

Detection Speed — How quickly do we know about problems?

  • Time from failure to first alarm (alarm detection latency)
  • Time from alarm to human acknowledgment (MTTA)

Choosing the Right SLIs

Not every metric should be an SLI. Here's my filter:

  1. Does this metric reflect user experience? Interface error counters on a redundant link don't matter to users if traffic has failed over. But packet loss on the only path to a customer site? That's user-facing.

  2. Can we measure it consistently? An SLI you can only measure sometimes isn't useful. Make sure your measurement infrastructure is reliable before committing to an SLI.

  3. Does it drive the right behavior? If a team optimizes for this metric, will the service actually get better? Alarm count is a bad SLI — teams might suppress alarms instead of fixing problems. Alarm actionability (percentage of alarms that required action) is better.

SLOs: Setting Targets

A Service Level Objective (SLO) is a target value for an SLI. It answers the question: "How reliable is reliable enough?"

The Error Budget Concept

This is the most powerful idea in SLO thinking. If your SLO is 99.95% availability, your error budget is 0.05% — roughly 22 minutes of downtime per month. As long as you're within budget, you have room for changes, experiments, and maintenance. When you're burning through budget, you slow down and focus on reliability.

In network engineering terms: if your core network SLO is 99.99% availability (about 4.3 minutes per month), a single incident that takes 10 minutes to resolve has already blown your monthly budget. That's a powerful motivator for investing in faster detection and automated failover.

Practical SLO Examples

Here are SLOs I've found useful in telecom environments:

| Service | SLI | SLO | |---------|-----|-----| | Core MPLS Network | Forwarding availability | 99.99% (monthly) | | DWDM Transport | Circuit in-service time | 99.95% (monthly) | | Access Aggregation | Link availability | 99.9% (monthly) | | Incident Detection | Time to first alarm | < 5 minutes for P1 events | | Incident Response | MTTA (acknowledgment) | < 15 minutes for P1 events | | Incident Resolution | MTTR (restoration) | < 60 minutes for P1 events | | Alert Quality | Alert actionability rate | > 80% of alerts require action |

Setting the Right Level

The hardest part of SLOs is picking the number. Too aggressive, and you're constantly "failing" even when the service is healthy — which demoralizes teams and undermines trust in the framework. Too lenient, and the SLO doesn't drive improvement.

My approach:

  1. Measure your baseline. Before setting an SLO, measure your actual performance for 3-6 months. If your core network has historically been 99.97% available, setting an SLO of 99.99% is ambitious. Setting it at 99.9% is meaningless.

  2. Start slightly above your baseline. Set the SLO where you need to improve, but not where you need a miracle. This creates productive tension.

  3. Differentiate by criticality. Not every service needs the same SLO. Your core MPLS backbone should have a higher availability target than a lab network.

  4. Review quarterly. SLOs aren't permanent. As you improve, ratchet them up. If a new platform is unstable, adjust expectations while you stabilize it.

Making SLOs Actionable

SLIs and SLOs are useless if they live in a document that nobody reads. They need to be visible, reviewed, and tied to action.

Dashboards

Build an SLO dashboard that shows:

  • Current SLI values vs. SLO targets
  • Error budget remaining this month
  • Trending — are we getting better or worse?
  • Recent events that burned error budget

This dashboard should be the first thing your team looks at in morning standup. It's the heartbeat of your SRE practice.

Incident Triggers

When you breach an SLO or burn through error budget too fast, that should trigger a response:

  • Yellow zone (50% budget consumed in first half of month): investigate trending, review upcoming changes
  • Red zone (budget exhausted): freeze non-critical changes, focus team on reliability improvements

Post-Incident Tie-In

Every post-incident review should reference SLO impact:

  • How much error budget did this incident consume?
  • What's our remaining budget for the month?
  • What corrective actions will prevent a repeat?

This connects individual incidents to the bigger picture of service health.

Common Mistakes

A few traps I've seen (and fallen into):

Too many SLIs. Start with 3-5 SLIs per service. You can always add more later. Too many SLIs means nobody knows which ones matter.

Measuring infrastructure instead of service. CPU utilization on a router is an infrastructure metric, not a service SLI. Customers don't care about CPU — they care about whether their traffic is flowing. Measure what customers experience.

No error budget culture. If breaching an SLO has no consequences, it's just a number. The error budget needs to influence decisions — change freezes, reliability investment, and team priorities.

Perfect data syndrome. Don't wait for perfect measurement before defining SLOs. Start with what you can measure today, and improve your instrumentation over time.

Getting Started

If you've never defined SLIs and SLOs for your network, here's a simple starting plan:

  1. Pick your single most critical service (probably your core network)
  2. Define 3 SLIs: availability, latency, packet loss
  3. Measure your baseline for 30 days
  4. Set initial SLOs slightly above your baseline
  5. Build one dashboard showing SLI values vs. SLO targets
  6. Review weekly with your team

That's it. You can get more sophisticated later — composite SLIs, tiered SLOs, automated error budget tracking. But this foundation is enough to start driving meaningful improvement.

The goal isn't perfection. The goal is a shared, measurable definition of what "reliable" means — and a framework for getting closer to it every month.


This is part of a series on SRE in telecom. Previous post: What SRE Means in a Telecom/ISP Context.