Product Manager and SLIs/SLOs/SLAs. Part 2: is the service healthy or not?
In the previous post we discussed what are the Service Level Indicators and why PM should care. Let's now talk about SLOs and SLAs, how are they different from SLIs and what's the role of a PM there.
Why SLIs are not enough
We defined a metric (SLI), but it doesn't answer the question, “Is the business process healthy?” For that, we need an SLO (Service Level Objective)—the health threshold of the metric. That is, SLI is a metric, and SLO is a “healthy” target for it.
For example, you can measure that 96% of requests sent to the Weather Forecast API are successful, with 4% returning errors. If the SLO for this metric is set to 95% (96 > 95), then all good: service is healthy (even though not perfect!). If this is set for 99%, then nope, service is not feeling good (96 < 99).
The classical pitfall of non-technically savvy PMs is trying to set SLOs to 100% to “make things simpler.” The reality is that backend logic doesn't “just always work.” Instead, something constantly happens to servers (actual physical computers in the data centers get broken), network connectivity loses data packets, code contains bugs, and so on. That’s why SLI graphs look like a “saw” in the pictures. It has two implications.
Firstly, 100% SLO is just impossible. Technically speaking, it is possible in short periods, but over a month, it is highly unlikely. Even Google's page doesn’t always load (even though it has one of the strictest SLOs in the world).
Secondly, every additional “nine” (99%, 99.9%, 99.99%, etc.) will cost you more effort. Think about it: if you promised 99.99%, it allows you 0.01% of failures. Assuming your service has a constant traffic load, you can calculate that you can afford outages for just 60 sec * 60 min * 24 h * 30 days * 0.0001 / 60 sec = 4.3 minutes in a month! Be careful to promise lightspeed outage resolutions unless you are Stripe or Amazon.
Pause and think about the next question for at least 10 seconds; it is deeper than it sounds.
How do you choose the right SLO? Talk to clients! They will push it up to 100%, while your job is to push it down (using historical values or other considerations) to the value that doesn’t block your innovation: less room for failure means fewer features and experiments to play with.
Last beast: SLA
Service Level Agreements (SLA) are rules service owners will follow if they violate SLOs. These might include a complete blockage of new development, investing in testing, outage response practices, and—if agreed upon in a contract—even a financial penalty.
Outside of top tech companies, SLAs are quite rare because they require rigor and discipline to follow, but you can occasionally see them in critical product departments (e.g., Stripe card processing engine). But this is one of the reasons why those companies are big: they grew up by being adamant in their pursuit to make the customer experience uninterrupted and fast.
Summary
Product managers should keep in mind:
SLI is a metric, one of the 2–3 most important ones for a service from a client's perspective. If the client is not onboard, then this is not an SLI.
SLO is a health threshold of SLI (e.g., 99%, or 99.99%, etc)
SLA defines service owner obligations in case of an SLO breach
Defining the right SLIs/SLOs/SLAs for a service is hard, especially at the beginning, but it pays off because you do it once and then ensure that your tech team is accountable for keeping SLIs green.
Then, you as a PM can focus on core activities to drive innovation, which is exactly what you are here for.
If you want more practice on this topic, there are two lessons and 30+ practical tasks dedicated to SLIs/SLOs/SLAs in the “Tech for PMs” hands-on course (20% off for my subscribers). There is also a live version of it as a Maven cohort in June; join until 18th May for an early bird discount. I hope a 4.9 out of 5 rating from PMs from all over the world speaks for itself 🦾
Weekly challenge 🦾🤖
Quite a few of you asked whether we could bring back the weekly challenge. Let’s do it! I'll now give you a task on SLOs. If you want to solidify this Tech concept, you can answer in the comments and receive my feedback. In 5 days, I will add my own answer there.
To make it more interesting, the author of the best answer will get one of the ProductDo mini-simulators (“SQL for PMs”, “ML basics for PMs”, “Product Planning: sprints, OKRs, vision” and “Product Discovery: JTBD, MVP, Unit Economy”) worth ±100$ for free.
The challenge: Imagine you are a PM in a service that sends SMS to any telephone number around the world (you can google “sms API” to see that there are plenty of these). Big and small businesses call your service thouthands times per day. Some send marketing SMS, others send confirmation ones, others - verification (e.g., account verification) and so on. How would you go about defining SLIs/SLOs for a service with such a diverse set of clients?
The answer to the weekly challnege is to introduce a "priority" parameter and make a graded OKR, e.g.:
- For prio=1 messages, latency SLO is 99.9% < 10 sec (e.g., for security SMSs)
- For prio=2 messages, latency SLO is 99% < 60 sec (e.g,. for transactional SMSs)
- For prio=3 messages, the latency SLO is 90% < 600 sec (e.g., for marketing SMSs)
In this case, PM should ensure that not all clients get themselves priority one ranks, otherwise the graded SLO loses its purpose.