Voltar ao blog
|5 min read

Understanding SLA Uptime Percentages: The Complete Guide

"Five nines" sounds impressive in a sales pitch, but what does 99.999% uptime actually mean? And more importantly, what SLA should you commit to for your own service? The answer depends on your architecture, your budget, and how much downtime your users will tolerate.

The nines of availability

Uptime is measured in "nines" — the number of 9s in your availability percentage. Each additional nine reduces allowed downtime by a factor of 10:

SLANinesDowntime/monthDowntime/year
99%Two nines7h 18m3d 15h 39m
99.5%Two and a half3h 39m1d 19h 49m
99.9%Three nines43m 49s8h 45m 56s
99.95%Three and a half21m 55s4h 22m 58s
99.99%Four nines4m 23s52m 35s
99.999%Five nines26s5m 15s

Notice the jump between three nines and four nines: you go from 43 minutes of allowed monthly downtime to just over 4 minutes. That's the difference between "we can do a rolling deploy" and "we need zero-downtime deployments with automated failover."

What each tier actually requires

99% — The baseline

Two nines gives you over 7 hours of downtime per month. This is achievable with a single server and manual deployments. Most hobby projects and internal tools live here. If your deploy process takes the service offline for a few minutes and you do it a couple times a week, you're in this range.

99.9% — The standard for production SaaS

Three nines is the most common SLA for production web services. You get about 43 minutes of downtime per month. This requires automated deployments, health checks, and basic redundancy. A load balancer with two app servers and a managed database gets you here.

99.99% — Serious infrastructure

Four nines means under 5 minutes of downtime per month. You need multi-region failover, automated incident response, zero-downtime deployments, and a mature on-call rotation. Database failover must be automatic. Deployments must be blue-green or canary. Every dependency must have a fallback.

99.999% — The gold standard

Five nines allows 26 seconds of monthly downtime. This is what payment processors, core banking systems, and emergency services target. It requires active-active multi-region architecture, circuit breakers on every dependency, and automated remediation. Most companies don't need this and shouldn't promise it.

How to choose your SLA

Don't promise more than you can deliver. Start by measuring your actual uptime over the past 3-6 months. If you're currently at 99.7%, don't commit to 99.99% — commit to 99.9% and work toward improving.

Consider your dependencies

Your SLA can't be higher than your weakest dependency. If your cloud provider guarantees 99.99% and your payment processor guarantees 99.9%, your end-to-end SLA can't realistically exceed 99.9%. Map your dependencies and their SLAs before setting your own.

Factor in planned maintenance

Some SLA definitions exclude planned maintenance windows. Others include all downtime. Be explicit about which definition you use. A 99.9% SLA that excludes a weekly 30-minute maintenance window is very different from one that includes it.

Measuring uptime accurately

You can't manage what you don't measure. External monitoring is the source of truth for uptime because it measures availability from your users' perspective, not from inside your infrastructure.

A good monitoring setup checks every 10-30 seconds, from multiple regions, with assertions beyond just "returns 200." Track response time percentiles, not just averages — your p99 tells you more about user experience than your mean.

When SLA breaches happen

SLA breaches happen. What matters is how you handle them. Have a clear policy for service credits or compensation. More importantly, publish a postmortem explaining what happened and what you're doing to prevent it. Transparency after a breach builds more trust than a perfect uptime number.

The goal of an SLA isn't to promise perfection — it's to set clear expectations and demonstrate accountability when things go wrong.

Comece a monitorar suas APIs

Configure monitoramento de uptime, páginas de status e gestão de incidentes em menos de um minuto.