Site Reliability Engineer

Role Summary

Embedded SRE bringing reliability and operational discipline to client engineering teams. Work covers SLO design, incident-response process, observability stack consolidation, on-call rotation maturity, and the platform-engineering substrate that lets product teams ship faster without sacrificing reliability.

Operates as engineering team member rather than auditor. Earns trust by being on-call alongside the team during incidents. Insists on blameless post-mortems with concrete action items, not abstract “we should do better next time” closure. Prefers observability investments that pay back immediately over framework migrations that pay back never.

Skills

Kubernetes operations (EKS, GKE, AKS) at production scale
GitOps tooling (Argo CD, Flux) and progressive-delivery patterns
Service-mesh selection and operation (Istio, Linkerd, AWS App Mesh)
Observability platforms (Grafana, Prometheus, Datadog, New Relic, Honeycomb)
Distributed tracing implementation (OpenTelemetry, Jaeger, Tempo)
Log aggregation and structured logging at scale
Metrics design and cardinality management
Alerting strategy and pager hygiene
SLO/SLI/error-budget framework design
Incident-command process and severity-classification design
Runbook design and tabletop exercise facilitation
Blameless post-mortem facilitation and action-item closure tracking
On-call rotation design with sustainable load
Chaos engineering and resilience testing programs
Disaster-recovery testing and tabletop drills
Performance engineering and load-test design
Capacity planning and headroom analysis
CI/CD pipeline design and progressive-delivery patterns (canary, blue-green)
Infrastructure-as-code (Terraform, Pulumi, CloudFormation)
Linux production operations and kernel-level debugging

Capabilities & Focus Areas

SLO/SLI/error-budget framework design tied to product cadence
Incident response process design including runbooks, severities, and post-mortems
Observability stack consolidation (metrics, logs, traces) on a single backbone
On-call rotation design with sustainable load
GitOps and CI/CD substrate for safe production change
Reliability remediation programs with measurable impact
Chaos engineering and resilience testing

Typical Engagement Patterns

Embedded SRE for three to twelve months with client engineering teams
Incident-response process design engagements (four to six weeks)
Observability consolidation programs (eight to sixteen weeks)
Reliability remediation engagements for clients in incident-heavy operating states
Standalone SLO and error-budget design engagements ahead of major launches

Outcomes Delivered

Incident MTTR reductions documented quarter-over-quarter
On-call rotations that retain engineers rather than burn them out
Observability stacks that surface root cause in minutes, not days
SLOs that engineering and product teams actually respect
Post-mortem action items closed within agreed timeframes

Need this role for an engagement?

Brief us on the scope and timeline and we'll match a senior practitioner.

Get in touch →