Cloud & Infrastructure
Site Reliability Engineer
Role Summary
Embedded SRE bringing reliability and operational discipline to client engineering teams. Work covers SLO design, incident-response process, observability stack consolidation, on-call rotation maturity, and the platform-engineering substrate that lets product teams ship faster without sacrificing reliability.
Operates as engineering team member rather than auditor. Earns trust by being on-call alongside the team during incidents. Insists on blameless post-mortems with concrete action items, not abstract “we should do better next time” closure. Prefers observability investments that pay back immediately over framework migrations that pay back never.
Skills
- Kubernetes operations (EKS, GKE, AKS) at production scale
- GitOps tooling (Argo CD, Flux) and progressive-delivery patterns
- Service-mesh selection and operation (Istio, Linkerd, AWS App Mesh)
- Observability platforms (Grafana, Prometheus, Datadog, New Relic, Honeycomb)
- Distributed tracing implementation (OpenTelemetry, Jaeger, Tempo)
- Log aggregation and structured logging at scale
- Metrics design and cardinality management
- Alerting strategy and pager hygiene
- SLO/SLI/error-budget framework design
- Incident-command process and severity-classification design
- Runbook design and tabletop exercise facilitation
- Blameless post-mortem facilitation and action-item closure tracking
- On-call rotation design with sustainable load
- Chaos engineering and resilience testing programs
- Disaster-recovery testing and tabletop drills
- Performance engineering and load-test design
- Capacity planning and headroom analysis
- CI/CD pipeline design and progressive-delivery patterns (canary, blue-green)
- Infrastructure-as-code (Terraform, Pulumi, CloudFormation)
- Linux production operations and kernel-level debugging
Capabilities & Focus Areas
- SLO/SLI/error-budget framework design tied to product cadence
- Incident response process design including runbooks, severities, and post-mortems
- Observability stack consolidation (metrics, logs, traces) on a single backbone
- On-call rotation design with sustainable load
- GitOps and CI/CD substrate for safe production change
- Reliability remediation programs with measurable impact
- Chaos engineering and resilience testing
Typical Engagement Patterns
- Embedded SRE for three to twelve months with client engineering teams
- Incident-response process design engagements (four to six weeks)
- Observability consolidation programs (eight to sixteen weeks)
- Reliability remediation engagements for clients in incident-heavy operating states
- Standalone SLO and error-budget design engagements ahead of major launches
Outcomes Delivered
- Incident MTTR reductions documented quarter-over-quarter
- On-call rotations that retain engineers rather than burn them out
- Observability stacks that surface root cause in minutes, not days
- SLOs that engineering and product teams actually respect
- Post-mortem action items closed within agreed timeframes
Need this role for an engagement?
Brief us on the scope and timeline and we'll match a senior practitioner.

