Role Summary

Embedded SRE bringing reliability and operational discipline to client engineering teams. Work covers SLO design, incident-response process, observability stack consolidation, on-call rotation maturity, and the platform-engineering substrate that lets product teams ship faster without sacrificing reliability.

Operates as engineering team member rather than auditor. Earns trust by being on-call alongside the team during incidents. Insists on blameless post-mortems with concrete action items, not abstract “we should do better next time” closure. Prefers observability investments that pay back immediately over framework migrations that pay back never.

Skills

  • Kubernetes operations (EKS, GKE, AKS) at production scale
  • GitOps tooling (Argo CD, Flux) and progressive-delivery patterns
  • Service-mesh selection and operation (Istio, Linkerd, AWS App Mesh)
  • Observability platforms (Grafana, Prometheus, Datadog, New Relic, Honeycomb)
  • Distributed tracing implementation (OpenTelemetry, Jaeger, Tempo)
  • Log aggregation and structured logging at scale
  • Metrics design and cardinality management
  • Alerting strategy and pager hygiene
  • SLO/SLI/error-budget framework design
  • Incident-command process and severity-classification design
  • Runbook design and tabletop exercise facilitation
  • Blameless post-mortem facilitation and action-item closure tracking
  • On-call rotation design with sustainable load
  • Chaos engineering and resilience testing programs
  • Disaster-recovery testing and tabletop drills
  • Performance engineering and load-test design
  • Capacity planning and headroom analysis
  • CI/CD pipeline design and progressive-delivery patterns (canary, blue-green)
  • Infrastructure-as-code (Terraform, Pulumi, CloudFormation)
  • Linux production operations and kernel-level debugging

Capabilities & Focus Areas

  • SLO/SLI/error-budget framework design tied to product cadence
  • Incident response process design including runbooks, severities, and post-mortems
  • Observability stack consolidation (metrics, logs, traces) on a single backbone
  • On-call rotation design with sustainable load
  • GitOps and CI/CD substrate for safe production change
  • Reliability remediation programs with measurable impact
  • Chaos engineering and resilience testing

Typical Engagement Patterns

  • Embedded SRE for three to twelve months with client engineering teams
  • Incident-response process design engagements (four to six weeks)
  • Observability consolidation programs (eight to sixteen weeks)
  • Reliability remediation engagements for clients in incident-heavy operating states
  • Standalone SLO and error-budget design engagements ahead of major launches

Outcomes Delivered

  • Incident MTTR reductions documented quarter-over-quarter
  • On-call rotations that retain engineers rather than burn them out
  • Observability stacks that surface root cause in minutes, not days
  • SLOs that engineering and product teams actually respect
  • Post-mortem action items closed within agreed timeframes

Need this role for an engagement?

Brief us on the scope and timeline and we'll match a senior practitioner.

Get in touch →