Overview:
We are seeking a self-driven, inquisitive, and curious Site Reliability Engineer (SRE) to drive reliability, availability, performance, and security across our global digital product ecosystem. This role is central to ensuring a seamless and resilient experience for our users by blending deep engineering expertise with operational excellence and automation.
You will be part of a global SRE practice supporting a portfolio of 260+ modern cloud-native applications across consumer, commercial, supply chain, and enablement functions. Your mission: prevent incidents before they occur, ensure rapid recovery when they do, and build scalable systems that evolve with our growing business.
Responsibilities:
Champion reliability, observability, and operational excellence across mission-critical applications.
-
Develop and maintain service-level indicators (SLIs), objectives (SLOs), and error budgets to measure and improve system performance.
-
Implement automated monitoring, alerting, and recovery mechanisms to reduce manual intervention and improve response times.
-
Collaborate closely with software engineering, platform, and operations teams to embed SRE practices across the development lifecycle.
-
Lead and participate in incident response, root cause analysis, and postmortem reviews to drive long-term improvements.
-
Identify and eliminate sources of toil through automation, tooling, and process refinement.
-
Continuously improve resiliency design, capacity planning, and release management in production systems.
-
Influence engineering teams with best practices on cloud-native architecture, observability, and deployment strategies.
Qualifications:
Required Skills:
-
5+ years of experience in production engineering, DevOps, or SRE roles.
-
Strong foundation in Linux systems, networking, and cloud platforms (Azure, AWS, or GCP).
-
Hands-on experience with observability tools (e.g., AppDynamics, Prometheus, Grafana, ELK, FullStory).
-
Proficiency in scripting or programming (e.g., Python, Bash, Go) and automation frameworks (e.g., Ansible, Terraform).
-
Deep understanding of CI/CD pipelines, release strategies, and deployment automation.
-
Experience in managing high-scale, distributed systems in cloud-native environments.
-
Strong analytical skills and a passion for continuous improvement.
Preferred Skills:
-
Familiarity with microservices, Kubernetes, containers, and service mesh architecture.
-
Exposure to incident and problem management frameworks (e.g., ITIL, RCA practices).
-
Experience working in global teams supporting mission-critical applications.