Are you ready to write your next chapter?
Make your mark at one of the biggest names in payments. We re looking for a Site Reliability Engineer (SRE) to join our ever-evolving TSO team and help us unleash the potential of every business.
What you ll own as - Site Reliability Engineer (SRE)
- Analyze incident data from platforms like ServiceNow, IBM Netcool, Everbridge xMatters, OpsGenie, and PagerDuty to identify recurring issues and service instability trends.
- Collaborate with cross-functional teams to improve platform availability, stability, and performance.
- Identify and close observability gaps in logging, monitoring, and alerting; recommend and implement new tools as needed.
- Integrate pre- and post-change validation testing into CI/CD pipelines and manual deployments.
- Develop and pilot automated runbooks for common incident types to improve incident response and reduce MTTR.
- Participate in Change Advisory Boards (CABs), major incident triage, and root cause analysis processes.
- Evaluate and implement tools for incident remediation, change validation, and performance benchmarking.
- Contribute to monthly retrospectives, publish quarterly SRE health reports, and drive continuous improvement initiatives.
What you bring
You bring a passion for building reliable, scalable systems backed by hands-on experience in observability, automation, and collaborative problem-solving.
- 3+ years of experience in Site Reliability Engineering, DevOps, or a related technical role.
- Strong understanding of incident management, root cause analysis, and service reliability principles.
- Experience in IT Operations, with a focus on observability, and log management.
- Solid understanding of observability concepts, including metrics, log aggregation, log management, OpenTelemetry (OTEL) concepts and best practices, traces, event management and alerting.
- Hands-on experience with observability and monitoring tools (e.g., Splunk Enterprise, Splunk Cloud, Splunk Observability, OTEL agents, collectors and gateways, Prometheus, Grafana, Zabbix).
- Experience developing Splunk queries and dashboards using Splunk Search Processing Language (SPL)
- Proficiency in scripting languages (e.g., Python, Bash) and infrastructure-as-code tools.
- Familiarity with CI/CD pipelines and automated testing frameworks.
- Excellent problem-solving skills and a proactive, collaborative mindset.
- Strong communication skills and the ability to work effectively across teams.
Added bonus if you have
- Experience working in high-availability or financial services environments.
- Experience with Software Development Life Cycle (SDLC) concepts.
- Experience working within an AGILE environment.
- Knowledge of ITIL processes and prior participation in CABs.
- Familiarity with cloud platforms such as AWS, Azure, or GCP.
- Exposure to performance benchmarking, capacity planning, and service-level objective (SLO) management.
- Experience in container monitoring (e.g., Kubernetes, Docker) and cloud-native architectures.
- Experience with one or more of the following application development or scripting languages:
- Java
- Python
- C#
- .NET
- JavaScript
- SQL
- C++
- Go (Golang)
- Rust
- Scala
- Kotlin
- Ruby
- Unix Scripting (e.g., Bash, Korn Shell)
- Certifications:
- Cloud: AWS, Azure
- Observability: Splunk, Datadog, Dynatrace
- Infrastructure: RedHat, VMware, MSCE
About the team
Our Tech and Security teams keep us moving each day, no matter where we are in the world. From the hardware to the networks and everything between, they humbly make it all happen.
As a Site Reliability Engineer (SRE) within the WP Technology Services Operations (TSO), you will play a critical role in enhancing the reliability, stability, and performance of our platforms and services in support of innovative fintech products that change the way the world pays, banks and invests. This role blends software engineering with systems engineering to proactively prevent incidents, automate operations, and improve observability across complex environments. You ll collaborate closely with infrastructure, development, and incident management teams to reduce service disruptions, implement scalable solutions, and drive continuous improvement. This is a unique opportunity to help shape a high-performing SRE function from the ground up, with a clear roadmap and strong executive support.
What makes a Worldpayer
What makes a Worldpayer? It s simple: Think, Act, Win. We stay curious, always asking the right questions and finding creative solutions to simplify the complex. We re dynamic, every Worldpayer is empowered to make the right decisions for their customers. And we re determined, always staying open and winning and failing as one.
LinkedIn # (#LI- Susmita Tripathy)