Home
Jobs

Site Reliability Engineer II

1 - 6 years

20 - 25 Lacs

Posted:5 days ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

Join Zuora s high-impact Operations team, where you'll be instrumental in maintaining the reliability, scalability, and performance of our SaaS platform. This role involves proactive service monitoring, incident response, infrastructure service management, and ownership of internal and external shared services to ensure optimal system availability and performance. You will work alongside a team of skilled engineers dedicated to operational excellence through automation, observability, and continuous improvement. In this cross-functional role, you'll collaborate daily with Product Engineering & Management, Customer Support, Deal Desk, Global Services, and Sales teams to ensure a seamless and customer-centric service delivery model. As a core member of the team, you'll have the opportunity to design and implement operational best practices, contribute to service provisioning strategies, and drive innovations that enhance the overall platform experience. If you're driven by solving complex problems in a fast-paced environment and are passionate about operational resilience and service reliability, we d love to hear from you. Our Tech Stack: Linux Administration, Python, Docker, Kubernetes, MySQL, Kafka, ActiveMQ, Tomcat App & Web, Oracle, Load Balancers, REDIS Cache, Debezium, AWS, WAF, LBs, Jenkins, GitOps, Terraform, Ansible, Puppet, Prometheus, Grafana, Open Telemetry In this role you'll get to Architect and implement intelligent automation workflows for infrastructure lifecycle management, including self-healing systems, automated incident remediation, and configuration analomy detection using Infrastructure as Code (IaC) and AI-driven tooling. Leverage predictive monitoring and anomaly detection techniques powe'red by AI/ML to proactively assess system health, optimize performance, and preempt service degradation or outages. Lead complex incident response efforts, applying deep root cause analysis (RCA) and postmortem practices to drive long-term stability, while integrating automated detection and remediation capabilities. Partner with development and platform engineering teams to build resilient CI/CD pipelines, enforce infrastructure standards, and embed observability and reliability into application deployments. Identify and eliminate reliability bottlenecks through automated performance tuning, dynamic scaling policies, and advanced telemetry instrumentation. Maintain and continuously evolve operational runbooks by incorporating machine learning insights, updating playbooks with AI-suggested resolutions, and identifying automation opportunities for manual steps. Stay abreast of emerging trends in AI for IT operations (AIOps), distributed systems, and cloud-native technologies to influence strategic reliability engineering decisions and tool adoption. Who we're looking for Hands-on experience with Linux Servers Administration and Python Programming. Deep experience with containerization and orchestration using Docker and Kubernetes, managing highly available services at scale. Working with messaging systems like Kafka and ActiveMQ, databases like MySQL and Oracle, and caching solutions like REDIS. Understands and applies AI/ML techniques in operations, including anomaly detection, predictive monitoring, and self-healing systems. Has a solid track record in incident management, root cause analysis, and building systems that prevent recurrence through automation. Is proficient in developing and maintaining CI/CD pipelines with a strong emphasis on observability, performance, and reliability. Monitoring and observability using Prometheus, Grafana, and OpenTelemetry, with a focus on real-time anomaly detection and proactive alerting. Is comfortable writing and maintaining runbooks and enjoys enhancing them with automation and machine learning insights. Keeps up-to-date with industry trends such as AIOps, distributed systems, SRE best practices, and emerging cloud technologies. Brings a collaborative mindset, working cross-functionally with engineering, product, and operations teams to align system design with business objectives. 1+ years of experience working in a SaaS environment. Nice to Have: Red Hat Certified System Administrator (RHCSA) - Red Hat AWS Certification Certified Associate in Python Programming (PCAP) - Python Institute Docker Certified Associate (DCA) or Certified Kubernetes Administrator (CKA) Good knowledge of Jenkins Advanced certifications in SRE or related fields As part of our commitment to building an inclusive, high-performance culture where ZEOs feel inspired, connected and valued, we support ZEOs with: Competitive compensation, corporate bonus program, performance rewards and retirement programs Medical insurance Generous, flexible time off Paid holidays, we'llness days and company wide end of year break 6 months fully paid parental leave Learning & Development stipend Opportunities to volunteer and give back, including charitable donation match Free resources and support for your mental we'llbeing

Mock Interview

Practice Video Interview with JobPe AI

Start Performance Tuning Interview Now
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
Zuora
Zuora

44 Jobs

RecommendedJobs for You

Kolkata, Mumbai, New Delhi, Hyderabad, Pune, Chennai, Bengaluru