Posted:5 days ago|
Platform:
Work from Office
Full Time
Join Zuora s high-impact Operations team, where you'll be instrumental in maintaining the reliability, scalability, and performance of our SaaS platform. This role involves proactive service monitoring, incident response, infrastructure service management, and ownership of internal and external shared services to ensure optimal system availability and performance. You will work alongside a team of skilled engineers dedicated to operational excellence through automation, observability, and continuous improvement. In this cross-functional role, you'll collaborate daily with Product Engineering & Management, Customer Support, Deal Desk, Global Services, and Sales teams to ensure a seamless and customer-centric service delivery model. As a core member of the team, you'll have the opportunity to design and implement operational best practices, contribute to service provisioning strategies, and drive innovations that enhance the overall platform experience. If you're driven by solving complex problems in a fast-paced environment and are passionate about operational resilience and service reliability, we d love to hear from you. Our Tech Stack: Linux Administration, Python, Docker, Kubernetes, MySQL, Kafka, ActiveMQ, Tomcat App & Web, Oracle, Load Balancers, REDIS Cache, Debezium, AWS, WAF, LBs, Jenkins, GitOps, Terraform, Ansible, Puppet, Prometheus, Grafana, Open Telemetry In this role you'll get to Architect and implement intelligent automation workflows for infrastructure lifecycle management, including self-healing systems, automated incident remediation, and configuration analomy detection using Infrastructure as Code (IaC) and AI-driven tooling. Leverage predictive monitoring and anomaly detection techniques powe'red by AI/ML to proactively assess system health, optimize performance, and preempt service degradation or outages. Lead complex incident response efforts, applying deep root cause analysis (RCA) and postmortem practices to drive long-term stability, while integrating automated detection and remediation capabilities. Partner with development and platform engineering teams to build resilient CI/CD pipelines, enforce infrastructure standards, and embed observability and reliability into application deployments. Identify and eliminate reliability bottlenecks through automated performance tuning, dynamic scaling policies, and advanced telemetry instrumentation. Maintain and continuously evolve operational runbooks by incorporating machine learning insights, updating playbooks with AI-suggested resolutions, and identifying automation opportunities for manual steps. Stay abreast of emerging trends in AI for IT operations (AIOps), distributed systems, and cloud-native technologies to influence strategic reliability engineering decisions and tool adoption. Who we're looking for Hands-on experience with Linux Servers Administration and Python Programming. Deep experience with containerization and orchestration using Docker and Kubernetes, managing highly available services at scale. Working with messaging systems like Kafka and ActiveMQ, databases like MySQL and Oracle, and caching solutions like REDIS. Understands and applies AI/ML techniques in operations, including anomaly detection, predictive monitoring, and self-healing systems. Has a solid track record in incident management, root cause analysis, and building systems that prevent recurrence through automation. Is proficient in developing and maintaining CI/CD pipelines with a strong emphasis on observability, performance, and reliability. Monitoring and observability using Prometheus, Grafana, and OpenTelemetry, with a focus on real-time anomaly detection and proactive alerting. Is comfortable writing and maintaining runbooks and enjoys enhancing them with automation and machine learning insights. Keeps up-to-date with industry trends such as AIOps, distributed systems, SRE best practices, and emerging cloud technologies. Brings a collaborative mindset, working cross-functionally with engineering, product, and operations teams to align system design with business objectives. 1+ years of experience working in a SaaS environment. Nice to Have: Red Hat Certified System Administrator (RHCSA) - Red Hat AWS Certification Certified Associate in Python Programming (PCAP) - Python Institute Docker Certified Associate (DCA) or Certified Kubernetes Administrator (CKA) Good knowledge of Jenkins Advanced certifications in SRE or related fields As part of our commitment to building an inclusive, high-performance culture where ZEOs feel inspired, connected and valued, we support ZEOs with: Competitive compensation, corporate bonus program, performance rewards and retirement programs Medical insurance Generous, flexible time off Paid holidays, we'llness days and company wide end of year break 6 months fully paid parental leave Learning & Development stipend Opportunities to volunteer and give back, including charitable donation match Free resources and support for your mental we'llbeing
Zuora
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Practice Python coding challenges to boost your skills
Start Practicing Python Now13.0 - 17.0 Lacs P.A.
Bengaluru
5.0 - 7.0 Lacs P.A.
Chennai
20.0 - 25.0 Lacs P.A.
Experience: Not specified
6.2 - 9.2 Lacs P.A.
Mumbai Metropolitan Region
Experience: Not specified
Salary: Not disclosed
25.0 - 30.0 Lacs P.A.
Hyderabad
7.0 - 11.0 Lacs P.A.
Bengaluru
3.0 - 8.0 Lacs P.A.
Bengaluru
10.0 - 14.0 Lacs P.A.
Kolkata, Mumbai, New Delhi, Hyderabad, Pune, Chennai, Bengaluru
20.0 - 25.0 Lacs P.A.