5.0 - 10.0 years

5 - 10 Lacs

Pune, Maharashtra, India

On-site

Role Overview: Business Operations Site Reliability Engineer (SRE): The role of the Business Operations team is to act as the production readiness steward for Mastercard products. As a BizOps SRE, the primary responsibility is ensuring the stability and health of the platform. Foster developer run ownership and empower developers to build resilient products. Support developers during the application build phase with operational design, automation, capacity planning, and monitoring, ensuring fault-tolerant and scalable products. Create and enforce operational standards while fostering an agile and learning culture. Focus on triage and root cause analysis, understanding the business impact of products, and performing blameless post-mortems. Engage early in the development lifecycle to be proactive and manage production and change activities to maximize customer experience. Focus on risk management, compliance, and risk mitigation across all environments. Align product and customer-focused priorities with operational needs by providing continuous feedback throughout the lifecycle. Mission: The mission is to ensure production readiness through close collaboration with developers to design, build, implement, and support technology services. Ensure operational criteria such as system availability, capacity, performance, monitoring, self-healing, and deployment automation are implemented throughout the delivery process. Lead the DevOps transformation at Mastercard through tooling and by advocating for change and standards across development, quality, release, and product organizations. Support daily operations with a hyper-focus on triage and root cause analysis, understanding business impacts and conducting blameless post-mortems. Shift left in the development process, becoming more proactive to maximize customer experience and increase the value of supported applications. Focus on streamlining and standardizing application-specific support activities and centralizing points of interaction for both internal and external partners. Communicate effectively with key stakeholders to align product and customer-focused priorities with operational needs. Key Responsibilities: Operational Readiness Architect: Serve as the primary contact responsible for the overall health, performance, and capacity of applications. Support services before they go live by engaging in system design consulting, capacity planning, and launch reviews. Partner with development and product teams to establish monitoring and alerting strategies, ensuring zero downtime during deployment. Site Reliability Engineering (SRE): Ensure application scalability, performance, and resilience. Practice sustainable incident response and blameless post-mortems. Take a holistic approach to problem-solving and optimize recovery time. Automate data-driven alerts to proactively escalate issues and work with development teams to establish Service Level Objectives (SLOs) to improve reliability. DevOps/Automation: Address complex development, automation, and business process challenges. Engage in and improve the entire lifecycle of services, from inception and design to deployment, operation, and refinement. Support the CI/CD pipeline, ensuring smooth promotion of software into higher environments through validation and operational gating. Lead Mastercard in DevOps automation and best practices. Increase automation and tooling to reduce manual interventions and toil. ITSM Practices: Analyze ITSM activities of the platform and provide feedback to development teams on operational gaps or resiliency concerns. Role Qualifications: Education and Experience: BS degree in Computer Science, a related technical field (e.g., physics, mathematics), or equivalent practical experience. Exposure to coding and/or scripting. An appetite for pushing the boundaries of automation and exploring new technology, infrastructure, and practices to scale architecture for future growth. Technical and Analytical Skills: Experience with algorithms, data structures, scripting, pipeline management, and software design. Systematic problem-solving approach with strong communication skills and a sense of ownership. Interest in designing, analyzing, and troubleshooting large-scale distributed systems. Comfortable collaborating with cross-functional teams to ensure expected system behavior is understood and monitoring is in place to detect anomalies. Additional Skills: Ability to balance doing things correctly with fixing issues quickly. Flexible and pragmatic, working towards the long-term health of systems. Willingness to learn and take on challenging opportunities while being part of a matrix-based, diverse, and geographically distributed team. Ability to prioritize and build relationships across development, operations, and product teams.

Posted 3 days ago

Apply

Manager, Site Reliability Engineering Cvent

5.0 - 9.0 years

0 Lacs

haryana

On-site

Cvent is a global leader in meeting, event, travel, and hospitality technology, with a workforce of over 4000 employees worldwide. Our cloud-based solutions cater to more than 28,000 customers in over 100 countries, including 80% of the Fortune 100 companies. As a Lead - Site Reliability Engineer at Cvent, you will leverage your expertise in development and operations to identify and address issues, develop universal solutions, and provide guidance to junior staff. Your responsibilities will also include enabling and supporting multi-disciplinary teams, resolving complex development and automation challenges, promoting Cvent's standards and best practices, ensuring the scalability and performance of our product suite, and collaborating with various teams to establish effective monitoring and alerting strategies. Key Responsibilities: - Utilize advanced knowledge in development and operations to prioritize and resolve issues - Mentor and support junior staff members - Empower and collaborate with multi-disciplinary teams across different applications and locations - Address complex development, automation, and business process challenges - Advocate for Cvent standards and best practices - Ensure product scalability, performance, and resilience - Establish monitoring and alerting strategies for new applications - Share best practices with acquisition's DevOps team - Develop automation solutions for deployment targeting multiple environments - Assist in achieving zero-down-time deployments for legacy code base - Contribute to Open Source projects - Automate tasks to streamline operations Requirements: - Knowledge of SDLC methodologies, preferably Agile - Proficiency in Java, Python, or Ruby - Experience with managing AWS services - Familiarity with configuration management tools like Chef, Puppet, or Ansible - Strong Windows and Linux administration skills - Working knowledge of APM, monitoring, and logging tools - Experience with 3-tier application stacks and incident response - Familiarity with build tools such as Jenkins, CircleCI, etc. - Exposure to containerization concepts like docker, ECS, EKS, Kubernetes - Experience with NoSQL databases like MongoDB, couchbase, postgres, etc. - Self-motivated with the ability to work independently Preferred Skills: - Understanding of F5 load balancing concepts - Basic knowledge of observability, SLIs/SLOs, and message queues - Familiarity with basic networking concepts - Experience with package managers like Nexus, Artifactory, etc. - Strong communication and people management skills Join us at Cvent to be part of a dynamic team that is driving innovation and excellence in the world of event management technology.,

Posted 1 week ago

Apply

Site Reliability Engineer Virtusa

3.0 - 5.0 years

0 - 3 Lacs

Hyderabad, Telangana, India

On-site

Job description The SRE function is a highly visible force multiplier with a growth mindset, going through a period of increased investment, where you can contribute to the delivery of a highly reliable banking solution As part of an SRE squad, you will partner with engineering teams within Macquarie to help develop and drive the adoption of SRE best practices and tooling across the organisation. The role will require close engagement and collaboration with all the engineering community. You will be involved in projects such as measuring, testing and improving our resilience (Chaos engineering), our capacity to deal with increasing load (Demand forecasting and capacity planning), our ability to make changes safely (Change management and System Design) and our Observability (Metrics, monitoring, and alerts) What you offer Strong experience in software engineering and system design utilising Java, Golang or similar language Understand the benefits and correct use of SLOs, metrics, logs and traces Cloud Native at heart ready to build on the shoulders of giants Excellent understanding of modern software development practices, tools and technologies Strong DevOps fundamentals with preference for Java, Golang, Microservices and other cloud technologies. Experience in APM and Observability tools, such as NewRelic, DataDog, Dynatrace, Grafana stack etc.

Posted 1 week ago

Apply

DevOps & Site Reliability Engineering MakeMyTrip

15.0 - 19.0 years

0 Lacs

haryana

On-site

As the Vice President of DevOps & SRE, you will hold a senior leadership position with the primary responsibility of driving platform reliability, secure operations, and DevOps excellence throughout the enterprise. Your role will involve integrating site reliability engineering practices with scalable DevOps automation and maintaining a robust cybersecurity posture. Leading high-performing teams, defining technology strategy, managing infrastructure, and safeguarding systems and data to support business growth and digital innovation will be key aspects of your role. You will be expected to lead enterprise-wide DevOps adoption and continuous delivery transformation, implementing and optimizing CI/CD pipelines, infrastructure-as-code (IaC), and cloud-native architectures. Championing automation in deployment, monitoring, and infrastructure provisioning will be essential, along with experience in containerization (Kubernetes, Docker), service mesh, and serverless environments. Facilitating collaboration between development, operations, and QA for rapid and reliable releases will also be a critical part of your responsibilities. Establishing and leading the Site Reliability Engineering (SRE) function to ensure system reliability, scalability, and performance will be another key aspect of your role. You will define and monitor SLAs, SLOs, and SLIs for critical applications and services, drive incident management, root cause analysis, and foster a postmortem culture. Developing and deploying observability strategies using tools like Prometheus, Grafana, Zabbix, or enterprise tools such as New Relic, Dynatrace, or Splunk will also be within your purview. In terms of leadership and strategic alignment, you will build and mentor cross-functional teams across DevOps and SRE, partnering with engineering, product, and business leaders to align technical initiatives with organizational goals. Managing departmental budgets, tools, and vendor relationships, as well as reporting on KPIs, operational health, security posture, and risk to the executive leadership team will also be part of your responsibilities. To qualify for this role, you must hold a Bachelors or Masters in Computer Science, Engineering, or a related field, along with at least 15+ years of experience in IT/engineering, including a minimum of 5+ years in leadership roles. Proven expertise in implementing DevOps, SRE, and security practices at scale, as well as hands-on experience with AWS, Azure, or GCP, CI/CD tools, and SRE observability platforms, are essential requirements for this position.,

Posted 2 weeks ago

Apply

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.