Home
Jobs

35 Observability Jobs - Page 2

Filter Interviews
Min: 0 years
Max: 25 years
Min: ₹0
Max: ₹10000000
Setup a job Alert
Filter
JobPe aggregates results for easy application access, but you actually apply on the job portal directly.

3.0 - 5.0 years

15 - 18 Lacs

Pune

Work from Office

Naukri logo

Experience: 3 to 5 years in cloud infrastructure operations, L1 incident management, automation support, and observability, with team coordination or mentoring experience. Location: Pune Shift: 24x7 Support (Rotational Shifts) Education: BE/B.Tech (Relevant certifications preferred AWS Cloud Practitioner/Associate, Azure Fundamentals, CKA, Terraform Associate) Job Summary: We are seeking a L1 Lead – Site Reliability Engineer (SRE) to guide and manage the frontline SRE team in ensuring the stability, availability, and efficiency of enterprise-scale cloud infrastructure operations. This role involves supervising incident response, ensuring adherence to runbooks and SOPs, providing technical guidance to L1 engineers, and being the key escalation point for L1 issues. You will be responsible for monitoring cloud services, triaging alerts, validating remediation efforts, mentoring junior engineers, and collaborating with L2/L3 teams for escalations and root cause analysis. Responsibilities: Lead and mentor the L1 SRE team during shifts, ensuring timely response and proper handling of incidents, service requests, and alerts. Oversee infrastructure and application monitoring using tools such as Prometheus, Grafana, AWS CloudWatch, and Azure Monitor. Validate and guide remediation actions like pod restarts, disk space cleanup, scaling, and alert verification. Ensure SOPs, runbooks , and shift handover notes are followed and updated regularly. Execute and validate predefined Ansible playbooks, Terraform scripts, and CI/CD pipelines with junior team members. Act as the first point of escalation for unresolved L1 issues and coordinate with L2/L3 teams for resolution and RCA. Govern and track shift performance, including SLA compliance, FCR (First Call Resolution), and ticket hygiene. Coordinate patching, backup checks, standard changes, and validations in AWS/Azure environments. Facilitate onboarding of new L1 engineers, and deliver knowledge-sharing and refresher training sessions. Support automation initiatives by identifying repetitive tasks and creating/reviewing simple scripts. Conduct weekly/monthly shift reports and participate in SRE governance and review calls with operations leadership. Monitor the health of Kubernetes clusters and guide the team in basic pod/node/service troubleshooting. Skills/Expertise: 3+ years of experience in cloud infrastructure operations with at least 1 year in a lead or mentoring role. Strong troubleshooting, coordination, documentation, and escalation management skills. Proven ability to lead shifts in a 24x7 support model. Familiarity with ITSM practices and SLA management ( ServiceNow or similar). Proactive and structured communicator, capable of shift planning, reporting, and stakeholder updates. Technical Skills: Experience monitoring and operating cloud-based environments with basic troubleshooting for system and application-level issues. Familiarity with cloud services and concepts across AWS, such as EC2, S3, IAM, VPC, etc and Azure DevOps services. Basic knowledge of container platforms such as Docker and Kubernetes (understanding pod/service basics, logs, etc.). Exposure to scripting using Shell, Bash, or Python for automation of routine tasks. Basic understanding of version control systems like Git, GitHub, or GitLab. Awareness of infrastructure-as-code and automation tools such as Ansible, Terraform, or CloudFormation (execution under guidance). Familiar with CI/CD concepts and tools like Jenkin or GitLab CI (executing builds, monitoring pipelines). Understanding of alerting and monitoring tools like Grafana, ELK, site 24*7, CloudWatch and Prometheus Hands-on with ITSM tools such as ServiceNow for incident and ticket tracking. Role & responsibilities Preferred candidate profile

Posted 1 month ago

Apply

6.0 - 9.0 years

6 - 9 Lacs

Bengaluru / Bangalore, Karnataka, India

On-site

Foundit logo

The Role LeadSquared platform and product suite is 100% on the cloud and currently all on AWS. The product suite comprises a large number of applications, services, and APIs built on various open-source and AWS native tech stacks and deployed across multiple AWS accounts. The role involves leading the mission-critical responsibility of ensuring that all our online services are available, reliable, performant, and running at optimal costs. We firmly believe in a code and automation-driven approach to Site Reliability. Key Responsibilities Build Processes and platforms to ensure full observability and automated incident response management of all systems, applications, platforms, and infrastructure. Track incidents and perform RCA for every incident and focus on prevention. Work closely with Engineering teams to improve the performance, reliability, and operability of various applications and services. Work with customers to address their concerns on infrastructure availability, performance, and security. Key Requirements 6+ years experience in building tools for observability and incident response management for AWS resources as well as custom applications of this 3+ years of experience should be on AWS Cloud. 2+ years of experience in leading SRE team. Deep understanding of observability of all major AWS services - EC2, RDS, Elasticsearch, Redis, SQS, API Gateway, Lambda, etc. Operational experience in deploying, operating, scaling, and troubleshooting large-scale production systems on the cloud. Strong interpersonal communication skills (including listening, speaking, and writing) Ability to create & work well in a diverse, team-focused environment with other DevOps and engineering teams. Function well in a fast-paced, rapidly changing environment

Posted 1 month ago

Apply

6.0 - 10.0 years

0 Lacs

Bengaluru / Bangalore, Karnataka, India

On-site

Foundit logo

Oracle Cloud Infrastructure (OCI) is one of the fastest-growing cloud platforms, and we are assembling a world-class team to build the next generation of security products. We're seeking a Principal Software Engineer to drive the design and development of mission-critical systems that protect OCI customers at hyperscale. As a Principal Engineer in the Security Products Group, you will play a key leadership role in: Architecting and delivering complex, distributed systems with a focus on security, resiliency, and scalability. Driving strategic technical decisions and shaping the long-term vision for OCI's security offerings. Mentoring engineers, influencing cross-team engineering practices, and raising the technical bar across the organization. Leading design reviews, setting coding standards, and fostering a culture of operational excellence. What You'll Do: Lead design and development of major features and large-scale systems from concept to production. Set the direction for platform architecture and system design in areas such as identity, data protection, threat detection, and vulnerability management. Operate and improve high-scale services, driving initiatives to increase reliability, observability, and automation. Collaborate across teams and orgs to align architecture, resolve dependencies, and ensure delivery of high-impact security capabilities. What We're Looking For: Deep experience in building and operating distributed systems at scale. Proven ability to design and deliver complex features with cross-cutting impact. Hands-on experience with services operating across regions and subject to strict compliance and regulatory requirements. Strong coding skills and the ability to dive deep into technical details across the stack-from low-level systems internals to API design. A bias for simplicity, a passion for scale, and a pragmatic approach to problem-solving. Why Security at OCI The OCI Security Products Group is on a mission to build the most secure cloud platform. We deliver a portfolio of cloud-native services that enable our customers to: Isolate workloads, encrypt data, and control access securely. Detect vulnerabilities and threats across applications, containers, and infrastructure. Remediate risks proactively, leveraging intelligence from CVEs, CIS benchmarks, and threat modeling. We are investing heavily in advanced security systems that detect, analyze, and block malicious activity in real time - empowering our customers to build and scale confidently on Oracle Cloud. Explore our work: Lead the design and development of large-scale, mission-critical security services within OCI, ensuring they are reliable, scalable, and secure by default. Define technical strategy and architecture for key areas such as identity, access control, data protection, threat detection, and vulnerability management. Drive end-to-end delivery of complex features - from ideation and design through development, testing, deployment, and operational support. Mentor and guide engineers across multiple teams, fostering technical growth, improving code quality, and raising the bar for design and execution. Champion engineering excellence by setting high standards for design, code, observability, automation, and operational readiness. Collaborate across functional teams (security, platform, compliance, product management) to align on strategy, resolve architectural challenges, and accelerate delivery. Continuously improve system reliability and performance through proactive observability, incident response, chaos engineering, and root cause analysis. Evaluate and adopt new technologies and patterns to improve security posture, performance, and developer productivity. Contribute to the broader OCI engineering community through leadership in design reviews, architecture discussions, and cross-org initiatives. Career Level - IC4

Posted 1 month ago

Apply

10 - 13 years

18 - 25 Lacs

Bengaluru

Hybrid

Naukri logo

Hiring, Lead Site Reliability Engineer with following skills and expertise. What will this person do? Provide leadership in designing and implementing reliable, scalable, and secure infrastructure solutions. Develop and maintain observability solutions, ensuring visibility into system performance using native Azure Cloud solutions. Define and track SLIs, ensuring compliance with SLOs and SLAs. Lead incident response efforts, conduct root cause analysis, and implement preventive measures to minimize downtime. Automate infrastructure provisioning, configuration and management using Terraform & Ansible. Build and maintain robust Observability pipelines to support automated deployments and continuous monitoring practices. Continuously analyze system health and optimize performance by identifying and resolving bottlenecks. Work with our BCDR team to minimize business impact during failures and measure the quality of services. Work with Cloud Governance team to monitor cloud infrastructure spending and implement cost-saving strategies. Implement centralized logging, metric collection, and distributed tracing for troubleshooting and debugging. Deploy, Manage and Monitor containerized workloads. Maintain configuration consistency and compliance across cloud environments using tools like Ansible. Partner with software development teams to integrate reliability best practices into the application development lifecycle. Conduct detailed post-mortems, document learnings, and drive improvements to reduce future incidents. Develop automation scripts in Python, Bash, or other languages to reduce manual efforts and improve efficiency. Provide mentorship to junior engineers, fostering a culture of learning and continuous technical growth. Research and evaluate new technologies, tools, and methodologies to improve system reliability and efficiency. Maintain detailed documentation on infrastructure, monitoring setups, incident responses, and best practices. Qualifications Bachelors degree in Computer Science, Engineering, or a related field. 10+ years in Observability, DevOps, and Site Reliability Engineering (SRE). At least 2 years of experience in defining Observability KPIs for both on-premises and cloud environments. Strong experience with cloud platforms (AWS, Azure, GCP) and cloud-native technologies. Passion for automation, reducing toil and implementing reliability-focused best practices. Deep knowledge of services/tools like Grafana, PowerBI, Prometheus, Azure Monitor, Application Insights & Azure Metrics. Expertise in Terraform, Ansible, Chef, and CI/CD pipeline tools like GitHub Actions, Jenkins, and GitOps methodologies. Working understanding of load balancing, authentication (AAA), encryption, and network parameters monitoring. Strong troubleshooting skills and experience handling on-call incidents and post-mortem analysis. Ability to work cross-functionally, drive technical discussions, and mentor junior engineers. Ability to work in a dynamic team environment and possess time management skills to meet deadlines. Sense of ownership and pride in your performance and its impact on the companys success. Critical thinker with problem-solving skills. Good interpersonal and communication skills.

Posted 1 month ago

Apply

5 - 8 years

15 - 25 Lacs

Chennai, Bengaluru

Work from Office

Naukri logo

We are looking for a Senior Platform Engineer Airflow & Control-M with 5-10 years of experience to join our team in Bangalore or Chennai The ideal candidate will have strong expertise in Airflow, Control-M, Kubernetes, Observability (OpenTelemetry), Python, and Bash scripting The role involves managing critical data workflows, enhancing platform automation, and ensuring system reliability and scalability Excellent communication skills and hands-on experience in stabilizing production environments are essential

Posted 1 month ago

Apply

8 - 12 years

16 - 27 Lacs

Kolkata

Work from Office

Naukri logo

Role Observability Engineer (AWS) EXP : 8 + Years Essential Skills (Two top skills) AWS Ecosystem – EKS, EC2, DynamoDB, Lambda, etc. Dynatrace (or similar) Monitoring Site, trend analysis, log analysis Key Responsibilities: Design, implement, and maintain observability solutions using AWS and Dynatrace to monitor application performance and infrastructure health. Collaborate with development and operations teams to define observability requirements and ensure seamless integration of monitoring tools. Develop and manage dashboards, alerts, and reports to provide insights into system performance and user experience. Troubleshoot complex issues by analyzing logs, metrics, and traces to identify root causes and recommend solutions. Optimize existing monitoring frameworks to enhance visibility across cloud environments and applications. Stay updated on industry trends and best practices in observability, cloud technologies, and performance monitoring. 8+ years of proven experience as an Observability Engineer or similar role with a strong focus on AWS services. Proficiency in using Dynatrace for application performance monitoring and observability. Strong understanding of cloud architecture, microservices, containers, and serverless computing. Experience with scripting languages (e.g., Python, Bash) for automation tasks. Excellent problem-solving skills with the ability to work under pressure in a fast-paced environment. Strong communication skills to effectively collaborate with cross-functional teams

Posted 1 month ago

Apply

10 - 20 years

25 - 35 Lacs

Pune, Bengaluru, Delhi / NCR

Work from Office

Naukri logo

Role & responsibilities SRE Architect in running large Reliability & Observability Programs for large, complex infrastructure deployments / distributed systems for major Banking customers. Proficiency in using Application Performance Monitoring (APM) tool New Relic/Dynatrace for monitoring, logging, tracing and Splunk for Log monitoring. should have implemented solutions around Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for services. • Understanding of software delivery life cycles, particularly Agile/Lean & DevOps • Proven experience in handling large scale and growing infrastructure across Data Centers and heterogeneous Cloud platforms • Expert level hands on knowledge in cloud platforms like PCF . Preferred candidate profile Understanding of software delivery life cycles, particularly Agile/Lean & DevOps Proven experience in handling large scale and growing infrastructure across Data Centers and heterogeneous Cloud platforms Perks and benefits

Posted 1 month ago

Apply

10 - 13 years

18 - 25 Lacs

Bengaluru

Hybrid

Naukri logo

Hiring, Lead Site Reliability Engineer with following skills and expertise. What will this person do? Provide leadership in designing and implementing reliable, scalable, and secure infrastructure solutions. Develop and maintain observability solutions, ensuring visibility into system performance using native Azure Cloud solutions. Define and track SLIs, ensuring compliance with SLOs and SLAs. Lead incident response efforts, conduct root cause analysis, and implement preventive measures to minimize downtime. Automate infrastructure provisioning, configuration and management using Terraform & Ansible. Build and maintain robust Observability pipelines to support automated deployments and continuous monitoring practices. Continuously analyze system health and optimize performance by identifying and resolving bottlenecks. Work with our BCDR team to minimize business impact during failures and measure the quality of services. Work with Cloud Governance team to monitor cloud infrastructure spending and implement cost-saving strategies. Implement centralized logging, metric collection, and distributed tracing for troubleshooting and debugging. Deploy, Manage and Monitor containerized workloads. Maintain configuration consistency and compliance across cloud environments using tools like Ansible. Partner with software development teams to integrate reliability best practices into the application development lifecycle. Conduct detailed post-mortems, document learnings, and drive improvements to reduce future incidents. Develop automation scripts in Python, Bash, or other languages to reduce manual efforts and improve efficiency. Provide mentorship to junior engineers, fostering a culture of learning and continuous technical growth. Research and evaluate new technologies, tools, and methodologies to improve system reliability and efficiency. Maintain detailed documentation on infrastructure, monitoring setups, incident responses, and best practices. Qualifications Bachelors degree in Computer Science, Engineering, or a related field. 10+ years in Observability, DevOps, and Site Reliability Engineering (SRE). At least 2 years of experience in defining Observability KPIs for both on-premises and cloud environments. Strong experience with cloud platforms (AWS, Azure, GCP) and cloud-native technologies. Passion for automation, reducing toil and implementing reliability-focused best practices. Deep knowledge of services/tools like Grafana, PowerBI, Prometheus, Azure Monitor, Application Insights & Azure Metrics. Expertise in Terraform, Ansible, Chef, and CI/CD pipeline tools like GitHub Actions, Jenkins, and GitOps methodologies. Working understanding of load balancing, authentication (AAA), encryption, and network parameters monitoring. Strong troubleshooting skills and experience handling on-call incidents and post-mortem analysis. Ability to work cross-functionally, drive technical discussions, and mentor junior engineers. Ability to work in a dynamic team environment and possess time management skills to meet deadlines. Sense of ownership and pride in your performance and its impact on the companys success. Critical thinker with problem-solving skills. Good interpersonal and communication skills.

Posted 1 month ago

Apply

8 - 13 years

30 - 45 Lacs

Bengaluru

Work from Office

Naukri logo

Drive SRE implementation and DevOps best practices. Reduce technical debt, automate reliability workflows, and ensure performance, scalability, and observability across cloud-based digital platforms. Required Candidate profile Experienced SRE with deep knowledge of Azure cloud, CI/CD, observability, automation, and programming. Strong DevOps mindset, troubleshooting ability, and alignment with digital transformation goals

Posted 1 month ago

Apply

7.0 - 12.0 years

12 - 22 Lacs

Pune

Work from Office

Naukri logo

Experience-7+ Years Job Locations-Pune Notice Period-30 Days Job Description- AWS Ecosystem EKS, EC2, DynamoDB, Lambda, etc. Dynatrace (or similar) The Observability team should include some members with Dynatrace experience, while the rest can have experience with similar tools. Monitoring Site, trend analysis, log analysis **Key Responsibilities: ** Design, implement, and maintain observability solutions using AWS and Dynatrace to monitor application performance and infrastructure health. Collaborate with development and operations teams to define observability requirements and ensure seamless integration of monitoring tools. Develop and manage dashboards, alerts, and reports to provide insights into system performance and user experience. Troubleshoot complex issues by analyzing logs, metrics, and traces to identify root causes and recommend solutions. Optimize existing monitoring frameworks to enhance visibility across cloud environments and applications. Stay updated on industry trends and best practices in observability, cloud technologies, and performance monitoring. 7+ years of proven experience as an Observability Engineer or similar role with a strong focus on AWS services. Proficiency in using Dynatrace for application performance monitoring and observability. Strong understanding of cloud architecture, microservices, containers, and serverless computing. Experience with scripting lan guages (e.g., Python, Bash) for automation tasks. • Excellent problem-solving skills with the ability to work under pressure in a fast-paced environment. • Strong communication skills to effectively collaborate with cross-functional teams.

Posted 1 month ago

Apply
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Featured Companies