Posted:5 hours ago| Platform: Foundit logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Key Responsibilities:

  • Design, build, and maintain observability platforms including monitoring, logging, tracing, and alerting systems.
  • Implement and optimize metrics collection using tools like Prometheus, Grafana, OpenTelemetry, or similar.
  • Develop and maintain centralized logging infrastructure (e.g., Data Dog, Open Telemetry, Splunk, or Google Cloud Logging).
  • Implement distributed tracing solutions using tools such as Jaeger, Zip kin, AppDynamics, or OpenTelemetry.
  • Collaborate with engineering teams to define SLIs, SLOs, and alerting thresholds.
  • Automate observability workflows and integrate observability into CI/CD pipelines.
  • Analyze and interpret telemetry data to proactively identify system issues and performance bottlenecks.
  • Provide training and documentation to teams on best practices in observability.
  • Continuously evaluate and adopt new observability technologies and practices.

Tools & Technologies:

  • Skilled in AppDynamics, Splunk, Thousand Eyes, ITRS for instrumentation, monitoring, alerting, and incident response.
  • Deep hands-on knowledge of Terraform, Kubernetes (GKE), GitLab CI/CD.
  • Familiar with modern observability practices like Open Telemetry, Grafana, Datadog
  • Strong knowledge of data platforms: Big Query, Cassandra, Kafka, PostgreSQL, MySQL.
  • Experience with AI/ML-based operations tools for automation, anomaly detection, and predictive alerting.

Qualifications:

  • Bachelor's degree in Computer Science, Engineering, or related fieldor equivalent experience.
  • Proven experience as an SRE or DevOps engineer, particularly in Google Cloud Platform (GCP).
  • Expertise in designing and managing observability platforms and tools.
  • Hands-on experience with monitoring systems like Prometheus, Grafana, Datadog, New Relic, etc.
  • Proficient in logging solutions such as ELK, Splunk, Fluentd, or Google Cloud Logging.
  • Familiarity with distributed tracing tools like Open Telemetry, Jaeger, or Zip kin.
  • Strong scripting and automation skills using Python, Go, Bash, or similar.
  • Experience with cloud platforms (AWS, GCP, Azure) and their observability services.
  • Solid understanding of Kubernetes and observability in containerized environments.
  • Deep knowledge of networking, application performance, and distributed systems.
  • Exposure to AI/ML-based observability or anomaly detection tools.
  • Excellent troubleshooting, debugging, and analytical capabilities.
  • Strong communication and cross-team collaboration skills.

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

RecommendedJobs for You