Senior Principal Site Reliability Engineer

15 - 20 years

17 - 22 Lacs

Posted:1 day ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

  • F5xc SRE: Play the role of a hands-on SRE Engineer focused on automation and toil-reduction and participate in Ops cycles to support our product.
  • Perform oncall support function on a rotation basis, providing timely resolution of issues and ensuring operational excellence in managing and maintaining distributed networking and security products
  • Easy-to-Use Automation: Continue to grow the infra-automation (k8s, ArgoCD, Helm Charts, Golang services, AWS, GCP, Terraform) with a focus on ease of configuration
  • Environment Stability using Observability: Create and continue to evolve existing Observability (metrics & alerts) andparticipate in regular monitoring of infrastructure for stability.
  • Collaborative Engagement: Collaborate closely with application owners and SRE team members as part of roadmap execution and continuous improvement of existing systems.
  • Scale & Resilient systems: Design & deploy systems/infra which ishighly available and resilient for the configured failure domains.
  • Design systems using strong security principles with security by default.

The Job Description is intended to be a general representation of the responsibilities and requirements of the job. However, the description may not be all-inclusive, and responsibilities and requirements are subject to change.

Knowledge, Skills and Abilities

  • Hands-on experience with the Cortex suite of observability tools, including Cortex, Loki, Tempo, and Prometheus integration for scalable, multi-tenant monitoring systems.
  • Proficient in deploying and managing Cortex in microservice environments, including configuration of distributors, ingesters, queriers, and store-gateways for high availability and performance.
  • Experienced with Grafana Mimir,including cluster setup, alerting, rule evaluation, and long-term metric storage at scale.
  • Skilled in optimizing Cortex/Mimir query performance, tuning compaction, and managing sharding/replication for massive telemetry workloads.
  • Familiar with integrating Cortex/Mimir with Grafana dashboards, Thanos, or Prometheus Remote Write to support observability-as-a-service use cases
  • Elasticsearch: Deep understanding of indexing strategies, query optimization, cluster management, and tuning for high-throughput use cases. Familiarity with slow query analysis, scaling, and shard management.
  • ClickHouse: Proven experience in designing and managing OLAP workloads, optimizing query performance, and implementing efficient table engines and materialized views.
  • Apache Kafka: Expertise in event streaming architecture, topic design, producer/consumer configuration, and handling high-volume, low-latency data pipelines. Experience with Kafka Connect and Schema Registry is a plus.
  • Vector (Datadog/Timber.io/Logs): Proficiency in configuring Vector for observability pipelines, including log transformation, enrichment, and routing to multiple sinks (e.g., Elasticsearch, S3, ClickHouse).
  • Hands-on programming experience in any one language python,golang + shell scripting.
  • Strong networking fundamentals and experience dealing with different layers of the networking stack.
  • SRE/Devops on Linux & Kubernetes: Demonstrate excellent, hands-on knowledge of deploying workloads and managing lifecyle on kubernetes, with practical experience on debugging issues.
  • Experience in upgrading workloads for SaaS Services without downtime.
  • Oncall Experience in managing everyday OPs for production environments. Experience in production alerts management and using dashboards to debug issues.
  • GipOps: Experience with helm charts/kustomizations and gitops tools like ArgoCD/FluxCD.
  • CI/CD: Experience working with/designing functional CI/CD systems.
  • Cloud Infrastructure: Prior experience in deploying workloads and managing lifecycle on any cloud provider (AWS/GCP/Azure)

Qualifications

  • Typically, requires at least 15 years of related experience with a bachelors degree, 12+year and a masters degree, or a PhD with 10+ year of experience or equivalent experience.
  • Excellent organizational agility and communication skills throughout the organization.

Environment

  • Empowered Work Culture: Experience an environment that values autonomy, fostering a culture where creativity and ownership are encouraged.
  • Continuous Learning: Benefit from the mentorship of experienced professionals with solid backgrounds across diverse domains, supporting your professional growth.
  • Team Cohesion: Join a collaborative and supportive team where you'll feel at home from day one, contributing to a positive and inspiring workplace.

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You