Site Reliability Engineer - Terraform Emperen Technologies

5.0 - 10.0 years

8 - 12 Lacs

Jaipur

Work from Office

We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools : logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills : - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD : GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 8 hours ago

Apply

Site Reliability Engineer - Terraform Emperen Technologies

5.0 - 10.0 years

8 - 12 Lacs

Bengaluru

Work from Office

We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools : logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills : - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD : GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 8 hours ago

Apply

Site Reliability Engineer - Terraform Emperen Technologies

5.0 - 10.0 years

8 - 12 Lacs

Lucknow

Work from Office

We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools : logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills : - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD : GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 1 day ago

Apply

Site Reliability Engineer - Terraform Emperen Technologies

5.0 - 10.0 years

8 - 12 Lacs

Hyderabad

Work from Office

We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools : logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills : - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD : GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 1 day ago

Apply

Site Reliability Engineer - Terraform Emperen Technologies

5.0 - 10.0 years

8 - 12 Lacs

Kolkata

Work from Office

We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools : logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills : - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD : GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 1 day ago

Apply

Site Reliability Engineer - Terraform Emperen Technologies

5.0 - 10.0 years

8 - 12 Lacs

Nagpur

Work from Office

We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools : logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills : - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD : GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 1 day ago

Apply

Site Reliability Engineer - Terraform Emperen Technologies

5.0 - 10.0 years

8 - 12 Lacs

Mumbai

Work from Office

We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools : logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills : - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD : GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 1 day ago

Apply

Site Reliability Engineer - Terraform Emperen Technologies

5.0 - 10.0 years

8 - 12 Lacs

Chandigarh

Work from Office

Job Description : We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools : logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills : - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD : GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 1 day ago

Apply

Site Reliability Engineer - Terraform Emperen Technologies

5.0 - 10.0 years

8 - 12 Lacs

Ahmedabad

Work from Office

Job Description : We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools : logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills : - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD : GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 1 day ago

Apply

Site Reliability Engineer - Terraform Emperen Technologies

5.0 - 10.0 years

7 - 12 Lacs

Pune

Work from Office

We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools : logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills : - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD : GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 1 day ago

Apply

DevOps/Site Reliability Engineer - IAC Terraform Emperen Technologies

6.0 - 9.0 years

12 - 16 Lacs

Pune

Work from Office

We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools: logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills - 510 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD: GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) ? - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 1 day ago

Apply

Site Reliability Engineer _ Contract Role _ Pan India Cygnus Professionals

8.0 - 13.0 years

0 - 0 Lacs

Hyderabad, Chennai, Bengaluru

Hybrid

Job Title Site Reliability Engineer SRE Observability Engineer Shift Type Rotational Shifts including Night Shift and Weekend Availability Experience 7 Years of Exp Job Summary We are looking for a skilled and adaptable Site Reliability Engineer SRE Observability Engineer to join our dynamic project team The ideal candidate will play a critical role in ensuring system reliability scalability observability and performance while collaborating closely with development and operations teams This position requires strong technical expertise problem solving abilities and a commitment to 247 operational excellence Key Responsibilities Site Reliability Engineering Design build and maintain scalable and reliable infrastructure Automate system provisioning and configuration using tools like Terraform Ansible Chef or Puppet Develop tools and scripts in Python Go Java or Bash for automation and monitoring Administer and optimize Linux Unix systems with a strong understanding of TCPIP DNS load balancers and firewalls Implement and manage cloud infrastructure across AWS or Kubernetes Maintain and enhance CICD pipelines using tools like Jenkins ArgoCD Monitor systems using Prometheus Grafana Nagios or Datadog and respond to incidents efficiently Conduct postmortems and define SLAsSLOs for system reliability and performance Plan for capacity and performance using benchmarking tools and implement autoscaling and failover systems Observability Engineering Instrument services with relevant metrics logs and traces using OpenTelemetry Prometheus Jaeger Zipkin etc Build and manage observability pipelines using Grafana ELK Stack Splunk Datadog or Honeycomb Work with timeseries databases eg InfluxDB Prometheus and log aggregation platforms Design actionable s and dashboards to improve system observability and reduce fatigue Partner with developers to promote observability best practices and define key performance indicators KPIs Required Skills Qualifications Proven experience as an SRE or Observability Engineer in complex production environments Handson expertise in LinuxUnix systems and cloud infrastructure AWSKubernetes Strong programming and scripting skills in Python Go Bash or Java Deep understanding of monitoring logging and ing systems Experience with modern Infrastructure as Code and CICD practices Ability to analyze and troubleshoot production issues in realtime Excellent communication skills to collaborate with crossfunctional teams and stakeholders Flexibility to work in rotational shifts including night shifts and weekends as required by project demands A proactive mindset with a focus on continuous improvement and reliability Additional Requirements Excellent communication skills to collaborate with crossfunctional teams and stakeholders Flexibility to work in rotational shifts including night shifts and weekends as required by project demands A proactive mindset with a focus on continuous improvement and reliability Skills Mandatory Skills : Ansible, AWS Automation Services, AWS CloudFormation, AWS Code Pipeline, AWS CodeDeploy, AWS DevOps Services

Posted 1 week ago

Apply

Site Reliability Engineer Aspire Systems India Private Limited

2.0 - 6.0 years

1 - 3 Lacs

, Singapore

On-site

Foundit logo

Description We are looking for a Site Reliability Engineer to join our team. The ideal candidate will have 2-6 years of experience in a similar role and will be responsible for ensuring the reliability, scalability, and performance of our systems. Responsibilities Design, build, and maintain highly available systems Monitor and respond to system alerts and incidents Identify and resolve performance issues Develop and implement automation tools for system management Collaborate with development teams to ensure seamless deployment and operation of applications Maintain documentation of system architecture and processes Participate in on-call rotation for after-hours support Skills and Qualifications Bachelor's degree in Computer Science or related field 2-6 years of experience in a Site Reliability Engineer or similar role Strong knowledge of Linux systems administration Experience with configuration management tools such as Puppet, Chef, or Ansible Experience with cloud infrastructure providers such as AWS, Azure, or GCP Strong scripting skills in at least one language such as Python, Ruby, or Bash Familiarity with monitoring tools such as Nagios, Zabbix, or Prometheus Excellent problem-solving and troubleshooting skills Strong communication and collaboration skills

Posted 2 weeks ago

Apply

Site Reliability Engineer(Bengaluru) Innova Solutions

5.0 - 10.0 years

15 - 25 Lacs

Bengaluru

Hybrid

Dear candidate, We are looking SRE ( Site Reliability Engineer) for Bangalore location. Requirement 1: SRE(Artifactory) * GitLab setup & administration * Implement best practices to improve pipeline performance * AWS with Terraform coding * Linux administration & troubleshooting * Strong coding skills in any language (preferably Python) * Familiar with container technologies (Docker / Kubernetes) * Good knowledge of infrastructure and application monitoring (Prometheus / Grafana / Could watch) Requirement 2: SRE(GITLAB) * JFrog Artifactory setup & administration * JFrog XRAY setup & administration * AWS with Terraform coding * Linux administration & troubleshooting * Strong coding skills in any language (preferably Python) * Familiar with container technologies (Docker / Kubernetes) * Good knowledge of infrastructure and application monitoring (Prometheus / Grafana / Could watch) Location:- Bangalore (Whitefield) Work mode:- Hybrid Interview Mode:- Face to face (Monday - Friday) If interested, please share your cv at ruchika.gahlawat@innovasolutions.com.

Posted 2 weeks ago

Apply

Senior Site Reliability Engineer Okta

5.0 - 10.0 years

4 - 8 Lacs

Bengaluru

Work from Office

We are looking for an experienced Senior BT Reliability Engineer to join our Business Technology team to maintain and continually improve our cloud-based services. The Site Reliability Engineering team in Bangalore is brand new, and builds foundational back-end infrastructure services and tooling for Okta s corporate teams. We enable teams to build infrastructure at scale and automate their software reliably and predictably. SREs are team players and innovators who build and operate technology using best practices and an agile mindset. We are looking for a smart, innovative, and passionate engineer for this role, someone who has a passion for designing complex and implementing cloud-based infrastructure. This is a new team, and the ideal candidate welcomes the challenge of building something new. They enjoy seeing their designs run at scale with automation, testing, and an excellent operational mindset. If you exemplify the ethics of, "If you have to do something more than once, automate it," we want to hear from you! Responsibilities Build and run development tools, pipelines, and infrastructure with a security-first mindset Actively participate in Agile ceremonies, write stories, and support team members through demos, knowledge sharing, and architecture sessions Promote and apply best practices for building secure, scalable, and reliable cloud infrastructure Develop and maintain technical documentation, network diagrams, runbooks, and procedures Designing, building, running, and monitoring Okta's IT infrastructure and cloud services Driving initiatives to evolve our current cloud platforms to increase efficiency and keep it in line with current security standards and best practices Recommend, develop, implement, and manage appropriate policy, standards, processes, and procedural updates Working with software engineers to ensure that development follows established processes and works as intended Create and maintain centralized technical processes, including container and image management Provide excellent customer service to our internal users and be an advocate for SRE services and DevOps practices Qualifications 5+ years of experience as a SRE, DevOps, Systems Engineer, or equivalent Demonstrated ability to develop complex applications for cloud infrastructure at scale and deliver projects on schedule and within budget Proficient in managing AWS multi-account environments and AWS authentication, governance, and using org management suite, including, but not limited to, AWS Orgs, AWS IAM, AWS Identity Center, and Stacksets Proficient with automating systems and infrastructure via Terraform Proficient in developing applications running on AWS or other cloud infrastructure resources, including compute, storage, networking, and virtualization Proficient with Git and building deployment pipeline using commercial tools, especially Github Actions Proficient with developing tooling and automation using Python Proficient with AWS container based workloads and concepts, especially EKS, ECS, and ECR. Experience with monitoring tools, especially Splunk, Cloudwatch, and Grafana Experience with reliability engineering concepts and security best practices on public cloud platforms Experience with image creation and management, especially for container and EC2 based workloads Knowledgeable with Linux system administration skills Familiar with configuration management tools, such as Ansible and SSM Familiar with Github Actions Runner Controller self-hosted runners Good communication skills, with the ability to influence others and communicate complex technical concepts to different audiences

Posted 2 weeks ago

Apply

Senior Site Reliability Engineer III - Ansible/Terraform GreyOrange

6.0 - 8.0 years

13 - 18 Lacs

Gurugram

Work from Office

Responsibilities : - Define and enforce SLOs, SLIs, and error budgets across microservices - Architect an observability stack (metrics, logs, traces) and drive operational insights - Automate toil and manual ops with robust tooling and runbooks - Own incident response lifecycle: detection, triage, RCA, and postmortems - Collaborate with product teams to build fault-tolerant systems - Champion performance tuning, capacity planning, and scalability testing - Optimise costs while maintaining the reliability of cloud infrastructure Must have Skills : - 6+ years in SRE/Infrastructure/Backend related roles using Cloud Native Technologies - 2+ years in SRE-specific capacity - Strong experience with monitoring/observability tools (Datadog, Prometheus, Grafana, ELK etc.) - Experience with infrastructure-as-code (Terraform/Ansible) - Proficiency in Kubernetes, service mesh (Istio/Linkerd), and container orchestration - Deep understanding of distributed systems, networking, and failure domains - Expertise in automation with Python, Bash, or Go - Proficient in incident management, SLAs/SLOs, and system tuning - Hands-on experience with GCP (preferred)/AWS/Azure and cloud cost optimisation - Participation in on-call rotations and running large-scale production systems Nice to have skills : - Familiarity with chaos engineering practices and tools (Gremlin, Litmus) - Background in performance testing and load simulation (Gatling, Locust, k6, JMeter)

Posted 2 weeks ago

Apply

Senior Site Reliability Engineer I - Marketplace Booking Holdings

5.0 - 8.0 years

4 - 7 Lacs

Bengaluru

Work from Office

Key Responsibilities Building software Applications Is responsible to build software applications by using relevant development languages and applying knowledge of systems, services and tools appropriate for the business area and guide more junior members of the team in this topic.Is responsible to refactor and simplify code by introducing design patterns when necessary and guide more junior members of the team in this topic.Is responsible to ensure the quality of the application by following standard testing techniques and methods that adhere to the test strategyIs responsible to write readable and reusable code by applying standard patterns and using standard librariesIs responsible to maintain data security, integrity and quality by effectively following company standards and best practices Software Systems DesignIs responsible to evaluate possible architecture solutions by taking into account cost, business requirements, technology requirements and emerging technologiesIs responsible to describe the implications of changing an existing system or adding a new system to a specific area, by having a broad, high-level understanding of the infrastructure and architecture of our systemsIs responsible to help grow the business and/or accelerate software development by applying engineering techniques (e.g. prototyping, spiking and vendor evaluation) and standardsIs responsible to meet business needs by designing solutions that meet current requirements and are adaptable for future enhancements End to End System OwnershipIs responsible to own a service end to end by actively monitoring application health and performance, setting and monitoring relevant metrics and act accordingly when violatedIs responsible to reduce business continuity risks and bus factor by applying state-of-the-art practices and tools, and writing the appropriate documentation such as runbooks and OpDocsIs responsible to reduce risk and obtain customer feedback by using continuous delivery and experimentation frameworksIs responsible to independently manage an application or service by working through deployment and operations in production and guide more junior members of the team in this topic.Is responsible to maintain data security, integrity and quality by effectively following company standards and best practises Technical Incident ManagementIs responsible to address and resolve live production issues by mitigating the customer impact within SLAIs responsible to improve the overall reliability of systems by producing long term solutions through root cause analysisIs responsible to keep track of incidents by contributing to postmortem processes and logging live issues Automation and toil reductionIs responsible to ensure that infrastructure stays current by reducing technical debt, searching for bottlenecks and preparing for scalingIs responsible to reduce cost of operations and maintenance by leveraging new technologies, automation, and partner with vendors to ensure we stay currentIs responsible to reduce human labour by writing small software features that address availability, scalability, latency and efficiency Monitoring and Alerting improvementsIs responsible to review and verify performance of production systems and network infrastructure by continuously monitoring appropriate observability metrics, business KPIs and capacity planningIs responsible to improve application reliability by partnering with development teams to advise on setting appropriate observability metrics Critical ThinkingIs responsible to systematically identify patterns and underlying issues in complex situations, and to find solutions by applying logical and analytical thinking.Is responsible to constructively evaluate and develop ideas, plans and solutions by reviewing them, objectively taking into account external knowledge, initiating 'SMART' improvements and articulating their rationale. Continuous Quality and Process ImprovementIs responsible to identify opportunities for process, system and structural improvements (i.e performance gains) by examining and evaluating current process flows, methods and standards. Is responsible to design and implement relevant improvements by defining adapted/new process flows, standards, and practices that enable business performance. Effective CommunicationHas sufficient knowledge to deliver clear, well-structured, and meaningful information to a target audience by using suitable communication mediums and language tailored to the audienceHas sufficient knowledge to achieve mutually agreeable solutions by staying adaptable, communicating ideas in clear coherent language and practising active listeningHas sufficient knowledge to ask relevant (follow-up) questions to properly engage with the speaker and really understand what they are saying, by applying listening and reflection techniques Architectural GuidanceIs responsible to advise product teams towards a technical solution that meets the functional, nonfunctional & architectural requirements by challenging the rationale for an application design and providing context in the wider architectural landscapeHas sufficient knowledge to set a clear direction for a technical capability by evaluating and aligning the target architecture improvements, reframing architectural designs and decisions for varied stakeholder Coaching/MentoringHas basic knowledge to coach, guide and improve the overall performance of stakeholders and colleagues at all levels, when appropriate, by sharing experience, knowledge and approaches to work Communication.Stakeholder Track members Product stakeholders Peers Communication.Type Cooperation - Persuasion - Information Cooperation - Persuasion Cooperation - Persuasion Communication.Frequency Continuous Frequent Frequent Level of Education.Level of Education Master degree Years of relevant Job Knowledge.Years of relevant Job Knowledge Advanced Knowledge (5 - 8 years) Requirements of special knowledge/skills Building Software Applications Software System Design End to End System Ownership Technical Incident Management Operations (Automation & Toil) Observability (Monitoring & Alerting) Critical Thinking Continuous Quality & Process Improvement Effective Communication Architectural Guidance Coaching & Mentoring

Posted 3 weeks ago

Apply

Site Reliability Engineer - Cloud Platforms Agivant Technologies

7.0 - 12.0 years

18 - 22 Lacs

Pune

Work from Office

We are looking for a highly skilled Site Reliability Engineer (SRE) with strong engineering and architectural expertise to design, implement, and manage large-scale, mission-critical infrastructure across multiple data centers and cloud providers. As an SRE, you will be responsible for architecting and optimizing our global infrastructure, enabling development teams to roll out new features efficiently while maintaining high availability and reliability. You will be hands-on with automation, performance tuning, infrastructure scalability, and cloud-native technologies to ensure a seamless user experience for millions of customers. Key Responsibilities : 1. Architect and implement highly scalable, fault-tolerant, and distributed systems across multi-cloud (OCI, AWS, GCP) and on-premise environments using modern DevOps and SRE principles. 2. Design and deploy next-generation cloud infrastructure with a strong focus on automation, self-healing systems, and performance optimization. Develop and maintain infrastructure-as-code (IaC) using Terraform and configuration management tools such as Ansible and Puppet for automated provisioning and orchestration. 3. Build and optimize containerized environments using Kubernetes and Docker for seamless deployment and scaling. 4. Drive performance, scalability, and security improvements across our cloud and on-prem infrastructure, ensuring high availability and disaster recovery capabilities. Monitor, troubleshoot, and resolve complex system issues by implementing advanced observability solutions, logging, and real-time monitoring frameworks. 5. Develop and enforce SRE best practices, including SLI/SLO definition, capacity planning, and incident management strategies. 6. Eliminate toil and automate repetitive tasks using scripting languages such as Python, Golang, or Shell scripting to improve operational efficiency. 7. Collaborate closely with engineering, architecture, and security teams to improve system resiliency, optimize application performance, and streamline CI/CD workflows. Lead the transition of legacy systems to modern, cloud-native architectures, advocating for DevOps and infrastructure automation. 8. Participate in 24/7 on-call rotations, ensuring rapid response to critical incidents and driving post-mortem analysis for continuous improvement. Requirements : 1. 7+ years of hands-on experience in a Site Reliability Engineering (SRE) role, with a strong focus on designing, implementing, and managing cloud-native infrastructure. Proficient with any cloud platform (preferably OCI) -not just operational experience but actual design and implementation expertise. 2. Proven experience in building, deploying, and optimizing infrastructure-as-code (IaC) using Terraform. 3. Strong automation mindset with proficiency in Ansible, Puppet, or other configuration management tools. 4. Hands-on experience with container orchestration using Kubernetes, Docker, and microservices architecture. 5. Advanced scripting and automation skills in Python, Golang, or Shell scripting to eliminate manual operations. 6. Working knowledge of load balancing technologies (HAProxy, Nginx, F5, Varnish, dnsdist) and web servers (Apache, Nginx). 7. Strong understanding of networking, distributed systems, and observability tools (Prometheus, Grafana, ELK stack, Datadog). 8. Experience in designing and implementing highly available, scalable, and secure architectures across cloud and hybrid environments. 9. AWS and/or GCP certifications are a plus but not required. 10. This is not a support-focused role-we are looking for engineers who have built, deployed, and optimized complex distributed systems from the ground up.

Posted 3 weeks ago

Apply

DevOps/Site Reliability Engineer Gemini Solutions

5.0 - 8.0 years

13 - 17 Lacs

Gurugram

Work from Office

POSITION SUMMARY : In this role, you will play a crucial part in shaping the firm's infrastructure reliability and efficiency by implementing robust Site Reliability Engineering practices. Your contribution will be pivotal in ensuring the availability, scalability, and performance of our systems and applications. Leveraging your strong technical skills and expertise in DevOps principles, you will work towards enhancing the reliability of our infrastructure and minimizing downtime, thus enabling the organization to deliver high-quality software with maximum efficiency EXPERIENCE AND REQUIRED SKILL SETS : - Ensure 24-7 uptime and stability of production systems - Investigate and troubleshoot production issues - Collaborate with developers to optimize system performance - Participate in on-call rotation to provide 24/7 support for critical systems - Work on automation and enhancements to reduce manual processes / intervention. - Relevant 5+ years of experience in SRE / Production/Product Support role, with a track record of implementing SRE practices - Basic understanding of cloud solutions provided by providers such as AWS or Azure. - Basic-Intermediate knowledge of Scripting in either of Bash/Python/PowerShell. - Good presentation, communication and interpersonal skills with the ability to collaborate effectively with cross-functional teams and stakeholders across different countries and cultures. - Good problem solving and troubleshooting skills - Continuous learning mindset and willingness to adapt to new technologies and industry trends. - Good Understanding of Operating System Commands (Linux), SQL (Ability to write, analyze queries and deduce / build important information per requirement) - In-depth knowledge of Trading Life Cycle: The candidate should possess a comprehensive understanding of trading life cycle, including order management, trade execution, settlement and post-trade processes. Familiarity with various financial products like Equities, Derivatives, Currencies, Commodities, FX is a plus. - Incident and Problem Management Expertise: The candidate must demonstrate strong problem-solving skills and the ability to manage incidents frequently and efficiently within a fast paced trading environment. This includes identifying, analyzing and resolving issues related to trading systems and processes as well as collaborating with cross-functional teams to implement long-term solutions and improve operational efficiency. - Good Understanding of Tools : (a) Orchestration Autosys / Airflow or Cron (b) Monitoring & Logging PagerDuty, Prometheus & Grafana or Datadog, Splunk (c) Project Management / ITSM Service Now (Basic ability to navigate / create change tickets / incidents) , Jira (Basic ability to create Jira Tickets , ability to filter your work) EDUCATION : - Bachelors degree or masters in computer science, Engineering, Software Engineering or a relevant field

Posted 4 weeks ago

Apply

Implementation Specialist - Grafana/Prometheus Steadfast It Consulting

5.0 - 7.0 years

3 - 7 Lacs

Pune

Remote

We are seeking a Grafana Implementation Expert with deep expertise in Grafana and Prometheus, focusing on core development and customization rather than SRE or DevOps responsibilities. This role requires a specialist in monitoring tools, responsible for designing, developing, and optimizing Grafana dashboards, plugins, and data sources to provide real-time observability and analytics. Key Responsibilities : - Develop, customize, and optimize Grafana dashboards with advanced visualizations, queries, and alerting mechanisms.- Integrate Grafana with Prometheus and other data sources (i.e. Loki, InfluxDB, Elasticsearch, MySQL, PostgreSQL, OpenTelemetry).- Extend Grafana capabilities by developing custom plugins, panels, and data sources using JavaScript, TypeScript, React, and Go.- Optimize Prometheus queries (PromQL) and storage solutions to ensure efficient data retrieval and visualization.- Automate dashboard provisioning using JSON, Terraform, or Grafana APIs for seamless deployment across environments.- Work closely with engineering teams to translate monitoring requirements into scalable and maintainable solutions.- Troubleshoot and enhance Grafana performance, including load balancing, scaling, and security hardening.- Implement advanced alerting mechanisms using Alertmanager, Grafana Alerts, and webhook integrations.- Stay updated on Grafana ecosystem advancements and contribute to best practices in observability tooling.- Document configurations, implementation guidelines, and best practices for internal stakeholders. Required Skills & Experience : - 5+ years of experience in monitoring and observability tools with a strong focus on Grafana and Prometheus.- Expertise in Grafana internals, including API usage, dashboard templating, and custom plugin development.- Strong hands-on experience with Prometheus, including metric collection, relabeling, and PromQL queries.- Proficiency in JavaScript, TypeScript, React, and Go for Grafana plugin and dashboard development.- Familiarity with infrastructure monitoring, including Kubernetes, cloud services (AWS, GCP, Azure), and system-level metrics. - Experience with time-series databases and log aggregation tools (i.e., Loki, Elasticsearch, InfluxDB). - Knowledge of security best practices in Grafana, including authentication, RBAC, and API security.- Experience with automation and infrastructure-as-code (IaC) for monitoring stack deployment.- Strong problem-solving skills with the ability to debug and optimize dashboards and alerting configurations.- Excellent communication and documentation skills to collaborate with cross-functional teams. Preferred Qualifications : - Grafana Certified Observability Engineer or equivalent certifications.- Experience contributing to open-source Grafana projects or plugin development.- Knowledge of distributed tracing tools like Jaeger or Zipkin.- Familiarity with service meshes (Istio, Linkerd) and their monitoring strategies.- This is a high-impact role focused on developing and enhancing Grafana-based monitoring solutions for enterprise-grade observability

Posted 4 weeks ago

Apply

Site Reliability Engineer II/III - Google Cloud Platform Shopsense Retail Technologies Limited

3.0 - 8.0 years

16 - 20 Lacs

Mumbai

Work from Office

What will you do at Fynd? - Run the production environment by monitoring availability and taking a holistic view of system health. - Improve reliability, quality, and time-to-market of our suite of software solutions - Be the 1st person to report the incident. - Debug production issues across services and levels of the stack. - Envisioning the overall solution for defined functional and non-functional requirements, and being able to define technologies, patterns and frameworks to realise it. - Building automated tools in Python / Java / GoLang / Ruby etc. - Help Platform and Engineering teams gain visibility into our infrastructure. - Lead design of software components and systems, to ensure availability, scalability, latency, and efficiency of our services. - Participate actively in detecting, remediating and reporting on Production incidents, ensuring the SLAs are met and driving Problem Management for permanent remediation. - Participate in on-call rotation to ensure coverage for planned/unplanned events. - Perform other task like load-test & generating system health reports. - Periodically check for all dashboards readiness. - Engage with other Engineering organizations to implement processes, identify improvements, and drive consistent results. - Working with your SRE and Engineering counterparts for driving Game days, training and other response readiness efforts. - Participate in the 24x7 support coverage as needed Troubleshooting and problem-solving complex issues with thorough root cause analysis on customer and SRE production environments - Collaborate with Service Engineering organizations to build and automate tooling, implement best practices to observe and manage the services in production and consistently achieve our market leading SLA. - Improving the scalability and reliability of our systems in production. - Evaluating, designing and implementing new system architectures. Some specific Requirements : - B.Tech. in Engineering, Computer Science, technical degree, or equivalent work experience - At least 3 years of managing production infrastructure. - Leading / managing a team is a huge plus. - Experience with cloud platforms like - AWS, GCP. - Experience developing and operating large scale distributed systems with Kubernetes, Docker and and Serverless (Lambdas) - Experience in running real-time and low latency high available applications (Kafka, gRPC, RTP) - Comfortable with Python, Go, or any relevant programming language. - Experience with monitoring alerting using technologies like Newrelic / zybix /Prometheus / Garafana / cloudwatch / Kafka / PagerDuty etc. - Experience with one or more orchestration, deployment tools, e. CloudFormation / Terraform / Ansible / Packer / Chef. - Experience with configuration management systems such as Ansible / Chef / Puppet. - Knowledge of load testing methodologies, tools like Gating, Apache Jmeter. - Work your way around Unix shell. - Experience running hybrid clouds and on-prem infrastructures on Red Hat Enterprise Linux / CentOS - A focus on delivering high-quality code through strong testing practices.

Posted 4 weeks ago

Apply

Lead Site Reliability Engineer - ITIL/ITSM Visionyle Solutions

6.0 - 10.0 years

13 - 17 Lacs

Hyderabad

Remote

Mode of Interview : 2-3 rounds (Virtual/Inperson) Notice : Immediate - 15 Days Max Technical Skill Requirements : ServiceNow Business Analyst, ITIL, ITSM, Dashboard Creation, APM, Scripting, Datadog Role and Responsibilities : - 6+ Years of experience into SRE Engineer , having thorough knowledge on ITIL/ITSM process - Certification in ITIL v4 framework and deep knowledge of ITSM platforms preferable - Hands on experience on APM tool Datadog - Demonstrable ability to implement complex process workflows, and evidence performance through metrics-driven reporting - Strong understanding of IT Operations - Strong written and verbal communication skills with the ability to understand and present complex technical information in a clear and concise manner to a variety of audiences including executive leadership - Ability to develop strategic relationships with other teams, departments, business stakeholders, and 3rd parties - Ability to understand business requirements and define KPIs which can showcase stability of the application in production and give meaningful insights to business - Proven trouble-shooting experience and strong incident reduction-minded focus - Should be able to unsurfaced recurring issues and Toil and suggest automations - Strong problem-solving skills and the ability to think quickly and execute on short-time frames

Posted 4 weeks ago

Apply

Site Reliability Engineer APPLIED INFORMATION SCIENCES (AIS)

8.0 - 13.0 years

15 - 25 Lacs

Hyderabad

Work from Office

Greetings from AIS!! AIS (Applied Information Sciences) is a highly regarded software and systems engineering firm providing professional application development services to commercial and government clients since 1982. One of Microsofts oldest and largest Managed Gold partners in the U.S., AIS is exclusively focused on building enterprise-class custom applications using Microsoft technologies. As we continue to experience extraordinary growth, we are seeking professionals to join our AIS Team in India. For more information, please visit: http://www.ais.com https://www.ais.com/blog/ Job Summary: Role: Site Reliability Engineer Mode of Hire: Full-time / Contract opportunity Responsibilities The Site reliability engineer will bring enhanced reliability, performance, and security to the project. Implementing comprehensive monitoring solutions to track system performance, detect anomalies, and prevent outages Setting up real-time alerts to quickly respond to issues, minimizing downtime and ensuring continuous service availability Automating routine tasks such as deployments, backups, and scaling, which reduces manual intervention and increases efficiency Integrating Continuous Integration/Continuous Deployment (CI/CD) pipelines to streamline the development and deployment process Optimizing the use of cloud resources to ensure cost-effectiveness and high performance Implementing load balancing strategies to distribute traffic evenly and prevent bottlenecks Applying security best practices to protect sensitive data and ensure compliance with regulatory requirements Regularly scanning for and addressing vulnerabilities to maintain a secure environment Developing and implementing incident response plans to quickly address and resolve issues Establishing disaster recovery protocols to ensure data integrity and service continuity in case of failures Working closely with development, operations, and business teams to align technical solutions with business goals Creating detailed documentation and providing training to ensure all team members are equipped to handle the system Requirements Oracle Cloud infrastructure experience Proficiency in oracle databases including performance tuning and optimization Scripting skills in Json, python Familiarity with CI/CD pipelines to ensure smooth deployments Understanding of security principles and practices to protect data and systems knowledge of regulatory requirements and how to implement them within Oracle Cloud Ability to work effectively with cross-functional teams, including developers and operations Communication Skills: Strong verbal and written communication skills to articulate technical issues and solutions If you are interested, please reply to me to meghana.mandhala@ais.com Thanks & Regards, Meghana Reddy M Sr. Talent Aquisition Business Partner

Posted 4 weeks ago

Apply

Senior Network Operations Engineer ( Automation & TEM) Consult Asia

9.0 - 14.0 years

20 - 35 Lacs

Bengaluru

Work from Office

Lead automation and expense management initiatives across global network platforms. Ensure secure, cost-effective operations, enhance reliability via SRE practices, and oversee vendor TEM performance, reporting, and billing accuracy. Required Candidate profile Exp in network automation, CI/CD, and cost governance. Skilled in SRE, telecom expense management, circuit cleanup, vendor coordination, and performance reporting using Power BI and Microsoft 365.

Posted 1 month ago

Apply

Lead Network & Voice Engineer ( RF, Telephony, Mobility & SRE ) Consult Asia

10.0 - 18.0 years

30 - 45 Lacs

Bengaluru

Work from Office

Lead and support RF, Voice/IPT, telephony, and mobile infrastructure globally. Drive innovation, reliability, and automation across network platforms, ensuring secure, scalable, and high-performance communication systems. Required Candidate profile Experienced in RF design, VOIP/IPT systems, UC tools, wireless/mobility, and SRE practices. Skilled in Tier-3 support, automation, and vendor management.

Posted 1 month ago

Apply

Login to

Please Verify Your Phone or Email

Confirm Action

Search

Profile

Upskill and Grow with AI

28 Site Reliability Jobs

Job Alert

Start Your Job Search Today

Please Verify Your Phone or Email

Job Application AI Bot

Download the Mobile App

Setup Job Alerts

Featured Companies

Before You Leave... Find Your Perfect Job!

Login to

Please Verify Your Phone or Email

Confirm Action

Contact Us

Search

Profile

Upskill and Grow with AI

28 Site Reliability Jobs

Job Alert

Upload Resume

AI Job Matching Summary

Pros

Cons

Summary

Start Your Job Search Today

Please Verify Your Phone or Email

Job Application AI Bot

Download the Mobile App

Setup Job Alerts

Featured Companies