Jobs
Interviews

4 Aws Sre Jobs

Setup a job Alert
JobPe aggregates results for easy application access, but you actually apply on the job portal directly.

10.0 - 20.0 years

30 - 40 Lacs

Hyderabad

Work from Office

Overview We are looking for a seasoned Senior Manager of Site Reliability Engineering (SRE) to lead our AWS-focused SRE initiatives. In this role, you will be responsible for overseeing the reliability, scalability, and performance of critical applications and infrastructure hosted on AWS. You will lead a team of experienced SREs, drive strategic operational improvements, and ensure the seamless functioning of our cloud ecosystem to meet business and customer needs Responsibilities Leadership and Team Management : Lead and mentor a team of SRE professionals, fostering a culture of innovation, collaboration, and accountability. Develop and implement career development plans, provide coaching, and facilitate knowledge-sharing within the team. Operational Excellence : Drive the adoption of SRE principles, including SLAs, SLOs, and error budgets, to enhance system reliability and performance. Oversee incident management processes, ensuring timely resolution and comprehensive root cause analysis. Establish and monitor operational KPIs to measure and improve system availability and performance. Automation and Tooling : Champion the use of automation to reduce manual processes, improve efficiency, and enhance system reliability. Implement and optimize Infrastructure as Code (IaC) using tools like Terraform, CloudFormation, or CDK. AWS Infrastructure Management : Design, build, and maintain scalable and secure AWS-based infrastructure to support current and future workloads. Leverage AWS services such as EC2, RDS, Lambda, S3, CloudWatch, and others to enhance operational capabilities. Collaboration and Stakeholder Engagement : Partner with engineering, product, and DevOps teams to align SRE initiatives with business objectives. Act as a key liaison between the SRE team and executive stakeholders, communicating updates on reliability and risks. Risk and Security Management : Ensure compliance with security standards and best practices within AWS environments. Identify risks related to cloud infrastructure and implement strategies for mitigation. Qualifications Bachelors degree in Computer Science, Engineering, or a related field (or equivalent experience). 10+ years of experience in cloud-based infrastructure and operations, with at least 4 years in a leadership role. Deep expertise in AWS services, architecture, and tools, including hands-on experience with core AWS services (e.g., EC2, ECS, Lambda, S3, VPC, IAM). Proficiency in automation scripting (e.g., Python, Bash) and Infrastructure as Code (e.g., Terraform, CloudFormation). Strong knowledge of monitoring and observability tools like CloudWatch, Prometheus, Grafana, or Datadog. Proven experience managing large-scale production environments, incident response, and operational scaling. Hands-on experience with CI/CD pipelines and DevOps methodologies. Preferred Qualifications AWS certifications, such as AWS Certified Solutions Architect (Professional) or AWS Certified DevOps Engineer. Experience with Kubernetes (EKS) and containerization technologies like Docker. Familiarity with FinOps principles for cost optimization in AWS environments. Strong analytical skills and a data-driven approach to decision-making. Exceptional communication, leadership, and stakeholder management abilities.

Posted 2 months ago

Apply

9.0 - 12.0 years

20 - 25 Lacs

Hyderabad

Work from Office

designing, managing, and optimizing our cloud infrastructure to ensure high availability, reliability, scalability of services Architect, deploy, maintain AWS infrastructure using Infrastructure-as-Code (IaC) tools such as Terraform or CloudFormation Required Candidate profile experience in a Site Reliability Engineer or DevOps role, with a focus on AWS cloud infrastructure AWS services such as EC2, S3, RDS, VPC, Lambda, CloudFormation, and CloudWatch.

Posted 3 months ago

Apply

12.0 - 20.0 years

25 - 40 Lacs

hyderabad, chennai, bengaluru

Hybrid

Overall 10+ years experience in IT industry. Experience in a technical architect role using service and hosting solutions such as public cloud IaaS, PaaS platforms. Hands-on experience in cloud-native architecture design, implementation of distributed, fault-tolerant enterprise applications for Cloud. Hands-on experience with Terraform, Ansible, GitHub, AWS CodePipeline, Puppet, Chef. Hands-on multi-tier architecting skills. Sound knowledge of Infrastructure design (Compute, Storage, Network). Hands-on cloud security (Endpoint & Portal security using AWS Security Hub, AWS Inspector, AWS WAF, AWS Guard Duty). Linux & Windows Hands-on Expertise. Strong knowledge & hands-on experience with scripting languages: YAML, Python, Shell, Bash, PowerShell, etc. Expertise in build tools like Jenkins to build re-usable release pipelines, release, and configuration management using Chef/Puppet/Ansible platforms. Experience in at least one configuration management tool ( Puppet, Ansible, Chef, Maven, etc. ). AWS Solutions Architect Professional

Posted Date not available

Apply

4.0 - 8.0 years

20 - 27 Lacs

hyderabad, pune, bengaluru

Hybrid

Role & responsibilities Exp: 4-7 yrs (we need candidates who can join immediately or within 15 days atleast); those with higher notice period, please refrain from applying) Technical Expertise Required Primary Skills required. Strong knowledge of Linux/Unix systems and command line tools. Proficiency in scripting languages such as Python, Shell, or Perl. Experience with configuration management tools like Ansible, Puppet, or Chef. Familiarity with cloud platforms like AWS, Azure, or Google Cloud. Understanding of networking principles and protocols (TCP/IP, HTTP, DNS, etc.). Knowledge of containerization technologies (Docker, Kubernetes) and orchestration tools. Expertise in monitoring and logging tools such as Prometheus, Grafana, ELK stack, or Splunk. (Optional - But Good to Know) Strong problem-solving and troubleshooting skills, with the ability to analyze and resolve complex technical issues. Excellent communication and collaboration skills to work effectively with cross-functional teams. Strong attention to detail and ability to work in a fast-paced, dynamic environment. Terraform basic syntax and GitLab CI/CD configuration, pipelines, jobs Cloud resources provisioning and configuration through CLI/API Understanding of how to do basic queries in logs tools for general questions Operating system (Linux) configuration, package management, startup and troubleshooting Block and object storage configuration Networking VPCs, proxies and CDNs Secondary skills required for the role. Bachelor's degree in computer science, engineering, or a related field. Proven experience as a Site Reliability Engineer or a similar role. Solid understanding of software development methodologies and DevOps principles. Experience with agile and iterative development processes. Certification in relevant technologies or frameworks is a plus (e.g., AWS Certified DevOps Engineer, Certified Kubernetes Administrator). Familiarity with continuous integration/continuous deployment (CI/CD) pipelines. Experience with source control systems such as Git or SVN. Knowledge of security best practices and experience implementing security measures in a production environment. Ability to work independently and handle multiple projects and priorities simultaneously. Strong analytical and problem-solving skills, with a focus on continuous improvement and automation. Role & Responsibilities of the Profile Design and implement highly available and scalable systems, ensuring the reliability and performance of the company's website or application. Collaborate with cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) for critical systems. Monitor systems and applications, proactively identifying and resolving any performance bottlenecks or availability issues. Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance. Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents. Automate repetitive tasks and processes to improve efficiency and reduce manual intervention. Create and maintain documentation for system architecture, configuration, and troubleshooting procedures. Perform capacity planning and resource allocation to ensure optimal system performance and scalability. Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability and performance standards. Stay up to date with industry best practices, new technologies, and emerging trends in site reliability engineering. Objectives of this role Run the production environment by monitoring availability and taking a holistic view of system health Build software and systems to manage platform infrastructure and applications Improve reliability, quality, and time-to-market of our suite of software solutions Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement Provide primary operational support and engineering for multiple large-scale distributed software applications.

Posted Date not available

Apply
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Featured Companies