Get alerts for new jobs matching your selected skills, preferred locations, and experience range. Manage Job Alerts
3.0 - 8.0 years
7 - 12 Lacs
Gurugram
Work from Office
Responsible for IT Infrastructure cross-platform technology areas demonstrating design and build expertise. Responsible for developing, architecting, and building AWS Cloud services with best practices, blueprints, patterns, high-availability and multi-region disaster recovery. Strong communication and collaboration skills Required education Bachelor's Degree Preferred education Master's Degree Required technical and professional expertise BE / B Tech in any stream, M.Sc. (Computer Science/IT) / M.C.A, with Minimum 3-5 plus years of experience Must have 3 + yrs of relevant experience in Python/ Java, AWS, Terraform/(IaC) Experience in Kubernetes, Docker, Shell scripting. Experienced in scripting languages Python (not someone who can write small scripts Preferred technical and professional experience Experience using DevOps tools in a cloud environment, such as Ansible, Artifactory, Docker, GitHub, Jenkins, Kubernetes, Maven, and Sonar Qube Experience installing and configuring different application servers such as JBoss, Tomcat, and WebLogic Experience using monitoring solutions like CloudWatch, ELK Stack, and Prometheus
Posted 1 month ago
2.0 - 5.0 years
4 - 8 Lacs
Bengaluru
Work from Office
Review and implement functional business requirements and non-functional technical requirements Translate business requirements into technical design documents and drive implementation with developers Research and analyze new technologies to be used (e.g., Libraries, IDE’s, tools) Develop high-level architecture and detailed design for application stack – backend Assist engineering and operational teams in debugging critical production problems Perform application code review, ensure creation and maintenance of appropriate artifacts for architecture and design work Develop back-end portions of web services. You will primarily focus on building backend REST API services. Work to implement server-side or application logic and design architectures. Create and talk to REST services. Shift between multiple projects and technologies. Write clean code and test it throughout the development process to ensure the quality is up to standards. Work on software that is used by millions of people all around the world is a challenge that you're willing to tackle. Perform peer reviews and mentor the team to evolve into backend developers. Encourage a self-motivated squad model of working from handling design, development, test and operations for the micro services. Required education Bachelor's Degree Required technical and professional expertise Kubernetes: Deep knowledge of Kubernetes architecture, pods, deployments, services, and persistent volumes. Storage Classes & Volumes: How Kubernetes manages persistent storage and snapshots. Networking Basics: Understanding Kubernetes networking Container Storage Interface (CSI): Familiarity with how storage plugins work in Kubernetes. CI/CD Pipelines: Integrating backup/restore into automation pipelines using Jenkins , GitHub action , travis etc. Scripting: Proficiency in Bash , Python , or Go for writing automation scripts. Disaster Recovery: Designing and implementing DR solutions for containerized environments. Data Replication: Understanding of synchronous and asynchronous replication techniques. Access Control: Implementing RBAC (Role-Based Access Control) in Kubernetes. Good to have: Compliance Knowledge: GDPR, HIPAA, or other data protection regulations relevant to backup data. Monitoring & Logging: Using tools like Prometheus , Grafana , ICD to monitor backup jobs and system health. Backup Tools: Experience with tools like Velero , Kasten K10 , Rsync , Restic , or Portworx for Kubernetes. Should have 5+ experience on Back end services development and Microservices Architecture. Proven experience implementing distributed applications in a container environment (Docker/Kubernetes) along with considerable experience configuring and administrating Linux (or other Unix-like) systems Software engineering experience designing Enterprise Cloud Applications with Go Lang, C, C++, Python etc., Proven experience in REST API development experience (APIs like REST / RESTful APIs). Expertise in defining business architecture, business process definition & modelling, use cases, and requirements definition, and associated best practice processes for defining these artifacts Proven proficiency in grasping requirements and building illustrative features with minimal specifications Experience working in agile development environments. Preferred technical and professional experience Understanding of Networking concepts and experience in Network development. Understanding of cloud storage concepts and experience in cloud storage development. Knowledge of security and compliance standards & requirements
Posted 1 month ago
2.0 - 7.0 years
3 - 7 Lacs
Bengaluru
Work from Office
At IBM, we are driven to shift our technology to an as-a-service model and to help our clients transform themselves to take full advantage of the cloud. With industry leadership in AI, analytics, security, commerce, and quantum computing and with unmatched hardware and software design and industrial research capabilities, no other company is as well positioned to address the full opportunity of enterprise cloud computing. We are looking for a backend developer to join our IBM Cloud VPC Observability team. This team is part of IBM Cloud VPC Service dedicated to ensuring that the IBM Cloud is at the forefront of reliable enterprise cloud technology. We are building Observability platforms to deliver performance, reliability and predictability for our customers' most demanding workloads, at global scale and with leadership efficiency, resiliency and security. In this role, you will be responsible for producing and enhancing features that collect, transform, and surface data on the various components of our cloud. The ability to take in requirements on an agile basis and be able to work autonomously with high level perspective is a must. You understand cloud native concepts and have experience with highly tunable and scalable Kubernetes based cloud deployments. You will participate in the design of the service, writing tools and automation, building containers, developing tests, determining monitoring best practices and handling complex escalations. If you are the kind of person who is collaborative, able to handle responsibility and enjoys not only sharing a vision but getting your hands dirty to be sure that the vision is made a reality in a fast-paced, challenging environment, then we want to talk to you! Required education Bachelor's Degree Required technical and professional expertise Bachelor's in Engineering, Computer Science, or relevant experience 2+ years experience and expertise in programming atleast in one language Python/Go/Node.js 1+ years experience in developing and deploying applications on Kubernetes and containerization technologies like Docker 2+ years familiarity with working in a CICD environment 2+ years experience with developing and operating highly available, distributed applications in production environments on Kubernetes Experience with building automated tests, handling customer escalations, 1+ years experiencewith managing service dependencies via Terraform or Ansible At least 2 years of experience coding and troubleshooting applications written in Go, Python, Node.js, Express.js. 1+ years experience in operating with secure principles At least 3 years of experience with micro-service development At least 1 years' experience with no-sql database systems such as MongoDB At least 1 years' experience with operating, configuring, and developing with caching systems like redis Proven understanding of REST principles and architecture Familiarity with working with Cloud services (IBM Cloud, GCP, AWS, Azure) Preferred technical and professional experience Advanced Experience with Kubernetes Experience with development on PostgreSQL, Kafka, Elastic, MySQL, Redis, or MongoDB 2 years experience with managing Linux machines using Configuration management (eg, Chef, Puppet, Ansible). Debian experience is preferred 2+ years experience with ability to automate using scripting languages like Python, Shell Experience with troubleshooting, using and configuring Linux systems 2+ years experience with infrastructure automation 2+ years experience with using monitoring tooling like Grafana, Prometheus
Posted 1 month ago
10.0 - 15.0 years
8 - 13 Lacs
Bengaluru
Work from Office
Understand product vision and business needs to define product requirements and product architectural solutions. Use tools and methodologies to create representations for functions and user interface of desired product Develop high-level product specifications with attention to system integration and feasibility Define all aspects of development from appropriate technology and workflow to coding standards Communicate successfully all concepts and guidelines to development team Ensure software meets all requirements of quality, security, modifiability, extensibility etc. Collaborate with other professionals to determine functional and non-functional requirements for new software or applications Provide support for production escalations and problem resolution for customers. Analyse requirements, design develop & maintain software products in alignment with the technology strategy of the organization Participate in technical reviews of requirements, specifications, designs, code and other artifacts. Ensure commitments are agreed, reviewed and met. Learn new skills and adopt new practices readily in order to develop innovative and cutting-edge software products that maintain Company's technical leadership position. Plan, develop and manage the infrastructure to enable strategic and effective use of tools. Lead the evaluation/evolution of tools/technologies/programs with input from internal teams, external developers. Proactively identifying issues and improvement opportunities. Directing resources to diagnose and resolve complex system, application software, security and related problems that impact system and availability. Required education Bachelor's Degree Required technical and professional expertise 1. Requires 10-15 years of experience. 2. Experience in Python/GoLang and REST APIs 3. Ability to write shell scripts - Automation 4. Experience in Cloud services and technologies like VPC, Gateways, NACL, security group. 5. Experience in Network debugging and Network routing protocols such as BGP, ISIS and others 6. Experience in DevOps and Site Reliability Engineering. 7. Understanding of Microservice Architecture, Docker, Kubernetes, and other cloud native technologies. 8. Debugging/Monitoring knowledge of Cloud Native Applications using Devops Tools such as Prometheus, NewRelic, Instana and others. 9. Experience in building, architecting, designing/implementing highly distributed global cloud-based systems. 10.Knowledge of technology solutions and ability to learn, understand and work quickly with new emerging technologies, methodologies and solutions in Cloud technology space. 11.Ability to deliver results and work cross-functionally. Ability to engage /influence audiences and identity expansion engagements. Preferred technical and professional experience Networking protocol knowledge ( TCPIP, IPTABLES, ROUTING MODELS ) Cloud Concepts – VPC, Subnet, Floating Ips
Posted 1 month ago
5.0 - 10.0 years
3 - 7 Lacs
Bengaluru
Work from Office
We are looking forSoftware Developerwith container platform and systems-level experience to join ourFabric Development team inIndia, Bangalore We seek individuals who innovate & share our passion for winning in the cloud marketplace. The Fabric Development team is a team dedicated to ensuring that the IBM Cloud is at the forefront of cloud technology, from bootstrapping data centres, to application architecture, to flexible infrastructure services. We are running IBM's next generation cloud platform to deliver performance and predictability for our customers' most demanding workloads, at global scale and with leadership efficiency, resiliency and security. It is an exciting time, and as a team we are driven by this incredible opportunity to thrill our clients. Design and developing innovative, company and industry impacting services using open source and commercial technologies at scale Designing and architecting enterprise solutions to complex problems Presenting technical solutions and designs to engineering team Following compliant procedures and secure engineering best practices Collaboration and review of technical designs with architecture and offering management Taking ownership and keen involvement in projects that vary in size and scope depending on requirements. Writing and executing unit, functional, and integration test cases Required education Bachelor's Degree Preferred education Bachelor's Degree Required technical and professional expertise Bachelor's degree in Computer Science, Information Technology, or a related field. 5+ years of experience as a SW Developer, with a focus on Python and Ansible. 3+ years of experience in Ansible for automation and configuration management. Strong Python programming skills for scripting and automation tasks. 3+ years In-depth knowledge of networking protocols and security principles. Experience with security tools and best practices. 3+ years experience with CI/CD pipelines and Ansible CI/CD practices. 3+ years of experience with Kuberneters, Docker deployments Familiarity with Agile development methodologies. Experience with cloud platforms Excellent problem-solving skills and attention to detail. Strong communication and team collaboration skills. Demonstrated skills with troubleshooting, debugging and maintaining code Preferred technical and professional experience Familiarity with CI/CD pipelines.Experience with version control systems (e.g., Git). Experience with Ansible Collections and writing custom Ansible modules. Experience with monitoring and logging tools for load balancers (Prometheus, Grafana, ELK Stack). An understanding and hands on experience with networking methodologies
Posted 1 month ago
3.0 - 5.0 years
4 - 8 Lacs
Mumbai
Work from Office
Practical experience with containerization and clustering (Kubernetes/OpenShift/Rancher/Tanzu/GKE/AKS/EKS etc). Version control system experience (e.g. Git, SVN) Experience implementing CI/CD (e.g. Jenkins). Experience with configuration management tools (e.g. Ansible, Chef) . Container Registry Solutions (Harbor, JFrog, Quay etc) . Good understanding on Kubernetes Networking & Security best practices Monitoring Tools like DataDog, or any other open source tool like Prometheus, Nagios, ELK. Mandatory Skills: Hands on Exp on Kubernetes and Kubernete Networking.
Posted 1 month ago
5.0 - 8.0 years
20 - 30 Lacs
Hyderabad
Work from Office
About the Role We are looking for a highly skilled Site Reliability Engineer (SRE) to lead the implementation and management of our observability stack across Azure-hosted infrastructure and .NET Core applications. This role will focus on configuring and managing Open Telemetry, Prometheus, Loki, and Tempo, along with setting up robust alerting systems across all services including Azure infrastructure and MSSQL databases. You will work closely with developers, DevOps, and infrastructure teams to ensure the performance, reliability, and visibility of our .NET Core applications and cloud services. Key Responsibilities • Observability Platform Implementation: Design and maintain distributed tracing, metrics, and logging using OpenTelemetry, Prometheus, Loki, and Tempo. Ensure complete instrumentation of .NET Core applications for end-to-end visibility. o Implement telemetry pipelines for application logs, performance metrics, and traces. Monitoring & Alerting: Develop and manage SLIs, SLOs, and error budgets. Create actionable, noise-free alerts using Prometheus Alertmanager and Azure Monitor. o Monitor key infrastructure components, applications, and databases with a focus on reliability and performance. • Azure & Infrastructure Integration: Integrate Azure services (App Services, VMs, Storage, etc.) with the observability stack. o Configure monitoring for MSSQL databases, including performance tuning metrics and health indicators. o Use Azure Monitor, Log Analytics, and custom exporters where necessary. Automation & DevOps: Automate observability configurations using Terraform, PowerShell, or other IaC tools. Integrate telemetry validation and health checks into CI/CD pipelines. Maintain observability as code for repeatable deployments and easy scaling. • Resilience & Reliability Engineering: Conduct capacity planning to anticipate scaling needs based on usage patterns and growth. Define and implement disaster recovery strategies for critical Azure-hosted services and databases. Perform load and stress testing to identify performance bottlenecks and validate infrastructure limits. Support release engineering by integrating observability checks and rollback strategies in CI/CD pipelines. Apply chaos engineering practices in lower environments to uncover potential reliability risks proactively. • Collaboration & Documentation: Partner with engineering teams to promote observability best practices in .NET Core development. o Create dashboards (Grafana preferred) and runbooks for system insights and incident response. o Document monitoring standards, troubleshooting guides, and onboarding materials. Required Skills and Experience 4+ years of experience in SRE, DevOps, or infrastructure-focused roles. Deep experience with .NET Core application observability using OpenTelemetry. Proficiency with Prometheus, Loki, Tempo, and related observability tools. Strong background in Azure infrastructure monitoring, including App Services and VMs. Hands-on experience monitoring MSSQL databases (deadlocks, query performance, etc.). • Familiarity with Infrastructure as Code (Terraform, Bicep) and scripting (PowerShell, Bash). Experience building and tuning alerts, dashboards, and metrics for production systems. Preferred Qualifications Azure certifications (e.g., AZ-104, AZ-400). Experience with Grafana, Azure Monitor, and Log Analytics integration. Familiarity with distributed systems and microservice architectures. Prior experience in high-availability, regulated, or customer-facing environments.
Posted 1 month ago
3.0 - 8.0 years
0 - 1 Lacs
Bangalore Rural, Bengaluru
Hybrid
Description - External You Lead the Way. We've Got Your Back.With the right backing, people and businesses have the power to progress in incredible ways. When you join Team Amex, you become part of a global and diverse community of colleagues with an unwavering commitment to back our customers, communities and each other. Here, youll learn and grow as we help you create a career journey thats unique and meaningful to you with benefits, programs, and flexibility that support you personally and professionally.At American Express, youll be recognized for your contributions, leadership, and impactevery colleague has the opportunity to share in the companys success. Together, well win as a team, striving to uphold our company values and powerful backing promise to provide the world’s best customer experience every day. And we’ll do it with the utmost integrity, and in an environment where everyone is seen, heard and feels like they belong.Join #TeamAmex and let's lead the way together. Key Responsibilities: SRE Strategy and Leadership: Develop and implement a comprehensive SRE strategy aligned with the company's goals and objectives. Lead a team of SRE professionals to drive the reliability, performance, and scalability of GRC technology solutions. Observability and Monitoring: Establish observability practices to ensure real-time insights into system performance, availability, and customer experience. Implement monitoring tools, metrics, and dashboards to proactively identify and address potential issues. Production Support Optimization: Lead all aspects of the end-to-end production support process, including incident management, problem resolution, and service-level agreement (SLA) compliance. Drive continuous improvement initiatives to enhance operational effectiveness and reduce mean time to resolution (MTTR). GRC Customer Journeys: Collaborate with multi-functional teams to enhance customer journeys through seamless and reliable technology experiences. Reliability Engineering Best Practices: Promote and implement standard methodologies, including error budgeting, chaos engineering, and disaster recovery planning. Cultivate a culture of resilience and reliability within technology. Automation and Efficiency: Champion automation initiatives to streamline operational workflows, deployment processes, and incident response tasks. Leverage automation tools and orchestration to improve reliability and reduce manual intervention. Qualifications: 3- 12 years of experience and degree or equivalent experience in Computer Science, Information Technology, or related field. Advanced certifications in SRE or related are a plus. Deep understanding of observability tools and methodologies, including experience with logging, monitoring, tracing, and performance analysis platforms. Strong leadership and people management skills, with the ability to inspire and empower successful SRE teams. Preferred Skills: Hands-on coding and System Design of highly available distributed systems Java/Golang/Javascript, Kubernetes, Docker Knowledge on modern observability stack – splunk, elastic search, Prometheus, Grafana Knowledge of cloud-based SRE practices and experience with public cloud platforms such as AWS, Azure, or Google Cloud. Familiarity with containerization technologies (e.g., Kubernetes, Docker) and microservices architecture. Demonstrated expertise in driving culture change, DevOps practices, and continuous improvement in SRE and production support functions. Join our innovative team and be at the forefront of advancing Site Reliability Engineering and production support in the Global Risk and Compliance Technology space. If you are passionate about driving reliability, observability, and excellence in customer experiences, we invite you to apply and join our mission to redefine the future of risk and compliance technology. Apply now and join us in shaping the reliability and performance of GRC solutions for a secure and compliant world. Compliance Language We back our colleagues and their loved ones with benefits and programs that support their holistic well-being. That means we prioritize their physical, financial, and mental health through each stage of life. Benefits include: Competitive base salaries Bonus incentives Support for financial-well-being and retirement Comprehensive medical, dental, vision, life insurance, and disability benefits (depending on location) Flexible working model with hybrid, onsite or virtual arrangements depending on role and business need Generous paid parental leave policies (depending on your location) Free access to global on-site wellness centers staffed with nurses and doctors (depending on location) Free and confidential counseling support through our Healthy Minds program Career development and training opportunities
Posted 1 month ago
5.0 - 8.0 years
8 - 12 Lacs
Bengaluru
Work from Office
Expertise in container technologies like Docker, Kubernetes Strong knowledge of architecture, design and implementation of microservices in container environments like Kubernetes. Experience managing Azure Kubernetes cluster Experience in container ecosystem buid maintain Proficient in Ansible/Terraform Experience with Azure DevOps/GitHub Actions/Jenkins Experience with system monitoring tools such as Lens,Prometheus, Azure monitoring etc. Experience in Helm Experience in building CI/CD pipeline for the web application backend services. Exceptional skills in debugging, performance tuning, optimization and troubleshooting of large software systems and ability to guide development teams and deliver in fast paced environments. Proficient understanding of code versioning tools like GitHub/TFS Familiarity with DevSecOps practices and tools BE/B.Tech/MCAor any Relevant Degree, CKA CKAD, DevOps certification desirable Key Responsibilities Design and Implement: Architect, design, and implement cloud-native applications and services using AKS. Containerization: Develop and deploy containerized applications utilizing Docker and Kubernetes. Migration of Applications from IaaS to AKS Automation: Automate the deployment, scaling, and management of containerized applications. CI/CD Pipeline: Build and maintain robust CI/CD pipelines to ensure smooth and efficient delivery of software. Monitoring and Optimization: Monitor application performance, troubleshoot issues, and optimize resource usage. Collaboration: Work closely with development, operations, and security teams to ensure seamless integration and alignment with business goals. Documentation: Create and maintain comprehensive documentation for architecture, processes, and procedures.
Posted 1 month ago
4.0 - 8.0 years
12 - 30 Lacs
Hyderabad
Work from Office
Strong Linux and Strong AWS experience Strong active directory Manage Hadoop clusters on Linux, Active Directory integration Collaborate with data science team on project delivery using Splunk & Spark Exp. managing BigData clusters in Production
Posted 1 month ago
10.0 - 13.0 years
20 - 25 Lacs
Pune
Work from Office
Company Overview With 80,000 customers across 150 countries, UKG is the largest U.S.-based private software company in the world. And were only getting started. Ready to bring your bold ideas and collaborative mindset to an organization that still has so much more to build and achieve? Read on. At UKG, you get more than just a job. You get to work with purpose. Our team of U Krewers are on a mission to inspire every organization to become a great place to work through our award-winning HR technology built for all. Here, we know that youre more than your work. Thats why our benefits help you thrive personally and professionally, from wellness programs and tuition reimbursement to U Choose a customizable expense reimbursement program that can be used for more than 200+ needs that best suit you and your family, from student loan repayment, to childcare, to pet insurance. Our inclusive culture, active and engaged employee resource groups, and caring leaders value every voice and support you in doing the best work of your career. If youre passionate about our purpose people then we cant wait to support whatever gives you purpose. Were united by purpose, inspired by you. Site Reliability Engineers at UKG are team members that have a breadth of knowledge encompassing all aspects of service delivery. They develop software solutions to enhance, harden and support our service delivery processes. This can include building and managing CI/CD deployment pipelines, automated testing, capacity planning, performance analysis, monitoring, alerting, chaos engineering and auto remediation. Site Reliability Engineers must have a passion for learning and evolving with current technology trends. They strive to innovate and are relentless in their pursuit of a flawless customer experience. They have an automate everything mindset, helping us bring value to our customers by deploying services with incredible speed, consistency and availability. Primary/Essential Duties and Key Responsibilities: Proficient in Splunk/ELK, and Datadog. Experience with observability tools such as Prometheus/InfluxDB, and Grafana. Possesses strong knowledge of at least one scripting language such as Python, Bash, Powershell or any other relevant languages. Design, develop, and maintain observability tools and infrastructure. Collaborate with other teams to ensure observability best practices are followed. Develop and maintain dashboards and alerts for monitoring system health. Troubleshoot and resolve issues related to observability tools and infrastructure. Engage in and improve the lifecycle of services from conception to EOL, includingsystem design consulting, and capacity planning Define and implement standards and best practices related toSystem Architecture, Service delivery, metrics and the automation of operational tasks Support services, product & engineering teams by providing common tooling and frameworks to deliver increased availability and improved incident response. Improve system performance, application delivery and efficiency through automation, process refinement, postmortem reviews, and in-depth configuration analysis Collaborate closely with engineering professionals within the organization to deliver reliable services Identify and eliminate operational toil by treating operational challenges as a software engineering problem Actively participate in incident response, including on-call responsibilities Partner with stakeholders to influence and help drive the best possible technical and business outcomes Guide junior team members and serve as a champion for Site Reliability Engineering Engineering degree, or a related technical discipline, and 10+years of experience in SRE. Experience coding in higher-level languages (e.g., Python, Javascript, C++, or Java) Knowledge of Cloud based applications & Containerization Technologies Demonstrated understanding of best practices in metric generation and collection, log aggregation pipelines, time-series databases, and distributed tracing Ability to analyze current technology utilized and engineering practices within the company and develop steps and processes to improve and expand upon them Working experience with industry standards like Terraform, Ansible. (Experience, Education, Certification, License and Training) Must have hands-on experience working within Engineering or Cloud. Experience with public cloud platforms (e.g. GCP, AWS, Azure) Experience in configuration and maintenance of applications & systems infrastructure. Experience with distributed system design and architecture Experience building and managing CI/CD Pipelines Where were going UKG is on the cusp of something truly special. Worldwide, we already hold the #1 market share position for workforce management and the #2 position for human capital management. Tens of millions of frontline workers start and end their days with our software, with billions of shifts managed annually through UKG solutions today. Yet its our AI-powered product portfolio designed to support customers of all sizes, industries, and geographies that will propel us into an even brighter tomorrow! UKG is proud to be an equal opportunity employer and is committed to promoting diversity and inclusion in the workplace, including the recruitment process. Disability Accommodation For individuals with disabilities that need additional assistance at any point in the application and interview process, please email UKGCareers@ukg.com
Posted 1 month ago
3.0 - 7.0 years
15 - 20 Lacs
Noida, Pune
Work from Office
The duties of a Site Reliability Engineer will be to support and maintain various Cloud Infrastructure Technology Tools in our hosted production/DR environments. He/she will be the subject matter expert for specific tool(s) or monitoring solution(s). Will be responsible for testing, verifying and implementing upgrades, patches and implementations. He/She will also partner with the other service and/or service functions to investigate and/or improve monitoring solutions. May mentor one or more tools team members or provide training to other cross functional teams as required. May motivate, develop, and manage performance of individuals and teams while on shift. May be assigned to produces regular and adhoc management reports in a timely manner. Proficient in Splunk/ELK, and Datadog. Experience with observability tools such as Prometheus/InfluxDB, and Grafana. Possesses strong knowledge of at least one scripting language such as Python, Bash, Powershell or any other relevant languages. Design, develop, and maintain observability tools and infrastructure. Collaborate with other teams to ensure observability best practices are followed. Develop and maintain dashboards and alerts for monitoring system health. Troubleshoot and resolve issues related to observability tools and infrastructure. Bachelors Degree in information systems or Computer Science or related discipline with relevant experience of 5-8 years Proficient in Splunk/ELK, and Datadog. Experience with Enterprise Software Implementations for Large Scale Organizations Exhibit extensive experience about the new technology trends prevalent in the market like SaaS, Cloud, Hosting Services and Application Management Service Monitoring tools like : Grafana, Prometheus, Datadog, Experience in deployment of application & infrastructure clusters within a Public Cloud environment utilizing a Cloud Management Platform Professional and positive with outstanding customer-facing practices Can-do attitude, willing to go the extra mile Consistently follows-up and follows-through on delegated tasks and actions
Posted 1 month ago
12.0 - 18.0 years
16 - 20 Lacs
Pune
Work from Office
Seasoned DevOps Architect to will lead the design, implementation, and maintenance of cloud-based infrastructure and DevOps team Collaborating closely with development, operations, and security teams, and ensure the seamless delivery of high-quality software solutions Qualifications: 18+ years of IT experience, with 8+ years dedicated to DevOps roles Deep knowledge of cloud platforms (AWS, Azure, GCP) Expertise in infrastructure automation tools (Terraform, Ansible, Puppet, Chef) Proficiency in containerization and orchestration (Docker, Kubernetes) Experience with CI/CD pipelines and tools (Jenkins, GitLab CI/CD, Azure DevOps) Strong knowledge of monitoring and logging tools (Prometheus, Grafana, ELK stack) Advanced scripting abilities (Bash, Python, Ruby) Solid understanding of security best practices and related tools Ability to work effectively both independently and within a team
Posted 1 month ago
2.0 - 7.0 years
3 - 7 Lacs
Ahmedabad
Work from Office
To help us build functional systems that improve customer experience we are now looking for an experienced DevOps Engineer. They will be responsible for deploying product updates, identifying production issues and implementing integrations that meet our customers' needs. If you have a solid background in software engineering and are familiar with Ruby or Python, wed love to speak with you. Responsibilities Work with development teams to ideate software solutions Building and setting up new development tools and infrastructure Working on ways to automate and improve development and release processes Ensuring that systems are safe and secure against cybersecurity threats Deploy updates and fixes Perform root cause analysis for production errors Develop scripts to automate infrastructure provision Working with software developers and software engineers to ensure that development follows established processes and works as intended Technologies we use GitOps GitHub, GitLab, BitBucket CI/CD Jenkins, Circle CI, Travis CI, TeamCity, Azure DevOps Containerization Docker, Swarm, Kubernetes Provisioning Terraform CloudOps Azure, AWS, GCP Observability Prometheus, Grafana, GrayLog, ELK Qualifications Graduate / Postgraduate in Technology sector Proven experience as a DevOps Engineer or similar role Effective communication and teamwork skills
Posted 1 month ago
3.0 - 8.0 years
1 - 4 Lacs
Chandigarh
Work from Office
Opportunity: We are seeking a highly skilled and experienced AI Infrastructure Engineer (or MLOps Engineer) to design, build, and maintain the robust and scalable AI/ML platforms that power our cutting-edge asset allocation strategies. In this critical role, you will be instrumental in enabling our AI Researchers and Quantitative Developers to efficiently develop, deploy, and monitor machine learning models in a high-performance, secure, and regulated financial environment. You will bridge the gap between research and production, ensuring our AI initiatives run smoothly and effectively. Responsibilities: Platform Design & Development: Architect, implement, and maintain the end-to-end AI/ML infrastructure, including data pipelines, feature stores, model training environments, inference serving platforms, and monitoring systems. Environment Setup & Management: Configure and optimize AI/ML development and production environments, ensuring access to necessary compute resources (CPUs, GPUs), software libraries, and data. MLOps Best Practices: Implement and advocate for MLOps best practices, including version control for models and data, automated testing, continuous integration/continuous deployment (CI/CD) pipelines for ML models, and robust model monitoring. Resource Optimization: Manage and optimize cloud computing resources (AWS, Azure, GCP, or on-premise) for cost-efficiency and performance, specifically for AI/ML workloads. Data Management: Collaborate with data engineers to ensure seamless ingestion, storage, and accessibility of high-quality financial and alternative datasets for AI/ML research and production. Tooling & Automation: Select, implement, and integrate various MLOps tools and platforms (e.g., Kubeflow, MLflow, Sagemaker, DataRobot, Vertex AI, Airflow, Jenkins, GitLab CI/CD) to streamline the ML lifecycle. Security & Compliance: Ensure that all AI/ML infrastructure and processes adhere to strict financial industry security standards, regulatory compliance, and data governance policies. Troubleshooting & Support: Provide expert support and troubleshooting for AI/ML infrastructure issues, resolving bottlenecks and ensuring system stability. Collaboration: Work closely with AI Researchers, Data Scientists, Software Engineers, and DevOps teams to translate research prototypes into scalable production systems. Documentation: Create and maintain comprehensive documentation for all AI/ML infrastructure components, processes, and best practices. Qualifications: Bachelor's or Master's degree in Computer Science, Software Engineering, Data Science, or a related quantitative field. Experience: 3+ years of experience in a dedicated MLOps, AI Infrastructure, DevOps, or Site Reliability Engineering role, preferably in the financial services industry. Proven experience in designing, building, and maintaining scalable data and AI/ML pipelines and platforms. Strong proficiency in cloud platforms (AWS, Azure, GCP) including services relevant to AI/ML (e.g., EC2, S3, Sagemaker, Lambda, Azure ML, Google AI Platform). Expertise in containerization technologies (Docker) and orchestration platforms (Kubernetes). Solid understanding of CI/CD principles and tools (Jenkins, GitLab CI/CD, CircleCI, Azure DevOps). Proficiency in scripting languages like Python (preferred), Bash, or similar. Experience with Infrastructure as Code (IaC) tools (e.g., Terraform, CloudFormation, Ansible). Familiarity with distributed computing frameworks (e.g., Spark, Dask) is a plus. Understanding of machine learning concepts and lifecycle, even if not directly developing models. Technical Skills: Deep knowledge of Linux/Unix operating systems. Strong understanding of networking, security, and database concepts. Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack). Familiarity with data warehousing and data lake concepts. Preferred candidate profile Exceptional problem-solving and debugging skills. Proactive and self-driven with a strong sense of ownership. Excellent communication and interpersonal skills, able to collaborate effectively with diverse teams. Ability to prioritize and manage multiple tasks in a fast-paced environment. A keen interest in applying technology to solve complex financial problems.
Posted 1 month ago
5.0 - 10.0 years
7 - 11 Lacs
Mumbai
Work from Office
We are looking for an experienced Senior Java Developer with a strong background in observability and telemetry to join our talented team. In this role, you will be responsible for designing, implementing, and maintaining robust and scalable solutions that enable us to gain deep insights into the performance, reliability, and health of our systems and applications. WHAT'S IN' IT FOR YOU : - You will get a pivotal role in the project and associated incentives based on your contribution towards the project success. - Working on optimizing performance of a platform handling data volume in the range of 5-8 petabytes. - An opportunity to collaborate and work with engineers from Google, AWS, ELK - You will be enabled to take-up leadership role in future to set-up your team as you grow with the customer during the project engagement. - Opportunity for advancement within the company, with clear paths for career progression based on performance and demonstrated capabilities. - Be part of a company that values innovation and encourages experimentation, where your ideas are heard and your contributions are recognized and rewarded. Work in a zero micro-management culture where you get to enjoy accountability and ownership for your tasks RESPONSIBILITIES : - Design, develop, and maintain Java-based microservices and applications with a focus on observability and telemetry. - Implement best practices for instrumenting, collecting, analyzing, and visualizing telemetry data (metrics, logs, traces) to monitor and troubleshoot system behavior and performance. - Collaborate with cross-functional teams to integrate observability solutions into the software development lifecycle, including CI/CD pipelines and automated testing frameworks. - Drive improvements in system reliability, scalability, and performance through data-driven insights and continuous feedback loops. - Stay up-to-date with emerging technologies and industry trends in observability, telemetry, and distributed systems to ensure our systems remain at the forefront of innovation. - Mentor junior developers and provide technical guidance and expertise in observability and telemetry practices. REQUIREMENTS : - Bachelor's or Master's degree in Computer Science, Engineering, or related field. - 5+ years of professional experience in software development with a strong focus on Java programming. - Expertise in observability and telemetry tools and practices, including but not limited to Prometheus, Grafana, Jaeger, ELK stack (Elasticsearch, Logstash, Kibana), and distributed tracing. - Solid understanding of microservices architecture, containerization (Docker, Kubernetes), and cloud- native technologies (AWS, Azure, GCP). - Proficiency in designing and implementing scalable, high-performance, and fault-tolerant systems. -Strong analytical and problem-solving skills with a passion for troubleshooting complex issues. - Excellent communication and collaboration skills with the ability to work effectively in a fast-paced, agile environment. - Experience with Agile methodologies and DevOps practices is a plus.
Posted 1 month ago
5.0 - 10.0 years
7 - 11 Lacs
Ahmedabad
Work from Office
We are looking for an experienced Senior Java Developer with a strong background in observability and telemetry to join our talented team. In this role, you will be responsible for designing, implementing, and maintaining robust and scalable solutions that enable us to gain deep insights into the performance, reliability, and health of our systems and applications. WHAT'S IN' IT FOR YOU : - You will get a pivotal role in the project and associated incentives based on your contribution towards the project success. - Working on optimizing performance of a platform handling data volume in the range of 5-8 petabytes. - An opportunity to collaborate and work with engineers from Google, AWS, ELK - You will be enabled to take-up leadership role in future to set-up your team as you grow with the customer during the project engagement. - Opportunity for advancement within the company, with clear paths for career progression based on performance and demonstrated capabilities. - Be part of a company that values innovation and encourages experimentation, where your ideas are heard and your contributions are recognized and rewarded. Work in a zero micro-management culture where you get to enjoy accountability and ownership for your tasks RESPONSIBILITIES : - Design, develop, and maintain Java-based microservices and applications with a focus on observability and telemetry. - Implement best practices for instrumenting, collecting, analyzing, and visualizing telemetry data (metrics, logs, traces) to monitor and troubleshoot system behavior and performance. - Collaborate with cross-functional teams to integrate observability solutions into the software development lifecycle, including CI/CD pipelines and automated testing frameworks. - Drive improvements in system reliability, scalability, and performance through data-driven insights and continuous feedback loops. - Stay up-to-date with emerging technologies and industry trends in observability, telemetry, and distributed systems to ensure our systems remain at the forefront of innovation. - Mentor junior developers and provide technical guidance and expertise in observability and telemetry practices. REQUIREMENTS : - Bachelor's or Master's degree in Computer Science, Engineering, or related field. - 5+ years of professional experience in software development with a strong focus on Java programming. - Expertise in observability and telemetry tools and practices, including but not limited to Prometheus, Grafana, Jaeger, ELK stack (Elasticsearch, Logstash, Kibana), and distributed tracing. - Solid understanding of microservices architecture, containerization (Docker, Kubernetes), and cloud- native technologies (AWS, Azure, GCP). - Proficiency in designing and implementing scalable, high-performance, and fault-tolerant systems. -Strong analytical and problem-solving skills with a passion for troubleshooting complex issues. - Excellent communication and collaboration skills with the ability to work effectively in a fast-paced, agile environment. - Experience with Agile methodologies and DevOps practices is a plus.
Posted 1 month ago
4.0 - 9.0 years
6 - 11 Lacs
Hyderabad
Work from Office
ABOUT AMGEN Amgen harnesses the best of biology and technology to fight the worlds toughest diseases, and make peoples lives easier, fuller and longer. We discover, develop, manufacture and deliver innovative medicines to help millions of patients. Amgen helped establish the biotechnology industry more than 45 years ago and remains on the cutting-edge of innovation, using technology and human genetic data to push beyond whats known today. ABOUT THE ROLE Role Description We are seeking a detail-oriented and highly skilled Data Engineering Test Automation Engineer with deep expertise of R&D domain in life sciences to ensure the quality, reliability, and performance of our data pipelines and platforms. The ideal candidate will have a strong background in data testing , ETL validation , and test automation frameworks . You will work closely with data engineers, analysts, and DevOps teams to build robust test suites for large-scale data solutions. This role combines deep technical execution with a solid foundation in QA best practices including test planning, defect tracking, and test lifecycle management . You will be responsible for designing and executing manual and automated test strategies for complex real-time and batch data pipelines , contributing to the design of automation frameworks , and ensuring high-quality data delivery across our AWS and Databricks-based analytics platforms . The role is highly technical and hands-on , with a strong focus on automation, data accuracy, completeness, consistency , and ensuring data governance practices are seamlessly integrated into development pipelines. Roles & Responsibilities Design, develop, and maintain automated test scripts for data pipelines, ETL jobs, and data integrations. Validate data accuracy, completeness, transformations, and integrity across multiple systems. Collaborate with data engineers to define test cases and establish data quality metrics. Develop reusable test automation frameworks and CI/CD integrations (e.g., Jenkins, GitHub Actions). Perform performance and load testing for data systems. Maintain test data management and data mocking strategies. Identify and track data quality issues, ensuring timely resolution. Perform root cause analysis and drive corrective actions. Contribute to QA ceremonies (standups, planning, retrospectives) and drive continuous improvement in QA processes and culture. Must-Have Skills Experience in QA roles, with strong exposure to data pipeline validation and ETL Testing. Domin Knowledge of R&D domain of life science. Validate data accuracy, transformations, schema compliance, and completeness across systems using PySpark and SQL . Strong hands-on experience with Python, and optionally PySpark, for developing automated data validation scripts. Proven experience in validating ETL workflows, with a solid understanding of data transformation logic, schema comparison, and source-to-target mapping. Experience working with data integration and processing platforms like Databricks/Snowflake, AWS EMR, Redshift etc Experience in manual and automated testing of data pipelines executions for both batch and real-time data pipelines. Perform performance testing of large-scale complex data engineering pipelines. Ability to troubleshoot data issues independently and collaborate with engineering teams for root cause analysis Strong understanding of QA methodologies, test planning, test case design, and defect lifecycle management. Hands-on experience with API testing using Postman, pytest, or custom automation scripts Experience integrating automated tests into CI/CD pipelines using tools like Jenkins, GitHub Actions, or similar. Knowledge of cloud platforms such as AWS, Azure, GCP. Good-to-Have Skills Certifications in Databricks, AWS, Azure, or data QA (e.g., ISTQB). Understanding of data privacy, compliance, and governance frameworks. Knowledge of UI automated testing frameworks like Selenium, JUnit, TestNG Familiarity with monitoring/observability tools such as Datadog, Prometheus, or Cloud Watch Education and Professional Certifications Masters degree and 3 to 7 years of Computer Science, IT or related field experience Bachelors degree and 4 to 9 years of Computer Science, IT or related field experience Soft Skills Excellent analytical and troubleshooting skills. Strong verbal and written communication skills Ability to work effectively with global, virtual teams High degree of initiative and self-motivation. Ability to manage multiple priorities successfully. Team-oriented, with a focus on achieving team goals Strong presentation and public speaking skills.
Posted 1 month ago
3.0 - 6.0 years
4 - 8 Lacs
Bengaluru
Work from Office
We are looking for a Kibana Subject Matter Expert (SME) to support our Network Operations Center (NOC) by designing, developing, and maintaining real-time dashboards and alerting mechanisms. The ideal candidate will have strong experience in working with Elasticsearch and Kibana to visualize key performance indicators (KPIs), system health, and alerts related to NOC-managed infrastructure. Key Responsibilities: Design and develop dynamic and interactive Kibana dashboards tailored for NOC monitoring. Integrate various NOC elements such as network devices, servers, applications, and services into Elasticsearch/Kibana. Create real-time visualizations and trend reports for system health, uptime, traffic, errors, and performance metrics. Configure alerts and anomaly detection mechanisms for critical infrastructure issues using Kibana or related tools (e.g., ElastAlert, Watcher). Collaborate with NOC engineers, infrastructure teams, and DevOps to understand monitoring requirements and deliver customized dashboards. Optimize Elasticsearch queries and index mappings for performance and data integrity. Provide expert guidance on best practices for log ingestion, parsing, and data retention strategies. Support troubleshooting and incident response efforts by providing actionable insights through Kibana visualizations. Primary Skills Proven experience as a Kibana SME or similar role with a focus on dashboards and alerting. Strong hands-on experience with Elasticsearch and Kibana (7.x or higher). Experience in working with log ingestion tools (e.g., Logstash, Beats, Fluentd). Solid understanding of NOC operations and common infrastructure elements (routers, switches, firewalls, servers, etc.). Proficiency in JSON, Elasticsearch Query DSL, and Kibana scripting for advanced visualizations. Familiarity with alerting frameworks such as ElastAlert, Kibana Alerting, or Watcher. Good understanding of Linux-based systems and networking fundamentals. Strong problem-solving skills and attention to detail. Excellent communication and collaboration skills. Preferred Qualifications: Experience in working within telecom, ISP, or large-scale IT operations environments. Exposure to Grafana, Prometheus, or other monitoring and visualization tools. Knowledge of scripting languages such as Python or Shell for automation. Familiarity with SIEM or security monitoring solutions.
Posted 1 month ago
4.0 - 9.0 years
9 - 14 Lacs
Bengaluru
Work from Office
Primary Skills Strong hands-on experience with observability tools like AppDynamics, Dynatrace, Prometheus, Grafana, and ELK Stack Proficient in AppDynamics setup, including installation, configuration, monitor creation, and integration with ServiceNow, email, and Teams Ability to design and implement monitoring solutions for logs, traces, telemetry, and KPIs Skilled in creating dashboards and alerts for application and infrastructure monitoring Experience with AppDynamics features such as NPM, RUM, and synthetic monitoring Familiarity with AWS and Kubernetes, especially in the context of observability Scripting knowledge in Python or Bash for automation and tool integration Understanding of ITIL processes and APM support activities Good grasp of non-functional requirements like performance, capacity, and security Secondary Skills AppDynamics Performance Analyst or Implementation Professional certification Experience with other APM tools like New Relic, Datadog, or Splunk Exposure to CI/CD pipelines and integration of monitoring into DevOps workflows Familiarity with infrastructure-as-code tools like Terraform or Ansible Understanding of network protocols and troubleshooting techniques Experience in performance tuning and capacity planning Knowledge of compliance and audit requirements related to monitoring and logging Ability to work in Agile/Scrum environments and contribute to sprint planning from an observability perspective
Posted 1 month ago
5.0 - 10.0 years
7 - 12 Lacs
Bengaluru
Hybrid
Position Overview: We are seeking a Senior Software Engineer to help drive our build, release, and testing infrastructure to the next level. You will focus on scaling and optimizing our systems for large-scale, high-performance deployments reducing build times from days to mere minutes while maintaining high-quality releases. As part of our collaborative, fast-paced engineering team, you will play a pivotal role in delivering tools and processes that support continuous delivery, test-driven development, and agile methodologies. Key Responsibilities: Automation & Tooling Development: Build, maintain, and improve our automated build, release, and testing infrastructure. Your focus will be on developing tools and scripts that automate our deployment pipeline, enabling a seamless and efficient continuous delivery process. Cross-functional Collaboration: Collaborate closely with development, QA, and SRE teams to ensure our build infrastructure meets the needs of all teams. Work with teams across the organization to create new tools, processes, and technologies that will streamline and enhance our delivery pipeline. Innovative Technology Integration: Stay on top of the latest advancements in cloud technology, automation, and infrastructure tools. You ll have the opportunity to experiment with and recommend new technologies, including AWS services, to enhance our CI/CD system. Scaling Infrastructure: Work on scaling our infrastructure to meet the demands of running thousands of automated tests for every commit. Help us reduce compute time from days to minutes, addressing scalability and performance challenges as we grow. Continuous Improvement & Feedback Loops: Be a champion for continuous improvement by collecting feedback from internal customers, monitoring the adoption of new tools, and fine-tuning processes to maximize efficiency, stability, and overall satisfaction. Process & Project Ownership: Lead the rollout of new tools and processes, from initial development through to full implementation. You ll be responsible for ensuring smooth adoption and delivering value to internal teams. Required Qualifications: 5+ years of experience in software development with a strong proficiency in at least one of the following languages: Python , Go , Java , or JavaScript . Deep understanding of application development, microservices architecture, and the elements that drive a successful multi-service ecosystem. Familiarity with building and deploying scalable services is essential. Strong automation skills : Experience scripting and building tools for automation in the context of continuous integration and deployment pipelines. Cloud infrastructure expertise : Hands-on experience with AWS services (e.g., EC2, S3, Lambda, RDS) and Kubernetes or containerized environments. Familiarity with containerization : Strong understanding of Docker and container orchestration, with a particular focus on cloud-native technologies. Problem-solving mindset : Ability to identify, troubleshoot, and resolve technical challenges, particularly in large-scale systems. Agile experience : Familiarity with Agile methodologies, and the ability to collaborate effectively within cross-functional teams to deliver on-time and with high quality. Collaboration skills : Ability to communicate complex technical concepts to both technical and non-technical stakeholders. Strong team-oriented mindset with a focus on delivering value through collaboration. Bachelor s degree in Computer Science or a related field, or equivalent professional experience. Preferred Qualifications: Experience with Kubernetes (K8s): In-depth knowledge of Kubernetes architecture and operational experience in managing Kubernetes clusters at scale. CI/CD expertise: Solid experience working with CI/CD pipelines and tools (e.g., Terraform, Ansible, Spinnaker). Infrastructure-as-code experience: Familiarity with Terraform , CloudFormation , or similar tools for automating cloud infrastructure deployments. Container orchestration & scaling : Experience with Karpenter or other auto-scaling tools for Kubernetes. Monitoring & Logging : Familiarity with tools such as Prometheus , Grafana , and CloudWatch for tracking infrastructure performance and debugging production issues.
Posted 1 month ago
9.0 - 10.0 years
11 - 12 Lacs
Hyderabad
Work from Office
We are seeking a highly skilled Devops Engineer to join our dynamic development team. In this role, you will be responsible for designing, developing, and maintaining both frontend and backend components of our applications using Devops and associated technologies. You will collaborate with cross-functional teams to deliver robust, scalable, and high-performing software solutions that meet our business needs. The ideal candidate will have a strong background in devops, experience with modern frontend frameworks, and a passion for full-stack development. Requirements : Bachelor's degree in Computer Science Engineering, or a related field. 9 to 10+ years of experience in full-stack development, with a strong focus on DevOps. DevOps with AWS Data Engineer - Roles & Responsibilities: Use AWS services like EC2, VPC, S3, IAM, RDS, and Route 53. Automate infrastructure using Infrastructure as Code (IaC) tools like Terraform or AWS CloudFormation . Build and maintain CI/CD pipelines using tools AWS CodePipeline, Jenkins,GitLab CI/CD. Cross-Functional Collaboration Automate build, test, and deployment processes for Java applications. Use Ansible , Chef , or AWS Systems Manager for managing configurations across environments. Containerize Java apps using Docker . Deploy and manage containers using Amazon ECS , EKS (Kubernetes) , or Fargate . Monitoring & Logging using Amazon CloudWatch,Prometheus + Grafana,E Stack (Elasticsearch, Logstash, Kibana),AWS X-Ray for distributed tracing manage access with IAM roles/policies . Use AWS Secrets Manager / Parameter Store for managing credentials. Enforce security best practices , encryption, and audits. Automate backups for databases and services using AWS Backup , RDS Snapshots , and S3 lifecycle rules . Implement Disaster Recovery (DR) strategies. Work closely with development teams to integrate DevOps practices. Document pipelines, architecture, and troubleshooting runbooks. Monitor and optimize AWS resource usage. Use AWS Cost Explorer , Budgets , and Savings Plans . Must-Have Skills: Experience working on Linux-based infrastructure. Excellent understanding of Ruby, Python, Perl, and Java . Configuration and managing databases such as MySQL, Mongo. Excellent troubleshooting. Selecting and deploying appropriate CI/CD tools Working knowledge of various tools, open-source technologies, and cloud services. Awareness of critical concepts in DevOps and Agile principles. Managing stakeholders and external interfaces. Setting up tools and required infrastructure. Defining and setting development, testing, release, update, and support processes for DevOps operation. Have the technical skills to review, verify, and validate the software code developed in the project. Interview Mode : F2F for who are residing in Hyderabad / Zoom for other states Location : 43/A, MLA Colony,Road no 12, Banjara Hills, 500034 Time : 2 - 4pm
Posted 1 month ago
6.0 - 8.0 years
13 - 18 Lacs
Gurugram
Work from Office
Responsibilities : - Define and enforce SLOs, SLIs, and error budgets across microservices - Architect an observability stack (metrics, logs, traces) and drive operational insights - Automate toil and manual ops with robust tooling and runbooks - Own incident response lifecycle: detection, triage, RCA, and postmortems - Collaborate with product teams to build fault-tolerant systems - Champion performance tuning, capacity planning, and scalability testing - Optimise costs while maintaining the reliability of cloud infrastructure Must have Skills : - 6+ years in SRE/Infrastructure/Backend related roles using Cloud Native Technologies - 2+ years in SRE-specific capacity - Strong experience with monitoring/observability tools (Datadog, Prometheus, Grafana, ELK etc.) - Experience with infrastructure-as-code (Terraform/Ansible) - Proficiency in Kubernetes, service mesh (Istio/Linkerd), and container orchestration - Deep understanding of distributed systems, networking, and failure domains - Expertise in automation with Python, Bash, or Go - Proficient in incident management, SLAs/SLOs, and system tuning - Hands-on experience with GCP (preferred)/AWS/Azure and cloud cost optimisation - Participation in on-call rotations and running large-scale production systems Nice to have skills : - Familiarity with chaos engineering practices and tools (Gremlin, Litmus) - Background in performance testing and load simulation (Gatling, Locust, k6, JMeter)
Posted 1 month ago
6.0 - 10.0 years
15 - 25 Lacs
Gurugram, Bengaluru
Hybrid
What you will be doing The Site Reliability Engineer (SRE) operates and maintains production systems in the cloud. Their primary goal is to make sure the systems are up and running and provide the expected performance. This involves daily operations tasks of monitoring, deployment and incident management as well as strategic tasks like capacity planning, provisioning and continuous improvement of processes. Also, a major part of the role is the design for reliability, scalability, efficiency and the automation of everyday system operations tasks. SREs work closely together with technical support teams, application developers and DevOps engineers both on incident resolution and on long-term evolution of systems. Employees will primarily work on creating Terraform, Shell & Ansible scripts and will be part of Application deployments using Azure Kubernetes service. Employees will work with a cybersecurity client/company. Monitor production systems' health, usage, and performance using dashboards and monitoring tools. Track provisioned resources, infrastructure, and their configuration. Perform regular maintenance activities on databases, services, and infrastructure. Respond to alerts and incidents: investigate, resolve, or dispatch according to SLAs. Respond to emergencies: recover systems and restore services with minimal downtime. Coordinate with customer success and engineering teams on incident resolution. Perform postmortems after major incidents. Change management: perform rollouts, rollbacks, patching and configuration changes. Drive demand forecasting and capacity planning with engineering and customer success teams. Consider projected growth and demand spikes. Provision production resources according to capacity demands. Work with the engineering teams on the design and testing for reliability, scalability, performance, efficiency, and security. Track resource utilization and cost-efficiency of production services. What were BSc/MSc, B. Tech degree in STEM, 6+ years of relevant industry experience. Technical skills: Terraform, Docker Swarm/K8s, Python, Unix/Linux Shell scripting, DevOps, GitHub Actions, Azure Active Directory, Azure monitor & Log Analytics. Experience in integrating Grafana with Prometheus will be an added advantage. Strong verbal and written communication skills. Ability to perform on-call duties.
Posted 1 month ago
0.0 - 3.0 years
3 - 6 Lacs
Hyderabad
Work from Office
The ideal candidate will have a deep understanding of automation, configuration management, and infrastructure-as-code principles, with a strong focus on Ansible. You will work closely with developers, system administrators, and other collaborators to automate infrastructure related processes, improve deployment pipelines, and ensure consistent configurations across multiple environments. The Infrastructure Automation Engineer will be responsible for developing innovative self-service solutions for our global workforce and further enhancing our self-service automation built using Ansible. As part of a scaled Agile product delivery team, the Developer works closely with product feature owners, project collaborators, operational support teams, peer developers and testers to develop solutions to enhance self-service capabilities and solve business problems by identifying requirements, conducting feasibility analysis, proof of concepts and design sessions. The Developer serves as a subject matter expert on the design, integration and operability of solutions to support innovation initiatives with business partners and shared services technology teams. Please note, this is an onsite role based in Hyderabad. Key Responsibilities: Automating repetitive IT tasks - Collaborate with multi-functional teams to gather requirements and build automation solutions for infrastructure provisioning, configuration management, and software deployment. Configuration Management - Design, implement, and maintain code including Ansible playbooks, roles, and inventories for automating system configurations and deployments and ensuring consistency Ensure the scalability, reliability, and security of automated solutions. Troubleshoot and resolve issues related to automation scripts, infrastructure, and deployments. Perform infrastructure automation assessments, implementations, providing solutions to increase efficiency, repeatability, and consistency. DevOps Facilite continuous integration and deployment (CI/CD) Orchestration Coordinating multiple automated tasks across systems Develop and maintain clear, reusable, and version-controlled playbooks and scripts. Manage and optimize cloud infrastructure using Ansible and terraform automation (AWS, Azure, GCP, etc.). Continuously improve automation workflows and practices to enhance speed, quality, and reliability. Ensure that infrastructure automation adheres to best practices, security standards, and regulatory requirements. Document and maintain processes, configurations, and changes in the automation infrastructure. Participate in design review, client requirements sessions and development teams to deliver features and capabilities supporting automation initiatives Collaborate with product owners, collaborators, testers and other developers to understand, estimate, prioritize and implement solutions Design, code, debug, document, deploy and maintain solutions in a highly efficient and effective manner Participate in problem analysis, code review, and system design Remain current on new technology and apply innovation to improve functionality Collaborate closely with collaborators and team members to configure, improve and maintain current applications Work directly with users to resolve support issues within product team responsibilities Monitor health, performance and usage of developed solutions What we expect of you We are all different, yet we all use our unique contributions to serve patients. Basic Qualifications: Bachelors degree and 0 to 3 years of computer science, IT, or related field experience OR Diploma and 4 to 7 years of computer science, IT, or related field experience Deep hands-on experience with Ansible including playbooks, roles, and modules Proven experience as an Ansible Engineer or in a similar automation role Scripting skills in Python, Bash, or other programming languages Proficiency expertise in Terraform & CloudFormation for AWS infrastructure automation Experience with other configuration management tools (e.g., Puppet, Chef). Experience with Linux administration, scripting (Python, Bash), and CI/CD tools (GitHub Actions, CodePipeline, etc.) Familiarity with monitoring tools (e.g., Dynatrace, Prometheus, Nagios) Working in an Agile (SAFe, Scrum, and Kanban) environment Preferred Qualifications: Red Hat Certified Specialist in Developing with Ansible Automation Platform Red Hat Certified Specialist in Managing Automation with Ansible Automation Platform Red Hat Certified System Administrator AWS Certified Solutions Architect Associate or Professional AWS Certified DevOps Engineer Professional Terraform Associate Certification Good-to-Have Skills: Experience with Kubernetes (EKS) and service mesh architectures. Knowledge of AWS Lambda and event-driven architectures. Familiarity with AWS CDK, Ansible, or Packer for cloud automation. Exposure to multi-cloud environments (Azure, GCP) Experience operating within a validated systems environment (FDA, European Agency for the Evaluation of Medicinal Products, Ministry of Health, etc.) Soft Skills: Strong analytical and problem-solving skills. Effective communication and collaboration with multi-functional teams. Ability to work in a fast-paced, cloud-first environment. Shift Information: This position is an onsite role and may require working during later hours to align with business hours. Candidates must be willing and able to work outside of standard hours as required to meet business needs.
Posted 1 month ago
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Accenture
31458 Jobs | Dublin
Wipro
16542 Jobs | Bengaluru
EY
10788 Jobs | London
Accenture in India
10711 Jobs | Dublin 2
Amazon
8660 Jobs | Seattle,WA
Uplers
8559 Jobs | Ahmedabad
IBM
7988 Jobs | Armonk
Oracle
7535 Jobs | Redwood City
Muthoot FinCorp (MFL)
6170 Jobs | New Delhi
Capgemini
6091 Jobs | Paris,France