Jobs
Interviews

35 Pagerduty Jobs

Setup a job Alert
JobPe aggregates results for easy application access, but you actually apply on the job portal directly.

1.0 - 5.0 years

0 Lacs

pune, maharashtra

On-site

As a Site Reliability Engineer - Incident Management, you will be responsible for monitoring, maintaining, and managing the entire Qualys infrastructure and services installed at different data centers. In the event of any malfunction in products/services, you will be required to monitor, troubleshoot, repair, and restore the service/system promptly to ensure maximum service availability and performance. Your role will also involve providing support services for Engineering and other technical teams, collaborating for quicker issue resolution, performing end-to-end incident management, documentation, and task automation. Your main responsibilities will include monitoring the performance and capacity of computer systems, utilizing various tools to identify and address issues effectively. You will be expected to conduct basic troubleshooting of platform/product issues, utilize tools such as Splunk, Grafana, Kibana for performance checking, and manage PagerDuty. Additionally, you will assist in task automation wherever applicable, ensure timely resolution of incident tickets, and work on triaging and troubleshooting problems affecting products or services. It will be crucial for you to meticulously track and document all issues and resolutions in detail on the ticketing/documentation tools to enhance the knowledge base and maintain a record of system health. In cases where troubleshooting complex issues is not feasible, you should escalate the problem to management, IT resources, or 3rd party vendors for further assistance. Communication within the team and externally to stakeholders, keeping them informed of relevant information, known issues, and steps being taken, will be an integral part of your role. The Site Reliability Engineer - Incident Management team will operate 24*7*365 on a monthly shift rotation basis as per requirements. To excel in this role, you should possess one to two years of IT Operations (Infra/System admin/Linux) experience or relevant certification. Familiarity with monitoring and integration tools like Splunk, Prometheus, Grafana, Kibana, PagerDuty, Runscope, and incident management tools such as Jira/ServiceNow is beneficial. A good understanding of ITSM main functions and tools, along with strong interpersonal skills to interact with employees at all levels professionally, will be essential. Certifications in computer functionality, Linux, System Admin, VMware, IT Security, or ITSM/ITIL, and knowledge of DevOps/SRE basics, Python, and Cloud will be advantageous for this role.,

Posted 2 days ago

Apply

3.0 - 7.0 years

0 Lacs

maharashtra

On-site

This role is eligible for our hybrid work model: Two days in-office. Rotational Shift - Two shifts starting at 6 am and 2 pm (IST) & 2 pm to 10 pm IST. Why this job is a big deal: Are you interested in learning cutting edge technologies Do you enjoy solving complex problems The priceline.com Site Reliability Operations Team offers these and many more opportunities while working in a fast-paced and challenging environment. The team is responsible for ensuring that every area of Priceline.com's site is highly available, reliable, and performing optimally. In this role, you will get to manage and issue track ticket creation, updates, escalations, and participation on incident bridge calls. Adherence to established response SLOs/SLAs and a working knowledge of all monitoring and support tools. Maintain a culture of continuous improvement by providing suggestions for process improvements, providing updates to documentation, providing transfer of knowledge to peers in your area of expertise, and assisting in the training of new hires. Frontline Tier I/II monitoring / escalation / incident response and impact mitigation. Execute Command & Control tasks on our infrastructure. Orchestrate and manage incident lifecycle between external 3rd party vendors, the Site Reliability Engineers (SRE), and internal development teams. Analyze and support the continuous improvement of our monitoring as well as command and control capabilities. Maintain a high level of communication and knowledge sharing: incident lifecycle tracking, runbooks, and operational documentation. Report the health and availability of the site and related services. Who you are: Bachelor's degree in Computer Science or related field or 3-4 years of relevant work experience. Experience with New Relic, PagerDuty, Splunk, Jira, Confluence. Working experience with Incident Management and Change Management. Prior experience in Operations or a fast-paced, high-stress environment with the requirement to resolve multiple interruption-driven priorities simultaneously. Solid understanding of Open Source environments and TCP/IP Networking. Self-motivated and can work both independently and within a team in our 24/7 Operations Center; available for off-hours shift coverage and be able to own technical issues in the role of Incident Commander. Illustrated history of living the values necessary to Priceline: Customer, Innovation, Team, Accountability, and Trust. The Right Results, the Right Way is not just a motto at Priceline; it's a way of life. Unquestionable integrity and ethics are essential. Who we are: WE ARE PRICELINE. Our success as one of the biggest players in online travel is all thanks to our incredible, dedicated team of talented employees. Priceliners are focused on being the best travel deal makers in the world, motivated by our passion to help everyone experience the moments that matter most in their lives. Whether it's a dream vacation, your cousin's graduation, or your best friend's wedding - we make travel affordable and accessible to our customers. Our culture is unique and inspiring (that's what our employees tell us). We're a grown-up, startup. We deliver the excitement of a new venture, without the struggles and chaos that can come with a business that hasn't stabilized. We're on the cutting edge of innovative technologies. We keep the customer at the center of all that we do. Our ability to meet their needs relies on the strength of a workforce as diverse as the customers we serve. We bring together employees from all walks of life, and we are proud to provide the kind of inclusive environment that stimulates innovation, creativity, and collaboration. Priceline is part of the Booking Holdings, Inc. (Nasdaq: BKNG) family of companies, a highly profitable global online travel company with a market capitalization of over $80 billion. Our sister companies include Booking.com, BookingGo, Agoda, Kayak, and OpenTable. If you want to be part of something truly special, check us out! Flexible work at Priceline: Priceline is following a hybrid working model, which includes two days onsite as determined by you and your manager (ideally selecting among Tuesday, Wednesday, or Thursday). On the remaining days, you can choose to be remote or in the office. Diversity and Inclusion are a Big Deal! To be the best travel dealmakers in the world, it's important we have a workforce that reflects the diverse customers and communities we serve. We are committed to cultivating a culture where all employees have the freedom to bring their individual perspectives, life experiences, and passion to work. Priceline is a proud equal opportunity employer. We embrace and celebrate the unique lenses through which our employees see the world. We'd love you to join us and add to our rich mix! Applying for this position: We're excited that you are interested in a career with us. For all current employees, please use the internal portal to find jobs and apply. External candidates are required to have an account before applying.,

Posted 3 days ago

Apply

5.0 - 12.0 years

0 Lacs

pune, maharashtra

On-site

As a Senior Service Reliability Engineer at Proofpoint, you will develop a deep understanding of the various services and applications that come together to deliver Proofpoint's next-generation security products. Your primary responsibility will be maintaining and extending the Elasticsearch and Splunk clusters used for critical near-real-time data analysis. This role involves continually evaluating the performance of these clusters, identifying and addressing developing problems, planning changes for high-load events, applying security fixes, testing and performing upgrades, as well as enhancing the monitoring and alert infrastructure. You will also play a key role in maintaining other components of the data pipeline, which may involve serverless or server-based systems for data ingestion into the Elasticsearch pipeline. Optimizing cost vs. performance will be a focus, including testing new hosts or configurations. Automation is a priority, utilizing tools like Puppet and various scripting mechanisms to achieve a build once/run everywhere system. Your work will span various types of infrastructure, including public cloud, Kubernetes clusters, and private data centers, providing exposure to diverse operational environments. Building effective partnerships across different teams within the organization, such as Product, Engineering, and Operations, is crucial. Participation in an on-call rotation and addressing escalated issues promptly are also part of the role. To excel in this position, you are expected to have a Bachelor's degree in computer science, information technology, engineering, or a related discipline. Your expertise should include proficient administration and management of Elasticsearch clusters, with secondary experience in managing Splunk clusters. Proficiency in provisioning and Configuration Management tools like Puppet, Ansible, and Rundeck is essential. Experience in building Automations and Infrastructure as Code using tools like Terraform, Packer, or CloudFormation templates is a plus. You should also be familiar with monitoring and logging tools such as Splunk, Prometheus, and PagerDuty, as well as scripting languages like Python, Bash, Go, Ruby, and Perl. Experience with CI/CD tools like Jenkins, Pipelines, and Artifactory will be beneficial. An inquisitive mind, effective troubleshooting skills, and the ability to navigate a complex system to extract meaningful data are essential qualities for success in this role. In addition to a competitive salary and benefits package, Proofpoint offers a culture focused on talent development, regular promotion cycles, company-sponsored education, and certifications. You will have the opportunity to work with cutting-edge technologies, participate in employee engagement initiatives, and benefit from annual health check-ups and insurance coverage. The company is committed to fostering diversity and inclusion in the workplace, offering hybrid work options, flexible hours, and inclusive facilities to support employees with diverse needs. Persistent Ltd. is an Equal Opportunity Employer that values diversity and prohibits discrimination and harassment. Join us to accelerate your growth professionally and personally, make a positive impact using the latest technologies, and collaborate in an innovative and inclusive environment to unlock global opportunities for learning and development. Let's unleash your full potential at Persistent.,

Posted 5 days ago

Apply

3.0 - 8.0 years

4 - 8 Lacs

Bengaluru

Work from Office

Job Description Document Job Role: Customer Success EngineerFunction: Level 2 Escalation Support Engineer Location: Bangalore Shift: Rotational. Primarily US time zones (EST/PST support coverage) Job Summary: We are looking for a highly motivated and technically adept Customer Success Engineer (CSE) t o serve as a key escalation point for Zeta Marketing Platform (ZMP). This role will interface directly with enterprise customers and internal teams to resolve complex technical issues, provide proactive guidance, and contribute to the continuous improvement of our customer experience. Key Responsibilities: Handle escalated customer tickets (L2) , perform in-depth root cause analysis, and drive timely resolution . Communicate with customers primarily via e-mail , and also through Slack, MS Teams and phone as needed. Collaborate cross-functionally with Product, Engineering, QA, Design and DevOps teams to investigate and resolve platform-level issues. Apply a structured and data-driven approach to debugging issues in areas such as API integration, campaign workflows, user interface, and data syncing. Provide technical walkthroughs and consultative guidance to customers on platform capabilities and best practices. Document solutions thoroughly in ticketing systems and contribute to the knowledge base for internal and customer use. Identify trends and proactively suggest product or documentation improvements based on recurring customer pain points. Participate in post-incident reviews, RCA documentation , and follow-ups with impacted customers. Provide support during product upgrades or critical incidents , including weekends or holiday coverage on a rotational basis. Required Skills & Experience: 3+ years of experience in a technical support or product support role in a SaaS or MarTech environment . Demonstrated ownership of L2+ escalation issues with strong analytical thinking and troubleshooting depth. Strong written and verbal communication skills with the ability to simplify complex technical concepts. Hands-on experience with web technologies : APIs (REST), HTML, CSS, JavaScript, SQL, JSON, and browser dev tools. Comfortable using tools like Postman, Grafana, Jira, Confluence or similar systems. Prior experience supporting US-based customers and working US time zone hours (minimum 1 year). Customer-first mindset with excellent consultative and advocacy skills. Ability to manage multiple priorities and deliver under pressure in a fast-paced support environment . Experience in writing or reviewing runbooks, playbooks, and RCA documents . Preferred Qualifications: Exposure to marketing automation platforms , customer data platforms (CDPs), or personalization engines. Experience with SQL-based investigation and understanding of event/data pipelines . Familiarity with tools like Honeycomb, AWS, Snowflake or similar platforms is a plus. Experience in incident management or working with on-call rotations using PagerDuty. Expereince in GenAI tools like OpenAI, MS Co-Pilot or Deepseek. Soft Skills: Self-starter who can work independently with minimal supervision. Strong collaboration skills and a positive attitude in cross-team environments. Detail-oriented with a passion for problem-solving and continuous learning.

Posted 1 week ago

Apply

4.0 - 8.0 years

13 - 18 Lacs

Bengaluru

Work from Office

Project description We've been engaged by a large Australian financial institution to provide resources to manage the production support activities along with their existing team in Sydney & India. Responsibilities Carry out enhancements to maintenance/housekeeping scripts as required and monitor the DB growth periodically. Handles cloud Environment preparation, refresh, rebuild, upkeep, maintenance, and upgrade activities. Ensure cloud cost optimisation. Troubleshooting of Murex environment-specific issues including Infrastructure related issues and update pipelines for a permanent fix. Handling EOD execution and troubleshooting of issues related to it. Participate in analysis, solutioning, and deployment of solution for production issues during EoD. Participate in the release activity and coordinate with QA/Release teams. Participate in AWS stack deployment, AWS AMI patching, and stack configuration to ensure optimal performance and cost-efficiency. Address requests like warehouse rebuild, maintenance, Perform Health/sanity checks, create XVA engine, environment restores & backup in AWS as per project need. Perform Weekend maintenance and perform health checks in the production environment during the weekend. Support working in shifts (max end time will be 12.30 AM IST) and available for weekend & on-call support. Have to work out of client location on a need basis. Flexible to work in a Hybrid model. Skills Must have 4 to 8 Years of experience in Murex Production Support Murex End of Day support Troubleshooting batch-related issues, including date moves and processing adjustments Murex Env Management & Troubleshooting Experienced in SQL Unix shell scripting, Monitoring tools, Web development Experienced in the Release and CI/CD process Linux/Unix server and Oracle RDS knowledge Working experience with automation/job scheduling tools such as Autosys, GitHub Actions Working experience with monitoring tools like Grafana, Splunk, Obstack, PagerDuty Good communication and organization skills working within a DevOps team supporting a wider IT delivery team Nice to have PL/SQL, Scripting languages (Python) Advanced troubleshooting experience with Shell scripting and Python Experience with CICD tools like Git, flows, Ansible, and AWS including CDK Exposure to AWS Cloud environment Willing to learn and obtain AWS certification

Posted 2 weeks ago

Apply

5.0 - 9.0 years

0 Lacs

haryana

On-site

Job Title Production Support Lead Location Gurgaon, India Reports to Head of Prod Support About FNZ Who we are: FNZ Group is an established and rapidly growing company in the financial technology sector. We partner with the entire industry to make wealth management accessible to more people. Today, we partner with over 650 financial institutions and 8,000 wealth management firms, enabling over 20 million people across all wealth segments to invest in the things they care the most about, on their own terms. We have over 20+ offices globally with 4500 employees (and growing!). To learn more about us and our journey, check out our careers site. Role Description What would you accomplish as a Lead Production Support As Production Support Lead, you will be the go-to person for our client. Your responsibilities extend to overseeing the intricate landscape of issue management, addressing concerns from both external and internal clients to meet key performance indicators (KPIs) and service level agreements (SLAs). A core aspect of your role involves managing the workflow, ensuring the seamless functioning of the application as deployed, emphasizing proactive and reactive measures to champion continuous service improvement. Your expertise comes to the forefront in Incident & Problem Management, where you lead the analysis, investigation, diagnosis, and problem-solving efforts to identify, troubleshoot, and resolve production issues. Additionally, your involvement in Release & Change Management is crucial, as you support the testing and release processes for production fixes. Facilitating the transition between project support and production support during Service Transition is a key responsibility, ensuring a smooth flow of operations. The Responsibilities Will Include: Analyse incidents, recommends solutions, and contributes to service improvement. Ensure that all requests, incidents and problems are dealt with according to set standards and procedures. Direct daily operations, allocate resources, and plan to meet service levels. Proactively address system and service problems, ensuring timely resolution actions. Facilitate development of documented problem solutions and corrective actions. Educate and train internal and external application users. Guide team members, monitor progress, and prioritize quality improvement. Initiate process improvements aligned with business objectives and audits. Drive enhancements aligning with procedural, regulatory, and security requirements. Draft and maintain meticulous documentation for application support procedures. Contribute to audits and reviews, collecting evidence for process evaluation. Undertake diverse projects and tasks to ensure smooth production operations. Experience Required What we are looking for: Degree preferable in either Commerce/IT or a related field; or equivalent. Expert SQL skills. Independent, self-directing and delivery focused working style. Superior analytical thinking and keen attention to detail. Good communication skills, confident in dealing with internal and external clients. Passionate about providing an excellent service experience for our clients. Demonstrable ability to provide leadership and direction in incident management, to effectively prioritize and execute tasks in a high-pressure environment. Builds relationships with senior internal and external stakeholders. Experience in support and incident management, ITIL preferably. For Technical skills, SQL, Application monitoring tools New Relic, Datadog, APM, Splunk, PagerDuty. Experience Preferred Beneficial but not essential. Interest / familiarity with financial markets and products. Some experience with Microsoft .NET development products, including C#, VB.NET and SQL Server, beneficial but not essential. Open to the variance of work hours, including the flexibility to start earlier or later than standard work hours. Opportunities What We Offer: We are mission led - work at the heart of a purpose-led organization, where you can be proud of the impact you make, every day. Where youll transform the way over 20 million people invest, making wealth management more accessible, sustainable and transparent to more people. Rapid career growth - encouraged to take on responsibility, play a part in the evolution of the company and rapidly drive your career development working on real projects that directly impact our clients and their customers. Market leading technology - Build, create and evolve innovative solutions for the worlds most trusted brands using the latest technologies to help change the face of investing for the future Learning & development Placing emphasis on a willingness to learn, to think differently, to be creative and to help drive innovation. Inclusion In addition, we want to ensure accessibility needs are well supported, if you require specific support, please advise us. About FNZ FNZ is committed to opening up wealth so that everyone, everywhere can invest in their future on their terms. We know the foundation to do that already exists in the wealth management industry, but complexity holds firms back. We created wealths growth platform to help. We provide a global, end-to-end wealth management platform that integrates modern technology with business and investment operations. All in a regulated financial institution. We partner with over 650 financial institutions and 12,000 wealth managers, with US$1.5 trillion in assets under administration (AUA). Together with our customers, we help over 20 million people from all wealth segments to invest in their future.,

Posted 2 weeks ago

Apply

5.0 - 9.0 years

0 Lacs

noida, uttar pradesh

On-site

The client's product enables the utilization of customer data through cutting-edge technologies to: - Enhance understanding of customer behavior to a previously unattainable level. - Determine the exact impact of advertising and promotions. - Create real-time profiles of customer segments. - Uncover the relationship between team member performance and customer loyalty. You should have: - Over 5 years of commercial experience as a DevOps professional. - Practical experience in cloud infrastructure provisioning, deployment, and monitoring on Azure for at least 2 years. - Strong familiarity with best DevOps practices and methodologies. - Good understanding of Computer Science and Computing Theory, including network interactions, protocols, deployment patterns, security patterns, software architecture (e.g., microservices, event-driven design), orchestration, and containerization (Docker, Kubernetes). - Hands-on experience with Infrastructure as Code (IaC), especially with ARM templates/Terraform. - Knowledge of logging and monitoring technologies like Zabbix, NewRelic, PagerDuty, Prometheus, and ELK stack. - Experience with CI/CD processes using AzureDevOps, Docker, Kubernetes (AKS), and product services written in .NET. - Proficiency in different delivery methodologies such as SCRUM, Agile, and Kanban. - Upper-Intermediate English language skills. Desirable qualifications include certifications in Azure and Kubernetes, along with practical experience in data engineering, Big Data stack, high-load systems, and microservices in a production environment. As part of the DevOps team, your responsibilities will include: - Collaborating on the creation of Azure infrastructure and setting up K8s clusters (AKS). - Managing CI/CD pipelines and automation processes. - Overseeing release management and infrastructure maintenance. - Participating in decision-making regarding infrastructure design. - Creating and managing dashboards for environments/builds. - Ensuring security controls do not adversely affect production by working with architects and developers. - Communicating effectively with various stakeholders including PM, PO, software developers, architects, and QA. GlobalLogic offers a stimulating work environment with diverse projects in industries like High-Tech, communication, media, healthcare, retail, and telecom. You will have the opportunity to collaborate with a talented team and enjoy work-life balance, professional development programs, competitive benefits, and fun perks. About GlobalLogic: GlobalLogic is a digital engineering leader that helps brands worldwide design and develop innovative products and digital experiences. Headquartered in Silicon Valley, GlobalLogic operates globally, assisting clients across various industries to envision and realize digital transformations.,

Posted 2 weeks ago

Apply

6.0 - 10.0 years

0 Lacs

karnataka

On-site

The Senior Developer / Technical Lead Java Full stack position based in Bangalore requires an experienced professional with 6-10 years of experience in software development and architecture. In this role, you will be responsible for providing solutions to technical issues that may impact product delivery. Your key responsibilities will include facilitating requirement analyses, conducting peer reviews, defining processes for technical platforms, and enhancing frameworks. The ideal candidate should possess hands-on experience in Java and ReactJS, with a minimum of 6 years of experience in Java backend/frontend technologies and building distributed enterprise software. Strong expertise in Core & Advanced Java, including threading, design patterns, and data structures, is essential. A good understanding of OOAD, design patterns, and software architecture is also required. Proficiency in Spring Boot, Microservices, Hibernate, MVC, RestAPI, collection, and frameworks is necessary for this role. Additionally, hands-on experience in working with/setting up CI and CD environments, writing SQL queries, and familiarity with collaboration tools like GitHub and DevOps/JIRA are important skills. The successful candidate should have good expertise with JavaScript frameworks like ReactJS, Graph API, and PagerDuty. Experience working in an agile development environment and tools is preferred. The ability to quickly learn and adapt to new business and technical concepts, along with excellent communication, organizational, and problem-solving skills, will be beneficial in this role.,

Posted 2 weeks ago

Apply

3.0 - 5.0 years

4 - 8 Lacs

Hyderabad

Work from Office

3-5 years of experience in IT operations and maintenance. Hands-on experience with Grafana, Zabbix, Azure Monitor, and ELK Log Management. Experience with large-scale monitoring system setup and maintenance. Good exposure to commonly used ITSM tools, including PagerDuty and ServiceNow. Basic understanding of public cloud knowledge, including IaaS, PaaS, and SaaS. Proactive approach to identifying problems, performance bottlenecks, and areas for improvement. Primary Skills Configure and implement end-to-end monitoring solutions for applications and infrastructure. Configure and maintain log analytic tools for applications and infrastructure. Develop mock-up views and build workable dashboards following a defined methodology based on briefings from various stakeholders. Short Description Open to work in 24*7 Shift. Microsoft Azure Monitor PagerDuty ELK Log Management

Posted 3 weeks ago

Apply

5.0 - 10.0 years

3 - 7 Lacs

Mumbai

Work from Office

We are looking for a skilled Java Backend Developer with 5 to 12 years of experience to develop and maintain backend services using Java Spring and JavaScript. The ideal candidate will have hands-on experience as a backend developer, proficiency in Java Spring framework and JavaScript, and experience with at least one cloud provider. Roles and Responsibility Develop and maintain scalable and efficient backend systems using Java Spring and JavaScript. Design, implement, and optimize cloud-based solutions on AWS, GCP, or Azure. Work with SQL and NoSQL databases such as PostgreSQL, MySQL, and MongoDB for data persistence. Architect and develop Kubernetes-based microservices caching solutions and messaging systems like Kafka. Implement monitoring, logging, and alerting using tools like Grafana, CloudWatch, Kibana, and PagerDuty. Participate in on-call rotations, handle incident response, and contribute to operational playbooks. Job Hands-on experience as a backend developer with strong understanding of data structures, algorithms, and software design principles. Proficiency in Java Spring framework and JavaScript, with experience in developing scalable and efficient backend systems. Experience with at least one cloud provider, preferably AWS, GCP, or Azure, and knowledge of cloud-based solutions and containerization. Familiarity with microservice architectures, caching solutions, and event-driven architectures using Kafka. Strong communication skills with an emphasis on technical documentation and the ability to work in a globally distributed environment. Ability to contribute to high availability services and participate in on-call rotations.

Posted 3 weeks ago

Apply

8.0 - 13.0 years

15 - 25 Lacs

Hyderabad

Work from Office

Role Summary Akrivia HCM is seeking an experienced Site Reliability Engineer to safeguard the performance, scalability, and availability of our global HR tech platform. You will define service-level objectives, automate infrastructure, lead incident response, and partner with engineering squads to deliver reliable releases at high velocity. Key Responsibilities Define and track SLIs/SLOs for latency, availability, and error budgets. Build and maintain Terraform/Helm/ArgoCD stacks; convert manual toil into code. Instrument services with Prometheus, Grafana, Datadog, and OpenTelemetry; create actionable alerts & dashboards. Serve in the on-call rotation, lead rapid mitigation, run blameless post-mortems, and close action items. Model load growth, tune autoscaling policies, run load tests, and drive cost-optimisation reviews. Design chaos game-days and fault-injection experiments to validate fail-over and recovery paths. Review designs/PRs for reliability anti-patterns and coach development teams on SRE best practices. Must-Have Qualifications 5+ years operating large-scale, user-facing SaaS systems on AWS, GCP, or Azure (Kubernetes/EKS preferred). Proficiency with Infrastructure-as-Code (Terraform, Helm, Pulumi, or CloudFormation) and GitOps (ArgoCD/Flux). Hands-on experience building observability stacks (Prometheus, Grafana, Datadog, New Relic, etc.). Proven track record reducing MTTR and change-failure rate through automation and robust incident processes. Strong scripting or programming skills in Go, Python, or TypeScript. Deep debugging skills across Linux, networking, containers, databases, and web/API layers. Excellent written and verbal communication skills. Good-to-Have Skills Exposure to AWS Well-Architected reviews, FinOps, or cost-optimisation initiatives. Experience with service mesh (Istio/Linkerd), event-driven systems (Kafka/NATS), or serverless (Lambda). Familiarity with SOC 2 / ISO 27001 controls and secrets management (AWS KMS, Vault). Chaos engineering tools (ChaosMesh, Gremlin) and performance testing (k6, Gatling). Certifications such as AWS DevOps Pro, CKA/CKAD, or Google Cloud SRE.

Posted 3 weeks ago

Apply

6.0 - 10.0 years

12 - 16 Lacs

Pune

Work from Office

We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine.. Key Responsibilities. Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps). Manage observability tools: logs, metrics, traces, and alerts. Tune backend services & GKE workloads (Node.js, Django, Go, Java). Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets). Lead incident responses & perform root cause analyses. Standardize secrets, tagging & infra consistency across environments. Enhance CI/CD pipelines & collaborate on better rollout strategies. Must-Have Skills. 510 years in DevOps / SRE / Infra roles. Kubernetes (GKE preferred). IaC with Terraform & Helm. CI/CD: GitHub Actions + GitOps (ArgoCD / Flux). Cloud architecture expertise (IAM, VPC, Secrets). Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) ?. Incident management with tools like Datadog & PagerDuty. Excellent communicator & documenter. Tech Stack. GKE, Kubernetes, Terraform, Helm. GitHub Actions, ArgoCD / Flux. Datadog, PagerDuty. CloudSQL, Cloudflare, IAM, Secrets. You're. A proactive team player & strong individual contributor. Confident yet humble. Curious, driven & always learning. Not afraid to solve deep infrastructure challenges. (ref:hirist.tech). Show more Show less

Posted 4 weeks ago

Apply

0.0 years

0 Lacs

Hyderabad, Telangana, India

On-site

Genpact (NYSE: G) is a global professional services and solutions firm delivering outcomes that shape the future. Our 125,000+ people across 30+ countries are driven by our innate curiosity, entrepreneurial agility, and desire to create lasting value for clients. Powered by our purpose - the relentless pursuit of a world that works better for people - we serve and transform leading enterprises, including the Fortune Global 500, with our deep business and industry knowledge, digital operations services, and expertise in data, technology, and AI. Inviting applications for the role of NOC & IM Engineer(L2) - Technical Associate In this role, Individual will be responsible to monitor and manage the detection, correction, and prevention of incidents, maintain SLAs, and ensure that critical issues are resolved promptly in order to reduce production downtime. Responsibilities . Responsible for Monitoring of Production site and services across the Enterprise. Detect issues and execute Runbooks or Escalate to the appropriate Service Owners for Investigation. . Responsible for managing Service now tickets. . Facilitate in the resolution of Major incidents across the Enterprise to bring services back to normal state and mitigate impact. . Ensure incident communication to stakeholders is consistent, clear, concise, and made in a timely manner. . Perform Postmortem for all Major Incidents to find the root cause once incidents are resolved to permanently fix the problem and support continuous improvement. . Ensure standardized methods, processes and procedures are used for all changes. . Facilitate efficient and prompt handling of all changes. . Perform Change Governance and run CAB Meetings. . Oversee processes & tools for Incident, Problem and Change Management. . Define, generate and publish KPI/metrics for transparency into incidents and teams affected, problems and root causes, change requests to production environment. . Hands on experience on Incident, Problem and Change Management. Qualifications we seek in you! Minimum Qualifications / Skills . Bachelor%27s Degree required. Preferably in Computer Science, Information Systems, or related field. . Excellent Communication skills Preferred Tool Skills . BigPanda . SLAM / Neustar - Vercara . Tardis . SNOW (INCIDENT/INCIDENTTASK/REQ/SCTASK/PROBLEM/PTASK/CHANGE) . PagerDuty . Splunk . Zabbix . JIRA . Teams . Reporting Genpact is an Equal Opportunity Employer and considers applicants for all positions without regard to race, color, religion or belief, sex, age, national origin, citizenship status, marital status, military/veteran status, genetic information, sexual orientation, gender identity, physical or mental disability or any other characteristic protected by applicable laws. Genpact is committed to creating a dynamic work environment that values respect and integrity, customer focus, and innovation. Get to know us at genpact.com and on LinkedIn, X, YouTube, and Facebook. Furthermore, please do note that Genpact does not charge fees to process job applications and applicants are not required to pay to participate in our hiring process in any other way. Examples of such scams include purchasing a %27starter kit,%27 paying to apply, or purchasing equipment or training.

Posted 4 weeks ago

Apply

3.0 - 7.0 years

13 - 18 Lacs

Bengaluru

Work from Office

Project description We've been engaged by a large Australian financial institution to provide resources to manage the production support activities along with their existing team in Sydney & India. Responsibilities Carry out enhancements to maintenance/housekeeping scripts as required and monitor the DB growth periodically. Handles cloud Environment preparation, refresh, rebuild, upkeep, maintenance, and upgrade activities. Ensure cloud cost optimisation. Troubleshooting of Murex environment-specific issues including Infrastructure related issues and update pipelines for a permanent fix. Handling EOD execution and troubleshooting of issues related to it. Participate in analysis, solutioning, and deployment of solution for production issues during EoD. Participate in the release activity and coordinate with QA/Release teams. Participate in AWS stack deployment, AWS AMI patching, and stack configuration to ensure optimal performance and cost-efficiency. Address requests like warehouse rebuild, maintenance, Perform Health/sanity checks, create XVA engine, environment restores & backup in AWS as per project need. Perform Weekend maintenance and perform health checks in the production environment during the weekend. Support working in shifts (max end time will be 12.30 AM IST) and available for weekend & on-call support. Have to work out of client location on a need basis. Flexible to work in a Hybrid model. Skills Must have 4 to 8 Years of experience in Murex Production Support Murex End of Day support Troubleshooting batch-related issues, including date moves and processing adjustments Murex Env Management & Troubleshooting Experienced in SQL Unix shell scripting, Monitoring tools, Web development Experienced in the Release and CI/CD process Linux/Unix server and Oracle RDS knowledge Working experience with automation/job scheduling tools such as Autosys, GitHub Actions Working experience with monitoring tools like Grafana, Splunk, Obstack, PagerDuty Good communication and organization skills working within a DevOps team supporting a wider IT delivery team Nice to have PL/SQL, Scripting languages (Python) Advanced troubleshooting experience with Shell scripting and Python Experience with CICD tools like Git, flows, Ansible, and AWS including CDK Exposure to AWS Cloud environment Willing to learn and obtain AWS certification Other Languages EnglishC1 Advanced Seniority Regular

Posted 4 weeks ago

Apply

3.0 - 5.0 years

9 - 11 Lacs

Bengaluru

Hybrid

Dear Professional, We are excited to present a unique opportunity at Cognizant, a leading IT firm renowned for fostering growth and innovation. We are seeking talented professionals with 3 to 5 years of experience in Major Incident Management,Critical Incident Handling,Incident Response,ITIL Incident Management Root Cause Analysis,Incident Escalation,Service Restoration,War Room Coordination,ServiceNow,BMC Remedy,Jira Service Management,PagerDuty,ISO 20000,COBIT,Major Incident Manager,Incident Response Lead to join our dynamic team. Your expertise in these areas is highly sought after, and we believe your contributions will be instrumental in driving our projects to new heights. We offer a collaborative environment where your skills will be valued and nurtured. To proceed to the next step of the recruitment process, please provide us with the following details with Updated resume to sathish.kumarmr@cognizant.com Please share below details (Mandatory) : Full Name(As per Pan card): Contact number: Email Current Location: Interested Locations: Total Years of experience: Relevant years of experience: Current company: Notice period: NP negotiable: if yes how many days they can negotiate? : If you are Serving any Notice period Means please mention Last date of Working: Current CTC- Expected CTC- Availability for interview on Weekdays ? Highest Qualification? Additionally, we would like to schedule a virtual interview with you on 26th June 2025 . Kindly confirm your availability for the same. We look forward to the possibility of you bringing your valuable experience to Cognizant. Please respond at your earliest convenience. Thanks & Regards, Sathish Kumar M R HR-Cognizant Sathish.KumarMR@cognizant.com

Posted 1 month ago

Apply

8.0 - 12.0 years

0 Lacs

Hyderabad, Telangana, India

On-site

About Zeta Zeta is a Next-Gen Banking Tech company that empowers banks and fintechs to launch banking products for the future. It was founded by and Ramki Gaddipati in 2015. Our flagship processing platform - Zeta Tachyon - is the industry's first modern, cloud-native, and fully API-enabled stack that brings together issuance, processing, lending, core banking, fraud & risk, and many more capabilities as a single-vendor stack. 20M+ cards have been issued on our platform globally. Zeta is actively working with the largest Banks and Fintechs in multiple global markets transforming customer experience for multi-million card portfolios. Zeta has over 1700+ employees - with over 70% roles in R&D - across locations in the US , EMEA , and Asia . We raised $280 million at a $1.5 billion valuation from Softbank, Mastercard, and other investors in 2021. Learn more @, , , The Site Delivery Manager is responsible for end-to-end service delivery and operational excellence for a specific site. This role ensures the stability, performance, and continuous improvement of IT services, while managing key performance indicators (KPIs), incident and change management, cost governance, and customer satisfaction. The individual will serve as the primary liaison between business stakeholders, SRE/infra teams, and other technology units to drive operational maturity and service reliability. Responsibilities: Service Delivery & Operations Management Own and manage site-level SLAs for incidents, problems, and changes Ensure adherence to MTTA (Mean Time to Acknowledge) and MTTR (Mean Time to Resolve) metrics for Alerts & Incidents Oversee incident lifecycle and ensure timely Root Cause Analysis (RCA) Track problem ticket aging and drive problem resolution Manage service delivery reviews, post-incident reviews, and escalations Change Management Lead the Change Advisory Board (CAB) process at the site level Review and approve changes ensure minimal service disruption during deployments Validate and document post-deployment summaries and outcomes Monitoring & Governance Oversee handover of SaaS product monitoring responsibilities to Zeta command center (ZCC) Monitor alerts, dashboards, and performance trends to proactively prevent incidents Maintain high security posture by coordinating with InfoSec and Compliance teams Customer and Stakeholder Engagement Act as the primary point of contact for internal and external stakeholders at the site Own customer-facing RCA communication and service quality improvements Facilitate cross-functional collaboration across product, SRE, infrastructure, and customer teams Cost & Resource Management Own and manage the site's technology budget ensure cost adherence Conduct monthly/quarterly cost anomaly analysis and optimizations Work with platform and finance team for infrastructure/resource planning People & Process Drive process improvements and operational maturity Foster a culture of accountability, resilience, and continuous improvement Skills: Strong operational and delivery management Excellent communication, stakeholder, and conflict-resolution skills Data-driven decision-making and analytical thinking Budgeting, cost analysis, and resource planning Familiarity with cloud platforms (AWS) Experience & Qualifications: Bachelor's degree in computer science, Engineering, or a related field (master's preferred) 8-12 years of experience in IT Service Management, SRE, or infrastructure operations Strong understanding of ITIL framework, site reliability principles, and cloud operations Experience with monitoring tools (e.g., Datadog, Prometheus, Grafana), incident platforms (e.g., OpsGenie/PagerDuty, Jira Service Management / ServiceNow), and change management tools Proven leadership skills in managing cross-functional teams and engaging with senior stakeholders

Posted 1 month ago

Apply

5.0 - 7.0 years

25 - 40 Lacs

Pune

Work from Office

Our world is transforming, and PTC is leading the way.Our software brings the physical and digital worlds together, enabling companies to improve operations, create better products, and empower people in all aspects of their business. Our people make all the difference in our success. Today, we are a global team of nearly 7,000 and our main objective is to create opportunities for our team members to explore, learn, and grow – all while seeing their ideas come to life and celebrating the differences that make us who we are and the work we do possible. Job Details As a senior SRE / Observability Engineer, you will be part of the Atlas Platform Engineering team and will: Create and maintain observability standards and best practices Review the current observability platform, identify areas for improvement, and guide the team in enhancing monitoring, logging, tracing, and alerting capabilities. Expand the observability stack across multiple clouds, regions, and clusters, managing all observability data. Design and implement monitoring solutions for complex distributed systems to provide deep insights into systems and services aiming at complete visibility of digital operations Supporting the ongoing evaluation of new capabilities in the observability stack, conducting proof of concepts, pilots, and tests to validate their suitability. Assist teams in creating clear, informative, and actionable dashboards to improve system visibility. Automate monitoring and alerting processes, including enrichment strategies and ML-driven anomaly detection where applicable. Provide technical leadership to the observability team with clear priorities ensuring agreed outcomes are achieved in a timely manner. Work closely with R&D and product development teams (understand their requirements and challenges) to ensure seamless visibility into system and service performance. Work closely with the Traffic Management team to identify and standardise on existing and new observability tools as part of a holistic solution Conduct training sessions and create documentation for internal teams Support the definition of SLI (service level indicators) and SLO (service level objectives) for the Atlas services. Keep track of the error budget of each service Participate in the emergency response process Conduct RCAs (root cause analysis) Help to automate repetitive tasks and reduce toil. Qualifications: People and communication qualifications Be a strong team player Have good collaboration and communication skills Ability to translate technical concepts for non-technical audiences Problem-solving and analytical thinking Technical qualifications - general: Familiarity with cloud platforms (Ideally Azure) Familiarity with Kubernetes and Istio as the architecture on which the observability and Atlas services run, and how they integrate and scale. Experience with infrastructure as code and automation Knowledge of common programming languages and debugging techniques Have a strong technical background and be hands on. Linux and scripting languages (Bash, Python, Golang). Significant Understanding of DevOps principles. Technical qualifications - observability Strong understanding of observability principles (metrics, logs, traces) Experience with APM tools and distributed tracing Proficiency in log aggregation and analysis Knowledge and hands-on experience with monitoring, logging, and tracing tools such as Prometheus, Prometheus, Grafana, Datadog, New Relic, Sumologic, ELK Stack, or others Knowledge of Open Telemetry, including OTEL collector and code instrumentation Experience designing and building unified observability platforms that enable the use of data (metrics, logs, and traces) to determine quickly if their application or service is operating as desired. Technical qualifications – SRE Understanding of the Google SRE principles Experience in defining SLIs and SLOs Experience in performing RCAs (root cause analysis) Experience in system performance Experience in incident response Knowledge of status tools, such as Atlassian Status Page or similar Knowledge of incident management and paging tools, such as PagerDuty or similar Knowledge of ITIL (Information Technology Infrastructure Library) processes Qualifications: People and communication qualifications • Be a strong team player • Have good collaboration and communication skills • Ability to translate technical concepts for non-technical audiences • Problem-solving and analytical thinking Technical qualifications - general: • Familiarity with cloud platforms (Ideally Azure) • Familiarity with Kubernetes and Istio as the architecture on which the observability platform runs, and how they integrate and scale. • Experience with infrastructure as code and automation • Knowledge of common programming languages and debugging techniques • Have a strong technical background and be hands on. • Linux and scripting languages (Bash, Python, Golang). • Significant Understanding of DevOps principles. Technical qualifications - observability • Strong understanding of observability principles (metrics, logs, traces) • Experience with APM tools and distributed tracing • Proficiency in log aggregation and analysis • Knowledge and hands-on experience with monitoring, logging, and tracing tools such as Prometheus, Prometheus, Grafana, Datadog, New Relic, Sumologic, ELK Stack, or others • Knowledge of Open Telemetry, including OTEL collector and code instrumentation • Experience designing and building unified observability platforms that enable the use of data (metrics, logs, and traces) to determine quickly if their application or service is operating as desired. Life at PTC is about more than working with today’s most cutting-edge technologies to transform the physical world. It’s about showing up as you are and working alongside some of today’s most talented industry leaders to transform the world around you. If you share our passion for problem-solving through innovation, you’ll likely become just as passionate about the PTC experience as we are. Are you ready to explore your next career move with us? We respect the privacy rights of individuals and are committed to handling Personal Information responsibly and in accordance with all applicable privacy and data protection laws. Review our Privacy Policy here ."

Posted 1 month ago

Apply

5.0 - 10.0 years

40 - 100 Lacs

Pune, Bengaluru, Delhi / NCR

Hybrid

Experience in Site Reliability Engineering, DevOps,managing teams, including mentoring and developing engineers.Prometheus, Grafana, ELK Stack, Splunk, Datadog, New Relic, AWS, GCP, Azure,Docker, Kubernetes,Python, Go, Bash, or simila.

Posted 1 month ago

Apply

4.0 - 8.0 years

6 - 10 Lacs

Coimbatore

Work from Office

The Opportunity Avantor is looking for a dynamic, forward-thinking, and experienced Engineer L0 - Scheduling and Alerting, who will be responsible for delivering results against some of the most complex business and technology initiatives. This role will be a full-time position based out of IND- Coimbatore. If you are passionate about solving complex challenges and driving innovation- lets talk! Reporting to the Sr. Manager of IT Services, the IT Engineer Associate is responsible for supporting multiple applications across the organization, which include Windows, AD, Citrix, Office 365 and additional applications like PagerDuty, Redwood CPP that you will be trained. As a member of our well-respected IT team, you will enjoy a wide variety of self-directed work within a supportive team environment. A positive attitude with the desire to focus on and please the customer, including the ability to quickly understand the customers point of view, will be key success factors in your role. MAJOR JOB DUTIES AND RESPONSIBILITIES (List in order of importance) Monitor event alerts, acknowledge and, when appropriate, escalate to the next level support team(s). Perform in-depth monitoring for P1 and P2 critical applications and basic monitoring for P3, P4 applications. Notify Outage Management Team as the first point of contact for critical P1 and P2 alerts to ensure timely escalation and resolution. Schedule jobs in SAP tool for different systems, ensure successful runs and restart when required. Cleanup NAS backup server files. Prepare weekly error report and ensure tickets are created for all failed jobs. Prepare weekly & monthly Task performance/ Aging reports, drive aging calls with wider team and ensure tickets are closed on time/record justification if required. Support IT changes, prioritizing change requests, assessing impact, and accepting changes which meet requirements. Maintain internal knowledge repository. Manage ticketed query system and ensure queries and resolutions are tracked and kept up to date. QUALIFICATIONS (Education/Training, Experience and Certifications) Bachelors degree or equivalent experience within an enterprise level corporate IT environment is required. Experience in IT monitoring is highly desirable. Direct experience with Jenkins, Nprinting, Cloudwatch, Qlikview, SolarWinds, Redwood, OpManager and/or PagerDuty is highly desirable. Certifications in AWS or ITIL is a plus. KNOWLEDGE S AND ABILITIES (Those necessary to perform the job competently) Knowledge of ITIL based Incident, Problem and Change Management processes. Strong problem solving and analytical skills. Ability to self-start and to effectively participate in a team environment. Ability to be an on-call escalation point for production support and scheduled off-hours/weekend work if/when required. Ability to focus on the customer and to adhere to processes defined for customer issue handling. Ability to examine, summarize, and effectively present data when required. Commitment to high professional and ethical standards in a diverse workplace. Disclaimer: The above statements are intended to describe the general nature and level of work being performed by employees assigned to this classification. They are not intended to be construed as an exhaustive list of all responsibilities, duties and skills required of employees assigned to this position. Avantor is proud to be an equal opportunity employer. Why Avantor Dare to go further in your career. Join our global team of 14,000+ associates whose passion for discovery and determination to overcome challenges relentlessly advances life-changing science. The work we do changes peoples lives for the better. It brings new patient treatments and therapies to market, giving a cancer survivor the chance to walk his daughter down the aisle. It enables medical devices that help a little boy hear his moms voice for the first time. Outcomes such as these create unlimited opportunities for you to contribute your talents, learn new skills and grow your career at Avantor. We are committed to helping you on this journey through our diverse, equitable and inclusive culture which includes learning experiences to support your career growth and success. At Avantor, dare to go further and see how the impact of your contributions set science in motion to create a better world. Apply today! EEO Statement: We are an Equal Employment/Affirmative Action employer and VEVRAA Federal Contractor. We do not discriminate in hiring on the basis of sex, gender identity, sexual orientation, race, color, religious creed, national origin, physical or mental disability, protected Veteran status, or any other characteristic protected by federal, state/province, or local law. If you need a reasonable accommodation for any part of the employment process, please contact us by email at recruiting@avantorsciences.com and let us know the nature of your request and your contact information. Requests for accommodation will be considered on a case-by-case basis. Please note that only inquiries concerning a request for reasonable accommodation will be responded to from this email address. 3rd party non-solicitation policy:

Posted 1 month ago

Apply

1.0 - 3.0 years

10 - 15 Lacs

Bengaluru

Work from Office

SRE 1 (Clouds Op) Locations: B'lore & Pune Exp - 1 to 3 yrs Candiates only from B2C product companies Exp - GCP, Prometheus, Grafana, ELK, Newrelic, Pingdom, or Pagerduty , Kubernets Experience with CI/CD tools 5 days week Rotational Shift

Posted 1 month ago

Apply

10.0 - 12.0 years

30 - 37 Lacs

Bengaluru

Work from Office

We need immediate joiners or those who are serving notice period and can join in another 10-15 days. No other candidate i.e. who are on bench or official 3, 2 months NP. Strong working experience in design and development of RESTful APIs using Java, Spring Boot and Spring Cloud. Technical hands-on experience to support development, automated testing, infrastructure and operations Fluency with relational databases or alternatively NoSQL databases Excellent pull request review skills and attention to detail Experience with streaming platforms (real-time data at massive scale like Confluent Kafka). Working experience in AWS services like EC2, ECS, RDS, S3 etc. Understanding of DevOps as well as experience with CI/CD pipelines Industry experience in Retail domain is a plus. Exposure to Agile Methodology and project tools: Jira, Confluence, SharePoint. Working knowledge in Docker Container/Kubernetes Excellent team player, ability to work independently and as part of a team Experience in mentoring junior developers and providing technical leadership Familiarity with Monitoring & Reporting tools (Prometheus, Grafana, PagerDuty etc). Ability to learn, understand, and work quickly with new emerging technologies, methodologies, and solutions in the Cloud/IT technology space Knowledge of front-end framework using React or Angular and any other programming languages like JavaScript/TypeScript or Python is a plus

Posted 1 month ago

Apply

4.0 - 6.0 years

6 - 8 Lacs

Hyderabad

Work from Office

What you will do In this vital role you will be responsible for managing the organization's global observability service. The role includes planning, implementation, performance tuning, and maintenance of enterprise server platforms with a focus on reliability, security, and automation. Lead a high-performing team managing monitoring and observability services. Develop and maintain the Dynatrace observability environment Contribute to infrastructure design and architecture planning Implement automation using PowerShell, Python, or Ansible Monitor systems and proactively address performance bottlenecks Collaborate with multi-functional teams on infrastructure needs Document system configurations and operational procedures. What we expect of you We are all different, yet we all use our unique contributions to serve patients. The ideal candidate will have a consistent record in monitoring and observability practices in large environments, Infrastructure Operations, and have a passion for fostering innovation and excellence in the biotechnology industry. Basic Qualifications: Masters degree and 4 to 6 years of IT related field experience OR Bachelors degree and 6 to 8 years of IT related field experience OR Diploma and 10 to 12 years of IT related field experience Preferred Qualifications: Must-Have Skills: Advanced knowledge of monitoring and notification technologies, observability concepts using Dynatrace and Pagerduty Experience with Infrastructure and Application monitoring, Windows, Linux, open telemetry and integrations Experience with TypeScript, React, Ansible and Python scripting Good understanding of networking, storage integration, and similar infrastructure services. Change management expertise Hands-on expertise in scripting and automation tools Knowledge of containers and K8 environment Good-to-Have Skills: Experience with cloud services (AWS, Azure, GCP) Experience with ITIL processes and frameworks Experience with CI/CD and DevOps practices Understanding of configuration management and automation tools (Ansible and Terraform) Professional Certifications: Associate or Specialist Certification from Dynatrace ITIL Foundation (preferred) Soft Skills: Excellent troubleshooting and analytical abilities Strong communication skills, both written and verbal Ability to work in a fast-paced environment Shift Information: This position requires you to be onsite and participate in 24/5 and weekend on call in rotation fashion and may require you to work a later shift. Candidates must be willing and able to work off hours, as required based on business requirements.

Posted 1 month ago

Apply

1.0 - 3.0 years

10 - 15 Lacs

Pune, Bengaluru

Work from Office

Must have a minimum 1 yr exp in SRE (CloudOps), Google Cloud platforms (GCP), monitoring, APM, and alerting tools like Prometheus, Grafana, ELK, Newrelic, Pingdom, or Pagerduty, Hands-on experience with Kubernetes for orchestration and container mgt Required Candidate profile Mandatory expreience working in B2C Product Companies. Must have Experience with CI/CD tools e.g. (Jenkins, GitLab CI/CD, CircleCI TravisCI..)

Posted 1 month ago

Apply

10.0 - 19.0 years

13 - 22 Lacs

Hyderabad, India

Hybrid

Department: Information Technology Employment Type: Full Time Location: India Description V3locity, Vitech’s cloud-native administration, engagement, and analytics platform, is a transformative suite of complementary applications that offers full life cycle business functionality and robust enterprise capabilities. It marries core administration with superior digital experience and augmented analytics. Its modular design enables flexible, agile deployment strategies. V3locity employs an advanced, cloud-native architecture that leverages the unique capabilities of AWS to deliver a solution with unparalleled security, scalability, and resiliency. Senior Manager– IT Service Management (ITSM) Location: Hyderabad - Hybrid We are seeking a dynamic and experienced IT Service Management (ITSM) leader to lead and enhance our global IT and Cloud operations. The ideal candidate will oversee core ITSM functions, including Service Desk, Incident Management, Problem Management, Change Management, and Service Request Fulfillment in a 24/7, fast-paced software product environment. This leader will play a strategic role in driving continuous improvement, implementing best practices in ITSM, and maturing overall service delivery practices. What you will do: ITSM: Define and drive the ITSM strategy aligned with organizational goals and customer satisfaction. Lead and develop the ITSM function, including Service Desk, Incident, Problem, and Change Management teams based out of our Hyderabad Office. Drive adoption and maturity of ITIL practices across the IT organization. Service Desk Operations: Oversee global service desk operations, ensuring high-quality and timely technical support. Establish and monitor SLAs, KPIs, and customer satisfaction metrics. Ensure timely delivery of customer monthly SLA reporting, leveraging tools like New Relic. Manage on-call rotation for all Service Teams using tools like PagerDuty. Incident & Problem Management: Lead major incident response and communication processes, ensuring minimal impact and quick resolution. Drive root cause analysis, problem identification, and long-term resolution strategies. Maintain high availability and performance of business-critical services. Change & Release Management: Establish and govern change control procedures ensuring safe, secure, and timely releases. Collaborate with DevOps and engineering teams to align change processes with agile product development/deployment/releases. ITSM Tools & Reporting: Own and optimize the ITSM platform (e.g., ServiceNow, Jira Service Management). Own and deliver our monthly client SLA reporting cadence to customers Deliver regular operational reports, dashboards, and executive summaries leveraging Jira Service Management. Identify and implement continuous improvement opportunities based on data insights. Governance & Compliance: Ensure compliance with internal policies, external regulations (e.g., ISO, SOC2), and audit requirements. Maintain clear documentation and process alignment with industry standards (ITIL v4, COBIT). Team Development & Leadership: Lead, mentor, and develop a high-performing team of ITSM professionals. Foster a culture of accountability, collaboration, and service excellence. Manage vendor relationships and third-party service providers as needed. What We're Looking For: 12–15+ years of ITSM experience, with 5+ years in a Service Management role. Proven experience managing global service desk operations and ITIL processes in a product or SaaS environment. ITIL v4 certification; certifications in Agile/Scrum, COBIT, or PMP are a plus. High-level Technical knowledge / certification in AWS Cloud or other clouds. Hands-on experience with ITSM tools like ServiceNow, Jira Service Management, or similar. Working experience with tools in the Monitoring and Service Management space like New Relic, PagerDuty, Honeycomb, Splunk, etc.. Proven experience managing the incident lifecycle, problem, and change processes. Excellent communication, stakeholder management, and crisis management skills. Experience working with global teams across time zones. Prior experience in a software product or SaaS company is highly desirable. Strong business acumen and ability to align IT services with organizational goals. Able to work in shifts and lead the team technically to manage the tasks/issues that arise in the shift. Join Us at Vitech! At Vitech, you’ll be part of a forward-thinking team that values collaboration, innovation, and continuous improvement. We provide a supportive and inclusive environment where you can grow as a leader while helping shape the future of our organization.

Posted 1 month ago

Apply

6.0 - 9.0 years

18 - 20 Lacs

Pune

Work from Office

Notice Period: (Immediate Joiner - Only) Duration: 6 Months (Possible Extension) Shift Timing: 11:30 AM 9:30 PM IST About the Role We are looking for a highly skilled and experienced DevOps / Site Reliability Engineer to join on a contract basis. The ideal candidate will be hands-on with Kubernetes (preferably GKE), Infrastructure as Code (Terraform/Helm), and cloud-based deployment pipelines. This role demands deep system understanding, proactive monitoring, and infrastructure optimization skills. Key Responsibilities: Design and implement resilient deployment strategies (Blue-Green, Canary, GitOps). Configure and maintain observability tools (logs, metrics, traces, alerts). Optimize backend service performance through code and infra reviews (Node.js, Django, Go, Java). Tune and troubleshoot GKE workloads, HPA configs, ingress setups, and node pools. Build and manage Terraform modules for infrastructure (VPC, CloudSQL, Pub/Sub, Secrets). Lead or participate in incident response and root cause analysis using logs, traces, and dashboards. Reduce configuration drift and standardize secrets, tagging, and infra consistency across environments. Collaborate with engineering teams to enhance CI/CD pipelines and rollout practices. Required Skills & Experience: 5-10 years in DevOps, SRE, Platform, or Backend Infrastructure roles. Strong coding/scripting skills and ability to review production-grade backend code. Hands-on experience with Kubernetes in production, preferably on GKE. Proficient in Terraform, Helm, GitHub Actions, and GitOps tools (ArgoCD or Flux). Deep knowledge of Cloud architecture (IAM, VPCs, Workload Identity, CloudSQL, Secret Management). Systems thinking understands failure domains, cascading issues, timeout limits, and recovery strategies. Strong communication and documentation skills capable of driving improvements through PRs and design reviews. Tech Stack & Tools Cloud & Orchestration: GKE, Kubernetes IaC & CI/CD: Terraform, Helm, GitHub Actions, ArgoCD/Flux Monitoring & Alerting: Datadog, PagerDuty Databases & Networking: CloudSQL, Cloudflare Security & Access Control: Secret Management, IAM Driving Results: A good single contributor and a good team player. Flexible attitude towards work, as per the needs. Proactively identify & communicate issues and risks. Other Personal Characteristics: Dynamic, engaging, self-reliant developer. Ability to deal with ambiguity. Manage a collaborative and analytical approach. Self-confident and humble. Open to continuous learning Intelligent, rigorous thinker who can operate successfully amongst bright people

Posted 1 month ago

Apply
Page 1 of 2
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Featured Companies