Senior Associate - Reliability Operations

2 - 4 years

0 Lacs

Posted:1 month ago| Platform: Linkedin logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

About the Role:


Responsibilities

  • :24x7 Monitoring and Support: Oversee the health, performance, and availability of cloud-based SaaS infrastructure and applications, using monitoring tools like Prometheus and Grafana, and respond to alerts during assigned shifts. Alignment and adherence to organization process to maintain the SLA
  • .Incident Management: Act as the first responder in a 24x7 rotation, managing and mitigating service disruptions, following standard incident procedures, and escalating issues to SMEs as needed
  • .Deployments and Change Management: Manage deployment lifecycle of the applications. Proactively engage with SMEs to resolve deployment process issues or challenges
  • .Troubleshooting and Resolution: Use diagnostic tools and scripts to resolve common issues in real-time and collaborate with cross-functional teams to analyze and address root causes
  • .Service Health and Reliability: Assist in defining and refining SLAs, SLOs, and SLIs; perform routine checks and follow established runbooks to maintain consistent service reliability
  • .Analysis and Reporting: Regularly review incident data to identify patterns, improve service resilience, and produce shift reports summarizing system health and resolved incidents
  • .Documentation and Knowledge Base: Document incident resolutions, update runbooks, and contribute to an internal knowledge base to improve team response and efficiency
  • .Continuous Improvement Initiatives: Participate in reliability enhancement projects, including automation, configuration management, and tools improvement
  • .Collaboration: Communicate effectively with SMEs to relay critical incident information, insights, and preventive recommendation
  • sMentorship: Work closely with team members to provide guidance during shifts and share insights on improving incident response


.Experience and Qualification

  • s:Education: B.Sc IT, B.Sc Computers, BCA or equivalen
  • t.Experience: 2-4 years of experience in reliability operations or related 24x7 support role within SaaS or cloud environmen


tsSkil

  • ls:Proficiency in monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splu
  • nk.Ability to remain composed in high-stakes situations and resolve incidents prompt
  • ly.Strong verbal and written communication skills to document and relay incident information effective


ly.Shift Informat

  • ion:24x7 Rotational Shifts: This role requires availability to work rotating shifts, including nights, weekends, and holidays, to ensure 24x7 support cover


age.

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now
Zeta logo
Zeta

Fintech

Menlo Park

RecommendedJobs for You