Posted:3 weeks ago|
Platform:
On-site
Part Time
Your Mission as SRE Manager As an SRE manager, you are responsible for the availability and reliability of Calix’s cloud. At Calix, Site Reliability Engineering combines software and systems engineering to build and run large-scale, distributed, fault-tolerant systems. You would be responsible for leading a team of Site Reliability Engineers, overseeing the reliability, scalability, and maintainability of Calix's critical infrastructure, including building and maintaining automation tools, managing on-call rotations, collaborating with development teams, and ensuring systems meet service level objectives (SLOs), all while prioritizing continuous improvement and a strong focus on infrastructure health and stability within the Calix platform, leveraging tools like Terraform, observability frameworks from the Grafana Labs ecosystem, and Google Cloud Platform. Key Responsibilities SRE Leadership: Manage and mentor a team of SREs, managing weekly sprints, providing technical guidance, fostering a collaborative environment to achieve team goals, and focusing on building a culture of high performance. Collaborate with your peers in Platform Engineering and Application Development to ensure the reliability of what gets deployed to production. This is a hands-on roll that requires coding, code reviews, and strong technical guidance. Monitoring and Alerting: Utilize monitoring systems to proactively identify potential issues and act on them immediately before they become disruptive. Eliminate red blindness and ensure high fidelity, actionable alerts by adhering to best practices for alert implementations and thresholds. Continually optimize for better observability and actionable alerting. Reliability Engineering: Build a culture of reliability by collaborating with Platform Engineering and Application Development teams at design time on through to implementation and test. Enforce reliability and resilience by ensuring systems are built to be HA through proper design, code reviews, and rigorous testing of modes of failure. Performance and Scalability Optimization: Identify bottlenecks using profilers and distributed tracing frameworks. Implement performance improvements across Calix's infrastructure. Work cross-functionally with development teams to guide them toward better performance, scalability, and cost efficiency. Capacity Planning: Proactively monitor system performance and capacity, identifying potential bottlenecks and scaling systems as needed. At Calix, we are constantly growing, making sure that we are scaling appropriately is an area of constant focus. Automation Development: Drive the development and implementation of automation tools to streamline operations, including deployment pipelines, monitoring, and self-healing mechanisms. Incident Management: Participate in an on-call incident manager rotation. Lead incident response, drive root cause analysis, hold blameless post-mortem reviews, and work cross-functionally to implement corrective and preventative actions. Ensure that incidents never repeat. Implement and Enforce SLI’s/SLO’s: Ensure that all service endpoints and critical user journeys are monitored, visualized, have alerts, and have associated SLO’s. Work closely with development teams, product owners, and other stakeholders to ensure alignment and enforcement of SLO’s and error budgets. On-Call Management: Establish and manage on-call rotations for the SRE team, ensuring timely response and resolution to system alerts and incidents. You will blend skills and experience levels to ensure a well-rounded team of responders capable of handling a diverse range of production issues. Clearly define the duties of on-call staff. This includes outlining their responsibilities for monitoring alerts, maintaining playbooks, eliminating toil, handover protocols, troubleshooting incidents, escalating issues, and collaborating with other teams. Qualifications: Strong experience as an SRE manager with a proven track record of managing large-scale, highly available systems. Expertise in cloud computing platforms (preferably Google Cloud Platform). Knowledge of core operating system principles, networking fundamentals, and systems management. Programming skills in languages like Python and Go. Proven experience building and leading SRE teams, including hiring, coaching, and performance management. Deep understanding and expertise in building and maintaining scalable open-source monitoring tools and backend storage. Experience with incident management processes and best practices. Excellent communication and collaboration skills to work with cross-functional teams. Knowledge of SRE principles, including error budgets, fault analysis, and reliability engineering concepts. Education: B.S. or M.S. in Computer Science or equivalent field. About Us PLEASE NOTE: All emails from Calix will come from a '@calix.com' email address. Please verify and confirm any communication from Calix prior to disclosing any personal or financial information. If you receive a communication that you think may not be from Calix, please report it to us at talentandculture@calix.com . Calix delivers a broadband platform and managed services that enable our customers to improve life one community at a time. We’re at the forefront of a once in a generational change in the broadband industry. Join us as we innovate, help our customers reach their potential, and connect underserved communities with unrivaled digital experiences. This is the Calix mission - to enable CSPs of all sizes to Simplify. Innovate. Grow. If you are a person with a disability needing assistance with the application process please: Email us at calix.interview@calix.com ; or Call us at +1 (408) 514-3000. Calix is a Drug Free Workplace.
Calix
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
7.0 - 13.0 Lacs P.A.
Experience: Not specified
6.0 - 9.0 Lacs P.A.
Experience: Not specified
6.0 - 9.0 Lacs P.A.