About the Role:
This role is responsible for managing and maintaining complex, distributed big data ecosystems.
It ensures the reliability, scalability, and security of large-scale production infrastructure. Key
responsibilities include automating processes, optimizing workflows, troubleshooting
production issues, and driving system improvements across multiple business verticals.
Roles and Responsibilities:
- Manage, maintain, and support incremental changes to Linux/Unix environments.
- Lead on-call rotations and incident responses, conducting root cause analysis and driving postmortem processes.
- Design and implement automation systems for managing big data infrastructure, including provisioning, scaling, upgrades, and patching clusters.
- Troubleshoot and resolve complex production issues while identifying root causes and implementing mitigating strategies.
- Design and review scalable and reliable system architectures.
- Collaborate with teams to optimize overall system/cluster performance.
- Enforce security standards across systems and infrastructure.
- Set technical direction, drive standardization, and operate independently.
- Ensure availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning.
- Resolve, analyze, and respond to system outages and disruptions and implement measures to prevent similar incidents from recurring.
- Develop tools and scripts to automate operational processes, reducing manual workload, increasing efficiency and improving system resilience.
- Monitor and optimize system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning.
- Collaborate with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle.
- Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities.
- Develop and enforce SRE best practices and principles.
- Align across functional teams on priorities and deliverables.
- Drive automation to enhance operational efficiency.
- Adapt new technologies as and when the need arises and define architectural recommendations for new tech stacks.
Preferred candidate profile
- Over 6 years of experience managing and maintaining distributed big data ecosystems.
- Strong expertise in Linux including IP, Iptables, and IPsec.
- Proficiency in scripting/programming with languages like Perl, Golang, or Python.
- Hands-on experience with the Hadoop stack (HDFS, HBase, Airflow, YARN, Ranger, Kafka, Pinot).
- Familiarity with open-source configuration management and deployment tools such as Puppet, Salt, Chef, or Ansible.
- Solid understanding of networking, open-source technologies, and related tools.
- Excellent communication and collaboration skills.
- DevOps tools: Saltstack, Ansible, docker, Git.
- SRE Logging and monitoring tools: ELK stack, Grafana, Prometheus, opentsdb, Open Telemetry.
Good to Have:
- Experience managing infrastructure on public cloud platforms (AWS, Azure, GCP).
- Experience in designing and reviewing system architectures for scalability and reliability.
- Experience with observability tools to visualize and alert on system performance.
- Experience in massive petabyte scale data migrations, massive upgrades