Staff Engineer/Tech Lead AI/ML [ Natural Language Processing, Transformers, Gen AI, LLM, Neural Networks]
The Opportunity
Staff Engineer (MTS-6)
About the Team
Panacea
Why Join Us
- Build
AI-first observability tools
that redefine how engineers triage and troubleshoot. - Own systems that reduce hours of manual work in
engineering and SRE workflows
. - Collaborate with a
tight-knit team of high-ownership engineers
who are passionate about impact and innovation. - Hybrid work model that supports flexibility and deep focus.
- Help shape the
central AI charter
at Nutanix and influence future AI products across the company.
Your Role
AI-Powered Observability Platform
: Own the vision, architecture, and delivery of Panaceas ML-based log and metrics analyzer that reduces triage time and improves engineering efficiency.AI/ML-powered Log Analyzer Tool
: Use deep learning (e.g., ModernBERT
) to represent log messages, detect anomalies, group patterns, and surface actionable insights to users.Metrics Anomaly Detection Engine
: Build robust ML models to detect anomalies in time-series metrics like CPU, memory, disk I/O, network traffic, service health
, and moreautomatically identifying performance degradation or system regressions across distributed environments.Auto-RCA Engine
: Combine log and metrics signals with graph-based correlation and LLM-powered summarization to automatically diagnose the root cause of system failures.Feedback Loop & Continuous Learning
: Build infrastructure for incorporating user feedback to continuously retrain and improve anomaly detection systems.LLM Integration
: Integrate LLMs for user queries, problem summarization, anomaly explanation, and contextual recommendations.Central AI Charter
: Contribute to Nutanixs foundational AI platform by defining shared tooling, datasets, governance, and reusable ML components across products.
Responsibilities
- Architect and scale ML pipelines for
real-time and batch-based anomaly detection
in both logs and time-series metrics. - Build and fine-tune
ModernBERT
and other transformer-based models for log understanding, anomaly classification, and summarization. - Develop unsupervised and semi-supervised ML models for
detecting anomalies in system metrics
(CPU, memory, network throughput, latency, etc.). - Implement correlation models to connect anomalies across logs and metrics to form a cohesive RCA narrative.
- Own the entire ML lifecycle: data ingestion, feature extraction, model training, evaluation, deployment, and monitoring.
- Build explainable AI systems that increase adoption and trust within engineering, QA, and support teams.
- Collaborate with cross-functional stakeholders (SRE, QA, Dev) to deeply understand pain points and translate them into intelligent tooling.
- Drive technical excellence through code and design reviews, mentoring, and setting engineering best practices.
What You Will Bring
Educational Background
: B.Tech/M.Tech in Computer Science, Machine Learning, AI, or related fields.Experience
: 12+ years of engineering experience , including designing , developing and deploying AI/ML systems at scale.ML Expertise
:- Strong in time-series anomaly detection, statistical modeling, supervised/unsupervised learning.
- Experience building ML models for
metrics data
(CPU, memory, IOPS, network, etc.) using models like Isolation Forest, Prophet, LSTM, or deep autoencoders. - Expertise in NLP using
ModernBERT, BERT, or
log classification, clustering, and summarization. - Experience with LLMs for downstream tasks like summarization, root cause reasoning, or intelligent Q&A.
Engineering Skills
: Strong Python background, hands-on with ML libraries (PyTorch, TensorFlow, Scikit-learn), time-series frameworks, and MLOps tools. Familiar with data pipelines and serving models.Observability Knowledge
: Hands-on with logs, metrics, traces, and popular monitoring tools (e.g., Prometheus, Grafana, ELK).Leadership
: Ability to independently drive projects from requirements to delivery, mentor junior engineers, and deliver business impact.
Work Arrangement
Hybrid: This role operates in a hybrid capacity, blending the benefits of remote work with the advantages of in-person collaboration. For most roles, that will mean coming into an office a minimum of 2 - 3 days per week, however certain roles and/or teams may require more frequent in-office presence. Additional team-specific guidance and norms will be provided by your manager.