Posted:1 week ago|
Platform:
On-site
Full Time
Company: Indian / Global Engineering & Manufacturing Organization Key Skills: Machine Learning, ML, AI Artificial intelligence, Artificial Intelligence, Tensorflow, Python, Pytorch. Roles and Responsibilities: Design, build, and rigorously optimize the complete stack necessary for large-scale model training, fine-tuning, and inference--including dataloading, distributed training, and model deployment--to maximize Model Flop Utilization (MFU) on compute clusters. Collaborate closely with research scientists to translate state-of-the-art models and algorithms into production-grade, high-performance code and scalable infrastructure. Implement, integrate, and test advancements from recent research publications and open-source contributions into enterprise-grade systems. Profile training workflows to identify and resolve bottlenecks across all layers of the training stack--from input pipelines to inference--enhancing speed and resource efficiency. Contribute to evaluations and selections of hardware, software, and cloud platforms defining the future of the AI infrastructure stack. Use MLOps tools (e.g., MLflow, Weights & Biases) to establish best practices across the entire AI model lifecycle, including development, validation, deployment, and monitoring. Maintain extensive documentation of infrastructure architecture, pipelines, and training processes to ensure reproducibility and smooth knowledge transfer. Continuously research and implement improvements in large-scale training strategies and data engineering workflows to keep the organization at the cutting edge. Demonstrate initiative and ownership in developing rapid prototypes and production-scale systems for AI applications in the energy sector. Experience Requirement: 5-9 years of experience building and optimizing large-scale machine learning infrastructure, including distributed training and data pipelines. Proven hands-on expertise with deep learning frameworks such as PyTorch, JAX, or PyTorch Lightning in multi-node GPU environments. Experience in scaling models trained on large datasets across distributed computing systems. Familiarity with writing and optimizing CUDA, Triton, or CUTLASS kernels for performance enhancement is preferred. Hands-on experience with AI/ML lifecycle management using MLOps frameworks and performance profiling tools. Demonstrated collaboration with AI researchers and data scientists to integrate models into production environments. Track record of open-source contributions in AI infrastructure or data engineering is a significant plus. Education: M.E., B.Tech M.Tech (Dual), BCA, B.E., B.Tech, M. Tech, MCA. Show more Show less
MyCareernet
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Practice Python coding challenges to boost your skills
Start Practicing Python NowHyderabad
9.0 - 13.0 Lacs P.A.
Bengaluru
9.0 - 13.0 Lacs P.A.
Bengaluru
13.0 - 17.0 Lacs P.A.
Hyderabad, Bengaluru, Delhi / NCR
30.0 - 45.0 Lacs P.A.
Ghaziabad, Uttar Pradesh, India
Salary: Not disclosed
Pune, Bengaluru, Mumbai (All Areas)
Experience: Not specified
30.0 - 45.0 Lacs P.A.
Hyderābād
6.51 - 8.095 Lacs P.A.
Bengaluru
4.8 - 6.8182 Lacs P.A.
Hyderabad
25.0 - 30.0 Lacs P.A.
India
Salary: Not disclosed