Get alerts for new jobs matching your selected skills, preferred locations, and experience range. Manage Job Alerts
3.0 - 7.0 years
0 Lacs
hyderabad, telangana
On-site
The Computer Vision Engineer position in Hyderabad requires someone with 3 to 5 years of experience in developing, implementing, and optimizing deep learning models. As a Deep Learning Engineer, you will be responsible for driving advanced AI solutions across various industries by leveraging your expertise in neural networks, data processing, and model deployment. Your main responsibilities will include owning the product development milestones, ensuring delivery to the architecture, and identifying challenges. You will drive innovation in the product, cater to successful initiatives, and establish engineering best practices for core product development teams within the company. In this role, you will be involved in developing, porting, and optimizing computer vision algorithms and data structures on proprietary cores. You will also engage in research and development efforts focused on advanced product-critical computer vision components, such as feature extraction, tracking objects, and sensor calibration. Solid programming skills in Python and C/C++, as well as experience with TensorFlow, PyTorch, ONNX, MXNet, Caffe, OpenCV, Keras, and various neural networks, frameworks, and platforms are essential. Previous exposure to GPU computing, HPC, cloud services like AWS/Azure/Google, and NoSQL databases will be beneficial. You should have hands-on experience in deploying efficient Computer Vision products, implementing research papers, using dockerized containers with microservices, and optimizing models for TensorRT. Familiarity with NVIDIA Jetson Nano, TX1, TX2, Xavier NX, AGX Xavier, Raspberry Pi, and edge devices is required. Understanding of computer vision concepts like photogrammetry, multi-view geometry, visual SLAM, detection and recognition, and 3D reconstruction is crucial. You will need to write maintainable, reusable code, leverage test-driven principles, and develop high-quality computer vision and machine learning modules. Experience with object detection, tracking, classification, recognition, scene understanding, and deep neural networks is important. Furthermore, knowledge of image classification, object detection, and semantic segmentation using deep learning algorithms is desirable. You should be able to evaluate and advise on new technologies, vendors, products, and competitors. Initiative, independence, teamwork, and hands-on technical expertise are key attributes for this role. If you are ready to contribute to cutting-edge AI solutions and drive innovation in computer vision, apply now and join our awesome squad.,
Posted 3 days ago
5.0 - 9.0 years
0 Lacs
thane, maharashtra
On-site
You will play a pivotal role in the design and implementation of cutting-edge GPU computers optimized for demanding deep learning, high-performance computing, and computationally intensive workloads. Your expertise will be essential in identifying architectural enhancements and innovative approaches to accelerate our deep learning models. Addressing strategic challenges related to compute, networking, and storage design for large-scale, high-performance workloads will be a key responsibility. Additionally, you will contribute to effective resource utilization in a heterogeneous computing environment, evolve our cloud strategy, perform capacity modeling, and plan for growth across our products and services. As an architect, you are tasked with translating business requirements pertaining to AI-ML algorithms into a comprehensive set of product objectives encompassing workload scenarios, end user expectations, compute infrastructure, and execution timelines. This translation should culminate in a plan to operationalize the algorithms efficiently. Furthermore, you will be responsible for benchmarking and optimizing Computer Vision Algorithms and Hardware Accelerators based on performance and quality KPIs. Your role will involve fine-tuning algorithms for optimal performance on GPU tensor cores and collaborating with cross-functional teams to streamline workflows spanning data curation, training, optimization, and deployment. Providing technical leadership and expertise for project deliverables is a core aspect of this position, along with leading, mentoring, and managing the technical team to ensure successful outcomes. Your contributions will be instrumental in driving innovation and achieving project milestones effectively. Key Qualifications: - Possess an MS or PhD in Computer Science, Electrical Engineering, or a related field. - Demonstrated expertise in deploying complex deep learning architectures. - Minimum of 5 years of relevant experience in areas such as Machine Learning (with a focus on Deep Neural Networks), DNN adaptation and training, code development for DNN training frameworks (e.g., Caffe, TensorFlow, Torch), numerical analysis, performance analysis, model compression, optimization, and computer architecture. - Strong proficiency in data structures, algorithms, and C/C++ programming. - Hands-on experience with PyTorch, TensorRT, CuDNN, GPU computing (CUDA, OpenCL, OpenACC), and HPC (MPI, OpenMP). - Thorough understanding of container technologies like Docker, Singularity, Shifter, Charliecloud. - Proficient in Python programming, bash scripting, and operating systems including Windows, Ubuntu, and Centos. - Excellent communication, collaboration, and problem-solving skills. Good To Have: - Practical experience with HPC cluster job schedulers such as Kubernetes, SLURM, LSF. - Familiarity with cloud computing architectures. - Hands-on exposure to Software Defined Networking and HPC cluster networking. - Working knowledge of cluster configuration management tools like Ansible, Puppet, Salt. - Understanding of fast, distributed storage systems and Linux file systems for HPC workloads. This role offers an exciting opportunity to contribute to cutting-edge technology solutions and make a significant impact in the field of deep learning and high-performance computing. If you are a self-motivated individual with a passion for innovation and a track record of delivering results, we encourage you to apply.,
Posted 1 week ago
4.0 - 8.0 years
0 Lacs
chennai, tamil nadu
On-site
You will be responsible for architecting and leading a team to develop the distributed software infrastructure that powers image computing clusters across the LS division. Your role is crucial in enabling scalable, high-performance platforms that support advanced image processing and AI workloads. Your key responsibilities will include defining and driving the long-term vision and roadmap for distributed HPC software infrastructure supporting image computing clusters. You will also be tasked with building, mentoring, and growing a high-performing team of software engineers and technical leaders. Collaborating with product, hardware, and algorithm teams to align infrastructure capabilities with evolving image processing and AI requirements will be essential. Additionally, you will oversee the design and implementation of scalable, fault-tolerant distributed systems optimized for hybrid CPU/GPU workloads. You will lead the end-to-end development of image computing platforms, from requirements gathering through deployment and maintenance, using best-in-class project management practices. Delivering robust software platforms and tools that empower engineers to develop, test, and deploy new image processing and deep learning algorithms efficiently will also be part of your role. Furthermore, you will spearhead the integration of traditional image processing and AI/DL techniques into a unified hybrid computing architecture, leveraging modern HPC technologies. Your qualifications should include a Bachelors or Masters degree in Computer Science, Electrical Engineering, or a related technical field, along with 10+ years of experience in software engineering, with at least 4 years in technical leadership or management roles. You must have a proven track record in building and scaling distributed systems, preferably in HPC or cloud-native environments. Experience with image processing, computer vision, or AI/ML infrastructure is highly desirable. Technically, you should have a deep understanding of distributed computing frameworks & Linux Systems Programming, proficiency in C++, Python, and/or other systems programming languages, familiarity with GPU computing, and hybrid CPU/GPU architectures. A strong grasp of software development best practices, CI/CD, and DevOps principles is also required. Demonstrated abilities in leading and driving functional teams, excellent communication, stakeholder management skills, and passion for mentoring and developing engineering talent are crucial for this role. We offer a competitive, family-friendly total rewards package designed to reflect our commitment to an inclusive environment while meeting the diverse needs of our employees. KLA is proud to be an equal opportunity employer. Please be cautious of potentially fraudulent job postings and suspicious recruiting activities and confirm legitimacy through KLAs Careers website.,
Posted 1 week ago
5.0 - 8.0 years
5 - 8 Lacs
Gurgaon, Haryana, India
On-site
NVIDIA has continuously reinvented itself. Our invention of the GPU sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. Today, research in artificial intelligence is booming worldwide, which calls for highly scalable and massively parallel computation horsepower that NVIDIA GPUs excel. NVIDIA is a learning machine that constantly evolves by adapting to new opportunities that are hard to solve, that only we can address, and that matter to the world. This is our life's work , to amplify human creativity and intelligence. As an NVIDIAN, you'll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join our diverse team and see how you can make a lasting impact on the world! As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow. What You'll Be Doing In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting. Design and implement state-of-the-art GPU compute clusters. Optimize cluster operations for maximum reliability, efficiency, and performance. Drive foundational improvements and automation to enhance researcher productivity. Troubleshoot, diagnose, and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systems Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world. Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log and Manage upgrades and automated rollbacks across all clusters. What We Need To See Bachelor's degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure. Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster. Deep understanding of GPU computing and AI infrastructure. Passion for solving complex technical challenges and optimizing system performance. Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm. Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc. In depth understating of container technologies like Docker, Enroot, etc. Experience programming in Python and Bash scripting. Ways To Stand Out From The Crowd Interest in crafting, analyzing, and fixing large-scale distributed systems. Familiarity with NVIDIA GPUs, Cuda Programming, NCCL, MLPerf benchmarking, InfiniBand with IBoIP and RDMA. Experience with Cloud Deployment, BCM, Terraform. Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads. Multi-cloud experience.
Posted 1 week ago
5.0 - 8.0 years
5 - 8 Lacs
Pune, Maharashtra, India
On-site
NVIDIA has continuously reinvented itself. Our invention of the GPU sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. Today, research in artificial intelligence is booming worldwide, which calls for highly scalable and massively parallel computation horsepower that NVIDIA GPUs excel. NVIDIA is a learning machine that constantly evolves by adapting to new opportunities that are hard to solve, that only we can address, and that matter to the world. This is our life's work , to amplify human creativity and intelligence. As an NVIDIAN, you'll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join our diverse team and see how you can make a lasting impact on the world! As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow. What You'll Be Doing In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting. Design and implement state-of-the-art GPU compute clusters. Optimize cluster operations for maximum reliability, efficiency, and performance. Drive foundational improvements and automation to enhance researcher productivity. Troubleshoot, diagnose, and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systems Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world. Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log and Manage upgrades and automated rollbacks across all clusters. What We Need To See Bachelor's degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure. Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster. Deep understanding of GPU computing and AI infrastructure. Passion for solving complex technical challenges and optimizing system performance. Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm. Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc. In depth understating of container technologies like Docker, Enroot, etc. Experience programming in Python and Bash scripting. Ways To Stand Out From The Crowd Interest in crafting, analyzing, and fixing large-scale distributed systems. Familiarity with NVIDIA GPUs, Cuda Programming, NCCL, MLPerf benchmarking, InfiniBand with IBoIP and RDMA. Experience with Cloud Deployment, BCM, Terraform. Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads. Multi-cloud experience.
Posted 1 week ago
5.0 - 8.0 years
5 - 8 Lacs
Hyderabad, Telangana, India
On-site
NVIDIA has continuously reinvented itself. Our invention of the GPU sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. Today, research in artificial intelligence is booming worldwide, which calls for highly scalable and massively parallel computation horsepower that NVIDIA GPUs excel. NVIDIA is a learning machine that constantly evolves by adapting to new opportunities that are hard to solve, that only we can address, and that matter to the world. This is our life's work , to amplify human creativity and intelligence. As an NVIDIAN, you'll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join our diverse team and see how you can make a lasting impact on the world! As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow. What You'll Be Doing In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting. Design and implement state-of-the-art GPU compute clusters. Optimize cluster operations for maximum reliability, efficiency, and performance. Drive foundational improvements and automation to enhance researcher productivity. Troubleshoot, diagnose, and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systems Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world. Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log and Manage upgrades and automated rollbacks across all clusters. What We Need To See Bachelor's degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure. Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster. Deep understanding of GPU computing and AI infrastructure. Passion for solving complex technical challenges and optimizing system performance. Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm. Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc. In depth understating of container technologies like Docker, Enroot, etc. Experience programming in Python and Bash scripting. Ways To Stand Out From The Crowd Interest in crafting, analyzing, and fixing large-scale distributed systems. Familiarity with NVIDIA GPUs, Cuda Programming, NCCL, MLPerf benchmarking, InfiniBand with IBoIP and RDMA. Experience with Cloud Deployment, BCM, Terraform. Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads. Multi-cloud experience.
Posted 1 week ago
5.0 - 8.0 years
5 - 8 Lacs
Bengaluru, Karnataka, India
On-site
NVIDIA has continuously reinvented itself. Our invention of the GPU sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. Today, research in artificial intelligence is booming worldwide, which calls for highly scalable and massively parallel computation horsepower that NVIDIA GPUs excel. NVIDIA is a learning machine that constantly evolves by adapting to new opportunities that are hard to solve, that only we can address, and that matter to the world. This is our life's work , to amplify human creativity and intelligence. As an NVIDIAN, you'll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join our diverse team and see how you can make a lasting impact on the world! As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow. What You'll Be Doing In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting. Design and implement state-of-the-art GPU compute clusters. Optimize cluster operations for maximum reliability, efficiency, and performance. Drive foundational improvements and automation to enhance researcher productivity. Troubleshoot, diagnose, and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systems Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world. Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log and Manage upgrades and automated rollbacks across all clusters. What We Need To See Bachelor's degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure. Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster. Deep understanding of GPU computing and AI infrastructure. Passion for solving complex technical challenges and optimizing system performance. Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm. Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc. In depth understating of container technologies like Docker, Enroot, etc. Experience programming in Python and Bash scripting. Ways To Stand Out From The Crowd Interest in crafting, analyzing, and fixing large-scale distributed systems. Familiarity with NVIDIA GPUs, Cuda Programming, NCCL, MLPerf benchmarking, InfiniBand with IBoIP and RDMA. Experience with Cloud Deployment, BCM, Terraform. Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads. Multi-cloud experience.
Posted 1 week ago
5.0 - 8.0 years
5 - 8 Lacs
Pune, Maharashtra, India
On-site
NVIDIA has continuously reinvented itself. Our invention of the GPU sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. Today, research in artificial intelligence is booming worldwide, which calls for highly scalable and massively parallel computation horsepower that NVIDIA GPUs excel. NVIDIA is a learning machine that constantly evolves by adapting to new opportunities that are hard to solve, that only we can address, and that matter to the world. This is our life's work , to amplify human creativity and intelligence. As an NVIDIAN, you'll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join our diverse team and see how you can make a lasting impact on the world! As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow. What You'll Be Doing In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting. Design and implement state-of-the-art GPU compute clusters. Optimize cluster operations for maximum reliability, efficiency, and performance. Drive foundational improvements and automation to enhance researcher productivity. Troubleshoot, diagnose, and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systems Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world. Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log and Manage upgrades and automated rollbacks across all clusters. What We Need To See Bachelor's degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure. Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster. Deep understanding of GPU computing and AI infrastructure. Passion for solving complex technical challenges and optimizing system performance. Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm. Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc. In depth understating of container technologies like Docker, Enroot, etc. Experience programming in Python and Bash scripting. Ways To Stand Out From The Crowd Interest in crafting, analyzing, and fixing large-scale distributed systems. Familiarity with NVIDIA GPUs, Cuda Programming, NCCL, MLPerf benchmarking, InfiniBand with IBoIP and RDMA. Experience with Cloud Deployment, BCM, Terraform. Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads. Multi-cloud experience.
Posted 1 week ago
2.0 - 6.0 years
0 Lacs
thiruvananthapuram, kerala
On-site
As a Computer Vision Engineer, you will play a crucial role in transforming business challenges into data-driven machine learning solutions. Your primary responsibility will be to design, develop, and implement computer vision algorithms and models for image and video analysis, object detection, recognition, and tracking. Collaborating with cross-functional teams, you will translate business requirements into technical specifications for computer vision solutions. Staying updated with the latest advancements in computer vision and deep learning, you will identify opportunities for innovation and optimization. To excel in this role, you should have at least 2 years of experience in computer vision and/or deep learning for object detection and tracking, as well as semantic or instance segmentation in academic or industrial domains. Proficiency in Python and related packages such as numpy, scikit-image, PIL, opencv, matplotlib, and seaborn is essential. Additionally, you must possess a strong foundation in data structures and algorithms in Python or C++, along with experience in training models through GPU computing using NVIDIA CUDA or on cloud platforms. Your responsibilities will also include working with large datasets, applying data preprocessing techniques, and optimizing computer vision models for efficiency, accuracy, and real-time performance. Hands-on experience with machine/deep learning frameworks like Tensorflow, Keras, and PyTorch is required, along with expertise in identifying data imbalance, data compatibility, data privacy, data normalization, and data encoding issues. You will be involved in model selection, evaluation, training, validation, and testing, exploring different methods and scenarios for effective outcomes. In this role, you will have the opportunity to work in a dynamic environment with a focus on innovation and collaboration. The position offers a 5-day working week with flexible timings. While we are not currently hiring for the Computer Vision Engineer role, please check back later for potential opportunities to join our team.,
Posted 1 week ago
3.0 - 7.0 years
0 Lacs
karnataka
On-site
As a GPU Kernel Developer specializing in AI models at AMD, your primary responsibility is to develop high-performance GPU kernels for cutting-edge and upcoming GPU hardware. You will collaborate with a team of industry experts, leveraging the latest hardware and software technologies to drive innovation in the field. To excel in this role, you must possess significant experience in GPU kernel development and optimization for AI/HPC applications. Your expertise should include a deep understanding of GPU computing, hardware architecture, and proficiency in HIP, CUDA, OpenCL, and Triton development. Effective communication skills are essential as you will be required to work within a team environment and convey complex technical concepts to both technical and non-technical audiences. Key responsibilities include developing high-performance GPU kernels for essential AI operators on AMD GPUs, optimizing GPU code through structured methodologies, and supporting critical workloads in NLP/LLM, Recommendation, Vision, and Audio domains. You will collaborate closely with system-level performance architects, GPU hardware specialists, and various validation and marketing teams to analyze and enhance training and inference processes for AI applications. Furthermore, you will engage with open-source framework maintainers to align with their requirements and integrate code changes upstream, debug, maintain, and optimize GPU kernels, and drive AI operator performance improvements. Your expertise in software engineering best practices will be crucial in ensuring the quality and efficiency of the developed solutions. Preferred qualifications for this role include knowledge of GPU computing technologies such as HIP, CUDA, OpenCL, and Triton, experience in optimizing GPU kernels, proficiency in profiling and debugging tools, a strong foundation in GPU hardware, and excellent programming skills in C/C++/Python. Additionally, a Master's or PhD in Computer Science, Computer Engineering, or a related field is preferred. Join us at AMD, where we are dedicated to pushing the boundaries of innovation and solving the world's most significant challenges through transformative technology.,
Posted 2 weeks ago
7.0 - 11.0 years
1 - 24 Lacs
Bengaluru, Karnataka, India
On-site
Job description About The Role We are looking for an experienced Software Engineer eager to work on 3D driver development for games, workstation applications and media. As a GPU Software Development Engineer, you will play a crucial role in developing and optimizing software solutions for Intel's cutting-edge GPU technologies. You will work closely with hardware engineers, software developers, and other cross-functional teams to deliver high-performance and innovative GPU solutions.Responsibilities: Design, develop, and optimize GPU software solutions for Intel's GPU products. Collaborate with hardware engineers to ensure seamless integration of software and hardware components. Conduct performance analysis and optimization to ensure high efficiency and performance of GPU software. Debug and resolve software issues related to GPU functionality. Participate in code reviews and provide constructive feedback to team members. Stay up-to-date with the latest advancements in GPU technologies and industry trends. Develop and maintain GPU drivers, libraries, and tools. Qualifications 5+ years of programming and debugging experience in C/C++. Knowledge of graphics APIs such as DirectX, Vulkan, and OpenGL. Strong analytical and problem-solving skills, with the ability to work methodically on complex issues. Strong verbal and written communication skills to collaborate effectively with team members and stakeholders. Master's degree in Software Engineering, Computer Engineering, Computer Science, or a related field.Nice to have: Familiarity with scripting languages (e.g., Python). Familiarity with GPU driver development and debugging tools (e.g., Visual Studio, WinDbg, GPUView). Understanding of display graphics drivers, media, and related areas. Expertise in the analysis and optimization of GPU and CPU performance. Solid understanding of GPU architecture and parallel computing concepts.
Posted 3 weeks ago
2.0 - 7.0 years
2 - 24 Lacs
Bengaluru, Karnataka, India
On-site
Job description We are looking for exceptionally smart people who believe that AI will change the world and would like to join our exciting journey. https://habana.ai/Will be responsible for presilicon modeling of AI hardware accelerator architecture. Will also be responsible for enabling AI software stack on simulation models. Opportunity to learn and contribute to complete AI software stack and future GPU/AI architecture. Qualifications Computer Science Software focused graduates with can-do attitude. This is for 2-7 years experience range. Strong C++ programming skills, computer architecture knowledge, and preferably python knowledge. Experience and knowledge of SystemC, or architecture simulation models or pre-silicon virtual platform development, or GPU architecture is a plus. We're working on Intel Gaudi AI accelerators, and AI GPUs in niche domain of simulating super-complex architecture for hardware-software co-design. No AI / ML experience required, only strong Computer Science fundamentals. The team excels in diversity, innovation, and technical leadership.
Posted 3 weeks ago
2.0 - 6.0 years
0 Lacs
noida, uttar pradesh
On-site
You should have a Bachelors degree in Computer Science, Electrical Engineering, or a related field. A strong understanding of computer vision fundamentals including image processing and feature extraction is required. Experience with machine learning frameworks like OpenCV, TensorFlow, PyTorch, or Keras is essential. Proficiency in Python and relevant libraries such as NumPy, SciPy, and Matplotlib is expected. Basic knowledge of deep learning models like CNNs, RNNs, and transformers is necessary. Familiarity with model architectures such as ResNet, YOLO, SSD, and U-Net is preferred. You should have an understanding of algorithms, data structures, and relevant mathematical concepts. Knowledge of cloud platforms like AWS and GCP, as well as edge computing frameworks, is a plus. Experience with GPU computing using CUDA and cuDNN is beneficial. Previous hands-on experience in computer vision projects or open-source contributions would be desirable for this role.,
Posted 3 weeks ago
2.0 - 7.0 years
4 - 8 Lacs
Bengaluru, Karnataka, India
On-site
THE ROLE: AMD is looking for a GPU kernel development engineer who is talented in developing high performance kernels for state-of-the-art and upcoming GPU hardware . You will be a member of a core team of incredibly talented industry specialists and will work with the very latest hardware and software technology. THE PERSON: Experienced in GPU kernel development and optimization for AI/HPC applications. Strong technical and analytical skills in GPU computing, hardware architecture, and deep understanding of HIP/CUDA/OpenCL/Triton development. Ability to work as part of a team, deliver to project scope, and communicate to a technical/non-technical audience. KEY RESPONSIBILITIES: Develop high performance GPU kernels for key AI operators on AMD GPUs Optimize GPU code using structured and disciplined methodology - profiling to identify gaps, roofline-analysis on hardware, identify key set of optimizations, establish uplift and line-of-sight, prototype and develop optimizations Support mission-critical workloads in NLP/LLM, Recommendation, Vision and Audio Collaborate and interact with system level performance architects, GPU hardware specialists, power/clock tuning teams, performance validation teams, and performance marketing teams to analyze and optimize training and inference for AI Work with open-source framework maintainers to understand their requirements and have your code changes integrated upstream Debug, maintain and optimize GPU kernels, understand and drive AI operator performance (GEMM, Attention, Distributed scale-up/out communication, etc.) Apply your knowledge of software engineering best practices PREFERRED EXPERIENCE: Knowledge of GPU computing (HIP, CUDA, OpenCL, Triton) Knowledge and experience in optimizing GPU kernels Expertise in using profiling, debugging tools Core understanding of GPU hardware Excellent C/C++/Python programming and software design skills, including debugging, performance analysis, and test design. ACADEMIC CREDENTIALS: Masters or PhD or equivalent experience in Computer Science, Computer Engineering, or related field
Posted 3 weeks ago
13.0 - 17.0 years
0 Lacs
sonipat, haryana
On-site
As a Software Engineer & Instructor specializing in Computer Architecture at Newton School of Technology located in Sonipat, you will play a pivotal role in redefining how engineers are trained by bridging the gap between academia and the real world. With a focus on industry-aligned, project-based curriculum development, you will have a unique opportunity to kickstart your academic career while creating a meaningful impact. Your responsibilities will include delivering project-based, hands-on sessions in Computer Architecture covering various aspects such as ISA design, processor fundamentals, memory systems, pipelining, and performance optimization. Additionally, you will collaborate on updating and evolving the curriculum to ensure its industry relevance, provide mentorship to students working on architecture-focused capstone projects, offer career guidance in system-level software, embedded systems, and hardware optimization, and work closely with experienced faculty and industry professionals to enhance the learning experience. To excel in this role, you should hold a Bachelors or Masters degree in Computer Engineering, Computer Science, Electrical Engineering, or related fields, along with 13 years of practical experience in system-level software, embedded development, or performance-critical programming. A strong understanding of C/C++, assembly programming fundamentals, memory systems, pipelining, ISA concepts, and OS-hardware interactions is essential. Basic knowledge of architecture simulators or profiling tools is also required. While exposure to Verilog/VHDL or hardware-level programming, interest or experience in compiler design, firmware, or GPU computing, and knowledge of Linux internals or kernel-level programming are considered advantageous, what truly sets the ideal candidate apart is a passion for teaching and mentoring, clear communication skills, a collaborative mindset, and enthusiasm for working in a dynamic academic environment. In return, you will receive competitive compensation, access to state-of-the-art labs and tools, and the opportunity to contribute to pioneering, practice-led tech education within a supportive, impact-driven academic culture. Join us at Newton School of Technology to embark on your academic journey and help shape the engineers of tomorrow.,
Posted 1 month ago
13.0 - 17.0 years
0 Lacs
sonipat, haryana
On-site
As a Software Engineer & Instructor in Computer Architecture at Newton School of Technology, you will be based in Sonipat at Rishihood University. With 13 years of experience, you will play a crucial role in the mission to redefine engineer training by bridging the gap between academia and the real world. Your responsibilities will include teaching core concepts in Computer Architecture such as ISA design, processor fundamentals, memory systems, pipelining, and performance optimization through project-based, hands-on sessions. You will also contribute to updating the industry-aligned curriculum, mentor students in architecture-focused projects, provide career guidance in system-level software and hardware optimization, and collaborate with experienced faculty and industry professionals to deliver impactful learning experiences. To excel in this role, you should hold a Bachelors or Masters degree in Computer Engineering, Computer Science, Electrical Engineering, or related fields. Your solid practical experience in system-level software, embedded development, or performance-critical programming along with a strong understanding of C/C++, assembly programming, memory systems, pipelining, ISA concepts, and OS-hardware interactions will be valuable assets. Basic knowledge of architecture simulators or profiling tools is essential. Exposure to Verilog/VHDL, interest in compiler design, firmware, GPU computing, Linux internals, or kernel-level programming will be advantageous. The ideal candidate will have a passion for teaching, clear communication skills, a collaborative mindset, and enthusiasm for working in a dynamic academic environment. In return, you will receive competitive compensation, access to state-of-the-art labs and tools, the opportunity to contribute to practice-led tech education, and be part of a supportive, impact-driven academic culture. Join us at Newton School of Technology and embark on a journey to shape the engineers of tomorrow.,
Posted 1 month ago
3.0 - 7.0 years
0 Lacs
karnataka
On-site
You should have proven experience as a Linux Systems Administrator, focusing on HPC environments. Your understanding of Linux operating systems such as CentOS, Ubuntu, and Red Hat should be strong. You should also have intermediate knowledge in SLURM resource scheduler. Hands-on experience with AWS services related to HPC like EC2, S3, FSx for Lustre, AWS Batch, and AWS ParallelCluster is required. Familiarity with parallel file systems like Lustre, GPFS, and network storage solutions is essential. Knowledge of GPU computing and working with GPU-enabled HPC systems on AWS is a plus. Experience with configuration management tools such as Ansible, Puppet, and Chef is desired. Moreover, experience with cloud-based HPC solutions and hybrid HPC environments will be beneficial for this role.,
Posted 1 month ago
10.0 - 20.0 years
40 - 50 Lacs
Bengaluru
Work from Office
Lead HPC delivery teams, define strategy, mentor staff, align capacity with business goals, and manage performance. Required Candidate profile Strong leadership and technical experience in HPC, with ability to manage teams and drive innovation.
Posted 1 month ago
7.0 - 12.0 years
27 - 37 Lacs
Bengaluru
Work from Office
Develop and optimize HPC applications using MPI, OpenMP, CUDA, and cloud platforms; collaborate across R&D teams. Required Candidate profile Experienced in scientific computing, parallel programming, performance engineering and software design.
Posted 1 month ago
1.0 - 4.0 years
1 - 4 Lacs
Hyderabad, Telangana, India
On-site
Develop and train deep learning models for various use cases. Optimize model performance and ensure scalability. Collaborate with the data science and engineering teams to integrate deep learning models into production systems. Required Qualifications: 3+ years of experience in deep learning and machine learning. Expertise in deep learning frameworks such as TensorFlow, PyTorch, or Keras. Strong programming skills in Python and experience with GPU computing.
Posted 1 month ago
9.0 - 12.0 years
11 - 15 Lacs
Bengaluru
Work from Office
This opportunity is ideal for those passionate about enhancing IT operational services. As a Senior IT Operations Engineer - HPC, you will be accountable for the end-to-end operations of digital landscapes and daily application services. This role will not only develop your skills but also help IDT drive key shifts to boost Shell's competitiveness and adaptability, especially as the energy sector transitions to cleaner energy forms. You will be part of the HPC team, under the Subsurface & Wells Service and Operations (SOM), focuses on delivering differentiated IT services securely, reliably, and affordably, enabling business value in collaboration with Projects & Technology at Shell. What youll be doing End-to-end responsible for the operations of HPC landscape and delivery of day-to-day E2E application services according to the agreed Service Levels and/or Operate Level Agreements Day-to-day service integrator (end-to-end) to ensure application landscapes remain compliant Drive incident/problem resolution by assisting in key operational activities in terms of delivery, fixes and supportability with operations staff and/suppliers, and ensure that regulatory and compliance controls are embedded in landscape operations and assist with evidence collection Operational readiness activities of enhancements and project solutions, including analyzing and understanding business functional and non-functional requirements Comply with the relevant information security, IT controls and legal and regulatory requirements Identify systemic issues on landscapes, supporting the drive towards continuous improvements What you bring 9-12 years of total experience in IT Service Operations At least 8 years of experience in HPC/Supercomputing Strong skills in Linux environment, with preferred experience in storage solutions and InfiniBand Proven ability to work within interconnected enterprise landscapes Experience in Cloud, GPU computing, and Networking Technology is advantageous ITIL Certification is a plus Excellent communication, influencing, negotiation, and presentation skills across all levels of the Business and IT hierarchy Demonstrated ability to deliver SLA commitments through collaboration across organizational boundaries, including ecosystem partners (suppliers, IT team members, business leaders, and team members) Strong interpersonal skills and confidence in engaging with Business stakeholders Good understanding of the software engineering life cycle for development and the concepts and practices required to implement effective information systems Progress as a person as we work on the energy transition together. Continuously grow the transferable skills you need to get ahead. Work at the forefront of technology, trends, and practices. Collaborate with experienced colleagues with unique expertise. Achieve your balance in a values-led culture that encourages you to be the best version of yourself. Benefit from flexible working hours, and the possibility of remote/mobile working. Perform at your best with a competitive starting salary and annual performance related salary increase our pay and benefits packages are considered to be among the best in the world. Take advantage of paid parental leave, including for non-birthing parents. Join an organisation working to become one of the most diverse and inclusive in the world. We strongly encourage applicants of all genders, ages, ethnicities, cultures, abilities, sexual orientation, and life experiences to apply. Grow as you progress through diverse career opportunities in national and international teams. Gain access to a wide range of training and development programmes.
Posted 2 months ago
5.0 - 8.0 years
6 - 10 Lacs
Bengaluru
Work from Office
What We Expect 4+ years of experience in C++ development, specializing in high-performance, low-latency systems. Deep expertise in modern C++ (C++14/17/20), multithreading, and concurrency. Strong Qt development experience for building real-time, high-performance trading UIs. Experience building ultra-fast order execution engines, market data feeds, and real-time risk management tools. Strong understanding of networking protocols (TCP/IP, UDP, FIX) and inter process communication (IPC, shared memory, message queues). Hands-on experience with latency optimization, performance tuning, and profiling tools (perf, Val grind, gprof, etc.). Proficiency in memory management, lock-free programming, and CPU cache optimization. A deep understanding of exchange connectivity, order matching engines, and algorithmic trading systems. A hacker mentality-you love solving problems that seem impossible. What You Will Do Architect, develop, and optimize ultra-low-latency C++ trading applications that handle millions of transactions per second. Build high-performance market data processing solutions with microsecond-level latencies. Develop real-time, intuitive, and high-speed trading interfaces using Qt. Work on exchange connectivity, FIX protocol integrations, and risk management systems. Profile and optimize code to achieve maximum throughput and minimal latency. Solve some of the hardest engineering problems in fintech alongside an elite team. Experiment with new technologies to stay ahead of the competition. Own your work end-to-end-from concept to deployment, pushing the limits of what's possible. Must-Have Skills 4+ years of experience in C++ development, specializing in high-performance, low-latency systems. Deep expertise in modern C++ (C++14/17/20), multithreading, and concurrency. Strong Qt development experience for building real-time, high-performance trading UIs. Experience building ultra-fast order execution engines, market data feeds, and real-time risk management tools. Strong understanding of networking protocols (TCP/IP, UDP, FIX) and inter process communication (IPC, shared memory, message queues). Hands-on experience with latency optimization, performance tuning, and profiling tools (perf, Val grind, gprof, etc.). Nice-to-Have Skills Experience in high-frequency trading (HFT), market-making, or ultra-low-latency environments. Knowledge of exchange matching algorithms, order routing strategies, and market microstructure. Contributions to open-source C++ and Qt projects or performance-critical software. Expertise in hardware acceleration (FPGA, SIMD, AVX, GPU computing). Familiarity with cloud-based trading infrastructure and hybrid on-prem/cloud systems.
Posted 2 months ago
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Accenture
54024 Jobs | Dublin
Wipro
24262 Jobs | Bengaluru
Accenture in India
18733 Jobs | Dublin 2
EY
17079 Jobs | London
Uplers
12548 Jobs | Ahmedabad
IBM
11704 Jobs | Armonk
Amazon
11059 Jobs | Seattle,WA
Bajaj Finserv
10656 Jobs |
Accenture services Pvt Ltd
10587 Jobs |
Oracle
10506 Jobs | Redwood City