GPU Engineer

5 - 10 years

6 - 10 Lacs

Posted:4 days ago| Platform: Naukri logo

Apply

Skills Required

Work Mode

Remote

Job Type

Full Time

Job Description

Role & responsibilities

Job Summary

We are seeking a highly skilled GPU Infrastructure Engineer to join our team. This role focuses on the design, implementation, and management of enterprise network and cloud-based infrastructure to support evolving Azure cloud needs. The ideal candidate will have a strong background in software, network, or systems engineering, along with hands-on experience in managing large-scale cloud and data center operations.

Responsibilities

  • Respond to incidents during regular on-call rotations and resolve issues efficiently to minimize downtime.
  • Design and plan scalable GPU infrastructure solutions to meet organizational capacity and performance needs.
  • Collaborate with cross-functional teams to define and implement GPU infrastructure architecture that aligns with business objectives.
  • Evaluate GPU technologies and recommend the best hardware and software configurations.
  • Configure and deploy GPU servers, including installation and setup of hardware, software, and networking components.
  • Coordinate with vendors for procurement and installation of GPUs and related infrastructure.
  • Implement and manage GPU clustering setups for compute-intensive tasks.
  • Utilize monitoring tools to assess GPU performance metrics and system health.
  • Conduct benchmarking tests and analyze the results to identify performance bottlenecks.
  • Optimize workload distribution across GPU resources to ensure maximum efficiency.
  • Provide expert troubleshooting support for reporting and resolving GPU-related issues experienced by team members.
  • Maintain incident response protocols to address hardware and software failures swiftly and effectively.
  • Develop FAQs and knowledge base articles to streamline support processes for internal users.
  • Infrastructure Maintenance:
  • Schedule and perform routine maintenance, including updates to software, firmware, and drivers related to GPU systems.
  • Plan and execute capacity upgrades and expansions as needed, ensuring minimal disruption to services.
  • Conduct post-mortem analyses on significant incidents to improve overall system reliability.
  • Write scripts for automation of deployment, configuration management, and system monitoring tasks (e.g., Python, Bash).
  • Develop tools that increase productivity for engineering and data science teams using GPUs.
  • Implement Infrastructure as Code (IaC) practices for efficient and repeatable deployments.

Requirements

  • Bachelors or Masters Degree in Computer Science, Information Technology, or a related field.

Technical Experience:

  • Proven expertise in software engineering, network engineering, or systems administration.
  • Hands-on experience with managing and debugging cloud backend server and networking infrastructure and services.
  • Strong understanding of enterprise network and cloud-based architectures, including experience working with Cisco and Azure.
  • Experience with cloud platforms providing GPU services (e.g., AWS, Google Cloud, Azure).
  • Understanding virtualization technologies (e.g., Docker, Kubernetes) and server orchestration tools.
  • Knowledge of network configurations and storage solutions used in GPU environments.
  • Strong understanding of GPU architectures (NVIDIA CUDA, AMD ROCm, etc.).
  • Experience with AI/ML workloads, HPC, or rendering applications.
  • Familiarity with PCIe, memory subsystems (DDR, HBM), and high-speed I/O.
  • Understanding of Azure Pipeline , Azure DevOps.
  • Demonstrated knowledge in deploying servers and network infrastructure equipment at scale.

Specialized Skills:

  • Experience working with GPU hardware or related system engineering.
  • Experience with:
  • Data center architecture and cloud infrastructure.
  • Network infrastructure design and management in hybrid environments.
  • Certifications in relevant technologies such as:
  • Cisco (e.g., CCNA /CCNP).
  • AZ900(Manadatory) , AZ104 (Optional).
  • OCI Foundations Associate (Optional)
  • ITIL or equivalent certifications (Optional).

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now
Aptly Tech logo
Aptly Tech

Software Development

Tech City

RecommendedJobs for You

chennai, coimbatore, bengaluru