Home
Jobs

Posted:2 weeks ago| Platform: Linkedin logo

Apply

Work Mode

On-site

Job Type

Contractual

Job Description

Overview: Seeking an engineer to build and optimize high-throughput, low-latency LLM inference infrastructure using open-source models (Qwen, LLaMA, Mixtral) on multi-GPU systems (A100/H100). You’ll own performance tuning, model hosting, routing logic, speculative decoding, and cost-efficiency tooling. Must-Have Skills: Deep experience with vLLM, tensor/pipe parallelism, KV cache management Strong grasp of CUDA-level inference bottlenecks, FlashAttention2, quantization Familiarity with FP8, INT4, speculative decoding (e.g., TwinPilots, PowerInfer) Proven ability to scale LLMs across multi-GPU nodes (TP, DDP, inference routing) Python (systems-level), containerized deployments (Docker, GCP/AWS), load testing (Locust) Bonus: Experience with any-to-any model routing (e.g., text2sql, speech2text) Exposure to LangGraph, Triton kernels, or custom inference engines Has tuned models for <$0.50/M token inference at scale Highlight: Very good rate card for the best candidate fit. Show more Show less

Mock Interview

Practice Video Interview with JobPe AI

Start Latency Interview Now
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You