Posted:2 weeks ago|
Platform:
On-site
Contractual
Overview: Seeking an engineer to build and optimize high-throughput, low-latency LLM inference infrastructure using open-source models (Qwen, LLaMA, Mixtral) on multi-GPU systems (A100/H100). You’ll own performance tuning, model hosting, routing logic, speculative decoding, and cost-efficiency tooling. Must-Have Skills: Deep experience with vLLM, tensor/pipe parallelism, KV cache management Strong grasp of CUDA-level inference bottlenecks, FlashAttention2, quantization Familiarity with FP8, INT4, speculative decoding (e.g., TwinPilots, PowerInfer) Proven ability to scale LLMs across multi-GPU nodes (TP, DDP, inference routing) Python (systems-level), containerized deployments (Docker, GCP/AWS), load testing (Locust) Bonus: Experience with any-to-any model routing (e.g., text2sql, speech2text) Exposure to LangGraph, Triton kernels, or custom inference engines Has tuned models for <$0.50/M token inference at scale Highlight: Very good rate card for the best candidate fit. Show more Show less
Constient Global Solutions
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Practice Python coding challenges to boost your skills
Start Practicing Python NowChennai, Tamil Nadu, India
Experience: Not specified
Salary: Not disclosed
Chennai, Tamil Nadu, India
Experience: Not specified
Salary: Not disclosed