Posted:2 days ago|
Platform:
On-site
Full Time
*Who you are*
You’re the person whose fingertips know the difference between spinning up a GPU cluster and spinning down a stale inference node. You love the “infrastructure behind the magic” of LLMs. You've built CI/CD pipelines that automatically version models, log inference metrics, and alert on drift. You’ve containerized GenAI services in Docker, deployed them on Kubernetes clusters (AKS or EKS), and implemented terraform or ARM to manage infra-as-code. You monitor cloud costs like a hawk, optimize GPU workloads, and sometimes sacrifice cost for performance—but never vice versa. You’re fluent in Python and Bash, can script tests for REST endpoints, and build automated feedback loops for model retraining. You’re comfortable working in Azure — OpenAI, Azure ML, Azure DevOps Pipelines—but are cloud-agnostic enough to cover AWS or GCP if needed. You read MLOps/LLMOps blog posts or arXiv summaries on the weekend and implement improvements on Monday. You think of yourself as a self-driven engineer: no playbooks, no spoon-feeding—just solid automation, reliability, and a hunger to scale GenAI from prototype to production.
---
*What you will actually do*
You’ll architect and build deployment platforms for internal LLM services: start from containerizing models and building CI/CD pipelines for inference microservices. You’ll write IaC (Terraform or ARM) to spin up clusters, endpoints, GPUs, storage, and logging infrastructure. You’ll integrate Azure OpenAI and Azure ML endpoints, pushing models via pipelines, versioning them, and enabling automatic retraining triggers. You’ll build monitoring and observability around latency, cost, error rates, drift, and prompt health metrics. You’ll optimize deployments—autoscaling, use of spot/gpu nodes, invalidation policies—to balance cost and performance. You’ll set up automated QA pipelines that validate model outputs (e.g. semantic similarity, hallucination detection) before merging. You’ll collaborate with ML, backend, and frontend teams to package components into release-ready backend services. You’ll manage alerts, rollbacks on failure, and ensure 99% uptime. You'll create reusable tooling (CI templates, deployment scripts, infra modules) to make future projects plug-and-play.
---
*Skills and knowledge*
Strong scripting skills in Python and Bash for automation and pipelines
Fluent in Docker, Kubernetes (especially AKS), containerizing LLM workloads
Infrastructure-as-code expertise: Terraform (Azure provider) or ARM templates
Experience with Azure DevOps or GitHub Actions for CI/CD of models and services
Knowledge of Azure OpenAI, Azure ML, or equivalent cloud LLM endpoints
Familiar with setting up monitoring: Azure Monitor, Prometheus/Grafana—track latency, errors, drift, costs
Cost-optimization tactics: spot nodes, autoscaling, GPU utilization tracking
Basic LLM understanding: inference latency/cost, deployment patterns, model versioning
Ability to build lightweight QA checks or integrate with QA pipelines
Cloud-agnostic awareness—experience with AWS or GCP backup systems
Comfortable establishing production-grade Ops pipelines, automating deployments end-to-end
Self-starter mentality: no playbooks required, ability to pick up new tools and drive infrastructure independently
Serenovolante Software Services Private Limited
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Practice Python coding challenges to boost your skills
Start Practicing Python NowExperience: Not specified
Salary: Not disclosed
Experience: Not specified
Salary: Not disclosed