Job Title: Associate Director | Hybrid cloud | Bengaluru | Engineering | Hybrid Cloud Engineering
Job Title: Associate Director – SRE & Observability Engineer (AI Infrastructure)
Role Overview
We are seeking a seasoned Site Reliability Engineering (SRE) and Observability leader to design, build, and scale reliability frameworks for AI/GenAI platforms and data-intensive workloads.
This role will focus on ensuring high availability, performance, scalability, and cost-efficiency across AI infrastructure (LLMs, model training/inference, vector databases, pipelines) by embedding SRE principles, observability, and automation into the platform lifecycle.
Key Responsibilities
1. SRE Strategy for AI Infrastructure
- Define and lead SRE strategy and operating model for AI platforms across cloud (Azure, AWS, GCP) and hybrid environments
- Establish SLIs, SLOs, and SLAs tailored to:
- LLM inference latency and throughput
- Model training performance and job success rates
- Pipeline reliability (RAG, orchestration frameworks, agents)
- Drive adoption of error budgets and reliability engineering practices across AI and platform teams
2. Observability Architecture for AI Workloads
- Design and implement end-to-end observability frameworks for AI systems, including:
- Metrics (latency, throughput, GPU utilization, token usage)
- Logs (model behavior, system failures, prompt traces)
- Traces (distributed AI workflows, API calls, orchestration flows)
- Build observability for:
- LLM pipelines and agent-based systems
- Vector databases and retrieval layers
- Data ingestion and feature pipelines
- Enable deep visibility into model performance, drift, and degradation
3. Reliability Engineering & Automation
- Implement self-healing systems, auto-remediation, and resiliency patterns
- Design fault tolerance strategies:
- Multi-region deployment
- Model fallback and routing strategies
- Graceful degradation in GenAI systems
- Lead adoption of:
- Chaos engineering for AI workloads
- Canary deployments and A/B testing for models
- Drive automation-first SRE practices using IaC and policy-as-code
4. AI System Performance Optimization
- Optimize:
- Inference latency and throughput
- GPU/accelerator utilization
- Distributed training efficiency
- Work with engineering teams to:
- Fine-tune model serving infrastructure
- Implement caching, batching, and async processing
- Drive performance benchmarking frameworks for AI workloads
5. Incident Management & Reliability Operations
- Establish incident response frameworks tailored for AI platforms
- Lead root cause analysis (RCA) for:
- Model failures
- Pipeline breakdowns
- Infrastructure bottlenecks
- Define and track MTTR, MTBF, availability, and reliability KPIs
- Build runbooks, playbooks, and operational dashboards
6. Tooling & Platform Enablement
- Implement and manage observability and SRE tooling such as:
- Monitoring: Prometheus, Grafana, Datadog, Azure Monitor, CloudWatch
- Logging & tracing: ELK stack, OpenTelemetry, Jaeger
- AI observability: Langfuse, Weights & Biases, Arize, WhyLabs (preferred)
- Develop custom telemetry pipelines for AI-specific metrics (token usage, prompt traces, response quality signals)
- Integrate observability into CI/CD and MLOps pipelines
7. Governance & Risk Management
- Define reliability guardrails and governance policies
- Ensure compliance, security, and availability requirements for AI systems
- Implement controls for:
- Model drift and degradation detection
- Data pipeline integrity
- Responsible AI monitoring
8. Stakeholder Leadership & Advisory
- Act as a trusted advisor to:
- Platform engineering
- Data science teams
- Enterprise architecture and leadership
- Translate reliability metrics into business impact (customer experience, revenue risk)
- Drive enterprise adoption of SRE practices for AI
9. Thought Leadership & Innovation
- Develop POVs, frameworks, and accelerators for:
- AI SRE maturity models
- Observability patterns for GenAI
- Stay ahead of trends in:
- AI reliability engineering
- Observability tooling and standards
- Lead internal capability building and external client workshops
Required Qualifications
Experience
- 12–15+ years of experience in:
- Site Reliability Engineering / DevOps / Platform Engineering
- Cloud infrastructure and distributed systems
- 4–6+ years working with AI/ML platforms, MLOps, or data-intensive systems
- Proven experience in designing high-scale, highly reliable systems
Core Skills
- Deep expertise in:
- SRE principles (SLI/SLO, error budgets, incident management)
- Observability (metrics, logs, tracing)
- Distributed system design and failure modes
- Strong understanding of:
- AI/ML workloads (training, inference, pipelines)
- LLM architectures and GenAI systems
Technical Skills
- Cloud Platforms: Azure, AWS, GCP
- Infrastructure:
- Kubernetes, containers, serverless architectures
- Observability stack:
- OpenTelemetry, Prometheus, Grafana, ELK
- Programming / scripting:
- Python, Go, or similar
- CI/CD & IaC:
- Terraform, ARM, CloudFormation, GitOps
Leadership & Consulting Skills
- Executive communication and stakeholder management
- Ability to lead cross-functional, global teams
- Strong problem-solving and analytical mindset
- Experience in client-facing advisory and transformation programs
Preferred Qualifications
- Certifications:
- Kubernetes (CKA/CKAD)
- Cloud Architect (Azure/AWS/GCP)
- Exposure to:
- AI observability platforms (Arize, WhyLabs, Langfuse, etc.)
- FinOps alignment for AI workloads
- Experience with:
- Multi-cloud and hybrid deployment strategies