Job Title: Associate Director | Hybrid cloud | Bengaluru | Engineering | Hybrid Cloud Engineering

Job requisition ID :: 107228

Date: Jun 22, 2026

Location: Bengaluru

Designation: Associate Director

Entity: Deloitte Touche Tohmatsu India LLP

Job Title: Associate Director – SRE & Observability Engineer (AI Infrastructure)

Role Overview

We are seeking a seasoned Site Reliability Engineering (SRE) and Observability leader to design, build, and scale reliability frameworks for AI/GenAI platforms and data-intensive workloads.

This role will focus on ensuring high availability, performance, scalability, and cost-efficiency across AI infrastructure (LLMs, model training/inference, vector databases, pipelines) by embedding SRE principles, observability, and automation into the platform lifecycle.

Key Responsibilities

1. SRE Strategy for AI Infrastructure

Define and lead SRE strategy and operating model for AI platforms across cloud (Azure, AWS, GCP) and hybrid environments
Establish SLIs, SLOs, and SLAs tailored to:
LLM inference latency and throughput
Model training performance and job success rates
Pipeline reliability (RAG, orchestration frameworks, agents)
Drive adoption of error budgets and reliability engineering practices across AI and platform teams

2. Observability Architecture for AI Workloads

Design and implement end-to-end observability frameworks for AI systems, including:
Metrics (latency, throughput, GPU utilization, token usage)
Logs (model behavior, system failures, prompt traces)
Traces (distributed AI workflows, API calls, orchestration flows)
Build observability for:
LLM pipelines and agent-based systems
Vector databases and retrieval layers
Data ingestion and feature pipelines
Enable deep visibility into model performance, drift, and degradation

3. Reliability Engineering & Automation

Implement self-healing systems, auto-remediation, and resiliency patterns
Design fault tolerance strategies:
Multi-region deployment
Model fallback and routing strategies
Graceful degradation in GenAI systems
Lead adoption of:
Chaos engineering for AI workloads
Canary deployments and A/B testing for models
Drive automation-first SRE practices using IaC and policy-as-code

4. AI System Performance Optimization

Optimize:
Inference latency and throughput
GPU/accelerator utilization
Distributed training efficiency
Work with engineering teams to:
Fine-tune model serving infrastructure
Implement caching, batching, and async processing
Drive performance benchmarking frameworks for AI workloads

5. Incident Management & Reliability Operations

Establish incident response frameworks tailored for AI platforms
Lead root cause analysis (RCA) for:
Model failures
Pipeline breakdowns
Infrastructure bottlenecks
Define and track MTTR, MTBF, availability, and reliability KPIs
Build runbooks, playbooks, and operational dashboards

6. Tooling & Platform Enablement

Implement and manage observability and SRE tooling such as:
Monitoring: Prometheus, Grafana, Datadog, Azure Monitor, CloudWatch
Logging & tracing: ELK stack, OpenTelemetry, Jaeger
AI observability: Langfuse, Weights & Biases, Arize, WhyLabs (preferred)
Develop custom telemetry pipelines for AI-specific metrics (token usage, prompt traces, response quality signals)
Integrate observability into CI/CD and MLOps pipelines

7. Governance & Risk Management

Define reliability guardrails and governance policies
Ensure compliance, security, and availability requirements for AI systems
Implement controls for:
Model drift and degradation detection
Data pipeline integrity
Responsible AI monitoring

8. Stakeholder Leadership & Advisory

Act as a trusted advisor to:
Platform engineering
Data science teams
Enterprise architecture and leadership
Translate reliability metrics into business impact (customer experience, revenue risk)
Drive enterprise adoption of SRE practices for AI

9. Thought Leadership & Innovation

Develop POVs, frameworks, and accelerators for:
AI SRE maturity models
Observability patterns for GenAI
Stay ahead of trends in:
AI reliability engineering
Observability tooling and standards
Lead internal capability building and external client workshops

Required Qualifications

Experience

12–15+ years of experience in:
Site Reliability Engineering / DevOps / Platform Engineering
Cloud infrastructure and distributed systems
4–6+ years working with AI/ML platforms, MLOps, or data-intensive systems
Proven experience in designing high-scale, highly reliable systems

Core Skills

Deep expertise in:
SRE principles (SLI/SLO, error budgets, incident management)
Observability (metrics, logs, tracing)
Distributed system design and failure modes
Strong understanding of:
AI/ML workloads (training, inference, pipelines)
LLM architectures and GenAI systems

Technical Skills

Cloud Platforms: Azure, AWS, GCP
Infrastructure:
Kubernetes, containers, serverless architectures
Observability stack:
OpenTelemetry, Prometheus, Grafana, ELK
Programming / scripting:
Python, Go, or similar
CI/CD & IaC:
Terraform, ARM, CloudFormation, GitOps

Leadership & Consulting Skills

Executive communication and stakeholder management
Ability to lead cross-functional, global teams
Strong problem-solving and analytical mindset
Experience in client-facing advisory and transformation programs

Preferred Qualifications

Certifications:
Kubernetes (CKA/CKAD)
Cloud Architect (Azure/AWS/GCP)
Exposure to:
AI observability platforms (Arize, WhyLabs, Langfuse, etc.)
FinOps alignment for AI workloads
Experience with:
Multi-cloud and hybrid deployment strategies