Job Title:  Associate Director | Hybrid cloud | Bengaluru | Engineering | Hybrid Cloud Engineering

Job requisition ID ::  107228
Date:  Jun 22, 2026
Location:  Bengaluru
Designation:  Associate Director
Entity:  Deloitte Touche Tohmatsu India LLP

Job Title: Associate Director – SRE & Observability Engineer (AI Infrastructure)

Role Overview

We are seeking a seasoned Site Reliability Engineering (SRE) and Observability leader to design, build, and scale reliability frameworks for AI/GenAI platforms and data-intensive workloads.

This role will focus on ensuring high availability, performance, scalability, and cost-efficiency across AI infrastructure (LLMs, model training/inference, vector databases, pipelines) by embedding SRE principles, observability, and automation into the platform lifecycle.


Key Responsibilities

1. SRE Strategy for AI Infrastructure

  • Define and lead SRE strategy and operating model for AI platforms across cloud (Azure, AWS, GCP) and hybrid environments
  • Establish SLIs, SLOs, and SLAs tailored to:
  • LLM inference latency and throughput
  • Model training performance and job success rates
  • Pipeline reliability (RAG, orchestration frameworks, agents)
  • Drive adoption of error budgets and reliability engineering practices across AI and platform teams


2. Observability Architecture for AI Workloads

  • Design and implement end-to-end observability frameworks for AI systems, including:
  • Metrics (latency, throughput, GPU utilization, token usage)
  • Logs (model behavior, system failures, prompt traces)
  • Traces (distributed AI workflows, API calls, orchestration flows)
  • Build observability for:
  • LLM pipelines and agent-based systems
  • Vector databases and retrieval layers
  • Data ingestion and feature pipelines
  • Enable deep visibility into model performance, drift, and degradation


3. Reliability Engineering & Automation

  • Implement self-healing systems, auto-remediation, and resiliency patterns
  • Design fault tolerance strategies:
  • Multi-region deployment
  • Model fallback and routing strategies
  • Graceful degradation in GenAI systems
  • Lead adoption of:
  • Chaos engineering for AI workloads
  • Canary deployments and A/B testing for models
  • Drive automation-first SRE practices using IaC and policy-as-code


4. AI System Performance Optimization

  • Optimize:
  • Inference latency and throughput
  • GPU/accelerator utilization
  • Distributed training efficiency
  • Work with engineering teams to:
  • Fine-tune model serving infrastructure
  • Implement caching, batching, and async processing
  • Drive performance benchmarking frameworks for AI workloads


5. Incident Management & Reliability Operations

  • Establish incident response frameworks tailored for AI platforms
  • Lead root cause analysis (RCA) for:
  • Model failures
  • Pipeline breakdowns
  • Infrastructure bottlenecks
  • Define and track MTTR, MTBF, availability, and reliability KPIs
  • Build runbooks, playbooks, and operational dashboards


6. Tooling & Platform Enablement

  • Implement and manage observability and SRE tooling such as:
  • Monitoring: Prometheus, Grafana, Datadog, Azure Monitor, CloudWatch
  • Logging & tracing: ELK stack, OpenTelemetry, Jaeger
  • AI observability: Langfuse, Weights & Biases, Arize, WhyLabs (preferred)
  • Develop custom telemetry pipelines for AI-specific metrics (token usage, prompt traces, response quality signals)
  • Integrate observability into CI/CD and MLOps pipelines


7. Governance & Risk Management

  • Define reliability guardrails and governance policies
  • Ensure compliance, security, and availability requirements for AI systems
  • Implement controls for:
  • Model drift and degradation detection
  • Data pipeline integrity
  • Responsible AI monitoring


8. Stakeholder Leadership & Advisory

  • Act as a trusted advisor to:
  • Platform engineering
  • Data science teams
  • Enterprise architecture and leadership
  • Translate reliability metrics into business impact (customer experience, revenue risk)
  • Drive enterprise adoption of SRE practices for AI


9. Thought Leadership & Innovation

  • Develop POVs, frameworks, and accelerators for:
  • AI SRE maturity models
  • Observability patterns for GenAI
  • Stay ahead of trends in:
  • AI reliability engineering
  • Observability tooling and standards
  • Lead internal capability building and external client workshops


Required Qualifications

Experience

  • 12–15+ years of experience in:
  • Site Reliability Engineering / DevOps / Platform Engineering
  • Cloud infrastructure and distributed systems
  • 4–6+ years working with AI/ML platforms, MLOps, or data-intensive systems
  • Proven experience in designing high-scale, highly reliable systems


Core Skills

  • Deep expertise in:
  • SRE principles (SLI/SLO, error budgets, incident management)
  • Observability (metrics, logs, tracing)
  • Distributed system design and failure modes
  • Strong understanding of:
  • AI/ML workloads (training, inference, pipelines)
  • LLM architectures and GenAI systems


Technical Skills

  • Cloud Platforms: Azure, AWS, GCP
  • Infrastructure:
  • Kubernetes, containers, serverless architectures
  • Observability stack:
  • OpenTelemetry, Prometheus, Grafana, ELK
  • Programming / scripting:
  • Python, Go, or similar
  • CI/CD & IaC:
  • Terraform, ARM, CloudFormation, GitOps


Leadership & Consulting Skills

  • Executive communication and stakeholder management
  • Ability to lead cross-functional, global teams
  • Strong problem-solving and analytical mindset
  • Experience in client-facing advisory and transformation programs


Preferred Qualifications

  • Certifications:
  • Kubernetes (CKA/CKAD)
  • Cloud Architect (Azure/AWS/GCP)
  • Exposure to:
  • AI observability platforms (Arize, WhyLabs, Langfuse, etc.)
  • FinOps alignment for AI workloads
  • Experience with:
  • Multi-cloud and hybrid deployment strategies