Job Title:  Manager | Hybrid cloud | Bengaluru | Engineering | Hybrid Cloud Engineering

Job requisition ID ::  107230
Date:  Jun 22, 2026
Location:  Bengaluru
Designation:  Manager
Entity:  Deloitte Touche Tohmatsu India LLP

Manager | Hybrid cloud | Bengaluru | Engineering | Hybrid Cloud Engineering
Job requisition ID : 107230 
Location: Bengaluru
Entity: Deloitte Touche Tohmatsu India LLP 

Job Title: AI Infrastructure Architect / Operate Lead (Manager)

Role Summary

The AI Infrastructure Architect / Operate Lead is responsible for operationalizing, managing, and optimizing AI/ML platforms and infrastructure at scale. This role focuses on ensuring high availability, reliability, performance, security, and cost efficiency of AI workloads across multi-cloud and hybrid environments.

The role bridges AI engineering, cloud platform operations, MLOps, DevOps, and SRE practices, enabling organizations to run production-grade AI systems with strong governance and operational excellence.


Key Responsibilities

1. AI Platform Operations & Service Reliability

  • Own end-to-end operations of AI platforms and infrastructure, including:
  • Model serving platforms (batch & real-time)
  • AI pipelines and orchestration frameworks
  • Data ingestion and processing layers
  • Ensure:
  • 99.9%+ availability and resilience
  • Defined SLOs/SLIs for AI services
  • Lead incident, problem, and change management processes
  • Conduct root cause analysis (RCA) and implement preventive measures


2. MLOps & Model Lifecycle Management

  • Lead operationalization of end-to-end ML lifecycle:
  • Model training, validation, deployment, monitoring, retraining
  • Implement and manage:
  • ML pipelines (CI/CD for models)
  • Model registry and versioning
  • Ensure:
  • Model reproducibility and traceability
  • Model performance tracking (latency, accuracy)
  • Drift detection (data drift / concept drift)
  • Integrate automated retraining and feedback loops


3. Cloud & Platform Engineering

  • Oversee deployment and operations across Azure, AWS, GCP, and hybrid environments
  • Manage:
  • Kubernetes clusters (On-prem/AKS/EKS/GKE)
  • Serverless and container-based AI workloads
  • Drive:
  • Infrastructure-as-Code (IaC) adoption (Terraform, Bicep, CloudFormation)
  • Platform standardization and reusable components
  • Ensure scalable infrastructure for training (high compute) and inference (low latency)


4. GPU & High-Performance Compute Optimization

  • Manage and optimize GPU/TPU-based workloads
  • Ensure efficient:
  • Workload scheduling
  • Resource allocation and bin-packing
  • Optimize infrastructure for:
  • Distributed training (e.g., Horovod, DeepSpeed)
  • Cost-performance trade-offs
  • Monitor GPU utilization and improve efficiency metrics


5. Observability & Intelligent Monitoring

  • Implement end-to-end observability across:
  • Infrastructure (CPU, GPU, memory)
  • Platform services
  • AI models
  • Define metrics for:
  • Model drift, bias, latency, throughput
  • Deploy monitoring tools:
  • Prometheus, Grafana, ELK, Azure Monitor, Datadog
  • Enable predictive alerting and AIOps capabilities


6. Security, Compliance & Responsible AI

  • Ensure secure operation of AI systems:
  • Identity & access management (IAM/RBAC)
  • Data encryption (at rest & in transit)
  • Secure model endpoints
  • Enforce:
  • Data privacy regulations (GDPR, HIPAA, etc.)
  • Responsible AI policies (bias detection, explainability)
  • Maintain:
  • Audit trails for models and data
  • Governance frameworks for model lifecycle


7. FinOps & Cost Optimization

  • Drive cost efficiency for AI workloads:
  • GPU and compute optimization
  • Storage and data transfer optimization
  • Implement:
  • Autoscaling and workload scheduling strategies
  • Spot/preemptible usage
  • Build:
  • Cost dashboards and chargeback models
  • Align AI infrastructure spend with business outcomes


8. Service Delivery & Operations Management

  • Lead 24x7 operations support (if applicable)
  • Manage SLAs, OLAs, and KPIs
  • Implement ITIL-based processes:
  • Incident, problem, change, release management
  • Drive continuous service improvement initiatives


9. Team Leadership & Talent Development

  • Lead and mentor a team of:
  • MLOps engineers
  • Cloud/platform engineers
  • SREs / AI Ops specialists
  • Responsibilities include:
  • Workforce planning and hiring
  • Capability development and certifications
  • Performance management
  • Foster a culture of:
  • Automation-first mindset
  • Reliability engineering
  • DevOps practices


10. Stakeholder & Program Management

  • Partner with:
  • Data science and AI engineering teams
  • Enterprise architects
  • Security and governance teams
  • Translate business requirements into:
  • Scalable AI infrastructure solutions
  • Provide leadership updates on:
  • Platform health
  • Cost metrics
  • Operational KPIs


11. Continuous Improvement & Innovation

  • Introduce:
  • Self-healing infrastructure
  • Autonomous operations using AI (AIOps)
  • Evaluate new technologies:
  • LLMOps (vector DBs, prompt pipelines, inference optimization)
  • Edge AI and distributed inference
  • Improve platform maturity across:
  • Automation
  • Standardization
  • Reliability


Required Qualifications

Education

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or related field


Experience

  • 12+ years in:
  • Cloud/platform engineering or infrastructure operations
  • At least 3-5 years in AI/ML infrastructure or MLOps
  • Proven team management experience (Manager level)


Technical Skills

Cloud & Infrastructure

  • Azure, AWS, GCP (multi-cloud preferred)
  • Kubernetes, Docker
  • Infrastructure as Code (Terraform, ARM/Bicep, CloudFormation)

AI/ML & MLOps

  • Platforms: Azure ML, SageMaker, Vertex AI, MLflow, Kubeflow
  • Model lifecycle management and pipeline orchestration

Data & Processing

  • Apache Spark, Kafka, Airflow
  • Data pipelines and feature stores

Observability & Monitoring

  • Prometheus, Grafana, ELK stack, Datadog

Programming

  • Python, Bash, or scripting languages


Leadership & Functional Skills

  • Strong people leadership and delivery management
  • Experience in SRE / DevOps transformations
  • Knowledge of ITIL-based service management
  • Strong stakeholder communication and executive reporting


Preferred Qualifications

  • Certifications:
  • Azure/AWS/GCP Architect
  • Certified Kubernetes Administrator (CKA)
  • AI/ML certifications (Azure ML, AWS ML Specialty)
  • Experience with:
  • Generative AI / LLMOps ecosystems
  • Vector databases (FAISS, Pinecone, etc.)
  • Responsible AI frameworks