Apply now »

Job Title: Manager | Hybrid cloud | Bengaluru | Engineering | Hybrid Cloud Engineering

Job requisition ID :: 107230

Date: Jun 22, 2026

Location: Bengaluru

Designation: Manager

Entity: Deloitte Touche Tohmatsu India LLP

Manager | Hybrid cloud | Bengaluru | Engineering | Hybrid Cloud Engineering
• Job requisition ID : 107230
• Location: Bengaluru
• Entity: Deloitte Touche Tohmatsu India LLP

Job Title: AI Infrastructure Architect / Operate Lead (Manager)

Role Summary

The AI Infrastructure Architect / Operate Lead is responsible for operationalizing, managing, and optimizing AI/ML platforms and infrastructure at scale. This role focuses on ensuring high availability, reliability, performance, security, and cost efficiency of AI workloads across multi-cloud and hybrid environments.

The role bridges AI engineering, cloud platform operations, MLOps, DevOps, and SRE practices, enabling organizations to run production-grade AI systems with strong governance and operational excellence.

Key Responsibilities

1. AI Platform Operations & Service Reliability

Own end-to-end operations of AI platforms and infrastructure, including:
Model serving platforms (batch & real-time)
AI pipelines and orchestration frameworks
Data ingestion and processing layers
Ensure:
99.9%+ availability and resilience
Defined SLOs/SLIs for AI services
Lead incident, problem, and change management processes
Conduct root cause analysis (RCA) and implement preventive measures

2. MLOps & Model Lifecycle Management

Lead operationalization of end-to-end ML lifecycle:
Model training, validation, deployment, monitoring, retraining
Implement and manage:
ML pipelines (CI/CD for models)
Model registry and versioning
Ensure:
Model reproducibility and traceability
Model performance tracking (latency, accuracy)
Drift detection (data drift / concept drift)
Integrate automated retraining and feedback loops

3. Cloud & Platform Engineering

Oversee deployment and operations across Azure, AWS, GCP, and hybrid environments
Manage:
Kubernetes clusters (On-prem/AKS/EKS/GKE)
Serverless and container-based AI workloads
Drive:
Infrastructure-as-Code (IaC) adoption (Terraform, Bicep, CloudFormation)
Platform standardization and reusable components
Ensure scalable infrastructure for training (high compute) and inference (low latency)

4. GPU & High-Performance Compute Optimization

Manage and optimize GPU/TPU-based workloads
Ensure efficient:
Workload scheduling
Resource allocation and bin-packing
Optimize infrastructure for:
Distributed training (e.g., Horovod, DeepSpeed)
Cost-performance trade-offs
Monitor GPU utilization and improve efficiency metrics

5. Observability & Intelligent Monitoring

Implement end-to-end observability across:
Infrastructure (CPU, GPU, memory)
Platform services
AI models
Define metrics for:
Model drift, bias, latency, throughput
Deploy monitoring tools:
Prometheus, Grafana, ELK, Azure Monitor, Datadog
Enable predictive alerting and AIOps capabilities

6. Security, Compliance & Responsible AI

Ensure secure operation of AI systems:
Identity & access management (IAM/RBAC)
Data encryption (at rest & in transit)
Secure model endpoints
Enforce:
Data privacy regulations (GDPR, HIPAA, etc.)
Responsible AI policies (bias detection, explainability)
Maintain:
Audit trails for models and data
Governance frameworks for model lifecycle

7. FinOps & Cost Optimization

Drive cost efficiency for AI workloads:
GPU and compute optimization
Storage and data transfer optimization
Implement:
Autoscaling and workload scheduling strategies
Spot/preemptible usage
Build:
Cost dashboards and chargeback models
Align AI infrastructure spend with business outcomes

8. Service Delivery & Operations Management

Lead 24x7 operations support (if applicable)
Manage SLAs, OLAs, and KPIs
Implement ITIL-based processes:
Incident, problem, change, release management
Drive continuous service improvement initiatives

9. Team Leadership & Talent Development

Lead and mentor a team of:
MLOps engineers
Cloud/platform engineers
SREs / AI Ops specialists
Responsibilities include:
Workforce planning and hiring
Capability development and certifications
Performance management
Foster a culture of:
Automation-first mindset
Reliability engineering
DevOps practices

10. Stakeholder & Program Management

Partner with:
Data science and AI engineering teams
Enterprise architects
Security and governance teams
Translate business requirements into:
Scalable AI infrastructure solutions
Provide leadership updates on:
Platform health
Cost metrics
Operational KPIs

11. Continuous Improvement & Innovation

Introduce:
Self-healing infrastructure
Autonomous operations using AI (AIOps)
Evaluate new technologies:
LLMOps (vector DBs, prompt pipelines, inference optimization)
Edge AI and distributed inference
Improve platform maturity across:
Automation
Standardization
Reliability

Required Qualifications

Education

Bachelor’s or Master’s degree in Computer Science, Engineering, or related field

Experience

12+ years in:
Cloud/platform engineering or infrastructure operations
At least 3-5 years in AI/ML infrastructure or MLOps
Proven team management experience (Manager level)

Technical Skills

Cloud & Infrastructure

Azure, AWS, GCP (multi-cloud preferred)
Kubernetes, Docker
Infrastructure as Code (Terraform, ARM/Bicep, CloudFormation)

AI/ML & MLOps

Platforms: Azure ML, SageMaker, Vertex AI, MLflow, Kubeflow
Model lifecycle management and pipeline orchestration

Data & Processing

Apache Spark, Kafka, Airflow
Data pipelines and feature stores

Observability & Monitoring

Prometheus, Grafana, ELK stack, Datadog

Programming

Python, Bash, or scripting languages

Leadership & Functional Skills

Strong people leadership and delivery management
Experience in SRE / DevOps transformations
Knowledge of ITIL-based service management
Strong stakeholder communication and executive reporting

Preferred Qualifications

Certifications:
Azure/AWS/GCP Architect
Certified Kubernetes Administrator (CKA)
AI/ML certifications (Azure ML, AWS ML Specialty)
Experience with:
Generative AI / LLMOps ecosystems
Vector databases (FAISS, Pinecone, etc.)
Responsible AI frameworks

Apply now »