Job Title: Manager | Hybrid cloud | Bengaluru | Engineering | Hybrid Cloud Engineering

Manager | Hybrid cloud | Bengaluru | Engineering | Hybrid Cloud Engineering
• Job requisition ID : 107230
• Location: Bengaluru
• Entity: Deloitte Touche Tohmatsu India LLP
Job Title: AI Infrastructure Architect / Operate Lead (Manager)
Role Summary
The AI Infrastructure Architect / Operate Lead is responsible for operationalizing, managing, and optimizing AI/ML platforms and infrastructure at scale. This role focuses on ensuring high availability, reliability, performance, security, and cost efficiency of AI workloads across multi-cloud and hybrid environments.
The role bridges AI engineering, cloud platform operations, MLOps, DevOps, and SRE practices, enabling organizations to run production-grade AI systems with strong governance and operational excellence.
Key Responsibilities
1. AI Platform Operations & Service Reliability
- Own end-to-end operations of AI platforms and infrastructure, including:
- Model serving platforms (batch & real-time)
- AI pipelines and orchestration frameworks
- Data ingestion and processing layers
- Ensure:
- 99.9%+ availability and resilience
- Defined SLOs/SLIs for AI services
- Lead incident, problem, and change management processes
- Conduct root cause analysis (RCA) and implement preventive measures
2. MLOps & Model Lifecycle Management
- Lead operationalization of end-to-end ML lifecycle:
- Model training, validation, deployment, monitoring, retraining
- Implement and manage:
- ML pipelines (CI/CD for models)
- Model registry and versioning
- Ensure:
- Model reproducibility and traceability
- Model performance tracking (latency, accuracy)
- Drift detection (data drift / concept drift)
- Integrate automated retraining and feedback loops
3. Cloud & Platform Engineering
- Oversee deployment and operations across Azure, AWS, GCP, and hybrid environments
- Manage:
- Kubernetes clusters (On-prem/AKS/EKS/GKE)
- Serverless and container-based AI workloads
- Drive:
- Infrastructure-as-Code (IaC) adoption (Terraform, Bicep, CloudFormation)
- Platform standardization and reusable components
- Ensure scalable infrastructure for training (high compute) and inference (low latency)
4. GPU & High-Performance Compute Optimization
- Manage and optimize GPU/TPU-based workloads
- Ensure efficient:
- Workload scheduling
- Resource allocation and bin-packing
- Optimize infrastructure for:
- Distributed training (e.g., Horovod, DeepSpeed)
- Cost-performance trade-offs
- Monitor GPU utilization and improve efficiency metrics
5. Observability & Intelligent Monitoring
- Implement end-to-end observability across:
- Infrastructure (CPU, GPU, memory)
- Platform services
- AI models
- Define metrics for:
- Model drift, bias, latency, throughput
- Deploy monitoring tools:
- Prometheus, Grafana, ELK, Azure Monitor, Datadog
- Enable predictive alerting and AIOps capabilities
6. Security, Compliance & Responsible AI
- Ensure secure operation of AI systems:
- Identity & access management (IAM/RBAC)
- Data encryption (at rest & in transit)
- Secure model endpoints
- Enforce:
- Data privacy regulations (GDPR, HIPAA, etc.)
- Responsible AI policies (bias detection, explainability)
- Maintain:
- Audit trails for models and data
- Governance frameworks for model lifecycle
7. FinOps & Cost Optimization
- Drive cost efficiency for AI workloads:
- GPU and compute optimization
- Storage and data transfer optimization
- Implement:
- Autoscaling and workload scheduling strategies
- Spot/preemptible usage
- Build:
- Cost dashboards and chargeback models
- Align AI infrastructure spend with business outcomes
8. Service Delivery & Operations Management
- Lead 24x7 operations support (if applicable)
- Manage SLAs, OLAs, and KPIs
- Implement ITIL-based processes:
- Incident, problem, change, release management
- Drive continuous service improvement initiatives
9. Team Leadership & Talent Development
- Lead and mentor a team of:
- MLOps engineers
- Cloud/platform engineers
- SREs / AI Ops specialists
- Responsibilities include:
- Workforce planning and hiring
- Capability development and certifications
- Performance management
- Foster a culture of:
- Automation-first mindset
- Reliability engineering
- DevOps practices
10. Stakeholder & Program Management
- Partner with:
- Data science and AI engineering teams
- Enterprise architects
- Security and governance teams
- Translate business requirements into:
- Scalable AI infrastructure solutions
- Provide leadership updates on:
- Platform health
- Cost metrics
- Operational KPIs
11. Continuous Improvement & Innovation
- Introduce:
- Self-healing infrastructure
- Autonomous operations using AI (AIOps)
- Evaluate new technologies:
- LLMOps (vector DBs, prompt pipelines, inference optimization)
- Edge AI and distributed inference
- Improve platform maturity across:
- Automation
- Standardization
- Reliability
Required Qualifications
Education
- Bachelor’s or Master’s degree in Computer Science, Engineering, or related field
Experience
- 12+ years in:
- Cloud/platform engineering or infrastructure operations
- At least 3-5 years in AI/ML infrastructure or MLOps
- Proven team management experience (Manager level)
Technical Skills
Cloud & Infrastructure
- Azure, AWS, GCP (multi-cloud preferred)
- Kubernetes, Docker
- Infrastructure as Code (Terraform, ARM/Bicep, CloudFormation)
AI/ML & MLOps
- Platforms: Azure ML, SageMaker, Vertex AI, MLflow, Kubeflow
- Model lifecycle management and pipeline orchestration
Data & Processing
- Apache Spark, Kafka, Airflow
- Data pipelines and feature stores
Observability & Monitoring
- Prometheus, Grafana, ELK stack, Datadog
Programming
- Python, Bash, or scripting languages
Leadership & Functional Skills
- Strong people leadership and delivery management
- Experience in SRE / DevOps transformations
- Knowledge of ITIL-based service management
- Strong stakeholder communication and executive reporting
Preferred Qualifications
- Certifications:
- Azure/AWS/GCP Architect
- Certified Kubernetes Administrator (CKA)
- AI/ML certifications (Azure ML, AWS ML Specialty)
- Experience with:
- Generative AI / LLMOps ecosystems
- Vector databases (FAISS, Pinecone, etc.)
- Responsible AI frameworks
