Job Title: Director | AI / ML | Bengaluru | Engineering | Hybrid Cloud Engineering
Your work profile
· AI Data Center Architecture & Solution Design
· Design and implement AI-focused Data Center architectures aligned with Tier II, Tier III, and Tier IV standards.
· Develop end-to-end AI Data Center solutions, including retrofitting traditional CPU-based data centers into AI Factories.
· Create advisory documents, RFPs, technical proposals, and commercial proposals for AI Data Center engagements.
· Design AI infrastructure solutions across hyperscalers (AWS, Azure, GCP, OCI) and NVIDIA Cloud Partners.
· Prepare HLDs, LLDs, network diagrams, rack layouts, BOQs, and TCO models.
· AI Networking & Fabric Architecture
· Architect and deploy InfiniBand and NVIDIA Spectrum Ethernet fabrics for AI workloads.
· Design and implement Spine-Leaf network architectures using EVPN-VXLAN overlays.
· Configure and optimize BGP, ECMP, RoCE, and high-performance networking environments.
· Lead Cumulus Linux-based deployments and network automation initiatives.
· Optimize network performance, latency, throughput, and congestion management for AI environments.
· AI Compute & GPU Infrastructure
· Design and size GPU clusters using NVIDIA H100, H200, B200, B300, DGX, and AI Factory platforms.
· Perform GPU capacity planning and workload profiling for AI and ML use cases.
· Implement GPU virtualization and Multi-Instance GPU (MIG) architectures.
· Support AI training and inference infrastructure deployments.
· AI Storage & Platform Engineering
· Design AI storage solutions utilizing NAS, SAN, NVMe, Object Storage, NFS, iSCSI, Fibre Channel, and parallel file systems.
· Implement and manage Kubernetes-based AI platforms, including OpenShift and VMware Tanzu.
· Deploy and integrate RUN and Slurm workload schedulers for GPU orchestration.
· Ensure seamless integration of AI platforms with existing enterprise infrastructure.
· Monitoring, Observability & Operations
· Implement NVIDIA UFM, NVIDIA Mission Control, and NetQ for infrastructure monitoring and observability.
· Configure telemetry, validation, troubleshooting, and fabric management workflows.
· Drive infrastructure benchmarking, performance optimization, and capacity planning initiatives.
· Support POCs, design validation exercises, production rollouts, and operational readiness activities.
· Cloud & AI Services
· Design AI infrastructure solutions across AWS, Azure, GCP, and OCI.
· Enable AI services integration across hybrid and multi-cloud environments.
· Provide guidance on AI platform adoption, scalability, and operational best practices.
Key skills required
Data Center Infrastructure
- Strong understanding of Data Center power infrastructure, including UPS, PDU, ATS, switchgear, transformers, and generators.
- Knowledge of Data Center cooling technologies such as CRAC, CRAH, liquid cooling, immersion cooling, and chiller systems.
- Experience in rack design, cabling architecture, white space planning, and physical infrastructure design.
- Understanding of raised floors, fire suppression systems, plenum design, and facility infrastructure.
AI Networking
- Strong expertise in InfiniBand (HDR/NDR), RoCE, and Ethernet fabrics.
- Hands-on experience with NVIDIA Spectrum switches.
- Deep understanding of EVPN-VXLAN, BGP, ECMP, Spine-Leaf architecture, and network automation.
- Experience with Cumulus Linux environments.
AI Compute & Platforms
- Expertise in NVIDIA GPU platforms including DGX, H100, H200, B200, and B300.
- Experience with GPU virtualization, MIG, and AI workload optimization.
- Strong understanding of AI training and inference infrastructure.
AI Storage
- Knowledge of AI storage architectures and parallel file systems such as Lustre and GPFS.
- Experience with NAS, SAN, Fibre Channel, NVMe, NFS, iSCSI, and Object Storage technologies.
Orchestration & Container Platforms
- Experience with Kubernetes ecosystems.
- Hands-on expertise with OpenShift and VMware Tanzu.
- Experience with RUN and Slurm workload management platforms.
- Understanding of container networking for AI workloads.
AI Software Stack
- Understanding of AI infrastructure software layers including:
- LLM Models
- MLOps Platforms
- Training and Inference Frameworks
- Agentic AI
- NVIDIA AI Enterprise
- NVIDIA Licensing
- NVIDIA NVIS
Cloud Technologies
- Strong understanding of AWS, Azure, GCP, and OCI services.
- Experience designing AI and cloud-native solutions in hyperscaler environments.