Job Title: Lead Senior Associate | Engineering Foundry & Managed Services | Bengaluru | Engineering as a Servic
Job Description:
A Senior Kubeflow Developer, having 8+ years of experience in software engineering or platform engineering, with substantial Kubernetes experience and 4+ years working directly with Kubeflow or related MLOps tooling. who will design, build, and maintain Kubeflow-based AI/ML platforms and applications.
This role focuses on customizing Kubeflow components (Jupyter integrations, Knative/KServe), managing Kubeflow install/upgrade/lifecycle, and implementing secure Kubernetes authentication and authorization. The ideal candidate partners closely with data scientists and platform engineers to deliver production-grade MLOps pipelines and scalable, secure AI services.
Overview
We are seeking a senior-level engineer with deep hands-on experience in Kubeflow, Kubernetes, and cloud-native MLOps to lead customization, deployment, and lifecycle management of Kubeflow deployments. You will be responsible for integrating Jupyter notebook services, extending Knative/KServe for model serving, implementing robust Kubernetes authN/authZ patterns, and ensuring reliable install/upgrade processes across environments (development, staging, production, private cloud). This is both a developer and platform-owner role — building AI/ML applications and operating the underlying Kubeflow platform.
Key responsibilities
Development and customization
- Customize and extend Kubeflow applications and components (KFP, Pipelines, Katib, Profiles, Metadata).
- Integrate and harden Jupyter Notebook / JupyterHub environments for interactive data science workflows.
- Implement and extend Knative and KServe components to support custom model-serving runtimes and autoscaling patterns.
- Create reusable manifests, operators, kustomize/Helm charts, or Kubernetes operators for repeatable deployments.
Deployment and lifecycle management
- Design and own the install, upgrade and rollback processes for Kubeflow across clusters and environments.
- Manage manifests and configuration (versioning, parameterization) to enable repeatable, auditable deployments.
- Automate bootstrap and cluster lifecycle tasks, including preflight checks, dependency validation, and post-deploy verification.
- Troubleshoot and resolve complex deployment/install issues across control plane and data plane components.
Security (authN/authZ)
- Implement Kubernetes authentication (OIDC, RBAC, ServiceAccounts, Vault integration, short-lived credentials) and authorization policies for secure multi-tenant Kubeflow deployments.
- Design and enforce least-privilege access models for data scientists, pipelines, and model-serving endpoints.
- Integrate cluster security controls (namespace isolation, PSP/PSA or equivalent, network policies, admission controllers) with Kubeflow components.
CI/CD and automation
- Build CI/CD pipelines to validate, test, and release Kubeflow manifests, application code, and model-serving images.
- Integrate test automation for functional, security, and smoke tests as part of deployment pipelines.
- Create git-driven workflows (GitOps) for manifests and environment promotion.
Operations, observability, and reliability
- Instrument and monitor Kubeflow and Kubernetes control/data planes (logs, metrics, tracing).
- Implement alerting and runbook documentation for common failure modes and operational tasks.
- Lead post-mortems and continuous improvement of platform reliability and deployment practices.
Collaboration and enablement
- Work closely with data scientists to translate model training and serving requirements into platform capabilities.
- Collaborate with platform, security, and cross-fuctional teams to align on architecture, policy, and operational standards.
Required skills and experience
- Strong experience with Kubeflow: customization, components, Pipelines, Profiles, Notebook integration, and operational management.
- Familiarity with AI tooling on kubernetes. One or more of: LangChain, LangFlow, Spark, Airflow, Kubeflow, MLFlow, KServe, Ray
- Good to have open-source contributions and particularly in the Kubeflow and Knative communities
- Deep Kubernetes expertise: cluster architecture, resource management, controllers, CRDs, operators, networking, and storage.
- Proven experience implementing Kubernetes authentication (OIDC, webhook token auth, service accounts) and authorization (RBAC, ABAC, policy enforcement).
- Practical experience with Knative and KServe: custom predictors, scaling behavior, revisions, and annotations for serving models.
- MLOps knowledge: model training, reproducible pipelines, model versioning, deployment patterns, inference scaling and A/B testing.
- CI/CD tooling: building pipelines for build/test/deploy of manifests and container images (Jenkins, GitHub Actions, GitLab CI, Tekton, ArgoCD, etc.).
- Strong troubleshooting and debugging skills for distributed systems and Kubernetes-native apps.
- Excellent communication and collaboration skills for cross-functional teams.
Preferred qualifications
- Experience designing cloud-native architectures and microservices patterns.
- Familiarity with GitOps workflows and tools (ArgoCD, Flux).
- Experience with Helm, Kustomize, and Kubernetes operators for managing manifests at scale.
- Knowledge of container registries, image promotion, and secure image supply chains.
- Monitoring, logging and tracing stack experience (Prometheus, Grafana, etc).
- Familiarity with secrets management solutions (Vault, K8s SAa, ExternalSecret).
- Prior experience maintaining or contributing to open-source Kubeflow manifests or distributions.
- Desired experience with our repositories
Additional attributes
- Senior-level mindset: proactive, ownership-oriented, and driven to improve platform reliability and developer productivity.
- Comfortable working in ambiguous environments and balancing short-term fixes and long-term platform investments.
- Willingness to mentor and grow the team’s Kubeflow and Kubernetes capabilities.