-
Notifications
You must be signed in to change notification settings - Fork 57
Add Kubeflow SDK intro #185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
243829b
ac87bcf
fb6748c
08e672a
27f0d3d
ba4dc92
9f49598
afe3230
dee9d91
35c6d36
17a5615
bfebdf3
3c592ef
3a0e3bb
b0822fc
203567b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,374 @@ | ||
| --- | ||
| toc: true | ||
| layout: post | ||
| comments: true | ||
| title: "Introducing the Kubeflow SDK: A Python API for Scalable AI Workloads" | ||
| hide: false | ||
| categories: [sdk, trainer, optimizer] | ||
| permalink: /sdk/intro/ | ||
| author: "Kubeflow SDK Team" | ||
| --- | ||
|
|
||
| # Unified SDK Concept | ||
|
|
||
| Scaling AI workloads shouldn't require deep expertise in distributed systems and container orchestration. Whether you are prototyping on local hardware or deploying to a production Kubernetes cluster, you need a unified API that abstracts infrastructure complexity while preserving flexibility. That's exactly what the Kubeflow Python SDK delivers. | ||
|
|
||
| As a data scientist, you've probably experienced this frustrating journey: you start by prototyping locally, writing your training script on your laptop. When you need more compute power, it's time to rewrite everything for distributed training. You containerize your code and rebuild images for every change. You write Kubernetes YAMLs, struggle with kubectl and switch between different SDKs - one for training, another for hyperparameter tuning, another for pipelines. Each step in this process requires different tools, different APIs, and different mental models. What if there was a better way? | ||
|
||
|
|
||
| The Kubeflow community started the Kubeflow SDK & ML Experience Working Group (WG) in order to address these challenges. You can find more information about this WG and the [Kubeflow Community here](https://www.kubeflow.org/docs/about/community/). | ||
|
||
|
|
||
| # Introducing Kubeflow SDK | ||
|
|
||
| The SDK sits on top of the Kubeflow ecosystem as a unified interface layer. When you write Python code, the SDK translates it into the appropriate Kubernetes resources — generating CRDs, handling orchestration, andmanaging distributed coordination. You get all the power of Kubeflow without needing to understand Kubernetes. | ||
kramaranya marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|  | ||
|
|
||
| Getting started is simple: | ||
|
|
||
| ```python | ||
| pip install kubeflow | ||
| ``` | ||
| ```python | ||
| from kubeflow.trainer import TrainerClient | ||
|
|
||
| def train_model(): | ||
| import torch | ||
|
|
||
| model = torch.nn.Linear(10, 1) | ||
| optimizer = torch.optim.Adam(model.parameters()) | ||
|
|
||
| # Training loop | ||
| for epoch in range(10): | ||
| # Your training logic | ||
| pass | ||
|
|
||
| torch.save(model.state_dict(), "model.pt") | ||
|
|
||
| # Create a client and train | ||
| client = TrainerClient() | ||
| client.train(train_func=train_model) | ||
| ``` | ||
|
|
||
| The following principles are the foundation that guide the design and implementation of the SDK: | ||
|
|
||
| - **Unified Experience**: Single SDK to interact with multiple Kubeflow projects through consistent Python APIs | ||
| - **Rapid iteration**: Reduced friction between development and production environments | ||
| - **Simplified AI Workloads**: Abstract away Kubernetes complexity, allowing AI practitioners to work in familiar Python environments | ||
| - **Seamless Integration**: Built to work effortlessly with all Kubeflow projects, enabling end-to-end AI model development at any scale | ||
|
||
| - **Local Development**: First-class support for local development without a Kubernetes cluster requiring only pip installation | ||
|
Comment on lines
58
to
62
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think, we should have dedicated paragraph which says that Kubeflow SDK is built for scale.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 |
||
|
|
||
| ## Role in the Kubeflow Ecosystem | ||
|
|
||
| The SDK doesn't replace any Kubeflow projects — it provides a unified way to use them. Kubeflow Trainer, Katib, Pipelines, etc still handle the actual workload execution. The SDK makes them easier to interact with through consistent Python APIs, letting you work entirely in the language you already use for ML development. | ||
kramaranya marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| This creates a clear separation: | ||
| - **AI Practitioners** use the SDK to submit jobs and manage workflows through Python, without touching YAML or Kubernetes directly | ||
| - **Platform Administrators** continue managing infrastructure — installing components, configuring runtimes, setting resource quotas. Nothing changes on the infrastructure side. | ||
|
|
||
|  | ||
|
|
||
| The Kubeflow SDK works with your existing Kubeflow deployment. If you already have Kubeflow Trainer and Katib installed, just `pip install kubeflow` and start using them through the unified interface. As Kubeflow evolves with new components and features, the SDK provides a stable Python layer that adapts alongside the ecosystem. | ||
|
|
||
| | Project | Status | Description | | ||
| |---------|--------|-------------| | ||
| | Kubeflow Trainer | Available ✅ | Train and fine-tune AI models with various frameworks | | ||
| | Kubeflow Optimizer | Available ✅ | Hyperparameter optimization | | ||
| | Kubeflow Pipelines | Planned 🚧 | Build, run, and track AI workflows | | ||
| | Kubeflow Model Registry | Planned 🚧 | Manage model artifacts, versions and ML artifacts metadata | | ||
| | Kubeflow Spark Operator | Planned 🚧 | Manage Spark applications for data processing and feature engineering | | ||
|
|
||
| # Key Features | ||
|
|
||
| ## Unified Python Interface | ||
|
|
||
| The SDK provides a consistent experience across all Kubeflow components. Whether you're training models or optimizing hyperparameters, the APIs follow the same patterns: | ||
|
|
||
| ```python | ||
| from kubeflow.trainer import TrainerClient | ||
| from kubeflow.optimizer import OptimizerClient | ||
|
|
||
| # Initialize clients | ||
| trainer = TrainerClient() | ||
| optimizer = OptimizerClient() | ||
|
|
||
| # List jobs | ||
| TrainerClient().list_jobs() | ||
| OptimizerClient().list_jobs() | ||
| ``` | ||
|
|
||
| ## Trainer Client | ||
|
|
||
| The TrainerClient provides the easiest way to run distributed training on Kubernetes, built on top of [Kubeflow Trainer v2](https://blog.kubeflow.org/trainer/intro/). Whether you're training custom models with PyTorch, or fine-tuning LLMs, the client provides a Python API for submitting and monitoring training jobs at scale. | ||
|
|
||
| The client works with pre-configured runtimes that Platform Administrators set up. These runtimes define the container images, resource policies, and infrastructure settings. As an AI Practitioner, you reference these runtimes and focus on your training code: | ||
|
|
||
| ```python | ||
| from kubeflow.trainer import TrainerClient, CustomTrainer | ||
|
|
||
| def get_torch_dist(): | ||
| """Your PyTorch training code runs on each node.""" | ||
| import os | ||
| import torch | ||
| import torch.distributed as dist | ||
|
|
||
| dist.init_process_group(backend="gloo") | ||
| print("PyTorch Distributed Environment") | ||
| print(f"WORLD_SIZE: {dist.get_world_size()}") | ||
| print(f"RANK: {dist.get_rank()}") | ||
| print(f"LOCAL_RANK: {os.environ['LOCAL_RANK']}") | ||
|
|
||
| # Create the TrainJob | ||
| job_id = TrainerClient().train( | ||
| runtime=TrainerClient().get_runtime("torch-distributed"), | ||
| trainer=CustomTrainer( | ||
| func=get_torch_dist, | ||
| num_nodes=3, | ||
| resources_per_node={ | ||
| "cpu": 2, | ||
| }, | ||
| ), | ||
| ) | ||
|
|
||
| # Wait for TrainJob to complete | ||
| TrainerClient().wait_for_job_status(job_id) | ||
|
|
||
| # Print TrainJob logs | ||
| print("\n".join(TrainerClient().get_job_logs(name=job_id))) | ||
| ``` | ||
|
|
||
| The TrainerClient supports `CustomTrainer` for your own training logic and [`BuiltinTrainer`](https://www.kubeflow.org/docs/components/trainer/user-guides/builtin-trainer/torchtune/) for pre-packaged training patterns like LLM fine-tuning. For example, fine-tuning LLMs with TorchTune: | ||
|
|
||
| ```python | ||
| from kubeflow.trainer import TrainerClient, BuiltinTrainer, TorchTuneConfig | ||
| from kubeflow.trainer import Initializer, HuggingFaceDatasetInitializer, HuggingFaceModelInitializer | ||
| from kubeflow.trainer import TorchTuneInstructDataset, LoraConfig, DataFormat | ||
|
|
||
| client = TrainerClient() | ||
|
|
||
| client.train( | ||
| runtime=client.get_runtime(name="torchtune-llama3.2-1b"), | ||
| initializer=Initializer( | ||
| dataset=HuggingFaceDatasetInitializer( | ||
| storage_uri="hf://tatsu-lab/alpaca/data" | ||
| ), | ||
| model=HuggingFaceModelInitializer( | ||
| storage_uri="hf://meta-llama/Llama-3.2-1B-Instruct", | ||
| access_token="hf_...", | ||
| ) | ||
| ), | ||
| trainer=BuiltinTrainer( | ||
| config=TorchTuneConfig( | ||
| dataset_preprocess_config=TorchTuneInstructDataset( | ||
| source=DataFormat.PARQUET, | ||
| ), | ||
| peft_config=LoraConfig( | ||
| apply_lora_to_mlp=True, | ||
| lora_attn_modules=["q_proj", "k_proj", "v_proj", "output_proj"], | ||
| quantize_base=True, | ||
| ), | ||
| resources_per_node={ | ||
| "gpu": 1, | ||
| } | ||
| ) | ||
| ) | ||
| ) | ||
| ``` | ||
|
|
||
| One click fine tuning can also be accomplished without any configuration being set. The default model, dataset and training configurations are pre-baked into the runtime: | ||
|
||
|
|
||
| ```python | ||
| TrainerClient().train( | ||
| runtime=TrainerClient().get_runtime("torchtune-qwen2.5-1.5b"), | ||
|
||
| ) | ||
| ``` | ||
|
|
||
| Initializers download datasets and models once to shared storage, then all training pods access the data from there — reducing startup time and network usage. | ||
|
|
||
| For more details about Kubeflow Trainer capabilities, including gang-scheduling, fault tolerance, and MPI support, check out the [Kubeflow Trainer v2 blog post](https://blog.kubeflow.org/trainer/intro/). | ||
|
|
||
| ## Optimizer Client | ||
|
|
||
| The OptimizerClient manages hyperparameter optimization for large models of any size on Kubernetes. It integrates with TrainerClient — you define your training job template once, specify which parameters to optimize, and the client orchestrates multiple trials to find the best hyperparameter configuration. | ||
|
||
|
|
||
| The client launches trials in parallel according to your resource constraints, tracks metrics across experiments, and identifies optimal parameters. | ||
|
|
||
| First, define your training job template: | ||
|
|
||
| ```python | ||
| from kubeflow.trainer import TrainerClient, CustomTrainer | ||
| from kubeflow.optimizer import OptimizerClient, TrainJobTemplate, Search, Objective | ||
|
|
||
| def train_func(learning_rate: float, batch_size: int): | ||
| """Training function with hyperparameters.""" | ||
| # Your training code here | ||
| import time | ||
| import random | ||
|
|
||
| for i in range(10): | ||
| time.sleep(1) | ||
| print(f"Training {i}, lr: {learning_rate}, batch_size: {batch_size}") | ||
|
|
||
| print(f"loss={round(random.uniform(0.77, 0.99), 2)}") | ||
|
|
||
|
|
||
| # Create a reusable template | ||
| template = TrainJobTemplate( | ||
| trainer=CustomTrainer( | ||
| func=train_func, | ||
| func_args={"learning_rate": "0.01", "batch_size": "16"}, | ||
| num_nodes=2, | ||
| resources_per_node={"gpu": 1}, | ||
| ), | ||
| runtime=TrainerClient().get_runtime("torch-distributed"), | ||
| ) | ||
|
|
||
| # Verify that your TrainJob is working with test hyperparameters. | ||
| TrainerClient().train(**template) | ||
| ``` | ||
|
|
||
| Then optimize hyperparameters with a single call: | ||
|
|
||
| ```python | ||
| optimizer = OptimizerClient() | ||
|
|
||
| job_name = optimizer.optimize( | ||
| # The same template can be used for Hyperparameter Optimisation | ||
| trial_template=template, | ||
| search_space={ | ||
| "learning_rate": Search.loguniform(0.001, 0.1), | ||
| "batch_size": Search.choice([16, 32, 64, 128]), | ||
| }, | ||
| trial_config=TrialConfig( | ||
| num_trials=20, | ||
| parallel_trials=4, | ||
| max_failed_trials=5, | ||
| ), | ||
| ) | ||
|
|
||
| # Verify OptimizationJob was created | ||
| optimizer.get_job(job_name) | ||
|
|
||
| # Wait for OptimizationJob to complete | ||
| optimizer.wait_for_job_status(job_name) | ||
|
|
||
| # Get the best hyperparameters and metrics from an OptimizationJob | ||
| optimizer.get_best_results(job_name) | ||
|
||
|
|
||
| # See all the TrainJobs | ||
| TrainerClient().list_jobs() | ||
|
||
| ``` | ||
|
|
||
| This creates multiple TrainJob instances (trials) with different hyperparameter combinations, executes them in parallel based on available resources, and tracks which parameters produce the best results. Each trial is a full training job managed by Kubeflow Trainer. Using [Katib UI](https://www.kubeflow.org/docs/components/katib/user-guides/katib-ui/), you can visualize your optimization with an interactive graph that shows metric performance against hyperparameter values across all trials. | ||
|
|
||
|  | ||
|
|
||
| For more details about hyperparameter optimization, check out the [OptimizerClient KEP](https://github.com/kubeflow/sdk/blob/main/docs/proposals/optimizer-client.md). | ||
kramaranya marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Local Execution Mode | ||
|
|
||
| Local Execution Mode provides backend flexibility while maintaining full API compatibility with the Kubernetes backend, substantially reducing friction for AI practitioners when developing and iterating. | ||
|
|
||
| Choose the right execution environment for your stage of development: | ||
|
|
||
| ### Local Process Backend: Fastest Iteration | ||
|
|
||
| The Local Process Backend is your starting point for ML development - offering the fastest possible iteration cycle with zero infrastructure overhead. This backend executes your training code directly as a Python subprocess on your local machine, bypassing containers, orchestration, and network complexity entirely. | ||
|
|
||
| ```python | ||
| from kubeflow.trainer.backends.localprocess import LocalProcessBackendConfig | ||
|
|
||
| config = LocalProcessBackendConfig() | ||
| client = TrainerClient(config) | ||
|
|
||
| # Runs directly on your machine - no containers, no cluster | ||
| client.train(train_func=train_model) | ||
| ``` | ||
|
|
||
| ### Container Backend: Production-Like Environment | ||
|
|
||
| The Container Backend bridges the gap between local development and production deployment by bringing production parity to your laptop. This backend executes your training code inside containers (using Docker or Podman), ensuring that your development environment matches your production environment byte-for-byte - same dependencies, same Python version, same system libraries, same everything. | ||
|
|
||
| Docker Example: | ||
|
|
||
| ```python | ||
| from kubeflow.trainer.backends.container import ContainerBackendConfig | ||
|
|
||
| config = ContainerBackendConfig( | ||
| container_runtime="docker", | ||
| auto_remove=True # Clean up containers after completion | ||
| ) | ||
|
|
||
| client = TrainerClient(config) | ||
|
|
||
| # Launch 2-node distributed training locally | ||
| client.train(train_func=train_model, num_nodes=2) | ||
| ``` | ||
|
|
||
| Podman Example: | ||
|
|
||
| ```python | ||
| from kubeflow.trainer.backends.container import ContainerBackendConfig | ||
|
|
||
| config = ContainerBackendConfig( | ||
| container_runtime="podman", | ||
| auto_remove=True | ||
| ) | ||
|
|
||
| client = TrainerClient(config) | ||
| client.train(train_func=train_model, num_nodes=2) | ||
| ``` | ||
|
|
||
| ### Kubernetes Backend: Production Scale | ||
|
|
||
| The Kubernetes Backend enables Kubeflow SDK to perform reliably at production scale - enabling you to deploy the exact same training code you developed locally to a production Kubernetes cluster with massive computational resources. This backend transforms your simple `client.train()` call into a full-fledged distributed training job managed by Kubeflow's Trainer, complete with fault tolerance, resource scheduling, and cluster-wide orchestration. | ||
|
|
||
| Kubernetes Example: | ||
|
|
||
| ```python | ||
| from kubeflow.trainer.backends.kubernetes import KubernetesBackendConfig | ||
|
|
||
| config = KubernetesBackendConfig( | ||
| namespace="ml-training", | ||
| ) | ||
|
|
||
| client = TrainerClient(config) | ||
|
|
||
| # Scales to hundreds of nodes - the same code you tested locally | ||
| client.train( | ||
| train_func=train_model, | ||
| num_nodes=100, | ||
| packages_to_install=["torch", "transformers"] | ||
| ) | ||
| ``` | ||
|
|
||
| # What's Next? | ||
|
|
||
| We're just getting started. The Kubeflow SDK currently supports Trainer and Optimizer, but the vision is much bigger — a unified Python interface for the entire [Cloud Native AI Lifecycle](https://www.kubeflow.org/docs/started/architecture/#kubeflow-projects-in-the-ai-lifecycle). | ||
|
|
||
| Here's what's on the horizon: | ||
|
|
||
| - [**Pipelines Integration**](https://github.com/kubeflow/sdk/issues/125): A PipelinesClient to build end-to-end ML workflows. Pipelines will reuse the core Kubeflow SDK primitives for training, optimization, and deployment in a single pipeline. The Kubeflow SDK will also power [KFP core components](https://github.com/kubeflow/pipelines-components) | ||
| - [**Model Registry Integration**](https://github.com/kubeflow/sdk/issues/59): Seamlessly manage model artifacts and versions across the training and serving lifecycle | ||
| - [**Spark Operator Integration**](https://github.com/kubeflow/sdk/issues/107): Data processing and feature engineering through a SparkClient interface | ||
| - [**Documentation**](https://github.com/kubeflow/sdk/issues/50): Full Kubeflow SDK documentation with guides, examples, and API references | ||
| - [**Local Execution for Optimizer**](https://github.com/kubeflow/sdk/issues/153): Run hyperparameter optimization experiments locally before scaling to Kubernetes | ||
| - [**Workspace Snapshots**](https://github.com/kubeflow/sdk/issues/48): Capture your entire development environment and reproduce it in distributed training jobs | ||
| - [**Multi-Cluster Support**](https://github.com/kubeflow/sdk/issues/23): Manage training jobs across multiple Kubernetes clusters from a single SDK interface | ||
| - [**Distributed Data Cache**](https://github.com/kubeflow/trainer/issues/2655): In-memory caching for large datasets via initializer SDK configuration | ||
| - [**Additional Built-in Trainers**](https://github.com/kubeflow/trainer/issues/2752): Support for more fine-tuning frameworks beyond TorchTune — unsloth, Axolotl, and others | ||
kramaranya marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| The community is driving these features forward. If you have ideas, feedback, or want to contribute, we'd love to hear from you! | ||
|
|
||
| # Get Involved | ||
|
|
||
| The Kubeflow SDK is built by and for the community. We welcome contributions, feedback, and participation from everyone! | ||
|
|
||
| **Resources**: | ||
| - [GitHub](https://github.com/kubeflow/sdk) | ||
| - [Design Proposal: Review our Kubeflow SDK KEP](https://github.com/kubeflow/sdk/blob/main/docs/proposals/kubeflow-sdk.md) | ||
kramaranya marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| **Connect with the Community**: | ||
| - Join [#kubeflow-ml-experience](https://kubeflow.slack.com/archives/C078ZMRQPB6) on [CNCF Slack](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels) | ||
kramaranya marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - Attend the [Kubeflow SDK and ML Experience WG](https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit) meetings | ||
kramaranya marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - Check out [good first issues](https://github.com/kubeflow/sdk/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) to get started | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer this title. What do you think @Fiona-Waters @kubeflow/kubeflow-sdk-team @franciscojavierarceo ? Since it is more engaging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it too! We also could be more specific:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say Manage is more "SRE" word (e.g. configuring, monitoring, setup), if we speak about scientists they usually use Run. What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we found that golden mean!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very good, a small variation:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, updated in 08e672a, let me know if anyone has any other suggestions