-
Notifications
You must be signed in to change notification settings - Fork 57
Add Kubeflow SDK intro #185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Fiona Waters <fiwaters6@gmail.com> Co-authored-by: Anya Kramar <kramaranya15@gmail.com>
astefanutti
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| ```python | ||
| TrainerClient().train( | ||
| runtime=TrainerClient().get_runtime("torchtune-qwen2.5-1.5b"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may still want to show how to pass the dataset users would want to fine-tune the model with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about https://github.com/kubeflow/blog/pull/185/files#diff-23eeb9ef1f6735818ea1636add68885b739791a248df8372663d3cd133dfb724R150-R157, or would you like to have more details on initializers?
Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Anya Kramar <akramar@redhat.com>
| toc: true | ||
| layout: post | ||
| comments: true | ||
| title: "Introducing the Kubeflow SDK: A Python API for Scalable AI Workloads" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer this title. What do you think @Fiona-Waters @kubeflow/kubeflow-sdk-team @franciscojavierarceo ? Since it is more engaging.
Introducing the Kubeflow SDK: A Python API to Run AI Workloads at Scale”
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it too! We also could be more specific:
Introducing the Kubeflow SDK: A Python API to Manage Distributed AI Workloads at Scale”
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say Manage is more "SRE" word (e.g. configuring, monitoring, setup), if we speak about scientists they usually use Run. What do you think ?
Introducing the Kubeflow SDK: A Python API to Run Distributed AI Workloads at Scale”
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we found that golden mean!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very good, a small variation:
Introducing the Kubeflow SDK: A Pythonic API to Run AI Workloads at Scale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducing the Kubeflow SDK: A Pythonic API to Run AI Workloads at Scale
Sounds good to me, updated in 08e672a, let me know if anyone has any other suggestions
|
|
||
| Scaling AI workloads shouldn't require deep expertise in distributed systems and container orchestration. Whether you are prototyping on local hardware or deploying to a production Kubernetes cluster, you need a unified API that abstracts infrastructure complexity while preserving flexibility. That's exactly what the Kubeflow Python SDK delivers. | ||
|
|
||
| As a data scientist, you've probably experienced this frustrating journey: you start by prototyping locally, writing your training script on your laptop. When you need more compute power, it's time to rewrite everything for distributed training. You containerize your code and rebuild images for every change. You write Kubernetes YAMLs, struggle with kubectl and switch between different SDKs - one for training, another for hyperparameter tuning, another for pipelines. Each step in this process requires different tools, different APIs, and different mental models. What if there was a better way? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we should also say about slows down the productivity here. @Fiona-Waters @astefanutti @kramaranya What do we think about this paragraph ?
As a data scientist, you’ve probably experienced this frustrating journey: you start by prototyping locally, training your model on your laptop. When you need more compute power, you have to rewrite everything for distributed training. You containerize your code, rebuild images for every small change, write Kubernetes YAMLs, wrestle with kubectl, and juggle multiple SDKs — one for training, another for hyperparameter tuning, and yet another for pipelines. Each step demands different tools, APIs, and mental models.
All this complexity slows down productivity, drains focus, and ultimately holds back AI innovation. What if there was a better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, thank you! Added in 08e672a
|
|
||
| As a data scientist, you've probably experienced this frustrating journey: you start by prototyping locally, writing your training script on your laptop. When you need more compute power, it's time to rewrite everything for distributed training. You containerize your code and rebuild images for every change. You write Kubernetes YAMLs, struggle with kubectl and switch between different SDKs - one for training, another for hyperparameter tuning, another for pipelines. Each step in this process requires different tools, different APIs, and different mental models. What if there was a better way? | ||
|
|
||
| The Kubeflow community started the Kubeflow SDK & ML Experience Working Group (WG) in order to address these challenges. You can find more information about this WG and the [Kubeflow Community here](https://www.kubeflow.org/docs/about/community/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be nice to link our YouTube playlist instead of community page, so readers can directly get up to speed: https://youtu.be/VkbVVk2OGUI?list=PLmzRWLV1CK_wSO2IMPnzChxESmaoXNfrY
| - **Unified Experience**: Single SDK to interact with multiple Kubeflow projects through consistent Python APIs | ||
| - **Rapid iteration**: Reduced friction between development and production environments | ||
| - **Simplified AI Workloads**: Abstract away Kubernetes complexity, allowing AI practitioners to work in familiar Python environments | ||
| - **Seamless Integration**: Built to work effortlessly with all Kubeflow projects, enabling end-to-end AI model development at any scale | ||
| - **Local Development**: First-class support for local development without a Kubernetes cluster requiring only pip installation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we should have dedicated paragraph which says that Kubeflow SDK is built for scale.
WDYT @franciscojavierarceo @Fiona-Waters @astefanutti @kramaranya ?
Built for Scale: Seamlessly scale any AI workload — from local laptop to large-scale production cluster with thousands of GPUs using the same APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| - **Simplified AI Workloads**: Abstract away Kubernetes complexity, allowing AI practitioners to work in familiar Python environments | ||
| - **Seamless Integration**: Built to work effortlessly with all Kubeflow projects, enabling end-to-end AI model development at any scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we combine this to a single principle ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, updated in 08e672a
andreyvelich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Overall looks great, I just left remaining small comments.
| ) | ||
| ``` | ||
|
|
||
| One click fine tuning can also be accomplished without any configuration being set. The default model, dataset and training configurations are pre-baked into the runtime: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we start with the simple one-line fine-tuning example, and after it show advanced when user can modify configuration ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, added in 27f0d3d
|
|
||
| ## Optimizer Client | ||
|
|
||
| The OptimizerClient manages hyperparameter optimization for large models of any size on Kubernetes. It integrates with TrainerClient — you define your training job template once, specify which parameters to optimize, and the client orchestrates multiple trials to find the best hyperparameter configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should emphasize that the API is almost identical between the Optimizer and Trainer clients, which significantly enhances the user experience during AI development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, updated ba4dc92
| optimizer.wait_for_job_status(job_name) | ||
|
|
||
| # Get the best hyperparameters and metrics from an OptimizationJob | ||
| optimizer.get_best_results(job_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we show simple output here for the best result ?
| # See all the TrainJobs | ||
| TrainerClient().list_jobs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get this information from the get_job() API:
job = optimizer.get_job(job_name)
print(job.trials)
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Anya Kramar <akramar@redhat.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Anya Kramar <akramar@redhat.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Anya Kramar <akramar@redhat.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Anya Kramar <akramar@redhat.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Anya Kramar <akramar@redhat.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Anya Kramar <akramar@redhat.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Anya Kramar <akramar@redhat.com>
|
@andreyvelich I've addressed the comments, let me know what you think |
|
Thanks for this! |
|
|
||
| # Introducing Kubeflow SDK | ||
|
|
||
| The SDK sits on top of the Kubeflow ecosystem as a unified interface layer. When you write Python code, the SDK translates it into the appropriate Kubernetes resources — generating CRDs, handling orchestration, and managing distributed communication. You get all the power of Kubeflow and distributed AI compute without needing to understand Kubernetes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we say CRs(custom resources) instead of CRDs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you!
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
|
/assign @andreyvelich |
|
Thanks everyone! |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Adding a blog post introducing the Kubeflow SDK
@kramaranya @andreyvelich @astefanutti
closes #184