Skip to content

Conversation

@Fiona-Waters
Copy link
Contributor

Adding a blog post introducing the Kubeflow SDK

@kramaranya @andreyvelich @astefanutti

closes #184

Signed-off-by: Fiona Waters <fiwaters6@gmail.com>

Co-authored-by: Anya Kramar <kramaranya15@gmail.com>
Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Fiona-Waters awesome write-up!

Only left a few nits.

/lgtm


```python
TrainerClient().train(
runtime=TrainerClient().get_runtime("torchtune-qwen2.5-1.5b"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may still want to show how to pass the dataset users would want to fine-tune the model with?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com>
Signed-off-by: Anya Kramar <akramar@redhat.com>
@google-oss-prow google-oss-prow bot removed the lgtm label Nov 7, 2025
toc: true
layout: post
comments: true
title: "Introducing the Kubeflow SDK: A Python API for Scalable AI Workloads"
Copy link
Member

@andreyvelich andreyvelich Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this title. What do you think @Fiona-Waters @kubeflow/kubeflow-sdk-team @franciscojavierarceo ? Since it is more engaging.

Introducing the Kubeflow SDK: A Python API to Run AI Workloads at Scale”

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it too! We also could be more specific:

Introducing the Kubeflow SDK: A Python API to Manage Distributed AI Workloads at Scale”

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say Manage is more "SRE" word (e.g. configuring, monitoring, setup), if we speak about scientists they usually use Run. What do you think ?

Introducing the Kubeflow SDK: A Python API to Run Distributed AI Workloads at Scale”

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we found that golden mean!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very good, a small variation:

Introducing the Kubeflow SDK: A Pythonic API to Run AI Workloads at Scale

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing the Kubeflow SDK: A Pythonic API to Run AI Workloads at Scale

Sounds good to me, updated in 08e672a, let me know if anyone has any other suggestions


Scaling AI workloads shouldn't require deep expertise in distributed systems and container orchestration. Whether you are prototyping on local hardware or deploying to a production Kubernetes cluster, you need a unified API that abstracts infrastructure complexity while preserving flexibility. That's exactly what the Kubeflow Python SDK delivers.

As a data scientist, you've probably experienced this frustrating journey: you start by prototyping locally, writing your training script on your laptop. When you need more compute power, it's time to rewrite everything for distributed training. You containerize your code and rebuild images for every change. You write Kubernetes YAMLs, struggle with kubectl and switch between different SDKs - one for training, another for hyperparameter tuning, another for pipelines. Each step in this process requires different tools, different APIs, and different mental models. What if there was a better way?
Copy link
Member

@andreyvelich andreyvelich Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should also say about slows down the productivity here. @Fiona-Waters @astefanutti @kramaranya What do we think about this paragraph ?

As a data scientist, you’ve probably experienced this frustrating journey: you start by prototyping locally, training your model on your laptop. When you need more compute power, you have to rewrite everything for distributed training. You containerize your code, rebuild images for every small change, write Kubernetes YAMLs, wrestle with kubectl, and juggle multiple SDKs — one for training, another for hyperparameter tuning, and yet another for pipelines. Each step demands different tools, APIs, and mental models.

All this complexity slows down productivity, drains focus, and ultimately holds back AI innovation. What if there was a better way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, thank you! Added in 08e672a


As a data scientist, you've probably experienced this frustrating journey: you start by prototyping locally, writing your training script on your laptop. When you need more compute power, it's time to rewrite everything for distributed training. You containerize your code and rebuild images for every change. You write Kubernetes YAMLs, struggle with kubectl and switch between different SDKs - one for training, another for hyperparameter tuning, another for pipelines. Each step in this process requires different tools, different APIs, and different mental models. What if there was a better way?

The Kubeflow community started the Kubeflow SDK & ML Experience Working Group (WG) in order to address these challenges. You can find more information about this WG and the [Kubeflow Community here](https://www.kubeflow.org/docs/about/community/).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be nice to link our YouTube playlist instead of community page, so readers can directly get up to speed: https://youtu.be/VkbVVk2OGUI?list=PLmzRWLV1CK_wSO2IMPnzChxESmaoXNfrY

Comment on lines 54 to 58
- **Unified Experience**: Single SDK to interact with multiple Kubeflow projects through consistent Python APIs
- **Rapid iteration**: Reduced friction between development and production environments
- **Simplified AI Workloads**: Abstract away Kubernetes complexity, allowing AI practitioners to work in familiar Python environments
- **Seamless Integration**: Built to work effortlessly with all Kubeflow projects, enabling end-to-end AI model development at any scale
- **Local Development**: First-class support for local development without a Kubernetes cluster requiring only pip installation
Copy link
Member

@andreyvelich andreyvelich Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should have dedicated paragraph which says that Kubeflow SDK is built for scale.
WDYT @franciscojavierarceo @Fiona-Waters @astefanutti @kramaranya ?

Built for Scale: Seamlessly scale any AI workload — from local laptop to large-scale production cluster with thousands of GPUs using the same APIs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines 56 to 57
- **Simplified AI Workloads**: Abstract away Kubernetes complexity, allowing AI practitioners to work in familiar Python environments
- **Seamless Integration**: Built to work effortlessly with all Kubeflow projects, enabling end-to-end AI model development at any scale
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we combine this to a single principle ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, updated in 08e672a

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
Overall looks great, I just left remaining small comments.

)
```

One click fine tuning can also be accomplished without any configuration being set. The default model, dataset and training configurations are pre-baked into the runtime:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we start with the simple one-line fine-tuning example, and after it show advanced when user can modify configuration ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added in 27f0d3d


## Optimizer Client

The OptimizerClient manages hyperparameter optimization for large models of any size on Kubernetes. It integrates with TrainerClient — you define your training job template once, specify which parameters to optimize, and the client orchestrates multiple trials to find the best hyperparameter configuration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should emphasize that the API is almost identical between the Optimizer and Trainer clients, which significantly enhances the user experience during AI development.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, updated ba4dc92

optimizer.wait_for_job_status(job_name)

# Get the best hyperparameters and metrics from an OptimizationJob
optimizer.get_best_results(job_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we show simple output here for the best result ?

Comment on lines 257 to 258
# See all the TrainJobs
TrainerClient().list_jobs()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get this information from the get_job() API:

job = optimizer.get_job(job_name)
print(job.trials)

kramaranya and others added 12 commits November 7, 2025 21:44
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Signed-off-by: kramaranya <kramaranya15@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Anya Kramar <akramar@redhat.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Anya Kramar <akramar@redhat.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Anya Kramar <akramar@redhat.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Anya Kramar <akramar@redhat.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Anya Kramar <akramar@redhat.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Anya Kramar <akramar@redhat.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Anya Kramar <akramar@redhat.com>
@kramaranya
Copy link
Contributor

@andreyvelich I've addressed the comments, let me know what you think
/hold to add Kubeflow survey link

@andreyvelich
Copy link
Member

Thanks for this!
/lgtm
/approve
/hold for survey as you said


# Introducing Kubeflow SDK

The SDK sits on top of the Kubeflow ecosystem as a unified interface layer. When you write Python code, the SDK translates it into the appropriate Kubernetes resources — generating CRDs, handling orchestration, and managing distributed communication. You get all the power of Kubeflow and distributed AI compute without needing to understand Kubernetes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say CRs(custom resources) instead of CRDs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you!

Signed-off-by: kramaranya <kramaranya15@gmail.com>
@google-oss-prow google-oss-prow bot removed the lgtm label Nov 9, 2025
Signed-off-by: kramaranya <kramaranya15@gmail.com>
@kramaranya
Copy link
Contributor

/assign @andreyvelich
I've added a survey link to the post

@andreyvelich
Copy link
Member

Thanks everyone!
/lgtm
/approve
/hold cancel

@google-oss-prow google-oss-prow bot merged commit f6e66d6 into kubeflow:master Nov 9, 2025
6 checks passed
@google-oss-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create Blog Post Introducing Kubeflow SDK

5 participants