Add Kubeflow SDK intro #185

Fiona-Waters · 2025-11-07T14:25:46Z

Adding a blog post introducing the Kubeflow SDK

@kramaranya @andreyvelich @astefanutti

closes #184

Signed-off-by: Fiona Waters <fiwaters6@gmail.com> Co-authored-by: Anya Kramar <kramaranya15@gmail.com>

astefanutti

Thanks @Fiona-Waters awesome write-up!

Only left a few nits.

/lgtm

_posts/2025-11-07-introducing-kubeflow-sdk.md

astefanutti · 2025-11-07T14:36:37Z

_posts/2025-11-07-introducing-kubeflow-sdk.md

+
+```python
+TrainerClient().train(
+    runtime=TrainerClient().get_runtime("torchtune-qwen2.5-1.5b"),


We may still want to show how to pass the dataset users would want to fine-tune the model with?

What about https://github.com/kubeflow/blog/pull/185/files#diff-23eeb9ef1f6735818ea1636add68885b739791a248df8372663d3cd133dfb724R150-R157, or would you like to have more details on initializers?

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Anya Kramar <akramar@redhat.com>

andreyvelich · 2025-11-07T14:45:45Z

_posts/2025-11-07-introducing-kubeflow-sdk.md

+toc: true
+layout: post
+comments: true
+title: "Introducing the Kubeflow SDK: A Python API for Scalable AI Workloads"


I would prefer this title. What do you think @Fiona-Waters @kubeflow/kubeflow-sdk-team @franciscojavierarceo ? Since it is more engaging.

Introducing the Kubeflow SDK: A Python API to Run AI Workloads at Scale”

I like it too! We also could be more specific:

Introducing the Kubeflow SDK: A Python API to Manage Distributed AI Workloads at Scale”

I would say Manage is more "SRE" word (e.g. configuring, monitoring, setup), if we speak about scientists they usually use Run. What do you think ?

Introducing the Kubeflow SDK: A Python API to Run Distributed AI Workloads at Scale”

I think we found that golden mean!

It's very good, a small variation:

Introducing the Kubeflow SDK: A Pythonic API to Run AI Workloads at Scale

Introducing the Kubeflow SDK: A Pythonic API to Run AI Workloads at Scale

Sounds good to me, updated in 08e672a, let me know if anyone has any other suggestions

andreyvelich · 2025-11-07T16:18:50Z

_posts/2025-11-07-introducing-kubeflow-sdk.md

+
+Scaling AI workloads shouldn't require deep expertise in distributed systems and container orchestration. Whether you are prototyping on local hardware or deploying to a production Kubernetes cluster, you need a unified API that abstracts infrastructure complexity while preserving flexibility. That's exactly what the Kubeflow Python SDK delivers.
+
+As a data scientist, you've probably experienced this frustrating journey: you start by prototyping locally, writing your training script on your laptop. When you need more compute power, it's time to rewrite everything for distributed training. You containerize your code and rebuild images for every change. You write Kubernetes YAMLs, struggle with kubectl and switch between different SDKs - one for training, another for hyperparameter tuning, another for pipelines. Each step in this process requires different tools, different APIs, and different mental models. What if there was a better way?


I think, we should also say about slows down the productivity here. @Fiona-Waters @astefanutti @kramaranya What do we think about this paragraph ?

As a data scientist, you’ve probably experienced this frustrating journey: you start by prototyping locally, training your model on your laptop. When you need more compute power, you have to rewrite everything for distributed training. You containerize your code, rebuild images for every small change, write Kubernetes YAMLs, wrestle with kubectl, and juggle multiple SDKs — one for training, another for hyperparameter tuning, and yet another for pipelines. Each step demands different tools, APIs, and mental models.

All this complexity slows down productivity, drains focus, and ultimately holds back AI innovation. What if there was a better way?

That's a good point, thank you! Added in 08e672a

andreyvelich · 2025-11-07T16:21:25Z

_posts/2025-11-07-introducing-kubeflow-sdk.md

+
+As a data scientist, you've probably experienced this frustrating journey: you start by prototyping locally, writing your training script on your laptop. When you need more compute power, it's time to rewrite everything for distributed training. You containerize your code and rebuild images for every change. You write Kubernetes YAMLs, struggle with kubectl and switch between different SDKs - one for training, another for hyperparameter tuning, another for pipelines. Each step in this process requires different tools, different APIs, and different mental models. What if there was a better way?
+
+The Kubeflow community started the Kubeflow SDK & ML Experience Working Group (WG) in order to address these challenges. You can find more information about this WG and the [Kubeflow Community here](https://www.kubeflow.org/docs/about/community/).


Maybe it would be nice to link our YouTube playlist instead of community page, so readers can directly get up to speed: https://youtu.be/VkbVVk2OGUI?list=PLmzRWLV1CK_wSO2IMPnzChxESmaoXNfrY

_posts/2025-11-07-introducing-kubeflow-sdk.md

andreyvelich · 2025-11-07T16:26:15Z

_posts/2025-11-07-introducing-kubeflow-sdk.md

+- **Unified Experience**: Single SDK to interact with multiple Kubeflow projects through consistent Python APIs
+- **Rapid iteration**: Reduced friction between development and production environments
+- **Simplified AI Workloads**: Abstract away Kubernetes complexity, allowing AI practitioners to work in familiar Python environments
+- **Seamless Integration**: Built to work effortlessly with all Kubeflow projects, enabling end-to-end AI model development at any scale
+- **Local Development**: First-class support for local development without a Kubernetes cluster requiring only pip installation


I think, we should have dedicated paragraph which says that Kubeflow SDK is built for scale.
WDYT @franciscojavierarceo @Fiona-Waters @astefanutti @kramaranya ?

Built for Scale: Seamlessly scale any AI workload — from local laptop to large-scale production cluster with thousands of GPUs using the same APIs.

andreyvelich · 2025-11-07T16:26:32Z

_posts/2025-11-07-introducing-kubeflow-sdk.md

+- **Simplified AI Workloads**: Abstract away Kubernetes complexity, allowing AI practitioners to work in familiar Python environments
+- **Seamless Integration**: Built to work effortlessly with all Kubeflow projects, enabling end-to-end AI model development at any scale


Shall we combine this to a single principle ?

Agree, updated in 08e672a

andreyvelich

Thanks!
Overall looks great, I just left remaining small comments.

_posts/2025-11-07-introducing-kubeflow-sdk.md

andreyvelich · 2025-11-07T18:14:07Z

_posts/2025-11-07-introducing-kubeflow-sdk.md

+)
+```
+
+One click fine tuning can also be accomplished without any configuration being set. The default model, dataset and training configurations are pre-baked into the runtime:


Can we start with the simple one-line fine-tuning example, and after it show advanced when user can modify configuration ?

Sure, added in 27f0d3d

andreyvelich · 2025-11-07T18:15:45Z

_posts/2025-11-07-introducing-kubeflow-sdk.md

+
+## Optimizer Client
+
+The OptimizerClient manages hyperparameter optimization for large models of any size on Kubernetes. It integrates with TrainerClient — you define your training job template once, specify which parameters to optimize, and the client orchestrates multiple trials to find the best hyperparameter configuration.


We should emphasize that the API is almost identical between the Optimizer and Trainer clients, which significantly enhances the user experience during AI development.

Makes sense, updated ba4dc92

andreyvelich · 2025-11-07T18:16:49Z

_posts/2025-11-07-introducing-kubeflow-sdk.md

+optimizer.wait_for_job_status(job_name)
+
+# Get the best hyperparameters and metrics from an OptimizationJob
+optimizer.get_best_results(job_name)


Could we show simple output here for the best result ?

andreyvelich · 2025-11-07T18:17:50Z

_posts/2025-11-07-introducing-kubeflow-sdk.md

+# See all the TrainJobs
+TrainerClient().list_jobs()


You can get this information from the get_job() API:

job = optimizer.get_job(job_name) print(job.trials)

_posts/2025-11-07-introducing-kubeflow-sdk.md

Signed-off-by: kramaranya <kramaranya15@gmail.com>

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Anya Kramar <akramar@redhat.com>

kramaranya · 2025-11-07T22:45:35Z

@andreyvelich I've addressed the comments, let me know what you think
/hold to add Kubeflow survey link

andreyvelich · 2025-11-07T23:29:55Z

Thanks for this!
/lgtm
/approve
/hold for survey as you said

akshaychitneni · 2025-11-08T18:15:09Z

_posts/2025-11-07-introducing-kubeflow-sdk.md

+
+# Introducing Kubeflow SDK
+
+The SDK sits on top of the Kubeflow ecosystem as a unified interface layer. When you write Python code, the SDK translates it into the appropriate Kubernetes resources — generating CRDs, handling orchestration, and managing distributed communication. You get all the power of Kubeflow and distributed AI compute without needing to understand Kubernetes.


Should we say CRs(custom resources) instead of CRDs

Yes, thank you!

Signed-off-by: kramaranya <kramaranya15@gmail.com>

kramaranya · 2025-11-09T14:58:50Z

/assign @andreyvelich
I've added a survey link to the post

andreyvelich · 2025-11-09T15:04:47Z

Thanks everyone!
/lgtm
/approve
/hold cancel

google-oss-prow · 2025-11-09T15:04:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Add Kubeflow SDK intro

243829b

Signed-off-by: Fiona Waters <fiwaters6@gmail.com> Co-authored-by: Anya Kramar <kramaranya15@gmail.com>

google-oss-prow bot requested review from terrytangyuan and zijianjoy November 7, 2025 14:25

google-oss-prow bot added the size/L label Nov 7, 2025

astefanutti reviewed Nov 7, 2025

View reviewed changes

google-oss-prow bot assigned astefanutti Nov 7, 2025

google-oss-prow bot added the lgtm label Nov 7, 2025

Update _posts/2025-11-07-introducing-kubeflow-sdk.md

ac87bcf

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Anya Kramar <akramar@redhat.com>

google-oss-prow bot removed the lgtm label Nov 7, 2025

andreyvelich reviewed Nov 7, 2025

View reviewed changes

kramaranya and others added 12 commits November 7, 2025 21:44

Update Kubeflow SDK diagrams

fb6748c

Signed-off-by: kramaranya <kramaranya15@gmail.com>

Refactor title and principles

08e672a

Signed-off-by: kramaranya <kramaranya15@gmail.com>

Refactor BuiltinTrainer info

27f0d3d

Signed-off-by: kramaranya <kramaranya15@gmail.com>

Elaborate on consistent APIs

ba4dc92

Signed-off-by: kramaranya <kramaranya15@gmail.com>

Refactor Optimization code

9f49598

Signed-off-by: kramaranya <kramaranya15@gmail.com>

Update _posts/2025-11-07-introducing-kubeflow-sdk.md

afe3230

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Anya Kramar <akramar@redhat.com>

Update _posts/2025-11-07-introducing-kubeflow-sdk.md

dee9d91

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Anya Kramar <akramar@redhat.com>

Update _posts/2025-11-07-introducing-kubeflow-sdk.md

35c6d36

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Anya Kramar <akramar@redhat.com>

Update _posts/2025-11-07-introducing-kubeflow-sdk.md

17a5615

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Anya Kramar <akramar@redhat.com>

Update _posts/2025-11-07-introducing-kubeflow-sdk.md

bfebdf3

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Anya Kramar <akramar@redhat.com>

Update _posts/2025-11-07-introducing-kubeflow-sdk.md

3c592ef

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Anya Kramar <akramar@redhat.com>

Update _posts/2025-11-07-introducing-kubeflow-sdk.md

3a0e3bb

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Anya Kramar <akramar@redhat.com>

google-oss-prow bot added the do-not-merge/hold label Nov 7, 2025

google-oss-prow bot assigned andreyvelich Nov 7, 2025

google-oss-prow bot added the lgtm label Nov 7, 2025

google-oss-prow bot added the approved label Nov 7, 2025

akshaychitneni reviewed Nov 8, 2025

View reviewed changes

Add Kubeflow SDK survey announcement

b0822fc

Signed-off-by: kramaranya <kramaranya15@gmail.com>

google-oss-prow bot removed the lgtm label Nov 9, 2025

Use CRs instead of CRDs

203567b

Signed-off-by: kramaranya <kramaranya15@gmail.com>

google-oss-prow bot added lgtm and removed do-not-merge/hold labels Nov 9, 2025

google-oss-prow bot merged commit f6e66d6 into kubeflow:master Nov 9, 2025
6 checks passed


		Scaling AI workloads shouldn't require deep expertise in distributed systems and container orchestration. Whether you are prototyping on local hardware or deploying to a production Kubernetes cluster, you need a unified API that abstracts infrastructure complexity while preserving flexibility. That's exactly what the Kubeflow Python SDK delivers.

		As a data scientist, you've probably experienced this frustrating journey: you start by prototyping locally, writing your training script on your laptop. When you need more compute power, it's time to rewrite everything for distributed training. You containerize your code and rebuild images for every change. You write Kubernetes YAMLs, struggle with kubectl and switch between different SDKs - one for training, another for hyperparameter tuning, another for pipelines. Each step in this process requires different tools, different APIs, and different mental models. What if there was a better way?


		As a data scientist, you've probably experienced this frustrating journey: you start by prototyping locally, writing your training script on your laptop. When you need more compute power, it's time to rewrite everything for distributed training. You containerize your code and rebuild images for every change. You write Kubernetes YAMLs, struggle with kubectl and switch between different SDKs - one for training, another for hyperparameter tuning, another for pipelines. Each step in this process requires different tools, different APIs, and different mental models. What if there was a better way?

		The Kubeflow community started the Kubeflow SDK & ML Experience Working Group (WG) in order to address these challenges. You can find more information about this WG and the [Kubeflow Community here](https://www.kubeflow.org/docs/about/community/).

		- Simplified AI Workloads: Abstract away Kubernetes complexity, allowing AI practitioners to work in familiar Python environments
		- Seamless Integration: Built to work effortlessly with all Kubeflow projects, enabling end-to-end AI model development at any scale


		## Optimizer Client

		The OptimizerClient manages hyperparameter optimization for large models of any size on Kubernetes. It integrates with TrainerClient — you define your training job template once, specify which parameters to optimize, and the client orchestrates multiple trials to find the best hyperparameter configuration.


		# Introducing Kubeflow SDK

		The SDK sits on top of the Kubeflow ecosystem as a unified interface layer. When you write Python code, the SDK translates it into the appropriate Kubernetes resources — generating CRDs, handling orchestration, and managing distributed communication. You get all the power of Kubeflow and distributed AI compute without needing to understand Kubernetes.

Add Kubeflow SDK intro #185

Add Kubeflow SDK intro #185

Conversation

Fiona-Waters commented Nov 7, 2025

Uh oh!

astefanutti left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andreyvelich Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kramaranya commented Nov 7, 2025

Uh oh!

andreyvelich commented Nov 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kramaranya commented Nov 9, 2025

Uh oh!

andreyvelich commented Nov 9, 2025

Uh oh!

Uh oh!

andreyvelich Nov 7, 2025 •

edited

Loading

andreyvelich Nov 7, 2025 •

edited

Loading

andreyvelich Nov 7, 2025 •

edited

Loading