Skip to content

Conversation

peteski22
Copy link
Contributor

What's changing

After discussions with @njbrake we decided that the structure of the compiled.json file probably needs to be adjusted in order to allow users to see metrics/parameters 'per job' that runs.

This PR changes the way this is done so that the compiled results store the job (run) ID as the key along with the results for that job as the values.

  • Updated get_job, _compile_metrics and _compile_parameters
  • Add exception for workflow run not found
  • Added unit tests

TODO: Still work in progress, we will need to make changes to the front end etc. to support this.

If this PR is related to an issue or closes one, please link it here.

Refs #...
Closes #...

How to test it

Steps to test the changes:

Additional notes for reviewers

Anything you'd like to add to help the reviewer understand the changes you're proposing.

I already...

  • Tested the changes in a working environment to ensure they work as expected
  • Added some tests for any new functionality
  • Updated the documentation (both comments in code and product documentation under /docs)
  • Checked if a (backend) DB migration step was required and included it if required

Updated get_job, _compile_metrics and _compile_parameters

Add exception for workflow run not found

Added unit tests
@peteski22 peteski22 requested a review from njbrake March 21, 2025 18:49
Copy link
Contributor

@njbrake njbrake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you for the logic simplification and improvement 💪 . I think the frontend code may need to be adjusted to accommodate this change, but I don't think anything needs to change in the SDK 👍

Copy link
Contributor

@njbrake njbrake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about why the get_job was turned into an async function, but other than that no issues. 😄

"""Delete an experiment. Although Mflow has a delete_experiment method,
We will use the functions of this class instead, so that we make sure we correctly
clean up all the artifacts/runs/etc. associated with the experiment.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What code calls get_job, and what is the impact of changing this to async?

Copy link
Contributor Author

@peteski22 peteski22 Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey hey 😄, good question, let me try to explain the reasoning/history:

This is to align with main as the interface has been updated to make all the methods async.

Question: "Why do all the tracking interface methods need to be async?"

Answer: Currently, they don't ... but ...

Background:

In changing the implementation of the S3 provider from boto3 to s3fs, we had to call some async library methods to have similar functionality.

Initially we could just change the implementation of MLflowTrackingClient to use an async method, but the idea of the tracking client interface is that code can accept it as a parameter for dependency injection and 'not care' about implementation details.

In this case, the TrackingClient wouldn't suggest that consuming code should await anything it does, which will lead to some interesting return values.

Because we didn't want that, we updated the interface and the implementation to mark the methods in question as async, this then meant that the methods that made calls to the interface method, became async, and so on up the chain.

(Side note: an alternative is trying to semi-manually handle the async stuff in sync code using asyncio calls (like run) but I think they come with their own foot guns, and run is usually to bootstrap the entry-point into the async app, so I say just use the native language syntax to specify intention in the functions even if that means it pervades the calls up the stack)

Question: "So what does that have to do with non-async methods now being async?"

Answer: TL;DR:

  • Future proofing code refactoring
  • Allows underlying code to be sync or async
  • Type hinting for the IDE to remind people to await things
  • Interface doesn't lie about implementations

There's a small overhead of adding async/await to code/calls but it's negligible vs the benefits. (I believe)

Because we saw some of the pain in making that change, having to find all the call sites and update them to be async, up the stack.. It makes sense to pre-emptively mark the other methods as async.

Hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants