Skip to content

Conversation

sicoyle
Copy link
Contributor

@sicoyle sicoyle commented Sep 23, 2025

This PR adds tracing instrumentation to the 05-multi-agent-workflows quickstart so I can investigate orchestrator behavior. While implementing workflow improvements, I discovered that removing global chat history in a previous PR changed orchestrator semantics and broke some orchestrator flows.

I removed global chat history to let in-flight workflows resume after app restarts and to avoid incorrectly interleaving messages (which produced invalid chat history and LLM provider errors). I ran the quickstarts after that change, but for orchestrators I only checked for workflow completion logs and killed the run — that was not sufficient and masked context-related failures that we see now.

Previously, orchestrators used a two-step approach:

  1. Broadcast a message to all agents (updating a global conversation history).

  2. Send a TriggerAction to the specific agent that should respond.

With global chat history gone, we need to decide how orchestrators should handle context going forward from an event-driven perspective:

This is what I went with: a two step approach:

  • First event: orchestrator broadcasts context to all agents (agents store/update their local view of the workflow/conversation). This does not trigger the entire workflow, so agents do not respond at this point.

  • Second event: orchestrator sends a TriggerAction to the target agent to act.

This PR so far:

  • Improves parts of the orchestrator workflow definition.

  • Updates prompts where the orchestrator was skipping steps or producing unexpected behavior.

  • Adds tracing setup/fixes for orchestrators (still WIP).

  • Fixes orchestrators to behave as expected.

  • creates an internal trigger action and supports the external triggerAction, so when we have orchestrators trigger agents they do not also trigger themselves, and users still can trigger agents via pubsub as expected

  • added context session_id to orchestrators so if you start them and let them run and the plan is not complete and you rerun them with the same session then it picks up where it left off prior

image See here where it starts with substep 1.1 image

THen I let it run and it completes 1.1, 1.2, 1.3 so step 2 is next. Now, when I restart my orchestrator I see it pick up where it left off
image

Note: This only fixes LLM Orchestrator. I need to look at the random and roundrobin ones still.

Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
…same session id

Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
@sicoyle sicoyle marked this pull request as ready for review October 1, 2025 20:30
Signed-off-by: Samantha Coyle <sam@diagrid.io>
@sicoyle sicoyle changed the title [WIP - pls ignore!] feat: wip on orchestrator state fixing + tracing feat: wip on orchestrator state fixing + tracing Oct 1, 2025
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
@sicoyle sicoyle changed the title feat: wip on orchestrator state fixing + tracing feat: fix llm orchestrator + tracing + sessions for long-term memory Oct 1, 2025
Copy link
Collaborator

@Cyb3rWard0g Cyb3rWard0g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thank you @sicoyle !

Signed-off-by: Samantha Coyle <sam@diagrid.io>
if isinstance(msg, dict):
long_term_memory_messages.append(msg)
elif hasattr(msg, "model_dump"):
long_term_memory_messages.append(msg.model_dump())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to be any other type? or maybe we should log it if we get not dict and not pydantic, again not sure if possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given our current setup i think this is it. State layer maintains dictionary so most times it is that, but everything else is a pydantic model

Signed-off-by: Samantha Coyle <sam@diagrid.io>
asyncio.set_event_loop(loop)
return loop.run_until_complete(context_wrapped_coro())
except Exception as e:
logger.warning(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section, and maybe the whole context_wrap needs to be analyze in the context of where is this executed in workflows and/or activities. it is strange for new_event_loop to fail, it might indicate other subtle bug,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the code is managing OpenTelemetry context across thread boundaries. I'll add a TODO comment on this, but I think when we get to the point of using trace context propagation through Dapr all of this will be resolved

Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
)

# Parse the response - now we get a Pydantic model directly
if hasattr(response, "choices") and response.choices:
Copy link
Contributor

@filintod filintod Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this choices? isn't this deprecated by now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we need choices as it is part of the conversation api response matching openai tool calling format

)

# Parse the response - now we get a Pydantic model directly
if hasattr(progress_response, "choices") and progress_response.choices:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here "if" not using raw data anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the if depends on which llm provider / api version. Most do support choices field with openai tool calling format, but not all do so that's why i use the if

Signed-off-by: Samantha Coyle <sam@diagrid.io>
Copy link
Contributor

@filintod filintod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm to make it work, we should get together and redesign at some point some aspects of this, and for sure we need to add comprehensive tests to the coordination mechanism

@yaron2 yaron2 merged commit d4b46a6 into dapr:main Oct 2, 2025
6 checks passed
@sicoyle sicoyle deleted the feat-orchestrator-wf-improvements branch October 2, 2025 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants