-
Notifications
You must be signed in to change notification settings - Fork 79
feat: fix llm orchestrator + tracing + sessions for long-term memory #213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
…same session id Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Thank you @sicoyle !
Signed-off-by: Samantha Coyle <sam@diagrid.io>
if isinstance(msg, dict): | ||
long_term_memory_messages.append(msg) | ||
elif hasattr(msg, "model_dump"): | ||
long_term_memory_messages.append(msg.model_dump()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to be any other type? or maybe we should log it if we get not dict
and not pydantic
, again not sure if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given our current setup i think this is it. State layer maintains dictionary so most times it is that, but everything else is a pydantic model
Signed-off-by: Samantha Coyle <sam@diagrid.io>
asyncio.set_event_loop(loop) | ||
return loop.run_until_complete(context_wrapped_coro()) | ||
except Exception as e: | ||
logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this section, and maybe the whole context_wrap needs to be analyze in the context of where is this executed in workflows and/or activities. it is strange for new_event_loop to fail, it might indicate other subtle bug,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the code is managing OpenTelemetry context across thread boundaries. I'll add a TODO comment on this, but I think when we get to the point of using trace context propagation through Dapr all of this will be resolved
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
) | ||
|
||
# Parse the response - now we get a Pydantic model directly | ||
if hasattr(response, "choices") and response.choices: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this choices
? isn't this deprecated by now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we need choices as it is part of the conversation api response matching openai tool calling format
) | ||
|
||
# Parse the response - now we get a Pydantic model directly | ||
if hasattr(progress_response, "choices") and progress_response.choices: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also here "if" not using raw data anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the if depends on which llm provider / api version. Most do support choices field with openai tool calling format, but not all do so that's why i use the if
Signed-off-by: Samantha Coyle <sam@diagrid.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm to make it work, we should get together and redesign at some point some aspects of this, and for sure we need to add comprehensive tests to the coordination mechanism
This PR adds tracing instrumentation to the 05-multi-agent-workflows quickstart so I can investigate orchestrator behavior. While implementing workflow improvements, I discovered that removing global chat history in a previous PR changed orchestrator semantics and broke some orchestrator flows.
I removed global chat history to let in-flight workflows resume after app restarts and to avoid incorrectly interleaving messages (which produced invalid chat history and LLM provider errors). I ran the quickstarts after that change, but for orchestrators I only checked for workflow completion logs and killed the run — that was not sufficient and masked context-related failures that we see now.
Previously, orchestrators used a two-step approach:
Broadcast a message to all agents (updating a global conversation history).
Send a TriggerAction to the specific agent that should respond.
With global chat history gone, we need to decide how orchestrators should handle context going forward from an event-driven perspective:
This is what I went with: a two step approach:
First event: orchestrator broadcasts context to all agents (agents store/update their local view of the workflow/conversation). This does not trigger the entire workflow, so agents do not respond at this point.
Second event: orchestrator sends a TriggerAction to the target agent to act.
This PR so far:
Improves parts of the orchestrator workflow definition.
Updates prompts where the orchestrator was skipping steps or producing unexpected behavior.
Adds tracing setup/fixes for orchestrators (still WIP).
Fixes orchestrators to behave as expected.
creates an internal trigger action and supports the external triggerAction, so when we have orchestrators trigger agents they do not also trigger themselves, and users still can trigger agents via pubsub as expected
added context session_id to orchestrators so if you start them and let them run and the plan is not complete and you rerun them with the same session then it picks up where it left off prior
THen I let it run and it completes 1.1, 1.2, 1.3 so step 2 is next. Now, when I restart my orchestrator I see it pick up where it left off

Note: This only fixes LLM Orchestrator. I need to look at the random and roundrobin ones still.