feat: fix llm orchestrator + tracing + sessions for long-term memory #213

sicoyle · 2025-09-23T23:31:19Z

This PR adds tracing instrumentation to the 05-multi-agent-workflows quickstart so I can investigate orchestrator behavior. While implementing workflow improvements, I discovered that removing global chat history in a previous PR changed orchestrator semantics and broke some orchestrator flows.

I removed global chat history to let in-flight workflows resume after app restarts and to avoid incorrectly interleaving messages (which produced invalid chat history and LLM provider errors). I ran the quickstarts after that change, but for orchestrators I only checked for workflow completion logs and killed the run — that was not sufficient and masked context-related failures that we see now.

Previously, orchestrators used a two-step approach:

Broadcast a message to all agents (updating a global conversation history).
Send a TriggerAction to the specific agent that should respond.

With global chat history gone, we need to decide how orchestrators should handle context going forward from an event-driven perspective:

This is what I went with: a two step approach:

First event: orchestrator broadcasts context to all agents (agents store/update their local view of the workflow/conversation). This does not trigger the entire workflow, so agents do not respond at this point.
Second event: orchestrator sends a TriggerAction to the target agent to act.

This PR so far:

Improves parts of the orchestrator workflow definition.
Updates prompts where the orchestrator was skipping steps or producing unexpected behavior.
Adds tracing setup/fixes for orchestrators (still WIP).
Fixes orchestrators to behave as expected.
creates an internal trigger action and supports the external triggerAction, so when we have orchestrators trigger agents they do not also trigger themselves, and users still can trigger agents via pubsub as expected
added context session_id to orchestrators so if you start them and let them run and the plan is not complete and you rerun them with the same session then it picks up where it left off prior

See here where it starts with substep 1.1

THen I let it run and it completes 1.1, 1.2, 1.3 so step 2 is next. Now, when I restart my orchestrator I see it pick up where it left off

Note: This only fixes LLM Orchestrator. I need to look at the random and roundrobin ones still.

Signed-off-by: Samantha Coyle <sam@diagrid.io>

…same session id Signed-off-by: Samantha Coyle <sam@diagrid.io>

Signed-off-by: Samantha Coyle <sam@diagrid.io>

quickstarts/05-multi-agent-workflows/requirements.txt

Cyb3rWard0g

LGTM Thank you @sicoyle !

Signed-off-by: Samantha Coyle <sam@diagrid.io>

filintod · 2025-10-02T15:01:09Z

dapr_agents/agents/durableagent/agent.py

+            if isinstance(msg, dict):
+                long_term_memory_messages.append(msg)
+            elif hasattr(msg, "model_dump"):
+                long_term_memory_messages.append(msg.model_dump())


is it possible to be any other type? or maybe we should log it if we get not dict and not pydantic, again not sure if possible

given our current setup i think this is it. State layer maintains dictionary so most times it is that, but everything else is a pydantic model

dapr_agents/agents/durableagent/agent.py

Signed-off-by: Samantha Coyle <sam@diagrid.io>

dapr_agents/llm/dapr/chat.py

dapr_agents/llm/utils/structure.py

filintod · 2025-10-02T15:34:26Z

dapr_agents/observability/instrumentor.py

+                        asyncio.set_event_loop(loop)
                        return loop.run_until_complete(context_wrapped_coro())
+                    except Exception as e:
+                        logger.warning(


this section, and maybe the whole context_wrap needs to be analyze in the context of where is this executed in workflows and/or activities. it is strange for new_event_loop to fail, it might indicate other subtle bug,

This part of the code is managing OpenTelemetry context across thread boundaries. I'll add a TODO comment on this, but I think when we get to the point of using trace context propagation through Dapr all of this will be resolved

Signed-off-by: Samantha Coyle <sam@diagrid.io>

dapr_agents/workflow/orchestrators/llm/orchestrator.py

Signed-off-by: Samantha Coyle <sam@diagrid.io>

dapr_agents/workflow/orchestrators/llm/orchestrator.py

Signed-off-by: Samantha Coyle <sam@diagrid.io>

filintod · 2025-10-02T16:45:15Z

dapr_agents/workflow/orchestrators/llm/orchestrator.py

+                )
+
+                # Parse the response - now we get a Pydantic model directly
+                if hasattr(response, "choices") and response.choices:


do we need this choices? isn't this deprecated by now?

yes we need choices as it is part of the conversation api response matching openai tool calling format

filintod · 2025-10-02T16:47:09Z

dapr_agents/workflow/orchestrators/llm/orchestrator.py

+            )
+
+            # Parse the response - now we get a Pydantic model directly
+            if hasattr(progress_response, "choices") and progress_response.choices:


also here "if" not using raw data anymore

the if depends on which llm provider / api version. Most do support choices field with openai tool calling format, but not all do so that's why i use the if

Signed-off-by: Samantha Coyle <sam@diagrid.io>

filintod

lgtm to make it work, we should get together and redesign at some point some aspects of this, and for sure we need to add comprehensive tests to the coordination mechanism

sicoyle added 8 commits September 23, 2025 17:24

feat: wip on orchestrator state fixing + tracing

97434fd

Signed-off-by: Samantha Coyle <sam@diagrid.io>

fix: address conflicts with main

80a9e87

Signed-off-by: Samantha Coyle <sam@diagrid.io>

fix: separate ex/in-ternal triggers + wip fix orchestrators

e8699ad

Signed-off-by: Samantha Coyle <sam@diagrid.io>

fix: ensure progress on substeps/steps

06b67ba

Signed-off-by: Samantha Coyle <sam@diagrid.io>

fix: give orchestrators ability to pick up where they left off using …

afe799d

…same session id Signed-off-by: Samantha Coyle <sam@diagrid.io>

style: make linter happy

75b438b

Signed-off-by: Samantha Coyle <sam@diagrid.io>

fix: address conflict with main

2960d33

Signed-off-by: Samantha Coyle <sam@diagrid.io>

fix: rm extra edge check since captured elsewhere

596fd1b

Signed-off-by: Samantha Coyle <sam@diagrid.io>

sicoyle mentioned this pull request Oct 1, 2025

Fix: durable agent context #223

Merged

sicoyle marked this pull request as ready for review October 1, 2025 20:30

sicoyle requested review from Cyb3rWard0g and yaron2 as code owners October 1, 2025 20:30

style: update new wf name

e51d924

Signed-off-by: Samantha Coyle <sam@diagrid.io>

sicoyle changed the title ~~[WIP - pls ignore!] feat: wip on orchestrator state fixing + tracing~~ feat: wip on orchestrator state fixing + tracing Oct 1, 2025

sicoyle added 2 commits October 1, 2025 15:35

style: tox -e ruff

b8c7ffe

Signed-off-by: Samantha Coyle <sam@diagrid.io>

fix: make flake8 happy

22af9da

Signed-off-by: Samantha Coyle <sam@diagrid.io>

sicoyle changed the title ~~feat: wip on orchestrator state fixing + tracing~~ feat: fix llm orchestrator + tracing + sessions for long-term memory Oct 1, 2025

filintod reviewed Oct 1, 2025

View reviewed changes

quickstarts/05-multi-agent-workflows/requirements.txt Outdated Show resolved Hide resolved

Cyb3rWard0g approved these changes Oct 2, 2025

View reviewed changes

fix: update requirements file

cb5a6dc

Signed-off-by: Samantha Coyle <sam@diagrid.io>