Skip to content

Conversation

hannw
Copy link
Contributor

@hannw hannw commented Sep 26, 2025

Add the game engine and visualizer to Kaggle environements.

hannw and others added 30 commits September 16, 2025 16:48
add werewolf game, prototype runs e2e

werewolf env stablized.

1. fixed renderer to return input and output of agent at each step
2. add a better random agent
3. add get_human_readable to WerewolfObservationModel

add unit tests for the WerewolfEnv

add werewolf basic html rendering

add default max days

add dummy llm agent

initialized new werewolf game engine with more principled and modular design
add game init logic

werewolf integration tests working e2e, resolved minor bugs

refactored add history entry

Enable debug agents for debug mode

fixed action parsing logic and random agent

Night stage stable

refactor __run_interpreter to respect debug mode

refactored werewolf players to receive observations in message queue. game interpreter working.

werewolf game engine working end to end

werewolf rendering working e2e

1. Resolved subtle werewolf voting bug, where SimultaneousMajority Voting need to log votes even if the action is invalid. Also, tie breaking extend to no valid votes, randomly select one.
2. Resolved HistoryEntry serialization issue of the subfield DataEntry (having multiple subclasses). Now, we serialize all subclasses manually to dict.
3. Resolve random werewolf agent issues in parsing observations (HistoryEntry) using isinstance, since now it only have access to json objects.
4. refactored werewolf.js. Now it works with new game engine.
add more number of valid agents to the werewolf game config

fix the debug mode of agent.py to allow manual debugging

Add two panel frontend for werewolf

Also, add doctor seer action data entry
added day and night voting to werewolf.js

Add doctor save information to werewolf.js

Use action from steps to render input actions.

add logos to test_werewolf.py

add player_thumbnails to config schema

delete obselete observation prep

temp commit test_engine.py

day of actions in event log rendered correctly now

separate day and night event logs into two shades in werewolf.js

Event log rendering order completed correctly in werewolf.js

add moderator announcement to event log in werewolf.js

UI related fixes

1. add moderator observations to env.info
2. refactor action to record game phase timestamps
3. change some entry to moderator announcement type so UI will render those
add llm harness for werewolf

fix the bug that reasoning and voting details were revealed to all players.

1. refactor DataEntry to have public_view method, so reasoning trace is hidden from public.
2. fix the bug of voting action add_history_entry calls in SimultaneousMajority.

moderator announce roles and their abilities

support better end game logging

fix supported llm model names

refactor player schema in config

1. modify werewolf.json
2. fix Enum value representation for model_dump
3. refactor player representation

add try except in llm werewolf harness

also added vertex ai credential for litellm

add display name to agent schema

add more supported models to llm harness

refactor llm harness to have better instruction template to delineate sections

pass agent_config to llm harness

fixed infinite voting loop problem and elected not eliminated problems

1. infinite voting bug: if certain players Timeout by kaggle, it will result in infinite voting loop due to potential votes never cast.
2. elected not eliminated problems: set default logic need to happen after player vote cast, otherwise the check for duplicate vote logic will prevent voter new vote to be cast.
3. tally vote has a subtle bug of not eliminating abstain, "-1", votes while counting.

resolve infinite voting loop bug

1. use collect_votes method to set default for all expected voters.
2. fix winning condition bug for villagers.

add retry logic

minor fixes

add event truncation to LLMWerewolfAgent

add pydantic to requirements

fix prompt template json fences and wording

record timestamp during history entry creation

change default voting scheme to sequential voting

This is for better dynamics and visualization for viewers during game replay
simplify visuals in werewolf.js

add player capsules in werewolf rendering

fix night action reasoning and actor id rendering

improve line spacing of moderator and game over messsage
fix issues in SequentialVoting protocol

refactor action queue to use action specific queue

adjust phase separator style

add perceived threat level to backend and frontend

improve json schema export for pydantic basemodel
refactor __run_interpreter to remove duplicated code

add comments to address debug mode branching

Refactoring interpreter() for readability and adding some high level documentation

minor new line improvement
add checks to ensure all player ids are unique

refactor phase transition to be more robust

make sure allow doctor self save is configurable
Add actor id capsule to event log updates

fix sequential voting bug

add voter capsule to day votes

add timeout display in werewolf.js
adding additional llm models with quota for testing

adding snippet to track token usage

fixing bug where single element instead of list is being returned

Adding additonal packages that seem to be required from base image

fixing issue with undefined 2d array for moderator and player roles not getting set

making some options required

Adding README.md to werwolf with bare bones example
adding branch to werewolf_harness

adding branch to werewolf_harness

adding a flat() call as moderator results can be 2d, kind of a hack for now

adding a flat() call as moderator results can be 2d, kind of a hack for now, another missed call

add werewolf runner scripts for simple testing and experimentation

remove print statements and use logger instead

revert the voting index change

sequential voting should only have one actor a time
add TTS to werewolf game replay

Add continual audio to front end

Add voices and association with model

fix werewolf.js black screen issue reporting "Waiting for game data..."

making the audio speed bar persistent across playbacks

minor fixes for dump_audio.py

add announcements to audio event

add action announcements to audio event
add 3d background

add 3d background rendering and portable replay folder

refactor dump_audio.py to be more modularized

added wolf and windmill

added wolf and windmill

loaded 8 stickmans and idle animation

adjusted stickman directions and numbers

add nameplate on top of the stickman

add local image into asset for serving thumbnail

nameplate text aligned center

fix z placement of nameplate

add moon to the scene

add file dump
render audio script

refactor dump_audio.py to enable debug path to play simple audio

refactor werewolf.js to have hover effect and select event to playback

change text color during audio playback event

refactor base.py instructions

fix finding audio key issues

add gpt-oss model

add new audio generation instructions and examples

Change the tts instruction format to gemini-2.5-pro's suggestions

Improve on the audio prompts
add 3d werewolf script

add assets

resolve moderator obs bug

use logger for self_play.py

resolve sequential voting protocol bug

refactor dump_audio.py

add instructions for audio generation in README.md

add instructions for audio generation in README.md

add flask and gym back

remove redundant files

clean up tests

add agent action error code handling, default to game end if there is agent error

fix out and err context

fix input args of interpreter loop

improve exemplar formatting

Change the debug mode for tests to False for testing prod paths

These tests are not really debugging, they are testing side effects of prod path which catches error and update agent status.

add game rules

# Conflicts:
#	requirements.txt
add violent language filter to discussion protocols

enforce violent language filter in action initializer level

add optional reveal night elimination or day exile role

Fix bid driven discussion state machine and add turn by turn bidding discussion

refactor action queue to a single class and refine bidding driven discussion

add bid data entry and bid result data entry

use deterministic tie breaking mechanisms for UrgencyBidProtocol

add bid_result_public to properties

add test_turn_by_turn_bidding_discussion
3d view changes

Improved 3d layout

Player list css improvements

Fix for emojis

3d scene effects

Player logo fading

Adjusted panel margins

fix for flipped players

Day/night transitions

Nameplate adjustments

Status panel

fix animation so backward replay restore past state

fix event log so backward replay restore past log state

add moderator announcement back to event log

fix play bar event loop step alignment

disable key control of scene

add block experiment code for role balanced sampling

add spotlight logic

add more supported models
report cost of inference for LLMs

enable configuring LLMWerewolfAgent and add text mode prompts

add script to measure cost

add bid action to llm harness

enable harness bid reasoning config options and add cost tracker

refactor measure_cost.py to use new cost trackers

fix day addition bug in engine.py

fix token trajectory plot

add agent reset mechanism to solve global agent state carry over across episode issue and cost accumulation bug

Load usage directly from litellm for cost analysis

fix trajectory loading in measure_cost.py
add run.py script and refactor run_block.py to use it

rearrange config files

add toggle for reasoning traces
Fix pulse and spotlight animation misalignment with actor

add targeting arcs

Improved vote target visualizer

Change color of arcs for different actions

Fixed day/night timing and werewolf turns red during night time

fix day voting arcs

add phase divider logic to cleanly segment events

separate allEvents and visibleEvents to remove undisplayed events from the replay bar

fix night werewolf red light glitches

Fix night eliminated werewolf resurrection issues

fix phase indicator on left panel

remove redundant code, change event log phase indicator text

Resolve event log duplication bug

Resolve game end phase indicator to no day count

change thumbnail size on left panel, clean up console logs

Add display_name rendering and adjust thumbnail background to be white

Add optional display name to 3d player nameplate
refactor run_block.py to use run.py as subprocess and parallelize the operations

add option to append timestamp to output dir

add option to shuffle player ids

revise run_block.py instructions

add LogExecutionTime context manager to keep track of task time

add instructions to configure agents

fix generate unique role permutations

Add register agent method

Make register agent the responsibility of the user.
This way we can register llm harnesses on the fly.

add litellm_models.yaml config to register litellm model for cost provider etc

remove capital P from player logs

Fix reveal day and night elimination bug

Fix reveal text for no reveal situation in visualizer

Update player capsule to show display name (placeholder for model name)

Handle capsule parsing better

1. introduce a caching function for player name parser
2. better formatting for list parsing in system messages

add doctor and seer night glow

add options to choose random first actor for roundrobiin discussion and voting

fix sequential voting protocol voter bug and history visibility bug

Minor format improvement to llm harness

Minor format improvement

Fix capsule regex for parsing "... <player_name>."

correct threat level text

use orb as self assessed threat level detector

1. repurpose orb as threat level detector
2. decouple is_active animation with role

add sound wave animation to speaker

change random message.

fix role_msg

add censor words, refactor censor pattern compile to global

add llm query logs, fix no cost logging

add flag to shuffle roles

refactor llm harness for better error handling and logging

1. refactored action parsing to be dispatched pattern.
2. introduce retry mechanism and callbacks for dealing with different errors (rate limit, context window, parsing error)
3. record completion and prompt in resulting actions.

fix tally exposing player role bug

add global toggle for turning reasoning on/off

add reveal rule messages in moderator announcement

refactor action schema for player

improve role specification formatting

fix schema_for_player return None bug

add action to ActionDataMixin

block action from public view of data entry

add player first person view on click of player card at left panel

add reset button for 3d camera position

fix query_parse bug, remove observation from action obj
remove observations from actions.py which is causing recursive memory growth

optimize imports

[critical] fix public view bug resulted from Mixin method resolution order (MRO) bug

1. The ActionDataMixin was wrongfully placed at the end of parent class hierarchy, it should be the first with the highest MRO priority.
2. from_history_entry has wrong signature and never used dict and DataEntry as input. It has been using HistoryEntry as input.

add missing error prompt

Refactor history entry access patterns

1. add const for observation key
2. agents (random and llm harness) use a more principled way to get action request history entry
3. refactor history entry type to be more detailed
4. update records.py to have access control for action data.
5. decouple history entry and player history entry view
6. remove unused history entries in observations in werewolf.py
7. control history entry access completely from state
8. clean up protocols.py and engine.py add history entry

Add visible history entry type control

fix raw observation key bug

make actor data observable to the actor by specifying source

cache schema_for_player for better latency

refactor raw_observation consumer to use getter and setter

refactor llm harness to record self action and reasoning

Introduce StrEnum to be better json serializable and clearer types
add configs for llm experiment

add packages to requirements.txt

add pairwise zero sum game tournament

add task shuffling to reduce LLM api load

reuse run_single_game_cli code

add display name to the right panel player name tag

revise self_play.py to run n games from a given config

add "random" and "no exile" tie exile options in SimultaneousMajority voting

Add RoundByRoundBiddingDiscussion

refactor _handle_night_await_actions() into smaller methods for readability.

Remove check for GOOGLE_APPLICATION_CREDENTIALS as sdk looks for them rather than application

Updating docs to remove internal OCTO project

Refactor to use event driven architecture for Role extensibility

1. Introduce EventBus in GameState to control fan-in and fan-out of game events.
2. refactor role specific event handlers to roles. Use decorator to register event handler.
3. action confirmation centrally handled.
4. Introduce PlayerID.
5. general improvement on symbol annotations.
Refactor protocols to be modules

1. provide factory methods and registry for proper configurable protocols.
2. refactor the protocols to be multiple modules.

Improve on player id annotations

standardize variable names from history entry to event

fix harness phase error

add cost to litellm models

remove general confirmation of action event to player

The action event will clutter llm prompt

Fix StrEnum repr

Improve on prompting

1. Providing human friendly name of protocols.
2. Improve on LLM prompts and wording of rule sets.

Improve the state machine transition to be explicit in engine.py

Add phase category as attribute of DetailedPhase

Resolve detailed_phase naming issue

add utility to log git hash

use wait random exponential to avoid "thundering herd" VM crash

Use threadpool instead

refactor voting to use Ballot

Refactor Role initialization to use role_params dict

1. remove allow_doctor_self_save flag
2. set params directly in Role subclasses.
3. update agent config to use role_params.
4. fix run scripts role shuffling logic.
5. fix minor public announcement in night elimination manager.
6. change no elimination night announcement.

Refactor day/night elimination reveal to use RevealLevel

1. moderator and night elimination manager config changes
2. general config schema and existing config updates.
3. Fix werewolf observation model reveal level.
4. Fix LLM harness observation model reveal level.

Fix seer team reveal level result blob

Add option to disallow doctor consecutive saving of same player

This option is mainly to prevent infinite Seer save and provide more strategic game play
Audios are added to existing replay to decouple voice over from game generation
@SohierDane
Copy link
Contributor

@hannw I'm trying to look for ways to break this into additional smaller PRs. Does this sound like a reasonable list of smaller PRs we could review in isolation? Chaining them should be fine if necessary.

  • The game logic & visualizers.
  • The harness.
  • The other experiment configs & runner scripts.
  • The LLM logos.

@hannw hannw requested review from develra and lipovetz September 30, 2025 16:44
parent: ref.current,
preact,
styled,
__mainContext: context,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the changes to this file read really strangely to me - generally it's a bad idea to link context and state like this. Can you tell me a bit more about what you are trying to accomplish with these changes?

Copy link
Contributor Author

@hannw hannw Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying here to overwrite the behavior of the functions of the control interface. Do you have suggestions on how to better approach this?

This is needed since Kaggle environment have assumptions about game loops that breaks for some game engines like werewolf (specifically, one kaggle steps may map to many werewolf steps). Therefore, we can only overwrite the play function and the steps in order to make a "step" more intuitive for werewolf (say, one moderator announcement step that result in no player action).

This is one of the main learning and suggestion for future 3rd party game integration that some assumptions need to be relaxed or redesigned to fit a wider set of game engine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying here to overwrite the behavior of the functions of the control interface. Do you have suggestions on how to better approach this?

Ya fair enough, I think my preference here would be to have a more well-defined interface of 'overridable controls' that we allow passing into the game renderer, and the game can choose to override those controls if they want. I'll try to think a bit about how to update the interface to support this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the main learning and suggestion for future 3rd party game integration that some assumptions need to be relaxed or redesigned to fit a wider set of game engine.

Just out of curiosity, why does werewolf need it's own set of 'steps' instead of relying on the 'steps' defined in the replay interface?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hannw - maybe for the time being we can do overrides similar to what setStep is doing here? https://github.com/Kaggle/kaggle-environments/blob/master/kaggle_environments/static/player.html#L536 - then can just pass in that with the context and override it as needed. Not my favorite pattern but I think it should work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be state reset issue with direct overrides (I've tried). Once we override setStep the original copy would be lost from top level context, that's why the setters and the global cache pattern are necessary here for the monkey patching to work correctly. We need a global cache window.wwOriginals to store the original setStep.

Copy link
Contributor

@develra develra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few requests for cleaning up the html and javascript - I left some comments about this but given the length of the file a number of other places it should also be applied.

  1. Figure out a debug strategy for the console.logs and console.warnings so it isn't always spamming the console output unless you specifically ask for it.

  2. Clean up a lot extraneous comments and dead code commented out.

  3. I'm pretty nervous about needing to maintain this over time given the complexity involved. Did you use a PLAN.MD or something for development? Might be nice to include a context doc here in case we need to dump it in for iterations in the future.

  4. I'm not crazy about the global changes to player.html and other top-level things. Can we discuss what you need these for and figure out a more sustainable way to make the changes? Happy to assist.

Thanks!

step,
parent,
height = 700,
width = 1100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in a doc I got some feedback that ideally this would be height of 1000ish and width of 1500ish - does something break down with those as initials?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the 3d rendering would be blocked or too small to read as a result. I think it might improve if we further simplify the UX like Yuting suggested. With current setup the 1000 x 1500 will give a better viewing experience.

const dataType = dataEntry.data_type;
const visibleInUI = event.visible_in_ui ?? true;

console.log(`[RAW SEEN] Kaggle Step: ${kaggleStep}`, { dataType: dataType, event: event });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want this to always log out? might be nice to condition it on a debugger setting but nbd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will clean these up afterward. Thanks for catching. I was experimenting heavily with optimizing the voice over experience.

const processedPhaseEvents = new Set(); // This will track events within a single phase.
(environment.info?.MODERATOR_OBSERVATION || []).forEach((stepEvents, kaggleStep) => {
(stepEvents || []).flat().forEach(dataEntry => {
const event = JSON.parse(dataEntry.json_str);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any possibility that json_str here is malformed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The output json would be handled deterministically from game engine (including when agent failed), so we should be relatively safe here.

return;
}

// (Your existing code to push events to player.allEvents, player.displayEvents, etc.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can probably clean this up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

});
});

console.log(`[FINAL STEP LIST]`, player.displayEvents);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question around always display vs a debug

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

}

// We patch the functions on the 'context' object directly.
if (context.__mainContext && !window.customPlayerControlsInjected) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not crazy about this mechanism for overwriting context - would be nice to figure out a more correct way to pass down whatever you need - happy to discuss.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's schedule sometime to go over possibilities. I am open to using a cleaner design.

const mainContext = context.__mainContext;

if (!window.wwOriginals) {
console.log("DEBUG: Storing original controls for the first time.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup

// --- Patch Play ---
mainContext.setPlay(() => (continuing) => {
console.log(`DEBUG: [setPlay] Play button clicked. Continuing: ${continuing}`);
// if (!audioState.isAudioEnabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code

if (audioState.isAudioEnabled) {
// --- AUDIO-DRIVEN PLAYBACK ---
console.log("DEBUG: [setPlay] Audio is ON. Using audio-driven playback.");
window.wwOriginals.setPlaying(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nicer to just pass down setPlaying in the 'real' context as oppposed to doing this hack with wwOrigionals

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@SohierDane SohierDane mentioned this pull request Sep 30, 2025
"day_exile_reveal_level": "no_reveal",
},
)
agents = ["random"] * 7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hannw I'd love to see more tests and more asserts in the tests. The existing suite seems reasonable solely for confirming that the game runs to completion. Refactoring kaggle-environments seems to pose a high risk of silently altering game behavior at the moment.


from kaggle_environments.envs.werewolf.game.actions import filter_language

test_data = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests confirming the behavior of truly invalid or unrelated inputs.

"agents": agents_config,
},
)
agents = ["random"] * 7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, please add some tests with non-random agents so we can have larger set of expected game properties to validate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants