Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 11, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 13, 2025 10:49pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 13, 2025 10:49pm
rivet-inspector Ignored Ignored Preview Nov 13, 2025 10:49pm
rivet-site Ignored Ignored Preview Nov 13, 2025 10:49pm

Copy link
Contributor Author

MasterPtato commented Nov 11, 2025


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: WebSocket Hibernation Improvements

Summary

This PR introduces important improvements to WebSocket hibernation handling, including:

  • Preventing duplicate open messages when reconnecting after hibernation
  • Adding keepalive functionality for hibernating WebSocket requests
  • Simplifying message resend logic
  • Improving timeout handling with drop_rx watch channel

Code Quality & Best Practices

Strengths ✅

  1. Good separation of concerns - The hibernation keepalive logic is cleanly separated into handle_websocket_hibernation wrapper and handle_websocket_hibernation_inner (pegboard-gateway/src/lib.rs:577-628)

  2. Proper resource cleanup - The keepalive task is correctly aborted and hibernating request is deleted on error/close (pegboard-gateway/src/lib.rs:612-624)

  3. Jitter implementation - Good practice adding jitter to prevent thundering herd (pegboard-gateway/src/lib.rs:599-600)

  4. Consistent error handling - Follows the project's custom error patterns using RivetError derive macros

  5. Import cleanup - Good fix removing unused glob import in gasoline/src/db/kv/keys/worker.rs:1

Areas for Improvement 🔧

  1. Simplified message resend logic (pegboard-gateway/src/shared_state.rs:325-345)

    • The removal of last_msg_index parameter and acknowledgment logic simplifies the code
    • However, this means ALL pending messages are resent on reconnection, which could be inefficient
    • Question: Is there a reason we removed the ability to skip already-received messages? This could lead to duplicate message processing on the actor side.
  2. Magic number in jitter (pegboard-gateway/src/lib.rs:599)

    let jitter = { rand::thread_rng().gen_range(0..128) };
    • Consider defining JITTER_MAX_MS as a constant with a comment explaining the value choice
  3. Timeout handling refactor (pegboard-gateway/src/shared_state.rs:472-476)

    • Changed from sending TunnelMessageData::Timeout to using drop_tx.send(())
    • The error handling if req.drop_tx.send(()).is_err() logs debug but continues - this is fine, but consider if warn level is more appropriate
  4. Database transaction isolation levels (pegboard/src/ops/actor/hibernating_request/*.rs)

    • All operations use Serializable isolation level, which is good for consistency
    • Verify this doesn't create performance bottlenecks under high hibernation request load

Potential Issues 🐛

  1. Race condition in hibernating request deletion (pegboard/src/ops/actor/hibernating_request/delete.rs:24-29)

    if let Some(last_ping_ts) = tx.read_opt(&last_ping_ts_key, Serializable).await? {
        tx.delete(&keys::actor::HibernatingRequestKey::new(
            input.actor_id,
            last_ping_ts,
            input.request_id,
        ));
    }
    • If last_ping_ts_key doesn't exist, the HibernatingRequestKey won't be deleted
    • This could leave orphaned entries in the actor hibernating request index
    • Recommendation: Consider always attempting deletion or adding cleanup logic
  2. Upsert race condition (pegboard/src/ops/actor/hibernating_request/upsert.rs:24-29)

    • Similar issue: if read returns None initially but entry was just created by another transaction
    • The old entry won't be deleted before creating new one
    • Could lead to duplicate entries temporarily (though eventually consistent)
  3. Error propagation in keepalive task (pegboard-gateway/src/lib.rs:602-606)

    ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
        actor_id,
        request_id: unique_request_id,
    })
    .await?;
    • If this operation fails, the entire keepalive task crashes silently (spawned task)
    • The hibernation will eventually timeout, but errors should be logged
    • Recommendation: Add error logging before propagating
  4. Missing validation after hibernation reconnect (pegboard-gateway/src/lib.rs:296-297)

    let can_hibernate = if after_hibernation {
        true
    • Assumes hibernation is still allowed without checking actor state
    • Question: Should we verify the actor still supports hibernation on reconnect?

Performance Considerations ⚡

  1. Keepalive interval calculation (pegboard-gateway/src/lib.rs:586-591)

    let mut ping_interval = tokio::time::interval(Duration::from_millis(
        (ctx.config()
            .pegboard()
            .hibernating_request_eligible_threshold()
            / 2)
        .try_into()?,
    ));
    • Default threshold is 90s, so ping every ~45s
    • This seems reasonable, but could create database write load with many hibernating connections
    • Consider if a longer interval would suffice (e.g. threshold / 3 or threshold - buffer)
  2. Database writes on every keepalive (pegboard/src/ops/actor/hibernating_request/upsert.rs:32-36)

    • Each keepalive deletes old entry and writes new one
    • With many hibernating connections, this could be significant write load
    • Suggestion: Consider a write-through cache or batch updates
  3. Message resend on every reconnection (pegboard-gateway/src/shared_state.rs:336-340)

    • All pending messages are republished without deduplication
    • For connections with many pending messages, this could be inefficient
    • The removed last_msg_index acknowledgment would have prevented this

Security Concerns 🔒

  1. Resource exhaustion via hibernating requests

    • No apparent limit on number of hibernating requests per actor
    • Consider adding max hibernating requests config to prevent resource exhaustion
    • Each hibernating request spawns a keepalive task and holds database entries
  2. Validation of request_id

    • The unique_request_id is passed through but not validated for ownership
    • Consider verifying the request_id belongs to the expected actor/session

Test Coverage 🧪

  1. Missing test updates

    • The after_hibernation parameter was added to handle_websocket but test files only show underscore prefix
    • Should add tests for:
      • Reconnection after hibernation (verify no duplicate open message)
      • Keepalive task behavior
      • Cleanup on hibernation failure
      • List operation filtering by ping threshold
  2. Protocol changes (runner-protocol/v3.bare)

    • Large protocol update (403 additions) but no corresponding test file changes visible
    • Ensure integration tests cover the new hibernating_request_ids field in CommandStartActor

Documentation 📚

  1. Config documentation (config/src/config/pegboard.rs:53-58)

    • Good addition of hibernating_request_eligible_threshold with doc comments
    • Consider adding example values or recommendations
  2. Missing function documentation

    • handle_websocket_hibernation_inner lacks doc comment explaining its purpose
    • The hibernating request ops modules lack module-level documentation

Minor Issues

  1. Logging level change (cache-purge/src/lib.rs:12)

    • Changed from info to debug - this is fine but ensure it doesn't hide important startup info
  2. Unused last_msg_index field removed (runner-protocol changes)

    • Make sure this doesn't break backward compatibility if older runners exist

Recommendations

  1. High Priority:

    • Fix potential orphaned entries in delete operation
    • Add error logging in keepalive task
    • Add max hibernating requests limit
    • Add integration tests for hibernation flow
  2. Medium Priority:

    • Consider performance impact of unconditional message resend
    • Extract magic numbers to constants
    • Add module-level documentation
    • Verify protocol backward compatibility
  3. Low Priority:

    • Consider write-through cache for keepalive updates
    • Add metrics for hibernating request count/duration

Overall Assessment

This is a solid improvement to the WebSocket hibernation system. The core logic is sound, but there are some edge cases and potential race conditions that should be addressed. The simplified message resend logic is cleaner but may have performance implications that should be evaluated.

Recommendation: Request changes to address the high-priority items, particularly around error handling and potential orphaned database entries.

🤖 Generated with Claude Code

@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: Fix Guard WebSocket Hibernation Behavior

Overview

This PR addresses two critical issues in WebSocket hibernation:

  1. Prevents duplicate open messages when reconnecting after hibernation
  2. Implements keepalive mechanism for hibernating requests to prevent premature disconnection

The changes span 30 files with 1,982 additions and 460 deletions. Overall, the implementation is solid with good architectural decisions.


Positive Aspects

1. Improved Timeout Handling Architecture ✅

The refactoring from TunnelMessageData enum to using drop_rx watch channel is excellent. Cleaner separation of concerns and more reliable timeout detection in pegboard-gateway/src/lib.rs:316-346.

2. Hibernation Keepalive System ✅

The keepalive implementation in pegboard-gateway/src/lib.rs:582-628 is well-designed with half of hibernating_request_eligible_threshold as ping interval, jitter to prevent thundering herd, and proper cleanup on exit.

3. Clean Database Key Design ✅

The HibernatingRequestKey structure enables efficient range queries with composite key (actor_id, last_ping_ts, request_id) and separate LastPingTsKey for updates.

4. Protocol Enhancement ✅

The after_hibernation parameter properly prevents redundant open messages during reconnection.


Issues & Concerns

CRITICAL: Removed Message Acknowledgment Logic

Location: pegboard-gateway/src/shared_state.rs:325-345

The simplification of resend_pending_websocket_messages removed important logic. Now resends ALL pending messages unconditionally which could cause message duplication on reconnect and lost tracking capability.

Recommendation: Verify if the actor-side acknowledgment system handles duplicates correctly, or restore acknowledgment-based filtering. Add integration tests for message deduplication during reconnection.


MEDIUM: Import Cleanup Inconsistency

CLAUDE.md states to not glob import from anyhow. The change in engine/packages/gasoline/src/db/kv/keys/worker.rs follows this correctly with use anyhow::Result, but other files may still have use anyhow::* violations.


MEDIUM: Configuration Missing Documentation

Location: engine/packages/config/src/config/pegboard.rs:53-58

The hibernating_request_eligible_threshold field needs better documentation including:

  • Default value 90,000ms not in docstring
  • No tuning guidance
  • Relationship to keepalive interval (threshold/2) not explained

MEDIUM: Magic Number in Jitter

Location: pegboard-gateway/src/lib.rs:599

The value 128 for jitter has no explanation or constant. Recommend extracting to KEEPALIVE_JITTER_MAX_MS constant.


MINOR: Logging Level Change

Location: engine/packages/cache-purge/src/lib.rs:12

Changed from info to debug which seems unrelated to PR purpose but is probably correct for reducing log noise.


Testing Recommendations

  1. Message Deduplication Test - verify actor handles duplicate messages during reconnection
  2. Keepalive Timeout Test - ensure hibernating requests cleaned up after threshold
  3. Race Condition Test - test concurrent hibernation/reconnection scenarios
  4. Load Test - verify jitter prevents stampeding herd

Security & Performance

Security: No obvious security issues. Keepalive mechanism properly cleans up state on all exit paths.

Performance:

  • Positive: Jitter prevents load spikes, efficient database keys, proper cleanup
  • Watch: Resending all pending messages on reconnect could cause bandwidth spikes with large backlogs

Code Quality Score: 8/10

Strengths: Clean architecture, good error handling, proper resource cleanup, well-structured database keys

Areas for improvement: Critical message acknowledgment issue, import consistency, documentation, magic numbers


Recommendation: Approve with follow-up

This PR significantly improves the hibernation system, but please address the critical concern about message acknowledgment before merging, or provide clarification on why the simplified approach is safe.

🤖 Generated with Claude Code

@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: Hibernating WebSocket Keepalive & Protocol Updates

This PR implements hibernating WebSocket request tracking with keepalive pings to prevent premature eviction, and fixes issues with duplicate open messages on reconnection. Overall, the implementation is solid with good architectural decisions, but there are some concerns to address.


Strengths

1. Well-Designed Hibernation State Management

  • The new HibernatingRequestKey structure in packages/pegboard/src/keys/actor.rs:317-424 properly indexes by (actor_id, last_ping_ts, request_id), enabling efficient time-based queries
  • Dual storage pattern (indexed key + metadata key) provides both efficient range queries and direct lookups
  • The hibernating_request_eligible_threshold configuration at packages/config/src/config/pegboard.rs:53-92 is properly documented and has sensible defaults (90 seconds)

2. Proper Concurrency Handling

  • The keepalive task at packages/pegboard-gateway/src/lib.rs:585-608 implements jitter (0-128ms) to prevent thundering herd issues - good defensive programming
  • Uses tokio::select! properly to handle concurrent events (message arrival vs timeout) in multiple places
  • Proper task cleanup with keepalive_handle.abort() at line 612

3. Improved Message Flow Control

  • Removal of the complex last_msg_index acknowledgment logic simplifies the message resend mechanism in packages/pegboard-gateway/src/shared_state.rs:325-342
  • The after_hibernation flag properly prevents duplicate open messages at packages/pegboard-gateway/src/lib.rs:275-295
  • Timeout signaling now uses drop_rx.changed() instead of sending timeout messages, which is cleaner

Issues & Concerns

Critical Issues

1. Race Condition in Keepalive Cleanup (packages/pegboard-gateway/src/lib.rs:612-625)

keepalive_handle.abort();

match &res {
    Ok(HibernationResult::Continue) => {}
    Ok(HibernationResult::Close) | Err(_) => {
        self.ctx
            .op(pegboard::ops::actor::hibernating_request::delete::Input {
                actor_id: self.actor_id,
                request_id: unique_request_id,
            })
            .await?;
    }
}

Problem: The keepalive task is aborted before the conditional delete. If the task is in the middle of an upsert operation when aborted, it could leave stale data. The delete operation might fail to clean up if the abort happens between the upsert starting and completing.

Recommendation: Either:

  • Move the abort after the delete operation
  • Use a cancellation token instead of abort to allow graceful shutdown
  • Add a small delay between abort and delete to ensure the upsert completes

2. Error Handling in Keepalive Loop (packages/pegboard-gateway/src/lib.rs:602-607)

ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
    actor_id,
    request_id: unique_request_id,
})
.await?;

Problem: If the upsert operation fails, the entire loop terminates with ?, but there's no logging or handling. This could cause silent failures where hibernating connections are not kept alive, leading to premature eviction.

Recommendation:

if let Err(err) = ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
    actor_id,
    request_id: unique_request_id,
}).await {
    tracing::error!(?err, request_id=?unique_request_id, "failed to update hibernating request keepalive");
    // Consider whether to continue the loop or break
}

Medium Priority Issues

3. Missing Import Cleanup (packages/gasoline/src/db/kv/keys/worker.rs:1-3)

The change from use anyhow::*; to use anyhow::Result; is good (follows CLAUDE.md guidelines), but this appears unrelated to the PR's main purpose. While it's a good cleanup, it should be mentioned in the commit message or done in a separate commit.

4. Incomplete Error Context in Database Operations

In packages/pegboard/src/ops/actor/hibernating_request/upsert.rs:24-30, the delete operation is performed without checking if it succeeds:

if let Some(last_ping_ts) = tx.read_opt(&last_ping_ts_key, Serializable).await? {
    tx.delete(&keys::actor::HibernatingRequestKey::new(
        input.actor_id,
        last_ping_ts,
        input.request_id,
    ));
}

While UniversalDB's transactional semantics may guarantee this works, there's no error handling or logging if the delete fails silently.

5. Potential Integer Overflow in Ping Interval (packages/pegboard-gateway/src/lib.rs:586-591)

let mut ping_interval = tokio::time::interval(Duration::from_millis(
    (ctx.config()
        .pegboard()
        .hibernating_request_eligible_threshold()
        / 2)
    .try_into()?,
));

The try_into() could fail if the threshold is set to a very large value. While unlikely with the default of 90 seconds, this should be handled more gracefully with a clear error message about invalid configuration.

6. Logging Level Change Seems Arbitrary (packages/cache-purge/src/lib.rs:12)

-	tracing::info!(subject=?CACHE_PURGE_TOPIC, "subscribed to cache purge updates");
+	tracing::debug!(subject=?CACHE_PURGE_TOPIC, "subscribed to cache purge updates");

This change is unrelated to the PR and should be explained. If this is intentional noise reduction, it should be in a separate commit or mentioned in the PR description.

Low Priority / Style Issues

7. Inconsistent Error Handling Patterns

In packages/pegboard-gateway/src/lib.rs:186-223, the error handling mixes break and return Err(...) in different branches of the select statement. While functionally correct, it would be more consistent to use one pattern throughout.

8. Magic Number for Jitter

let jitter = { rand::thread_rng().gen_range(0..128) };

The 128ms maximum jitter should be a named constant or configuration value to make it easier to tune.


Performance Considerations

Positive

  • The time-based indexing in HibernatingRequestKey enables efficient range scans to find stale requests
  • Jitter prevents synchronized load spikes
  • The use of StreamingMode::WantAll in list operation is appropriate for expected data volumes

Concerns

  1. Database Write Amplification: Each hibernating connection now performs periodic upserts (every 45 seconds with default config). With many hibernating connections, this could create significant database load. Consider:

    • Batching updates if multiple connections hibernate simultaneously
    • Adjusting the ping interval based on number of hibernating connections
    • Monitoring database write metrics after deployment
  2. Memory Usage: The pending_ws_msgs vector in HibernationState can grow up to 1 MiB per request. With many concurrent hibernating requests, this could consume significant memory. The limit is good, but should be monitored.


Security Considerations

Good Practices

  • Request IDs use UUIDs, preventing prediction attacks
  • The hibernation threshold prevents indefinite resource consumption
  • Proper cleanup on errors prevents resource leaks

Minor Concern

  • No rate limiting on hibernation state creation - a malicious client could create many hibernating connections to consume database resources. Consider adding rate limits per actor or globally.

Test Coverage

Concern: The PR only updates a test stub at packages/guard-core/tests/custom_serve.rs:72-119 without adding meaningful tests for the new hibernation keepalive functionality.

Recommendations:

  • Add integration tests that verify keepalive updates occur at expected intervals
  • Test hibernation cleanup on various error conditions
  • Test the race condition scenarios mentioned above
  • Verify behavior when hibernating_request_eligible_threshold is configured to edge case values

Protocol Changes

The changes to engine/sdks/schemas/runner-protocol/v3.bare and generated code are substantial (+927/-211 lines in versioned.rs). The addition of hibernating_request_ids to CommandStartActor is a breaking change that needs:

  1. Migration strategy: How will older runners handle this new field?
  2. Version compatibility: Is this properly versioned in the protocol?
  3. Documentation: The new field should be documented in the protocol schema

Recommendations Summary

Must Fix Before Merge

  1. Fix race condition in keepalive cleanup (move abort or add cancellation token)
  2. Add error handling in keepalive loop with proper logging
  3. Add meaningful integration tests for hibernation keepalive

Should Fix

  1. Document protocol breaking changes and migration strategy
  2. Handle try_into() failure gracefully with better error message
  3. Add monitoring/metrics for hibernating connection count and database writes

Nice to Have

  1. Extract magic number (128) to named constant
  2. Explain unrelated changes (cache-purge logging, anyhow import)
  3. Consider rate limiting on hibernation state creation
  4. Add database write load testing for many concurrent hibernating connections

Conclusion

The core implementation is architecturally sound and solves the hibernation keepalive problem effectively. However, the error handling gaps and potential race condition in cleanup need to be addressed before merging. The lack of comprehensive tests is also concerning for such a critical reliability feature.

Recommendation: Request changes - primarily for error handling improvements and test coverage.

@NathanFlurry NathanFlurry force-pushed the 11-11-fix_guard_dont_send_open_msg_for_hibernating_ws_hibernating_ws_keepalive branch from f45c96f to 769c3f8 Compare November 13, 2025 21:39
@NathanFlurry NathanFlurry force-pushed the 11-10-fix_guard_handle_websocket_tasks_during_shutdown branch from 97572fd to 3362252 Compare November 13, 2025 21:39
@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: WebSocket Hibernation Keepalive Implementation

I've reviewed this PR which adds hibernating WebSocket keepalive functionality and fixes the issue where open messages were being sent for hibernating connections. Overall, the implementation is solid with good architectural decisions. Here are my findings:


Strengths

  1. Well-designed keepalive mechanism: The periodic ping task with jitter (lines 585-608 in pegboard-gateway/src/lib.rs) prevents thundering herd issues and efficiently tracks active hibernating connections.

  2. Proper resource cleanup: The keepalive task is correctly aborted and hibernating request entries are deleted on closure/error (lines 612-624).

  3. Good use of database operations: The hibernating request ops (upsert, delete, list) follow existing patterns with proper transaction handling and isolation levels.

  4. Fixed the open message issue: Correctly skips sending ToClientWebSocketOpen when reconnecting after hibernation using the after_hibernation flag.

  5. Improved timeout handling: Replaced TunnelMessageData::Timeout with drop_rx watcher pattern, which is cleaner and more idiomatic.

  6. Backward compatibility: Protocol v3 changes maintain compatibility through converter functions.


🔍 Issues & Concerns

Critical Issues

  1. Potential memory leak in keepalive task (pegboard-gateway/src/lib.rs:585-608)

    • The keepalive task runs in an infinite loop with no error handling
    • If ctx.op() fails (line 602-606), the error propagates but the task continues looping
    • The Result is never checked, so errors are silently ignored
    • Recommendation: Add error handling and consider breaking/logging on persistent failures:
    loop {
        ping_interval.tick().await;
        let jitter = { rand::thread_rng().gen_range(0..128) };
        tokio::time::sleep(Duration::from_millis(jitter)).await;
        
        if let Err(e) = ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
            actor_id,
            request_id: unique_request_id,
        }).await {
            tracing::warn!(?e, "failed to upsert hibernating request");
            // Consider breaking after N consecutive failures
        }
    }
  2. Race condition in message acknowledgment (pegboard-gateway/src/shared_state.rs:322-342)

    • The resend_pending_websocket_messages function now resends ALL pending messages without checking last_msg_index
    • This removes the deduplication logic that was previously present (deleted lines 345-381)
    • Could lead to duplicate message delivery if messages are still in flight during reconnection
    • Recommendation: Clarify if this is intentional or if the ack logic needs to be restored
  3. Missing hibernation state check (pegboard-gateway/src/lib.rs:348-350)

    • After hibernation, resend_pending_websocket_messages is called but can_hibernate might be false
    • Should verify hibernation state exists before attempting to resend
    • Recommendation: Add a check:
    if can_hibernate {
        self.shared_state
            .resend_pending_websocket_messages(request_id)
            .await?;
    }

Medium Priority Issues

  1. Inconsistent error handling in tunnel message loop (pegboard-gateway/src/lib.rs:382-419)

    • The msg_rx.recv() returns None when channel closes, triggering hibernation
    • However, this might not always be the desired behavior if the channel closes due to an error
    • Recommendation: Add logging to distinguish between expected closure and unexpected errors
  2. Ping interval calculation (pegboard-gateway/src/lib.rs:586-591)

    • Uses hibernating_request_eligible_threshold / 2 for ping interval
    • If threshold is very small (e.g., <1000ms), this could cause excessive database writes
    • Recommendation: Add a minimum ping interval (e.g., 5 seconds):
    let ping_interval_ms = (ctx.config()
        .pegboard()
        .hibernating_request_eligible_threshold()
        / 2)
        .max(5_000);
  3. Database key design concern (pegboard/src/keys/actor.rs:311-383)

    • The HibernatingRequestKey uses last_ping_ts in the key structure
    • This means every ping requires a delete + write operation (lines 25-29 in ops/actor/hibernating_request/upsert.rs)
    • This is inefficient for high-frequency updates
    • Recommendation: Consider separating the timestamp into a value rather than part of the key, or using a single-key-per-request structure with atomic updates
  4. Missing test coverage

    • No tests added for the new hibernating request operations
    • No tests for the keepalive mechanism
    • Recommendation: Add integration tests for hibernation lifecycle

Minor Issues

  1. Jitter range hardcoded (pegboard-gateway/src/lib.rs:599)

    • Jitter range of 0-128ms is reasonable but hardcoded
    • Recommendation: Make it configurable or document why 128ms was chosen
  2. Log level change (cache-purge/src/lib.rs:12)

    • Changed from info! to debug! - likely intentional but not mentioned in PR description
    • Recommendation: Confirm this was intended
  3. Import cleanup (gasoline/src/db/kv/keys/worker.rs:1-3)

    • Good cleanup removing unused imports and the unnecessary Result::Ok import
    • ✅ No issues here

📋 Testing Recommendations

  1. Test hibernation keepalive with network partitions
  2. Test behavior when actor becomes ready during hibernation with pending messages
  3. Test cleanup behavior when client disconnects during hibernation
  4. Load test the keepalive mechanism with many concurrent hibernating connections
  5. Test the database delete + write transaction in upsert for race conditions

🔒 Security Considerations

  • No obvious security issues
  • The jitter prevents timing attacks and stampeding herds ✅
  • Request IDs are properly UUIDs preventing prediction ✅

📊 Performance Considerations


📝 Documentation & Code Quality

  • Configuration comments are clear and helpful ✅
  • Good use of structured logging with ?x format ✅
  • Follows Rivet error handling patterns ✅
  • Code follows the repository's style guide ✅

Summary

This is a solid implementation with good design choices. The main concerns are:

  1. Error handling in the keepalive task (critical)
  2. Potential race condition in message acknowledgment (critical)
  3. Database key design efficiency (medium)

Recommendation: Address the critical issues before merging, particularly the keepalive error handling and the message acknowledgment logic.

@NathanFlurry NathanFlurry force-pushed the 11-11-fix_guard_dont_send_open_msg_for_hibernating_ws_hibernating_ws_keepalive branch from 769c3f8 to fc8f1fc Compare November 13, 2025 22:27
@NathanFlurry NathanFlurry force-pushed the 11-10-fix_guard_handle_websocket_tasks_during_shutdown branch from 3362252 to 67f0a65 Compare November 13, 2025 22:27
@claude
Copy link

claude bot commented Nov 13, 2025

Code Review - PR #3449

I've reviewed this PR which implements WebSocket hibernation improvements and keepalive functionality. Here's my comprehensive feedback:

Summary

This PR addresses two main issues:

  1. Prevents duplicate open messages for hibernating WebSocket connections after reconnection
  2. Implements keepalive mechanism for hibernating WebSocket requests with database tracking

Positive Aspects ✅

1. Clean Database Schema Design

  • The new hibernating request tracking system uses a well-structured composite key design (actor_id + last_ping_ts + request_id) in engine/packages/pegboard/src/keys/actor.rs:312
  • Efficient range queries using timestamp as part of the key for garbage collection
  • Proper separation of concerns between data storage and index keys

2. Proper Error Handling

  • Good use of tokio::select! to handle multiple async branches in engine/packages/pegboard-gateway/src/lib.rs:355-415
  • Cleanup logic properly handles both success and error cases in handle_websocket_hibernation (lib.rs:614-625)
  • The keepalive task is properly aborted on all exit paths

3. Jitter Implementation

  • Smart addition of jitter (0-128ms) in engine/packages/pegboard-gateway/src/lib.rs:599 to prevent thundering herd problems when multiple hibernating requests ping simultaneously

4. Removed Complex Deduplication Logic

  • Simplified message resending by removing the last_msg_index parameter and associated deduplication logic in engine/packages/pegboard-gateway/src/shared_state.rs:325-345
  • This is cleaner and shifts acknowledgment responsibility appropriately

Issues & Concerns ⚠️

1. Potential Race Condition in Database Operations 🔴
In engine/packages/pegboard/src/ops/actor/hibernating_request/upsert.rs:24-30:

if let Some(last_ping_ts) = tx.read_opt(&last_ping_ts_key, Serializable).await? {
    tx.delete(&keys::actor::HibernatingRequestKey::new(
        input.actor_id,
        last_ping_ts,
        input.request_id,
    ));
}
  • This read-then-delete pattern could have issues under concurrent upserts for the same request_id
  • The transaction uses Serializable isolation, which should help, but the delete operation doesn't have .await? - is this intentional?
  • Consider using a stronger isolation level or document why Serializable is sufficient here

2. Missing Error Propagation 🔴
In engine/packages/pegboard-gateway/src/lib.rs:585-608, the keepalive task:

tokio::spawn(async move {
    // ... error handling inside
    ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
        actor_id,
        request_id: unique_request_id,
    })
    .await?;
})
  • Errors in the keepalive loop are silently dropped since the JoinHandle result isn't checked
  • If database operations fail, the hibernating request won't be marked as alive, but the connection continues
  • Recommendation: Add logging for keepalive failures or consider surfacing critical errors

3. Inconsistent Logging Levels 🟡
In engine/packages/cache-purge/src/lib.rs:12:

-tracing::info!(subject=?CACHE_PURGE_TOPIC, "subscribed to cache purge updates");
+tracing::debug!(subject=?CACHE_PURGE_TOPIC, "subscribed to cache purge updates");
  • This change downgrades log level from info to debug
  • While this may be correct, ensure it's intentional and document why in the commit message

4. Message Resend Behavior Change 🟡
The removal of last_msg_index in engine/packages/pegboard-gateway/src/shared_state.rs:325-345 means:

  • Before: Only resent messages the runner hadn't acknowledged
  • After: Always resends ALL pending messages
  • This could lead to duplicate message delivery if the runner had partially processed messages before hibernation
  • Question: Is the runner expected to deduplicate? This behavioral change should be documented.

5. Keepalive Interval Calculation 🟡
In engine/packages/pegboard-gateway/src/lib.rs:586-591:

let mut ping_interval = tokio::time::interval(Duration::from_millis(
    (ctx.config()
        .pegboard()
        .hibernating_request_eligible_threshold()
        / 2)
    .try_into()?,
));
  • Using threshold / 2 is good for safety margin
  • However, if threshold is very small (e.g., < 2ms), this could cause issues
  • Consider adding validation or a minimum interval threshold

6. Protocol Breaking Change Not Versioned 🟡
The addition of hibernating_request_ids: list<data> to CommandStartActor in engine/sdks/schemas/runner-protocol/v3.bare is a breaking protocol change, but:

  • It's added to v3 directly rather than creating a v4
  • Ensure backward compatibility is maintained or document the breaking change clearly

7. Type Safety Concern 🟡
In engine/packages/pegboard-runner/src/tunnel_to_ws_task.rs:66-76:

*hibernating_request_ids =
    ids.into_iter().map(|x| x.into_bytes().to_vec()).collect();
  • Converting Uuid to Vec<u8> via into_bytes().to_vec() works but loses type safety
  • Consider if the protocol should use a more strongly-typed representation

8. Missing Documentation 🟡

  • The new after_hibernation parameter in CustomServeTrait::handle_websocket lacks documentation explaining:
    • What should change in behavior when this is true
    • Why the gateway shouldn't send an open message in this case
    • Contract between gateway and runner for hibernation state

Testing Concerns 🧪

Missing Test Coverage:

  1. No tests for the new hibernating request database operations (upsert/list/delete)
  2. No integration tests for the keepalive mechanism
  3. No tests for race conditions in concurrent hibernation state updates
  4. The test file engine/packages/guard-core/tests/custom_serve.rs was updated but no new test cases were added

Recommendations:

  • Add unit tests for pegboard::ops::actor::hibernating_request::*
  • Add integration tests for WebSocket hibernation flow with keepalive
  • Test edge cases: rapid hibernation/wake cycles, database failures during keepalive

Performance Considerations 🚀

1. Database Write Amplification

  • Each keepalive ping performs 2 writes (delete old timestamp entry, write new one) in upsert.rs
  • For many hibernating connections, this could create significant database load
  • Recommendation: Monitor database performance and consider batch operations or longer intervals

2. Jitter Range

  • Current jitter is 0-128ms
  • With many concurrent hibernating requests, this might not spread load enough
  • Consider making jitter range configurable or proportional to the ping interval

Security Considerations 🔒

1. Request ID Exposure

  • The unique_request_id is now passed through more layers
  • Ensure this doesn't leak in logs or expose internal state inappropriately

2. Resource Exhaustion

  • No apparent limit on number of hibernating requests per actor
  • Consider adding limits to prevent resource exhaustion attacks

Code Style & Best Practices 📝

Follows CLAUDE.md conventions:

  • ✅ Uses custom error system correctly
  • ✅ Proper structured logging with ? syntax
  • ✅ Lowercase log messages
  • ✅ Uses workspace dependencies
  • ⚠️ Some glob imports from anyhow might exist (couldn't verify all files)

Minor style issues:

  • Some inconsistent spacing in match arms
  • Consider extracting magic numbers (e.g., 128ms jitter, channel size 128) to named constants

Summary & Recommendations

Overall Assessment: This is a solid implementation of an important feature, but has several areas that need attention before merging.

Must Fix Before Merge:

  1. ❌ Verify race condition safety in hibernating_request::upsert
  2. ❌ Add error logging/handling for keepalive task failures
  3. ❌ Document message resend behavior change and ensure runner handles duplicates
  4. ❌ Add test coverage for new database operations

Should Consider:

  1. 🔶 Add validation for minimum keepalive interval
  2. 🔶 Document protocol breaking changes clearly
  3. 🔶 Add integration tests for hibernation flow
  4. 🔶 Monitor and optimize database write amplification
  5. 🔶 Add limits on hibernating requests per actor

Nice to Have:

  1. 💭 Extract magic numbers to named constants
  2. 💭 Improve documentation for after_hibernation parameter
  3. 💭 Consider making jitter range configurable

Great work on improving the WebSocket hibernation system! The core design is sound, but please address the critical issues above.

@NathanFlurry NathanFlurry force-pushed the 11-11-fix_guard_dont_send_open_msg_for_hibernating_ws_hibernating_ws_keepalive branch from fc8f1fc to e5ca471 Compare November 13, 2025 22:35
@NathanFlurry NathanFlurry force-pushed the 11-10-fix_guard_handle_websocket_tasks_during_shutdown branch from 67f0a65 to 4039392 Compare November 13, 2025 22:35
@claude
Copy link

claude bot commented Nov 13, 2025

Code Review: WebSocket Hibernation & Keepalive

This PR implements hibernating WebSocket request tracking and keepalive functionality for the Pegboard gateway. I've reviewed the changes and have the following feedback:

Architecture & Design ✅

Strengths:

  • Well-structured separation of concerns with new hibernating_request ops module
  • Smart use of timestamp-based indexing in HibernatingRequestKey for efficient range queries
  • Keepalive mechanism with jitter (0-128ms) to prevent stampeding herd problem
  • Protocol v3 schema updates properly integrate hibernating request IDs into CommandStartActor

Code Quality Observations

1. Import Cleanup ✅ (engine/packages/gasoline/src/db/kv/keys/worker.rs:1-5)

Good cleanup removing the redundant use std::result::Result::Ok; and simplifying anyhow imports. This follows the CLAUDE.md guideline about not glob importing from anyhow.

2. Message Resending Logic Simplified ⚠️ (engine/packages/pegboard-gateway/src/shared_state.rs:325-345)

The removal of last_msg_index parameter and the complex acknowledgment logic in resend_pending_websocket_messages is a significant change.

Before: Selective resending based on what the runner last received
After: Always resend all pending messages

Questions:

  • Is there a potential for duplicate message delivery now?
  • How does this interact with the existing ack_pending_websocket_messages flow?
  • Consider documenting why this simplification is safe

3. Timeout Handling Refactored ✅ (engine/packages/pegboard-gateway/src/shared_state.rs:470-476)

Changed from sending TunnelMessageData::Timeout enum to using the drop_tx watch channel. This is cleaner but:

  • The error message "failed to send timeout msg" on line 475 could be more descriptive
  • Consider logging the request_id in the error path for debugging

4. WebSocket Open Message Conditional ⚠️ (engine/packages/pegboard-gateway/src/lib.rs:286-362)

The new after_hibernation flag prevents sending duplicate ToClientWebSocketOpen messages when reconnecting.

Potential Issue:

let can_hibernate = if after_hibernation {
    true  // Line 297
} else {
    // ... send open message and get can_hibernate from response
}

This assumes that if we're reconnecting after hibernation, the actor still supports hibernation. Is it possible for an actor's hibernation capability to change between disconnects? If so, we might need to re-verify.

5. Database Operations ✅ (engine/packages/pegboard/src/ops/actor/hibernating_request/)

upsert.rs (lines 24-30): Good pattern - read old timestamp, delete old key, write new key with updated timestamp. This ensures the timestamp index stays current.

delete.rs (lines 24-30): Properly cleans up both the timestamped actor key and the request data key.

list.rs (lines 28-38): Smart range query from ping_threshold_ts to end of actor subspace to find stale requests.

Concern: All three operations use Serializable isolation level. Given that upsert is called frequently (every 45 seconds per hibernating connection), consider if ReadCommitted would be sufficient for better performance.

6. Keepalive Task ✅ (engine/packages/pegboard-gateway/src/lib.rs:585-608)

Strengths:

  • Interval set to half of threshold (45s default) provides good margin
  • Jitter prevents thundering herd
  • Proper cleanup with keepalive_handle.abort() on line 612
  • Clean error handling with ? propagation

Consideration:
The keepalive task will fail silently if database operations error. The Result<()> is never checked. While this is okay (connection will eventually time out), consider logging errors:

let keepalive_handle: JoinHandle<Result<()>> = tokio::spawn(async move {
    if let Err(e) = keepalive_loop(&ctx, actor_id, unique_request_id).await {
        tracing::warn!(?e, "keepalive task failed");
    }
});

7. Protocol Schema Changes ✅ (engine/sdks/schemas/runner-protocol/v3.bare:175)

Adding hibernatingRequestIds: list<data> to CommandStartActor is straightforward and backwards compatible (assuming proper version handling).

8. Dynamic Population ✅ (engine/packages/pegboard-runner/src/tunnel_to_ws_task.rs:62-79)

Smart injection of hibernating request IDs right before sending to runner. This ensures the list is always current without needing to store it.

9. Logging Level Change ✅ (engine/packages/cache-purge/src/lib.rs:12)

Changed from info! to debug! - appropriate for reducing log noise on routine operations.

Security Considerations

No Critical Issues Found ✅

  • UUID generation for request IDs is secure
  • No SQL injection vectors (using typed database keys)
  • Timeout values are configurable via config, not user input
  • Rate limiting via MAX_PENDING_MSGS_SIZE_PER_REQ is preserved

Performance Considerations

Database Load ⚠️

With many hibernating connections, you'll have:

  • Frequent upserts (every ~45s per connection)
  • Timestamp updates requiring delete + insert of index key

Recommendation: Monitor database write load in production. Consider batching updates or using a less frequent keepalive interval if needed.

Memory Usage ✅

The pending_ws_msgs size limit (MAX_PENDING_MSGS_SIZE_PER_REQ = 1MB) prevents unbounded growth.

Testing Considerations

Missing Test Coverage ⚠️

No new tests were added for:

  1. Hibernating request lifecycle (upsert, list, delete)
  2. Keepalive task behavior
  3. Message resending after hibernation
  4. after_hibernation flag handling

Recommendation: Add integration tests covering the full hibernation cycle.

Minor Issues

  1. engine/packages/pegboard/src/keys/actor.rs:320 - The request_id field is public while others are private. Is this intentional?

  2. engine/packages/config/src/config/pegboard.rs:89-92 - Consider documenting what happens if a request doesn't ping within the threshold (is it eligible for cleanup? will it fail?).

  3. Typo in commit message: "kee…palive" appears truncated

Recommendations Summary

  1. Accept the import cleanup and logging changes
  2. ⚠️ Document why removing last_msg_index is safe - potential duplicate message implications
  3. ⚠️ Consider adding error logging to keepalive task
  4. ⚠️ Add test coverage for hibernation lifecycle
  5. ⚠️ Verify the after_hibernation assumption about persistent hibernation capability
  6. ℹ️ Consider if Serializable isolation is necessary for all hibernation ops
  7. ℹ️ Monitor database write load from frequent keepalive updates in production

Overall Assessment

This is a solid implementation of WebSocket hibernation keepalive functionality. The code follows the repository's conventions (per CLAUDE.md), uses appropriate error handling patterns, and has good separation of concerns. The main areas for improvement are around test coverage and documenting the message resending simplification.

Recommendation: Approve with suggestions for follow-up testing and documentation improvements.

@claude
Copy link

claude bot commented Nov 13, 2025

Pull Request Review: Hibernating WebSocket Fix & Keepalive

Summary

This PR implements a keepalive mechanism for hibernating WebSocket connections and fixes issues with the WebSocket open message handling during reconnection. The changes are well-structured and address important reliability concerns in the pegboard gateway.


Code Quality & Best Practices

Strengths:

  • ✅ Well-organized module structure for hibernating request operations (upsert/delete/list)
  • ✅ Proper use of database transactions with appropriate isolation levels
  • ✅ Good separation of concerns between gateway logic and state management
  • ✅ Consistent error handling patterns following the codebase's custom error system
  • ✅ Appropriate use of tracing for debugging and observability

Areas for Improvement:

  1. Import Organization (engine/packages/gasoline/src/db/kv/keys/worker.rs:1-3)

    • ✅ Good: Cleaned up redundant imports
    • The change from use std::result::Result::Ok; use anyhow::*; to use anyhow::Result; follows CLAUDE.md guidelines to avoid glob imports
  2. Logging Level (engine/packages/cache-purge/src/lib.rs:12)

    • Changed from info! to debug! - appropriate for reducing log noise, but ensure this won't hide important subscription status during debugging

Potential Bugs & Issues

  1. Race Condition in Keepalive Task (engine/packages/pegboard-gateway/src/lib.rs:585-608)

    let keepalive_handle: JoinHandle<Result<()>> = tokio::spawn(async move {
        // ... keepalive loop
    });
    let res = self.handle_websocket_hibernation_inner(client_ws).await;
    keepalive_handle.abort();
    • ⚠️ Issue: If the keepalive task errors, the error is silently discarded
    • Recommendation: Consider logging errors from the keepalive task or handling them more explicitly
  2. Message Resend Logic Simplified (engine/packages/pegboard-gateway/src/shared_state.rs:325-345)

    • The removal of last_msg_index acknowledgment logic means ALL pending messages are resent unconditionally
    • ⚠️ Concern: This could lead to duplicate message delivery if the actor has already processed some messages
    • Question: Is idempotency guaranteed on the actor side? If not, this could cause issues
  3. Missing Await (engine/packages/pegboard/src/ops/actor/hibernating_request/upsert.rs:29)

    if let Some(last_ping_ts) = tx.read_opt(&last_ping_ts_key, Serializable).await? {
        tx.delete(&keys::actor::HibernatingRequestKey::new(
            input.actor_id,
            last_ping_ts,
            input.request_id,
        ));  // Missing .await?
    }
    • ✅ Actually correct - tx.delete() returns () not a future in this DB API
  4. Protocol Version Migration (engine/sdks/rust/runner-protocol/src/versioned.rs)

    • The protocol changes are extensive (1982 additions)
    • ⚠️ Recommendation: Ensure backward compatibility is maintained with older runners. The versioned protocol system should handle this, but worth testing

Performance Considerations

  1. Jitter Implementation (engine/packages/pegboard-gateway/src/lib.rs:598-600)

    let jitter = { rand::thread_rng().gen_range(0..128) };
    tokio::time::sleep(Duration::from_millis(jitter)).await;
    • ✅ Good: Prevents thundering herd problem
    • ✅ Reasonable jitter range (0-128ms) relative to ping interval (~45 seconds)
  2. Database Query in List Operation (engine/packages/pegboard/src/ops/actor/hibernating_request/list.rs:28-38)

    • Range query with WantAll streaming mode
    • ⚠️ Potential Issue: If an actor has many hibernating requests, this could be expensive
    • Question: What's the expected max number of hibernating requests per actor? Consider pagination if unbounded
  3. Keepalive Frequency (engine/packages/pegboard-gateway/src/lib.rs:586-591)

    let mut ping_interval = tokio::time::interval(Duration::from_millis(
        (ctx.config().pegboard().hibernating_request_eligible_threshold() / 2).try_into()?,
    ));
    • ✅ Sensible: Ping at half the timeout threshold (default 90s / 2 = 45s)
    • Database writes every 45 seconds per hibernating connection could add up at scale
    • Recommendation: Monitor database load if you expect many concurrent hibernating connections

Security Concerns

  1. WebSocket State Tracking (engine/packages/pegboard/src/keys/actor.rs:312-382)

    • New database keys track request IDs and ping timestamps
    • ✅ No obvious security issues
    • ✅ Proper cleanup in delete operation prevents stale data accumulation
  2. Request ID Handling

    • UUIDs are used for request IDs (good)
    • ✅ No obvious replay attack vectors
    • The after_hibernation flag prevents duplicate open messages on reconnect

Test Coverage

⚠️ Missing Test Coverage:

  • No new tests added for hibernating request operations (upsert/delete/list)
  • No tests for the keepalive mechanism
  • No tests for the after_hibernation flag behavior

Recommendations:

  1. Add unit tests for pegboard::ops::actor::hibernating_request::* operations
  2. Add integration tests for:
    • WebSocket reconnection after hibernation
    • Keepalive preventing premature timeout
    • Proper cleanup on connection close
  3. Test edge cases:
    • Rapid hibernation/wake cycles
    • Keepalive task failure scenarios
    • Multiple concurrent hibernating connections for same actor

Specific Code Comments

  1. engine/packages/pegboard-gateway/src/lib.rs:295-297

    // If we are reconnecting after hibernation, don't send an open message
    let can_hibernate = if after_hibernation {
        true
    • ✅ Logic is correct but could be simplified to let can_hibernate = after_hibernation || /* existing logic */
    • Actually on second look, the existing structure is clearer for readability
  2. engine/packages/guard-core/src/custom_serve.rs:37-38

    • ✅ Good: Added documentation for new parameter
    • Consider adding examples to doc comments
  3. engine/packages/universaldb/src/utils/keys.rs (line 1 change)

    • Curious what changed here? The diff was truncated

Database Schema Concerns

New Keys Introduced:

  • ACTOR / HIBERNATING_REQUEST / {actor_id} / {last_ping_ts} / {request_id}()
  • HIBERNATING_REQUEST / DATA / {request_id} / LAST_PING_TSi64

Observations:

  • ✅ Good: Two-way lookup (by actor and by request)
  • ✅ Good: Timestamp in key allows range queries for timeout detection
  • ⚠️ Consideration: The upsert operation deletes old key and creates new one on timestamp change. This works but creates write amplification. Consider if this is necessary or if a simple timestamp update would suffice.

Overall Assessment

Strengths:

  • Addresses a real operational issue (hibernating connections timing out)
  • Well-structured code following repository conventions
  • Good use of existing patterns and infrastructure

Risks:

  • Limited test coverage for new critical functionality
  • Potential performance impact at scale (database writes per keepalive)
  • Message resend logic simplification may cause duplicate delivery

Recommendation:Approve with suggestions

  • The core implementation is solid
  • Add tests before merging
  • Monitor database load after deployment
  • Consider documenting the idempotency requirements for actor message handlers

Questions for Author

  1. Is message delivery idempotency guaranteed on the actor side since we removed the last_msg_index logic?
  2. What's the expected number of concurrent hibernating connections? Have you load-tested the keepalive mechanism?
  3. Should we add metrics/telemetry for hibernating connection counts and keepalive failures?

@claude
Copy link

claude bot commented Nov 13, 2025

Code Review - PR #3449

Summary

This PR implements important fixes for WebSocket hibernation in the Guard service, specifically:

  1. Prevents sending duplicate open messages when reconnecting after hibernation
  2. Adds keepalive mechanism for hibernating WebSocket requests
  3. Improves timeout handling by switching from message-based to watch-based signaling
  4. Simplifies pending message resend logic by removing complex index tracking

Positive Observations

Well-Structured Changes

  • Clean separation of concerns: The after_hibernation flag cleanly distinguishes between initial connections and reconnections
  • Proper resource cleanup: Keepalive task is correctly aborted and database entries cleaned up in all code paths (pegboard-gateway/src/lib.rs:612-625)
  • Improved timeout mechanism: Switching from TunnelMessageData::Timeout enum to drop_rx watch channel is more idiomatic and efficient

Good Architecture Decisions

  • Hibernating request tracking: Using composite keys (actor_id, last_ping_ts, request_id) enables efficient range queries by timestamp (pegboard/src/keys/actor.rs:312-383)
  • Jitter in keepalive: Adding random jitter (0-128ms) prevents thundering herd problems (pegboard-gateway/src/lib.rs:599-600)
  • Proper transaction isolation: Using Serializable isolation level for hibernating request operations

Issues & Concerns

🔴 Critical: Potential Race Condition in Keepalive Task

Location: engine/packages/pegboard-gateway/src/lib.rs:585-608

The keepalive task spawns but doesn't wait for the first successful upsert before returning. If the connection fails immediately after hibernation starts, the cleanup in lines 618-623 might execute before the keepalive task inserts the first entry, leading to a leaked database entry.

Recommendation:

// Perform initial upsert before spawning background task
ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
    actor_id,
    request_id: unique_request_id,
})
.await?;

// Then spawn keepalive for subsequent updates
let keepalive_handle: JoinHandle<Result<()>> = tokio::spawn(async move {
    let mut ping_interval = tokio::time::interval(Duration::from_millis(
        (ctx.config()
            .pegboard()
            .hibernating_request_eligible_threshold()
            / 2)
        .try_into()?,
    ));
    // ... rest of implementation
});

🟡 Medium: Missing Error Handling in Keepalive Loop

Location: engine/packages/pegboard-gateway/src/lib.rs:595-607

The keepalive loop will panic on the first database error (via ? operator), potentially leaving the connection in an inconsistent state. The outer task won't know about the failure.

Recommendation:

loop {
    ping_interval.tick().await;
    
    let jitter = { rand::thread_rng().gen_range(0..128) };
    tokio::time::sleep(Duration::from_millis(jitter)).await;
    
    if let Err(err) = ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
        actor_id,
        request_id: unique_request_id,
    })
    .await {
        tracing::warn!(?err, request_id=?unique_request_id, "failed to upsert hibernating request keepalive");
        // Continue trying - connection might still be valid
    }
}

🟡 Medium: Simplified Resend Logic May Cause Duplicate Messages

Location: engine/packages/pegboard-gateway/src/shared_state.rs:325-345

The PR removes the last_msg_index parameter and always resends ALL pending messages. While simpler, this could cause:

  1. Duplicate message delivery if the actor already processed some messages
  2. Increased network traffic on reconnection
  3. Potential message ordering issues if messages are processed out of order

Previous behavior: Only resent messages with index > last_msg_index
New behavior: Always resends everything in pending_ws_msgs

Questions:

  • Is message deduplication handled at the actor level?
  • What's the expected behavior if duplicate messages are delivered?
  • Should we add logging to track when duplicates might occur?

🟡 Medium: Inconsistent Error Handling Pattern

Location: Multiple files

The PR mixes timeout signaling patterns:

  • HTTP requests use drop_rx.changed() (new pattern)
  • WebSocket messages use drop_rx.changed() (new pattern)
  • But still maintains separate timeout branches in some places

Example - pegboard-gateway/src/lib.rs:186-223:

tokio::select! {
    res = msg_rx.recv() => { /* ... */ }
    _ = drop_rx.changed() => {
        tracing::warn!("tunnel message timeout");
        return Err(ServiceUnavailable.build());
    }
}

This is actually good! Just noting the pattern changed from the old TunnelMessageData::Timeout enum.

🟢 Minor: Unused Import in gasoline

Location: engine/packages/gasoline/src/db/kv/keys/worker.rs:1-2

-use std::result::Result::Ok;
-
-use anyhow::*;
+use anyhow::Result;

Good cleanup! This follows the CLAUDE.md guideline to avoid glob imports from anyhow.

🟢 Minor: Missing Documentation

Location: engine/packages/config/src/config/pegboard.rs:53-58

The new hibernating_request_eligible_threshold config option has doc comments marked as "Experimental" but could benefit from more detail:

  • What happens if this is set too low?
  • What happens if it's set too high?
  • What's the relationship to the keepalive interval (which is threshold / 2)?

Recommendation:

/// How long after last ping before considering a hibernating request disconnected.
///
/// Hibernating requests send keepalive pings at half this interval to prevent
/// being considered disconnected. Setting this too low may cause premature
/// disconnections during network hiccups. Setting it too high increases memory
/// usage for disconnected clients.
///
/// Unit is in milliseconds.
///
/// **Experimental**
pub hibernating_request_eligible_threshold: Option<i64>,

🟢 Minor: Logging Level Change Seems Appropriate

Location: engine/packages/cache-purge/src/lib.rs:12

-tracing::info!(subject=?CACHE_PURGE_TOPIC, "subscribed to cache purge updates");
+tracing::debug!(subject=?CACHE_PURGE_TOPIC, "subscribed to cache purge updates");

This is a good change - subscription events are typically debug-level, not info.

Security Considerations

No security issues identified

  • Proper cleanup of resources prevents DoS via resource exhaustion
  • UUIDs used for request IDs prevent enumeration attacks
  • No SQL injection or similar issues (using typed keys)

Performance Considerations

Positive:

  • Removing complex index tracking simplifies the code and reduces CPU overhead
  • Watch channels are more efficient than polling/message passing for timeouts
  • Jitter prevents thundering herd when many connections hibernate simultaneously

Concerns:

  • Resending all pending messages could increase network usage on reconnection
  • Database upserts every ~45 seconds (threshold/2) per hibernating connection could add load
    • Mitigation: Consider batching upserts or using a longer interval if many concurrent hibernations expected

Test Coverage

⚠️ Not visible in this PR: The changes to test files (custom_serve.rs) only update method signatures to match the new after_hibernation parameter. Consider adding specific tests for:

  1. Reconnection after hibernation doesn't send duplicate open messages
  2. Keepalive task updates database correctly
  3. Cleanup happens on all error paths
  4. Edge case: hibernation start followed immediate by connection close

Recommendations Summary

Must Fix (Before Merge)

  1. Address keepalive race condition - Ensure first database upsert completes before returning from hibernation handler
  2. Handle keepalive errors gracefully - Don't panic the task on database errors

Should Consider

  1. Document the duplicate message behavior and confirm it's intentional
  2. Add integration tests for the new hibernation reconnection flow
  3. Monitor database load from keepalive upserts in production

Nice to Have

  1. Improve documentation for the new config option
  2. Add metrics for tracking hibernation reconnections and duplicate message delivery

Overall Assessment

Verdict: ✅ Approve with minor concerns

This is a well-structured PR that fixes important issues in WebSocket hibernation. The code quality is generally high, following Rust best practices and the repository's conventions. The main concerns are around edge cases in the keepalive task that should be addressed before merge.

The simplification of the pending message resend logic is a trade-off between code complexity and potential duplicate messages - this seems reasonable if the actor layer handles deduplication, but should be verified.

Great work on improving the timeout mechanism and properly tracking hibernating connections!

@NathanFlurry NathanFlurry force-pushed the 11-10-fix_guard_handle_websocket_tasks_during_shutdown branch from 4039392 to 6546966 Compare November 13, 2025 22:42
@NathanFlurry NathanFlurry force-pushed the 11-11-fix_guard_dont_send_open_msg_for_hibernating_ws_hibernating_ws_keepalive branch from e5ca471 to 5b5466a Compare November 13, 2025 22:42
@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 13, 2025

Merge activity

@claude
Copy link

claude bot commented Nov 13, 2025

Code Review

Summary

This PR implements hibernating WebSocket support and keepalive functionality to prevent disconnected hibernating requests. The changes span multiple packages including pegboard-gateway, pegboard, guard-core, and runner-protocol.

Positive Aspects

  1. Good architectural approach: The hibernation state management is well-structured with proper separation of concerns between gateway and pegboard packages.

  2. Proper database key design: The HibernatingRequestKey with last_ping_ts index enables efficient queries for stale connections in pegboard/src/ops/actor/hibernating_request/list.rs.

  3. Consistent error handling: Follows the project's custom error system pattern correctly with RivetError derive macros.

  4. Configuration management: Adds configurable hibernating_request_eligible_threshold with sensible defaults (90s) in config/pegboard.rs:86-92.

Issues and Concerns

1. Critical: Keepalive task not properly awaited (pegboard-gateway/src/lib.rs:577-627)

If the keepalive task panics or encounters an error, it is silently ignored. The JoinHandle result is never checked after abort().

Recommendation: Check the join handle result after abort and log any non-cancellation errors.

2. Potential race condition in upsert operation (pegboard/src/ops/actor/hibernating_request/upsert.rs:24-30)

Uses Serializable isolation level which may allow phantom reads. If two upsert operations race, one might delete the wrong entry.

Recommendation: Consider using Snapshot isolation level or document why Serializable is safe here.

3. Missing bounds check on jitter (pegboard-gateway/src/lib.rs:600-601)

Magic number (128ms) without explanation. Why this specific range?

Recommendation: Add a constant with documentation explaining the jitter bounds choice.

4. Inconsistent after_hibernation handling (pegboard-gateway/src/lib.rs:295-298)

When reconnecting after hibernation, can_hibernate is hardcoded to true without verifying the actor still supports hibernation. The actor state might have changed.

Recommendation: Consider fetching current hibernation capability or add a comment explaining why it is safe to assume true.

5. Unbounded loop in keepalive task (pegboard-gateway/src/lib.rs:587-609)

No error handling if upsert repeatedly fails. Task will keep retrying forever, potentially spamming logs.

Recommendation: Add retry limit or exponential backoff on persistent failures.

6. Missing test coverage

No tests found for the new hibernating request operations or keepalive functionality. This is a complex feature with race conditions and timing concerns.

Recommendation: Add integration tests covering:

  • Keepalive preventing stale request cleanup
  • Reconnection after hibernation
  • Concurrent upsert/delete operations
  • Stale request identification and cleanup

7. Import cleanup (gasoline/src/db/kv/keys/worker.rs:1-4)

The PR includes an unrelated import cleanup that should be in a separate commit. This is good hygiene but pollutes the diff.

8. Large protocol schema changes (runner-protocol/v3.bare)

Added 403 lines to the protocol schema. Ensure:

  • Backward compatibility is maintained
  • Version migration is handled
  • Schema changes are documented

Performance Considerations

  1. Database hotspot: All keepalive tasks will write to the same last_ping_ts key every 45 seconds. With many hibernating connections, this could become a write hotspot. Consider batching updates, using a write-ahead log pattern, or monitoring write contention.

  2. Jitter implementation: Current jitter (0-128ms) is small relative to 45s interval. Consider increasing if stampeding herd is observed.

Security Considerations

  1. Request ID validation: Ensure unique_request_id cannot be spoofed or reused maliciously to hijack connections.

  2. Hibernation timeout: The 90s threshold seems reasonable but should be documented as a security parameter - longer timeouts increase resource holding potential.

Style and Convention Notes

  1. Follows Rust formatting with hard tabs per rustfmt.toml
  2. Uses structured logging correctly
  3. Lowercase log messages
  4. Workspace dependencies used correctly
  5. Debug log downgraded unnecessarily (cache-purge/src/lib.rs:12) - may hinder debugging

Recommendations Summary

Must Fix:

  1. Handle keepalive task errors properly
  2. Add test coverage for hibernation flows
  3. Document or fix isolation level in upsert

Should Fix:
4. Add bounds constant for jitter with documentation
5. Add retry logic to keepalive loop
6. Verify after_hibernation assumption safety

Nice to Have:
7. Split unrelated import cleanup to separate commit
8. Monitor database write contention in production
9. Consider increasing jitter range if needed

Overall Assessment

The implementation is solid and follows project conventions well. The main concerns are around error handling in the keepalive task and missing test coverage. The database operations look correct but would benefit from explicit isolation level justification. Once the critical issues are addressed, this should be good to merge.

Recommendation: Request changes for items 1-3, then approve after fixes.

@NathanFlurry NathanFlurry changed the base branch from 11-10-fix_guard_handle_websocket_tasks_during_shutdown to graphite-base/3449 November 13, 2025 22:47
@NathanFlurry NathanFlurry force-pushed the 11-11-fix_guard_dont_send_open_msg_for_hibernating_ws_hibernating_ws_keepalive branch from 5b5466a to 73b2789 Compare November 13, 2025 22:48
@NathanFlurry NathanFlurry changed the base branch from graphite-base/3449 to main November 13, 2025 22:48
@NathanFlurry NathanFlurry merged commit 450d9e5 into main Nov 13, 2025
6 of 14 checks passed
@NathanFlurry NathanFlurry deleted the 11-11-fix_guard_dont_send_open_msg_for_hibernating_ws_hibernating_ws_keepalive branch November 13, 2025 22:49
@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: WebSocket Hibernation & Keepalive Improvements

Thank you for this comprehensive PR! This adds important functionality for handling hibernating WebSocket connections. Here's my detailed review:


Strengths

1. Well-Designed Architecture

  • The separation of concerns between gateway, shared state, and database operations is clean
  • The keepalive mechanism using periodic pings with jitter (lines 598-600 in pegboard-gateway/src/lib.rs) is a good pattern to prevent thundering herd
  • The use of a separate handle_websocket_hibernation_inner method improves testability

2. Protocol Evolution Handled Correctly

  • Protocol v3 adds hibernatingRequestIds field to CommandStartActor appropriately
  • Backward compatibility maintained through version converters setting last_msg_index: -1 for older versions
  • Clean separation between protocol versions

3. Database Schema

  • Two-way indexing (by request_id and by actor_id+timestamp) is smart for efficient queries
  • Using timestamp-based ordering for eligible hibernating requests is a good pattern
  • Proper cleanup in delete operations

🔍 Issues & Concerns

1. Critical: Race Condition in Keepalive Task ⚠️

Location: pegboard-gateway/src/lib.rs:585-608

The keepalive task doesn't handle errors properly - if the upsert operation fails, the task silently exits and keepalive stops working. The main code only calls .abort() but never checks if the task panicked or errored.

Recommendation: Wrap the operation in if let Err(e) and log errors without exiting the loop.

2. Critical: Removal of last_msg_index Logic ⚠️

Location: pegboard-gateway/src/shared_state.rs:325-345

The previous implementation had selective message resending based on last_msg_index. The new code unconditionally resends ALL pending messages.

Issue: This could cause duplicate message delivery if the actor already received some messages before disconnecting. The ToServerWebSocketMessageAck system is still in place (lines 347-380), suggesting messages should be tracked.

Questions:

  1. Is duplicate message delivery acceptable in your protocol?
  2. If not, why was the selective resending removed?
  3. The last_msg_index field is still being converted in protocol versions - is it deprecated?

Recommendation: Either document that duplicate delivery is acceptable and remove the acknowledgment system entirely, or restore the selective resending logic.

3. Medium: Inconsistent Error Handling

Location: pegboard-gateway/src/lib.rs:265-424

The websocket message handling has mixed error patterns - one branch breaks, the other returns. Makes control flow harder to follow.

Recommendation: Make error handling consistent - either both return or both break.

4. Medium: Missing Validation

Location: pegboard/src/ops/actor/hibernating_request/list.rs:28

No validation that hibernating_request_eligible_threshold won't cause underflow or invalid timestamps. While unlikely with the default 90 seconds, runtime config could be misconfigured.

Recommendation: Add validation or use saturating_sub.

5. Low: Logging Levels

Location: cache-purge/src/lib.rs:12

Changed from tracing::info! to tracing::debug! - this is unrelated to the PR's main purpose. Consider separating cleanup changes into separate commits.


🧪 Test Coverage

Concern: No new tests were added for the hibernation keepalive functionality. The only test file modified (guard-core/tests/custom_serve.rs) just updates function signatures.

Recommendations:

  1. Add integration test for keepalive task functionality
  2. Add test for hibernating request lifecycle (upsert → list → delete)
  3. Add test for reconnection after hibernation with after_hibernation=true
  4. Add test for the eligible threshold filtering in list operation

📝 Code Style & Conventions

Following CLAUDE.md Guidelines: ✅

  • ✅ Structured logging used correctly
  • ✅ Lowercase log messages
  • ✅ Workspace dependencies used properly
  • ✅ Error handling with anyhow::Result
  • ✅ No glob imports from anyhow
  • ✅ Imports at top of files

🔒 Security Considerations

  1. DOS Risk: The keepalive mechanism could potentially be abused if many clients enter hibernation simultaneously. The jitter helps but consider rate limiting or maximum hibernating connections per actor.

  2. Memory Growth: pending_ws_msgs has a size limit (MAX_PENDING_MSGS_SIZE_PER_REQ = 1 MiB) which is good, but there's no limit on the number of hibernating requests per actor. Consider adding a maximum.


Performance Considerations

  1. Database Queries: The list operation does a range scan which is efficient given the timestamp-based indexing.

  2. Keepalive Frequency: Ping interval is set to hibernating_request_eligible_threshold / 2 (45 seconds default). This seems reasonable but could generate significant database load with many hibernating connections.

  3. Resending All Messages: As mentioned in issue [SVC-2479] Send cluster events to PostHog #2, unconditionally resending all pending messages could cause bandwidth waste if the list is large.


🎯 Summary

High Priority:

  1. ⚠️ Fix error handling in keepalive task
  2. ⚠️ Clarify/fix the removed last_msg_index logic

Medium Priority:
3. Make error handling patterns consistent
4. Add validation for timestamp calculations
5. Add test coverage

Low Priority:
6. Consider separating unrelated logging changes
7. Document DOS/memory growth mitigations

Overall, this is a solid implementation with good architectural decisions. The main concerns are around error handling and the message resending logic change. Once those are addressed, this should be good to merge!


Generated via Claude Code review

@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: WebSocket Hibernation & Keep-Alive Implementation

I've reviewed this PR which implements hibernating WebSocket request tracking and keep-alive functionality. Here's my feedback:

✅ Strengths

  1. Well-structured implementation: The hibernating request tracking system with upsert/list/delete operations follows good patterns
  2. Proper database indexing: Using composite keys (actor_id, last_ping_ts, request_id) for efficient range queries
  3. Good error handling: Proper use of the custom error system throughout
  4. Configuration management: Threshold made configurable via hibernating_request_eligible_threshold
  5. Proper cleanup: Keep-alive task properly aborted and entries deleted when no longer needed

🔍 Code Quality Observations

Positive

  • Follows project conventions (hard tabs, structured logging with tracing)
  • Uses workspace dependencies correctly
  • Proper transaction isolation levels (Serializable)
  • Good separation of concerns with dedicated ops modules

Minor Issues

  1. Import cleanup: Good fix in gasoline/src/db/kv/keys/worker.rs removing unnecessary imports ✓
  2. Logging level change: Changed cache-purge subscription from info to debug - appropriate for reducing noise

🐛 Potential Issues

  1. Race condition in upsert operation (pegboard/src/ops/actor/hibernating_request/upsert.rs:24-30):

    • Read-delete-write pattern could have issues if multiple upserts happen concurrently for the same request_id
    • The old entry deletion uses last_ping_ts from read, but another concurrent upsert might have already changed it
    • Consider if this needs additional locking or if the transaction isolation level handles this adequately
  2. Keep-alive task error handling (pegboard-gateway/src/lib.rs:590):

    • The spawned task runs in an infinite loop but errors are not logged before the loop exits
    • If hibernating_request::upsert fails, the error is silently lost
    • Recommendation: Add error logging in the loop before propagating errors
  3. Message replay logic removed (pegboard-gateway/src/shared_state.rs:334):

    • The last_msg_index parameter was removed from resend_pending_websocket_messages
    • Previous logic had acknowledgment handling to avoid re-sending already processed messages
    • Now all pending messages are resent without checking if they were already received
    • Concern: Could this cause duplicate message processing? Verify this is intentional and actors handle duplicate messages properly
  4. Protocol breaking change:

    • Removed lastMsgIndex field from ToServerWebSocketOpen in the protocol
    • This is a v3 protocol change - ensure all runners are updated before deploying
    • No migration path for v2 runners
  5. Timeout mechanism changed (pegboard-gateway/src/shared_state.rs:475):

    • Changed from sending TunnelMessageData::Timeout to dropping drop_tx
    • All code paths now use drop_rx.changed() to detect timeouts
    • Verify all select branches handle this correctly

🔒 Security Considerations

  1. Resource exhaustion: The keep-alive mechanism pings every threshold/2 ms. With many hibernating connections, this could generate significant database traffic

    • Consider: Is there a limit on hibernating requests per actor?
    • The jitter (0-128ms) helps prevent stampeding herd, which is good
  2. UUID handling: Request IDs are properly typed as Uuid and serialized correctly ✓

🧪 Test Coverage

Concern: No new tests added for:

  • Hibernating request CRUD operations
  • Keep-alive task functionality
  • Message replay after hibernation
  • Concurrent hibernation scenarios

Recommendation: Add integration tests covering:

  • Hibernating request lifecycle (create, keep-alive, reconnect, delete)
  • Timeout scenarios when keep-alive stops
  • Message replay correctness
  • Concurrent connections to same actor

📝 Additional Notes

  1. Configuration default (config/pegboard.rs:91): 90 second threshold seems reasonable, but document why threshold/2 for ping interval
  2. Protocol schema: The v3.bare schema looks correct with hibernatingRequestIds: list<data> added to CommandStartActor
  3. TypeScript SDK: Changes look aligned with protocol updates

Recommendations

High Priority:

  • Add error logging in keep-alive task
  • Verify message replay deduplication is handled correctly or add test coverage
  • Add tests for hibernating request operations

Medium Priority:

  • Document the race condition behavior in upsert (or add locking if needed)
  • Document migration path for protocol v3
  • Consider adding a limit on hibernating requests per actor

Low Priority:

  • Document the threshold/2 ping interval rationale

Overall, this is a solid implementation of a complex feature. The main concerns are around test coverage and the message replay logic change. Great work on the structured implementation! 🚀

@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: WebSocket Hibernation and Keepalive Implementation

Summary

This PR implements WebSocket hibernation request tracking and keepalive functionality for the pegboard-gateway. The changes allow hibernating WebSocket connections to be properly tracked and resumed when actors restart.

Positive Aspects

  1. Well-structured hibernation tracking using a composite key (actor_id, last_ping_ts, request_id) for efficient timestamp-based querying
  2. Protocol changes are backward compatible - hibernatingRequestIds and after_hibernation are additive
  3. Proper cleanup - keepalive task is aborted and hibernating request entries are cleaned up appropriately
  4. Improved message timeout handling using drop_rx watch channel instead of enum variant

Critical Issues

  1. Race condition in hibernating request upsert (upsert.rs:24-29) - The read-delete-write pattern could have issues with concurrent upserts. Consider using higher isolation level.

  2. Missing error handling for keepalive task (lib.rs:608) - Task result not checked when aborted. Should log errors.

  3. Removed message acknowledgment logic (shared_state.rs:322-342) - Previous implementation tracked already-acknowledged messages using last_msg_index. Now all pending messages are resent. Document why this is safe or re-implement deduplication.

Medium Priority Issues

  1. Magic number for jitter (lib.rs:598) - Hardcoded 0..128 range. Extract to named constant.
  2. Missing validation for hibernating_request_eligible_threshold - Could be negative/zero, breaking calculation.
  3. Unclear cleanup logic for pending websocket messages when size limit exceeded.

Testing Concerns

  • No new tests for hibernating request functionality
  • Need integration tests for: actor restart during keepalive, rapid hibernation/wake cycles, message replay
  • Edge cases need coverage

Performance

  • Database writes every ~45 seconds per hibernating connection could be significant load
  • Monitor in production, consider batching if needed

Recommendations

Must Fix: (1) Address race condition (2) Document/fix message ack removal (3) Add config validation

Should Fix: (4) Add integration tests (5) Extract magic numbers (6) Add protocol documentation

Overall Assessment

Solid implementation with good design, but edge cases and potential race conditions should be addressed before merging. Lack of tests is concerning for critical infrastructure.

Recommendation: Request changes for critical issues, then approve after fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants