-
Notifications
You must be signed in to change notification settings - Fork 131
fix(guard): dont send open msg for hibernating ws, hibernating ws keepalive #3449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(guard): dont send open msg for hibernating ws, hibernating ws keepalive #3449
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
3 Skipped Deployments
|
67aa81c to
5d18828
Compare
380ccdd to
4c115b8
Compare
PR Review: WebSocket Hibernation ImprovementsSummaryThis PR introduces important improvements to WebSocket hibernation handling, including:
Code Quality & Best PracticesStrengths ✅
Areas for Improvement 🔧
Potential Issues 🐛
Performance Considerations ⚡
Security Concerns 🔒
Test Coverage 🧪
Documentation 📚
Minor Issues
Recommendations
Overall AssessmentThis is a solid improvement to the WebSocket hibernation system. The core logic is sound, but there are some edge cases and potential race conditions that should be addressed. The simplified message resend logic is cleaner but may have performance implications that should be evaluated. Recommendation: Request changes to address the high-priority items, particularly around error handling and potential orphaned database entries. 🤖 Generated with Claude Code |
PR Review: Fix Guard WebSocket Hibernation BehaviorOverviewThis PR addresses two critical issues in WebSocket hibernation:
The changes span 30 files with 1,982 additions and 460 deletions. Overall, the implementation is solid with good architectural decisions. Positive Aspects1. Improved Timeout Handling Architecture ✅The refactoring from TunnelMessageData enum to using drop_rx watch channel is excellent. Cleaner separation of concerns and more reliable timeout detection in pegboard-gateway/src/lib.rs:316-346. 2. Hibernation Keepalive System ✅The keepalive implementation in pegboard-gateway/src/lib.rs:582-628 is well-designed with half of hibernating_request_eligible_threshold as ping interval, jitter to prevent thundering herd, and proper cleanup on exit. 3. Clean Database Key Design ✅The HibernatingRequestKey structure enables efficient range queries with composite key (actor_id, last_ping_ts, request_id) and separate LastPingTsKey for updates. 4. Protocol Enhancement ✅The after_hibernation parameter properly prevents redundant open messages during reconnection. Issues & ConcernsCRITICAL: Removed Message Acknowledgment LogicLocation: pegboard-gateway/src/shared_state.rs:325-345 The simplification of resend_pending_websocket_messages removed important logic. Now resends ALL pending messages unconditionally which could cause message duplication on reconnect and lost tracking capability. Recommendation: Verify if the actor-side acknowledgment system handles duplicates correctly, or restore acknowledgment-based filtering. Add integration tests for message deduplication during reconnection. MEDIUM: Import Cleanup InconsistencyCLAUDE.md states to not glob import from anyhow. The change in engine/packages/gasoline/src/db/kv/keys/worker.rs follows this correctly with use anyhow::Result, but other files may still have use anyhow::* violations. MEDIUM: Configuration Missing DocumentationLocation: engine/packages/config/src/config/pegboard.rs:53-58 The hibernating_request_eligible_threshold field needs better documentation including:
MEDIUM: Magic Number in JitterLocation: pegboard-gateway/src/lib.rs:599 The value 128 for jitter has no explanation or constant. Recommend extracting to KEEPALIVE_JITTER_MAX_MS constant. MINOR: Logging Level ChangeLocation: engine/packages/cache-purge/src/lib.rs:12 Changed from info to debug which seems unrelated to PR purpose but is probably correct for reducing log noise. Testing Recommendations
Security & PerformanceSecurity: No obvious security issues. Keepalive mechanism properly cleans up state on all exit paths. Performance:
Code Quality Score: 8/10Strengths: Clean architecture, good error handling, proper resource cleanup, well-structured database keys Areas for improvement: Critical message acknowledgment issue, import consistency, documentation, magic numbers Recommendation: Approve with follow-upThis PR significantly improves the hibernation system, but please address the critical concern about message acknowledgment before merging, or provide clarification on why the simplified approach is safe. 🤖 Generated with Claude Code |
5d18828 to
f45c96f
Compare
PR Review: Hibernating WebSocket Keepalive & Protocol UpdatesThis PR implements hibernating WebSocket request tracking with keepalive pings to prevent premature eviction, and fixes issues with duplicate open messages on reconnection. Overall, the implementation is solid with good architectural decisions, but there are some concerns to address. Strengths1. Well-Designed Hibernation State Management
2. Proper Concurrency Handling
3. Improved Message Flow Control
Issues & ConcernsCritical Issues1. Race Condition in Keepalive Cleanup (packages/pegboard-gateway/src/lib.rs:612-625)keepalive_handle.abort();
match &res {
Ok(HibernationResult::Continue) => {}
Ok(HibernationResult::Close) | Err(_) => {
self.ctx
.op(pegboard::ops::actor::hibernating_request::delete::Input {
actor_id: self.actor_id,
request_id: unique_request_id,
})
.await?;
}
}Problem: The keepalive task is aborted before the conditional delete. If the task is in the middle of an Recommendation: Either:
2. Error Handling in Keepalive Loop (packages/pegboard-gateway/src/lib.rs:602-607)ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
actor_id,
request_id: unique_request_id,
})
.await?;Problem: If the upsert operation fails, the entire loop terminates with Recommendation: if let Err(err) = ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
actor_id,
request_id: unique_request_id,
}).await {
tracing::error!(?err, request_id=?unique_request_id, "failed to update hibernating request keepalive");
// Consider whether to continue the loop or break
}Medium Priority Issues3. Missing Import Cleanup (packages/gasoline/src/db/kv/keys/worker.rs:1-3)The change from 4. Incomplete Error Context in Database OperationsIn if let Some(last_ping_ts) = tx.read_opt(&last_ping_ts_key, Serializable).await? {
tx.delete(&keys::actor::HibernatingRequestKey::new(
input.actor_id,
last_ping_ts,
input.request_id,
));
}While UniversalDB's transactional semantics may guarantee this works, there's no error handling or logging if the delete fails silently. 5. Potential Integer Overflow in Ping Interval (packages/pegboard-gateway/src/lib.rs:586-591)let mut ping_interval = tokio::time::interval(Duration::from_millis(
(ctx.config()
.pegboard()
.hibernating_request_eligible_threshold()
/ 2)
.try_into()?,
));The 6. Logging Level Change Seems Arbitrary (packages/cache-purge/src/lib.rs:12)- tracing::info!(subject=?CACHE_PURGE_TOPIC, "subscribed to cache purge updates");
+ tracing::debug!(subject=?CACHE_PURGE_TOPIC, "subscribed to cache purge updates");This change is unrelated to the PR and should be explained. If this is intentional noise reduction, it should be in a separate commit or mentioned in the PR description. Low Priority / Style Issues7. Inconsistent Error Handling PatternsIn 8. Magic Number for Jitterlet jitter = { rand::thread_rng().gen_range(0..128) };The 128ms maximum jitter should be a named constant or configuration value to make it easier to tune. Performance ConsiderationsPositive
Concerns
Security ConsiderationsGood Practices
Minor Concern
Test CoverageConcern: The PR only updates a test stub at Recommendations:
Protocol ChangesThe changes to
Recommendations SummaryMust Fix Before Merge
Should Fix
Nice to Have
ConclusionThe core implementation is architecturally sound and solves the hibernation keepalive problem effectively. However, the error handling gaps and potential race condition in cleanup need to be addressed before merging. The lack of comprehensive tests is also concerning for such a critical reliability feature. Recommendation: Request changes - primarily for error handling improvements and test coverage. |
f45c96f to
769c3f8
Compare
97572fd to
3362252
Compare
PR Review: WebSocket Hibernation Keepalive ImplementationI've reviewed this PR which adds hibernating WebSocket keepalive functionality and fixes the issue where open messages were being sent for hibernating connections. Overall, the implementation is solid with good architectural decisions. Here are my findings: ✅ Strengths
🔍 Issues & ConcernsCritical Issues
Medium Priority Issues
Minor Issues
📋 Testing Recommendations
🔒 Security Considerations
📊 Performance Considerations
📝 Documentation & Code Quality
SummaryThis is a solid implementation with good design choices. The main concerns are:
Recommendation: Address the critical issues before merging, particularly the keepalive error handling and the message acknowledgment logic. |
769c3f8 to
fc8f1fc
Compare
3362252 to
67f0a65
Compare
Code Review - PR #3449I've reviewed this PR which implements WebSocket hibernation improvements and keepalive functionality. Here's my comprehensive feedback: SummaryThis PR addresses two main issues:
Positive Aspects ✅1. Clean Database Schema Design
2. Proper Error Handling
3. Jitter Implementation
4. Removed Complex Deduplication Logic
Issues & Concerns
|
fc8f1fc to
e5ca471
Compare
67f0a65 to
4039392
Compare
Code Review: WebSocket Hibernation & KeepaliveThis PR implements hibernating WebSocket request tracking and keepalive functionality for the Pegboard gateway. I've reviewed the changes and have the following feedback: Architecture & Design ✅Strengths:
Code Quality Observations1. Import Cleanup ✅ (engine/packages/gasoline/src/db/kv/keys/worker.rs:1-5)Good cleanup removing the redundant 2. Message Resending Logic Simplified
|
Pull Request Review: Hibernating WebSocket Fix & KeepaliveSummaryThis PR implements a keepalive mechanism for hibernating WebSocket connections and fixes issues with the WebSocket open message handling during reconnection. The changes are well-structured and address important reliability concerns in the pegboard gateway. Code Quality & Best PracticesStrengths:
Areas for Improvement:
Potential Bugs & Issues
Performance Considerations
Security Concerns
Test Coverage
Recommendations:
Specific Code Comments
Database Schema ConcernsNew Keys Introduced:
Observations:
Overall AssessmentStrengths:
Risks:
Recommendation: ✅ Approve with suggestions
Questions for Author
|
Code Review - PR #3449SummaryThis PR implements important fixes for WebSocket hibernation in the Guard service, specifically:
Positive ObservationsWell-Structured Changes
Good Architecture Decisions
Issues & Concerns🔴 Critical: Potential Race Condition in Keepalive TaskLocation: The keepalive task spawns but doesn't wait for the first successful upsert before returning. If the connection fails immediately after hibernation starts, the cleanup in lines 618-623 might execute before the keepalive task inserts the first entry, leading to a leaked database entry. Recommendation: // Perform initial upsert before spawning background task
ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
actor_id,
request_id: unique_request_id,
})
.await?;
// Then spawn keepalive for subsequent updates
let keepalive_handle: JoinHandle<Result<()>> = tokio::spawn(async move {
let mut ping_interval = tokio::time::interval(Duration::from_millis(
(ctx.config()
.pegboard()
.hibernating_request_eligible_threshold()
/ 2)
.try_into()?,
));
// ... rest of implementation
});🟡 Medium: Missing Error Handling in Keepalive LoopLocation: The keepalive loop will panic on the first database error (via Recommendation: loop {
ping_interval.tick().await;
let jitter = { rand::thread_rng().gen_range(0..128) };
tokio::time::sleep(Duration::from_millis(jitter)).await;
if let Err(err) = ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
actor_id,
request_id: unique_request_id,
})
.await {
tracing::warn!(?err, request_id=?unique_request_id, "failed to upsert hibernating request keepalive");
// Continue trying - connection might still be valid
}
}🟡 Medium: Simplified Resend Logic May Cause Duplicate MessagesLocation: The PR removes the
Previous behavior: Only resent messages with index > Questions:
🟡 Medium: Inconsistent Error Handling PatternLocation: Multiple files The PR mixes timeout signaling patterns:
Example - pegboard-gateway/src/lib.rs:186-223: tokio::select! {
res = msg_rx.recv() => { /* ... */ }
_ = drop_rx.changed() => {
tracing::warn!("tunnel message timeout");
return Err(ServiceUnavailable.build());
}
}This is actually good! Just noting the pattern changed from the old 🟢 Minor: Unused Import in gasolineLocation: -use std::result::Result::Ok;
-
-use anyhow::*;
+use anyhow::Result;Good cleanup! This follows the CLAUDE.md guideline to avoid glob imports from anyhow. 🟢 Minor: Missing DocumentationLocation: The new
Recommendation: /// How long after last ping before considering a hibernating request disconnected.
///
/// Hibernating requests send keepalive pings at half this interval to prevent
/// being considered disconnected. Setting this too low may cause premature
/// disconnections during network hiccups. Setting it too high increases memory
/// usage for disconnected clients.
///
/// Unit is in milliseconds.
///
/// **Experimental**
pub hibernating_request_eligible_threshold: Option<i64>,🟢 Minor: Logging Level Change Seems AppropriateLocation: -tracing::info!(subject=?CACHE_PURGE_TOPIC, "subscribed to cache purge updates");
+tracing::debug!(subject=?CACHE_PURGE_TOPIC, "subscribed to cache purge updates");This is a good change - subscription events are typically debug-level, not info. Security Considerations✅ No security issues identified
Performance ConsiderationsPositive:
Concerns:
Test Coverage
Recommendations SummaryMust Fix (Before Merge)
Should Consider
Nice to Have
Overall AssessmentVerdict: ✅ Approve with minor concerns This is a well-structured PR that fixes important issues in WebSocket hibernation. The code quality is generally high, following Rust best practices and the repository's conventions. The main concerns are around edge cases in the keepalive task that should be addressed before merge. The simplification of the pending message resend logic is a trade-off between code complexity and potential duplicate messages - this seems reasonable if the actor layer handles deduplication, but should be verified. Great work on improving the timeout mechanism and properly tracking hibernating connections! |
4039392 to
6546966
Compare
e5ca471 to
5b5466a
Compare
Merge activity
|
Code ReviewSummaryThis PR implements hibernating WebSocket support and keepalive functionality to prevent disconnected hibernating requests. The changes span multiple packages including pegboard-gateway, pegboard, guard-core, and runner-protocol. Positive Aspects
Issues and Concerns1. Critical: Keepalive task not properly awaited (pegboard-gateway/src/lib.rs:577-627)If the keepalive task panics or encounters an error, it is silently ignored. The JoinHandle result is never checked after abort(). Recommendation: Check the join handle result after abort and log any non-cancellation errors. 2. Potential race condition in upsert operation (pegboard/src/ops/actor/hibernating_request/upsert.rs:24-30)Uses Serializable isolation level which may allow phantom reads. If two upsert operations race, one might delete the wrong entry. Recommendation: Consider using Snapshot isolation level or document why Serializable is safe here. 3. Missing bounds check on jitter (pegboard-gateway/src/lib.rs:600-601)Magic number (128ms) without explanation. Why this specific range? Recommendation: Add a constant with documentation explaining the jitter bounds choice. 4. Inconsistent after_hibernation handling (pegboard-gateway/src/lib.rs:295-298)When reconnecting after hibernation, can_hibernate is hardcoded to true without verifying the actor still supports hibernation. The actor state might have changed. Recommendation: Consider fetching current hibernation capability or add a comment explaining why it is safe to assume true. 5. Unbounded loop in keepalive task (pegboard-gateway/src/lib.rs:587-609)No error handling if upsert repeatedly fails. Task will keep retrying forever, potentially spamming logs. Recommendation: Add retry limit or exponential backoff on persistent failures. 6. Missing test coverageNo tests found for the new hibernating request operations or keepalive functionality. This is a complex feature with race conditions and timing concerns. Recommendation: Add integration tests covering:
7. Import cleanup (gasoline/src/db/kv/keys/worker.rs:1-4)The PR includes an unrelated import cleanup that should be in a separate commit. This is good hygiene but pollutes the diff. 8. Large protocol schema changes (runner-protocol/v3.bare)Added 403 lines to the protocol schema. Ensure:
Performance Considerations
Security Considerations
Style and Convention Notes
Recommendations SummaryMust Fix:
Should Fix: Nice to Have: Overall AssessmentThe implementation is solid and follows project conventions well. The main concerns are around error handling in the keepalive task and missing test coverage. The database operations look correct but would benefit from explicit isolation level justification. Once the critical issues are addressed, this should be good to merge. Recommendation: Request changes for items 1-3, then approve after fixes. |
5b5466a to
73b2789
Compare
6546966 to
da324bd
Compare
PR Review: WebSocket Hibernation & Keepalive ImprovementsThank you for this comprehensive PR! This adds important functionality for handling hibernating WebSocket connections. Here's my detailed review: ✅ Strengths1. Well-Designed Architecture
2. Protocol Evolution Handled Correctly
3. Database Schema
🔍 Issues & Concerns1. Critical: Race Condition in Keepalive Task
|
PR Review: WebSocket Hibernation & Keep-Alive ImplementationI've reviewed this PR which implements hibernating WebSocket request tracking and keep-alive functionality. Here's my feedback: ✅ Strengths
🔍 Code Quality ObservationsPositive
Minor Issues
🐛 Potential Issues
🔒 Security Considerations
🧪 Test CoverageConcern: No new tests added for:
Recommendation: Add integration tests covering:
📝 Additional Notes
RecommendationsHigh Priority:
Medium Priority:
Low Priority:
Overall, this is a solid implementation of a complex feature. The main concerns are around test coverage and the message replay logic change. Great work on the structured implementation! 🚀 |
PR Review: WebSocket Hibernation and Keepalive ImplementationSummaryThis PR implements WebSocket hibernation request tracking and keepalive functionality for the pegboard-gateway. The changes allow hibernating WebSocket connections to be properly tracked and resumed when actors restart. Positive Aspects
Critical Issues
Medium Priority Issues
Testing Concerns
Performance
RecommendationsMust Fix: (1) Address race condition (2) Document/fix message ack removal (3) Add config validation Should Fix: (4) Add integration tests (5) Extract magic numbers (6) Add protocol documentation Overall AssessmentSolid implementation with good design, but edge cases and potential race conditions should be addressed before merging. Lack of tests is concerning for critical infrastructure. Recommendation: Request changes for critical issues, then approve after fixes. |

No description provided.