-
-
Notifications
You must be signed in to change notification settings - Fork 93
Add timing instrumentation to debug River contract timeouts #1683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Update freenet-stdlib to 0.1.9 (includes panic fix + NodeQuery APIs) - Fix compilation error in node.rs for release builds - Bump freenet and fdev versions to 0.1.14
- Add timing logs for contract PUT/GET execution in contract/mod.rs - Warn when contract operations take >10ms (blocking message pipeline) - Add timing for overall packet processing in peer_connection.rs - This will help identify WASM execution bottlenecks causing channel overflow
- Track channel overflow and dropped packets immediately - Monitor PUT operation start/end timing - Log message routing through NetworkBridge - Track UDP send performance and channel backlogs - Add queue depth monitoring for outbound packets This instrumentation will help identify: 1. Channel buffer overflows causing packet drops 2. Message routing failures 3. UDP send performance issues 4. Queue buildup locations
- Track SuccessfulPut message reception and generation - Log PUT state transitions to understand completion flow - Add debug info to trace when operations move between states - Focus on identifying why PUT completes with false status 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit addresses critical issues preventing stable connections to remote gateways, which has been blocking River functionality for weeks. ## Issues Fixed ### 1. Connecting Map Race Condition **Problem**: When multiple connection attempts were made to the same gateway, subsequent attempts would fail with "connection attempt already in progress". The error handler would then remove the gateway from the connecting map, causing the successful connection to fail lookup with "No connecting entry found". **Fix**: Modified handshake error handling to NOT remove entries for duplicate connection attempts. Only genuine failures now remove entries from the connecting map. ### 2. Gateway Channel Buffer Overflow **Problem**: The new_connection_notifier channel had a buffer of only 10 and used blocking send(). Once 10 connections were established, the entire UDP packet processing loop would block, preventing all packet processing including keep-alives. **Fix**: - Increased buffer size from 10 to 1000 for gateways - Changed from blocking send() to non-blocking try_send() - Added logging to detect when channel is full ## Current Status With these fixes, connections now successfully establish and the connecting map race condition is resolved. However, a different issue persists: **Remaining Issue**: Remote gateways (specifically ziggy/Raspberry Pi) stop responding to keep-alive packets after ~20 seconds, causing connections to timeout at 30 seconds. Pattern observed: - 0-10s: Keep-alive sent & response received ✓ - 10-20s: Keep-alive sent & response received ✓ - 20-30s: Keep-alive sent but NO response ✗ - 30s: Connection timeout (as designed) This appears to be a gateway-side issue where packet processing stops after 20 seconds. ## Next Steps 1. **Investigate Gateway-Side Issues**: Need to understand why remote gateways stop processing packets. Possible causes: - Thread starvation on Raspberry Pi - Resource exhaustion - Another blocking operation on gateway side 2. **Local Gateway Testing**: Set up a local gateway to reproduce and debug the issue in a controlled environment. 3. **Additional Instrumentation**: Add more detailed logging on the gateway side to identify where packet processing stalls. Note: These fixes improve the situation significantly but don't fully resolve the River invitation hang issue, which requires fixing the gateway keep-alive problem. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Closing old debugging PR for River contract timeouts. If this issue resurfaces, we can create a new PR with fresh investigation. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Investigation Context
River PUT/GET operations are timing out after 30 seconds. Hypothesis is that WASM contract execution is blocking the message pipeline, causing connection failures.
This instrumentation will help identify:
Test Plan
🤖 Generated with Claude Code