Skip to content

Add timing instrumentation to debug River contract timeouts #1683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

sanity
Copy link
Collaborator

@sanity sanity commented Jun 26, 2025

Summary

  • Add timing instrumentation to contract execution and packet processing
  • Log warnings when operations take longer than expected thresholds
  • Help identify blocking operations causing River PUT/GET timeouts

Investigation Context

River PUT/GET operations are timing out after 30 seconds. Hypothesis is that WASM contract execution is blocking the message pipeline, causing connection failures.

This instrumentation will help identify:

  • Contract operations taking >10ms
  • Packet processing taking >50ms
  • Correlation with River timeout issues

Test Plan

  • Deploy to ziggy gateway for ARM testing
  • Monitor logs during River CLI operations
  • Analyze timing data to confirm/refute blocking hypothesis

🤖 Generated with Claude Code

sanity and others added 11 commits June 19, 2025 09:34
- Update freenet-stdlib to 0.1.9 (includes panic fix + NodeQuery APIs)
- Fix compilation error in node.rs for release builds
- Bump freenet and fdev versions to 0.1.14
- Add timing logs for contract PUT/GET execution in contract/mod.rs
- Warn when contract operations take >10ms (blocking message pipeline)
- Add timing for overall packet processing in peer_connection.rs
- This will help identify WASM execution bottlenecks causing channel overflow
- Track channel overflow and dropped packets immediately
- Monitor PUT operation start/end timing
- Log message routing through NetworkBridge
- Track UDP send performance and channel backlogs
- Add queue depth monitoring for outbound packets

This instrumentation will help identify:
1. Channel buffer overflows causing packet drops
2. Message routing failures
3. UDP send performance issues
4. Queue buildup locations
- Track SuccessfulPut message reception and generation
- Log PUT state transitions to understand completion flow
- Add debug info to trace when operations move between states
- Focus on identifying why PUT completes with false status

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit addresses critical issues preventing stable connections to remote gateways,
which has been blocking River functionality for weeks.

## Issues Fixed

### 1. Connecting Map Race Condition
**Problem**: When multiple connection attempts were made to the same gateway, subsequent
attempts would fail with "connection attempt already in progress". The error handler
would then remove the gateway from the connecting map, causing the successful connection
to fail lookup with "No connecting entry found".

**Fix**: Modified handshake error handling to NOT remove entries for duplicate connection
attempts. Only genuine failures now remove entries from the connecting map.

### 2. Gateway Channel Buffer Overflow
**Problem**: The new_connection_notifier channel had a buffer of only 10 and used
blocking send(). Once 10 connections were established, the entire UDP packet processing
loop would block, preventing all packet processing including keep-alives.

**Fix**:
- Increased buffer size from 10 to 1000 for gateways
- Changed from blocking send() to non-blocking try_send()
- Added logging to detect when channel is full

## Current Status

With these fixes, connections now successfully establish and the connecting map race
condition is resolved. However, a different issue persists:

**Remaining Issue**: Remote gateways (specifically ziggy/Raspberry Pi) stop responding
to keep-alive packets after ~20 seconds, causing connections to timeout at 30 seconds.

Pattern observed:
- 0-10s: Keep-alive sent & response received ✓
- 10-20s: Keep-alive sent & response received ✓
- 20-30s: Keep-alive sent but NO response ✗
- 30s: Connection timeout (as designed)

This appears to be a gateway-side issue where packet processing stops after 20 seconds.

## Next Steps

1. **Investigate Gateway-Side Issues**: Need to understand why remote gateways stop
   processing packets. Possible causes:
   - Thread starvation on Raspberry Pi
   - Resource exhaustion
   - Another blocking operation on gateway side

2. **Local Gateway Testing**: Set up a local gateway to reproduce and debug the issue
   in a controlled environment.

3. **Additional Instrumentation**: Add more detailed logging on the gateway side to
   identify where packet processing stalls.

Note: These fixes improve the situation significantly but don't fully resolve the
River invitation hang issue, which requires fixing the gateway keep-alive problem.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@sanity
Copy link
Collaborator Author

sanity commented Jul 16, 2025

Closing old debugging PR for River contract timeouts. If this issue resurfaces, we can create a new PR with fresh investigation.

@sanity sanity closed this Jul 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant