Skip to content

fix(web): batch findMany queries fixes #360 #361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

24601
Copy link

@24601 24601 commented Jun 20, 2025

Fix: Implement pagination for unbounded database queries to prevent memory issues

Problem

In large installations with thousands of repositories, Sourcebot becomes unresponsive during large queries due to unbounded prisma.repo.findMany() calls. This causes the error:

Failed to convert rust String into napi string
PrismaClientKnownRequestError:
Invalid prisma.repo.findMany() invocation:

This is caused by a known Prisma issue (#13864) where loading large datasets into memory exhausts available resources.

NOTE: While, it seems that while this fix addresses the issue with a large fetch and at least Sourcebot can continue to operate (which is far better), the web container does suffer from other limitations (e.g. using the results from this now working findMany and runs out of heap, so there are other places where the results from this query are being used that are either overselecting or also not batching/paginating/cursor-driven work and there is a broader/deeper arch issue that may need to be fixed in the UI, but I would consider that a different/separate issue and I was able to successful work around it with a NODE_OPTIONS="--max-old-space-size=16000", but this likely will scale linearly or so with data set size, so this is something/somewhere that should be optimized.

I am keeping it in draft b/c now that this is fixed there are things further up well beyond just the heap that are causing freezes, etc, too, so going to fix those to get things to a usable state and will move the PR out of draft.

Solution

This PR implements pagination for all repository database queries to prevent memory exhaustion. The implementation includes:

1. Batched Query Functions

  • Created packages/web/src/lib/repoBatchQueries.ts with specialized batch query functions:
    • batchedFindReposByIds() - Fetches repos by IDs in configurable batches
    • batchedFindReposByNames() - Fetches repos by names in configurable batches
    • batchedGetAllRepos() - Fetches all repos with pagination support

2. Configurable Batch Size

The batch size is now configurable through two methods:

Environment Variable:

DB_QUERY_BATCH_SIZE=1000  # Default: 500, Range: 100-10000

Configuration File (config.json):

{
  "settings": {
    "dbQueryBatchSize": 1000
  }
}

3. Updated Queries

  • Replaced unbounded findMany() calls in searchApi.ts with batched queries
  • Updated getRepos() in actions.ts to use pagination with progress tracking
  • All queries now process data in chunks to maintain constant memory usage

4. Documentation

  • Added DB_QUERY_BATCH_SIZE to environment variables documentation
  • Updated schema documentation with the new dbQueryBatchSize setting
  • Added detailed comments explaining the Prisma workaround

Testing

  • All existing tests pass
  • Manual testing with large datasets confirms memory usage remains stable
  • The default batch size of 500 provides a good balance between performance and memory usage
  • Have tested against the original environment that resulted in opening [bug] Failed to convert rust String into napi string on "large" installs #360 and works against that large, actual dataset (see note above about how that uncovers other, separate issues in the design of the web app).

Breaking Changes

None. The changes are backward compatible with a sensible default value.

Notes

  • This is a workaround for Prisma issue #13864
  • The batch size can be tuned based on available memory and dataset size
  • Future versions can remove this workaround once Prisma fixes the underlying issue, although it maybe more performant/efficient to keep batching/cursor based querying altogether.

Fixes #360

Summary by CodeRabbit

  • New Features
    • Introduced a configurable batch size setting for database queries to improve performance and reliability when handling large datasets.
    • Added a new environment variable and configuration option (dbQueryBatchSize) with a default value of 500, adjustable between 100 and 10,000.
  • Documentation
    • Updated documentation and schema references to include the new batch size configuration option.
  • Bug Fixes
    • Improved database query handling to prevent memory issues with large datasets by batching queries.
  • Chores
    • Added utility functions for batched database queries.

Copy link

coderabbitai bot commented Jun 20, 2025

Walkthrough

This change introduces a configurable batch size for database queries to address memory issues with large datasets, specifically as a workaround for a known Prisma bug. It adds the dbQueryBatchSize setting to configuration schemas, environment variables, and backend constants, and refactors repository-fetching logic to use batched queries.

Changes

File(s) Change Summary
docs/docs/configuration/environment-variables.mdx Documents new DB_QUERY_BATCH_SIZE environment variable.
docs/snippets/schemas/v3/index.schema.mdx
schemas/v3/index.json
Adds dbQueryBatchSize property to configuration JSON schema with constraints and description.
packages/schemas/src/v3/index.schema.ts
packages/schemas/src/v3/index.type.ts
Adds dbQueryBatchSize to TypeScript schema and types for settings.
packages/backend/src/constants.ts Adds dbQueryBatchSize with default value to backend default settings.
packages/web/src/env.mjs Adds DB_QUERY_BATCH_SIZE to server environment variables with default and type.
packages/web/src/lib/repoBatchQueries.ts Introduces new utility functions for batched repo queries by IDs and names.
packages/web/src/actions.ts Refactors repo fetching logic to use batched queries instead of single large queries.
packages/web/src/features/search/searchApi.ts Refactors repo fetching in search to use new batched query utilities.
packages/web/src/auth.ts Renames a parameter in an event handler (no functional impact).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant WebServer
    participant RepoBatchQueries
    participant Database

    Client->>WebServer: Request repositories (large dataset)
    WebServer->>RepoBatchQueries: batchedFindReposByIds/names(...)
    loop For each batch
        RepoBatchQueries->>Database: Query batch (size = dbQueryBatchSize)
        Database-->>RepoBatchQueries: Return batch results
    end
    RepoBatchQueries-->>WebServer: Combined results
    WebServer-->>Client: Aggregated repository data
Loading

Assessment against linked issues

Objective Addressed Explanation
Refactor unbounded prisma.repo.findMany() calls to use batching/pagination for large datasets (#360)
Make batch size for queries configurable via environment/config (#360)
Document new configuration/environment variable for batch size (#360)

Assessment against linked issues: Out-of-scope changes

Code Change (file_path) Explanation
Parameter rename from account to _account (packages/web/src/auth.ts) This is a minor refactor unrelated to batching or Prisma memory issues.

Suggested reviewers

  • msukkari

Poem

In the warren where code does hop,
Batches now bound, no memory flop!
Prisma’s bug, we leap right past—
Querying repos, no longer aghast.
With settings to tweak, and docs to guide,
The rabbit’s code runs smooth and wide.
🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@24601 24601 marked this pull request as ready for review June 20, 2025 21:27
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
packages/web/src/auth.ts (1)

144-144: Consider simplifying handler signature by removing unused account parameter
The underscore-prefixed _account correctly suppresses unused-variable linting. For clarity, you could destructure only { user } since account isn’t used in this handler.

docs/docs/configuration/environment-variables.mdx (1)

23-23: Clarify passive sentence in description.

Consider adding a subject:

It can also be configured via the dbQueryBatchSize setting in the configuration file.
This will form a complete sentence.

packages/web/src/lib/repoBatchQueries.ts (2)

60-87: Consistent implementation with good API design.

This function properly mirrors batchedFindReposByIds with the same secure and efficient batching approach. The consistency between the two functions provides a clean, predictable API.

Both functions could potentially be refactored to reduce duplication by extracting a generic batching function, but the current approach is clear and maintainable.


36-47: Consider error handling and execution strategy enhancements.

While the current sequential execution is safe for memory management, consider these potential enhancements for future iterations:

  1. Error resilience: Currently, if any batch fails, the entire operation fails. Consider whether partial results or retry logic would be beneficial.

  2. Include support: The functions don't support Prisma's include option, limiting reusability compared to the batched approach in actions.ts.

  3. Parallel execution option: For deployments with adequate memory, a configurable parallel execution mode could improve performance.

The current implementation is solid and appropriate for the immediate need.

Also applies to: 72-83

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fb2ef05 and 87a92ba.

📒 Files selected for processing (11)
  • docs/docs/configuration/environment-variables.mdx (1 hunks)
  • docs/snippets/schemas/v3/index.schema.mdx (2 hunks)
  • packages/backend/src/constants.ts (1 hunks)
  • packages/schemas/src/v3/index.schema.ts (2 hunks)
  • packages/schemas/src/v3/index.type.ts (1 hunks)
  • packages/web/src/actions.ts (1 hunks)
  • packages/web/src/auth.ts (1 hunks)
  • packages/web/src/env.mjs (1 hunks)
  • packages/web/src/features/search/searchApi.ts (2 hunks)
  • packages/web/src/lib/repoBatchQueries.ts (1 hunks)
  • schemas/v3/index.json (1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/docs/configuration/environment-variables.mdx

[style] ~23-~23: To form a complete sentence, be sure to include a subject.
Context: ...github.com/prisma/prisma/issues/13864). Can also be configured via the `dbQueryBatc...

(MISSING_IT_THERE)

🔇 Additional comments (13)
packages/backend/src/constants.ts (1)

19-19: Confirm default batch size aligns with schema constraints.

The default 500 matches the 100–10000 range defined in the JSON schema. Looks good.

packages/web/src/env.mjs (1)

22-24: Validate environment variable default.

The new DB_QUERY_BATCH_SIZE is correctly added with a default of 500. Ensure this is destructured in experimental__runtimeEnv if client access is required in the future.

schemas/v3/index.json (1)

69-76: JSON schema addition looks correct.

The dbQueryBatchSize property is defined with the proper type, description, and constraints (100–10000, default 500). Good job.

packages/web/src/features/search/searchApi.ts (2)

15-15: LGTM: Import statement correctly brings in the batched query functions.

The import properly references the new batched query utilities that implement the pagination solution for the Prisma memory issue.


202-214: Excellent implementation of batched queries to resolve memory exhaustion.

The code properly addresses the core issue by:

  1. Type-safe separation: Correctly filters repoIdentifiers into numericIds and stringNames arrays
  2. Conditional querying: Only executes database queries when the respective arrays contain data, avoiding unnecessary operations
  3. Memory efficiency: Replaces unbounded findMany() calls with batched operations that prevent the "Failed to convert rust String into napi string" error
  4. Data consistency: Maintains the same Map<string | number, Repo> structure, ensuring downstream code continues to work unchanged

This implementation directly addresses Prisma issue #13864 while maintaining functional compatibility.

docs/snippets/schemas/v3/index.schema.mdx (2)

72-77: Well-designed configuration property with appropriate constraints.

The dbQueryBatchSize property is properly designed with:

  • Sensible bounds: Minimum of 100 prevents inefficient micro-batches, maximum of 10000 prevents oversized batches
  • Reasonable default: 500 provides a good balance between memory usage and query efficiency
  • Clear documentation: Explicitly references Prisma issue #13864 and explains the purpose
  • Proper validation: Uses JSON schema constraints to enforce valid values

The configuration will allow administrators to tune batch sizes based on their specific database and memory constraints.


193-198: Consistent schema duplication maintains structural integrity.

The property is correctly duplicated in both the definitions.Settings section and the main properties.settings section, which is the expected pattern for this schema structure. This ensures the configuration is available in both contexts where the Settings type is referenced.

packages/schemas/src/v3/index.schema.ts (2)

71-77: Consistent schema definition with proper TypeScript typing.

The dbQueryBatchSize property definition matches the documentation schema perfectly, ensuring consistency between:

  • Runtime validation: The schema will properly validate configuration values
  • Type safety: TypeScript will enforce the correct property type
  • Documentation: Generated docs will accurately reflect the implementation

The constraint values (min: 100, max: 10000, default: 500) are identical to the documentation, maintaining consistency across the schema system.


192-198: Proper schema structure with consistent property duplication.

The property is correctly defined in both schema locations, maintaining the established pattern for configuration properties. This ensures the dbQueryBatchSize setting will be available wherever the Settings schema is used throughout the application.

packages/web/src/actions.ts (2)

665-705: LGTM! Well-implemented batched query solution.

The refactoring successfully addresses the Prisma memory issue while preserving all existing functionality. The implementation correctly:

  • Maintains the same filtering logic via whereClause
  • Uses standard skip/take pagination for batching
  • Preserves the include relationships for connections
  • Accumulates results properly across batches

The approach is memory-efficient and maintains backward compatibility.


685-686: Good use of configurable batch size.

Using the environment variable DB_QUERY_BATCH_SIZE makes the batch size configurable as described in the PR objectives, providing flexibility for different deployment sizes.

packages/web/src/lib/repoBatchQueries.ts (2)

1-16: Excellent documentation and module setup.

The comprehensive documentation clearly explains the purpose, references the specific Prisma issue, and documents configuration options. The imports and constant setup are appropriate.


24-51: Solid implementation with good security practices.

The function correctly implements batching with proper edge case handling (empty arrays) and includes essential orgId filtering for security. The sequential execution approach is appropriate for memory management, trading some performance for stability.

The batching logic using Math.ceil and slice is standard and correct.

Comment on lines +86 to +89
/**
* The batch size for database queries to prevent memory issues with large datasets. This is a workaround for Prisma issue #13864. Defaults to 500.
*/
dbQueryBatchSize?: number;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid manual edits in auto-generated file.

This interface is marked as auto-generated; manual modifications will be overwritten. Confirm the schema generator or codegen configuration was updated to include dbQueryBatchSize so that this change persists.

🤖 Prompt for AI Agents
In packages/schemas/src/v3/index.type.ts around lines 86 to 89, the
dbQueryBatchSize property was manually added to an auto-generated file, which
will be overwritten. To fix this, revert the manual change here and instead
update the schema generator or codegen configuration to include dbQueryBatchSize
so that it is generated automatically and persists across regenerations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug] Failed to convert rust String into napi string on "large" installs
1 participant