Skip to content

Conversation

yuyu1815
Copy link
Owner

No description provided.

- Removed all files in deepwiki_to_md/ directory except setup.py
- Deleted Selenium-dependent chat functionality
- Removed old scraping strategies and utilities
- Preserved setup.py, requirements.txt, tests/, and documentation
- Preparing for new Next.js-based architecture implementation
- Created new modular directory structure:
  - core/ (コア処理層)
  - strategies/ (データ取得戦略層)
  - extractors/ (データ抽出システム)
  - processors/ (処理・出力層)
  - utils/ (ユーティリティ機能)
- Added __init__.py files to all modules with proper documentation
- Created config/default.yaml with comprehensive configuration
- Updated requirements.txt to remove Selenium dependencies
- Ready for Phase 1: Core functionality implementation
✅ Implemented core features:
- NextJSExtractor: __NEXT_DATA__ parsing and Markdown extraction
- JSONPathNavigator: Advanced JSON path navigation and content search
- UnifiedErrorHandler: Retry strategies and error management system

🧪 Comprehensive testing:
- All modules tested with mock data
- JSON path resolution verified
- Error handling and retry logic confirmed
- Markdown validation and content extraction working

📊 Test Results:
- JSONPathNavigator: ✅ Path parsing, content search, fallback extraction
- UnifiedErrorHandler: ✅ Success cases, retry mechanisms
- NextJSExtractor: ✅ HTML parsing, content extraction, validation

Ready for Phase 2: Strategy system implementation
- `deepwiki_to_md/` から実装済みモジュール、設定ファイル、ユーティリティを全て削除。
- `debug_vscode_data.py` およびその依存関係を削除。
- 新アーキテクチャ導入に向け、既存機能を全てクリア。フェーズ3実装準備完了。
@yuyu1815
Copy link
Owner Author

/gemini review

Copy link
Contributor

Summary of Changes

Hello @yuyu1815, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a comprehensive re-architecture of the deepwiki-to-md project, transforming it into a more robust, maintainable, and dependency-minimal content extraction solution. The changes streamline the core logic, modernize the build process, and enhance the user experience through a unified command-line interface and improved documentation. This foundational update prepares the project for future extensibility and easier collaboration.

Highlights

  • Core Re-architecture: The entire content extraction logic has been rewritten using a flexible Strategy Pattern, making it highly modular and extensible for various content sources.
  • Zero External Dependencies: The core library now aims for zero external dependencies, relying primarily on Python's standard library for HTTP and parsing, with optional features using lazy-loaded external libraries.
  • Modern Python Packaging: The project has transitioned to pyproject.toml for package metadata and build configuration, replacing the older setup.py and requirements.txt.
  • Enhanced CLI & Features: A new unified command-line interface (deepwiki-to-md) has been introduced, integrating content extraction, repository search, and Devin API chat functionalities.
  • Comprehensive Documentation: New, detailed development guidelines and library usage documentation have been added in both English and Japanese to improve maintainability and onboarding.
Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/publish.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

socket-security bot commented Sep 30, 2025

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedsetuptools@​80.9.07410010010070
Addedmypy@​1.18.282100100100100
Addedpytest@​8.4.290100100100100
Addedmkdocs@​1.6.192100100100100
Addedchat@​1.0.7.dev13692100100100100
Addedblack@​25.9.098100100100100
Addedisort@​6.0.198100100100100
Addedwheel@​0.45.198100100100100
Addedflake8@​7.3.0100100100100100
Addeddeepwiki-to-md@​2.0.3100100100100100

View full report

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a major and impressive refactoring of the entire project. It modernizes the packaging by adopting pyproject.toml, eliminates external dependencies for the core library, flattens the source structure into src/, and significantly improves the documentation and developer tooling (like Makefile and example scripts). The overall changes are excellent and greatly enhance the project's maintainability and usability. I've found a few issues, including a critical bug in the CLI argument parsing and some minor maintainability and documentation points, which I've detailed in the comments.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a major and impressive refactoring of the entire project. It modernizes the packaging by adopting pyproject.toml and a src layout, removes legacy files and unused dependencies, and introduces a much cleaner, more extensible architecture based on the Strategy pattern. The documentation has also been significantly improved and restructured. Overall, these are excellent changes that greatly enhance the project's maintainability and clarity. I have a few suggestions, including one critical security fix, to further polish this great work.

…ore`, and bump version to 2.0.3

- Introduced `--chat` mode for DeepWiki-to-CLI interactions with additional options (`--deep-research`, `--devlog`, etc.).
- Improved input validation and handling for CLI commands, including URL parsing and fallbacks.
- Appended `.gitignore` entries for logs, `.pypirc`, and cache directories like `.ruff_cache/`.
- Updated `README.md` and `README_JP.md` to include chat usage examples.
- Added `config.json` to project data files for distribution.
- Adjusted Makefile for streamlined PyPI uploads.
- Skipped detailed `<details>` and `<summary>` blocks during content extraction to improve Markdown clarity.
- Incremented version to 2.0.3 to reflect new functionality.
@yuyu1815
Copy link
Owner Author

yuyu1815 commented Oct 1, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a major and impressive refactoring of the entire project. It successfully transitions the codebase from a complex, multi-dependency package to a streamlined, zero-dependency library with a much clearer focus and structure. The adoption of pyproject.toml for packaging, the clean separation of concerns into modules like cli.py, chat.py, and deepwiki_to_md.py, and the use of a Strategy pattern for content extraction are all excellent improvements that significantly enhance maintainability and extensibility. The documentation has also been massively overhauled, with new English and Japanese guides that are well-structured and informative.

My review focuses on a few critical points to ensure the new structure is robust:

  • A syntax error in the new Makefile that would prevent it from running.
  • A potential silent failure in the HTTP client when handling Brotli-compressed content, which also conflicts with the zero-dependency goal.
  • The presence of a redundant and conflicting setup.py file alongside the new pyproject.toml.

Addressing these points will help solidify this excellent refactoring effort.

@yuyu1815
Copy link
Owner Author

yuyu1815 commented Oct 1, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a major and impressive refactoring of the entire project. It modernizes the packaging from setup.py to pyproject.toml, flattens the project structure into a src layout, and completely rewrites the core scraping logic to be zero-dependency and strategy-based. The documentation has also been significantly improved and reorganized. Overall, these are excellent changes that improve maintainability and usability. I've found a few issues, including a critical one in the Makefile and a bug in a README.md example, along with some suggestions for improving consistency and clarity.

@yuyu1815
Copy link
Owner Author

yuyu1815 commented Oct 1, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a major and impressive refactoring of the entire project. It modernizes the packaging from setup.py and requirements.txt to a pyproject.toml-based setup. The core logic has been completely rewritten with a clean, extensible, and zero-dependency architecture using a Strategy pattern for content extraction. The old, complex, and scattered scraper implementations have been replaced with a single, coherent deepwiki_to_md.py module. The CLI has also been streamlined. Documentation has been significantly improved with new README.md files in English and Japanese, and detailed development guidelines. The addition of example scripts in wiki_tests/ is also a great help for developers.

My review has found a few minor issues: a critical syntax error in the new Makefile, a small configuration error in pyproject.toml for code coverage, and a misleading function name in the new chat.py module. Once these are addressed, this will be an excellent update.

@yuyu1815
Copy link
Owner Author

yuyu1815 commented Oct 1, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and impressive refactoring of the project. It modernizes the structure by adopting pyproject.toml and a src layout, which is a great move for maintainability. The codebase has been thoughtfully modularized, and the shift to zero core dependencies with lazy imports for optional features is a solid design choice. The documentation has also been substantially improved and reorganized. My review focuses on potential improvements in the new modules, configuration clarity, and documentation consistency.

…I formatting in `cli.py`.

- Removed caller-relative path fallback in favor of CWD-based resolution in `chat.py`.
- Extracted search results formatting into `format_search_results_devlog` function for better modularity in `cli.py`.
- Updated Markdown section splitting in `deepwiki_to_md.py` with state machine-based parsing and dedicated filtering logic for improved readability and reusability.
@yuyu1815
Copy link
Owner Author

yuyu1815 commented Oct 1, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request represents a significant and impressive refactoring of the entire project. The move to a modern pyproject.toml-based packaging system, the adoption of a src layout, and the redesign of the core logic using a Strategy pattern are all excellent changes that greatly improve maintainability and code quality. The documentation has also been substantially rewritten and is much clearer.

My review focuses on a few areas to further polish this new structure, including improving the packaging of data files, clarifying developer documentation, and increasing the robustness of some of the new modules. Overall, this is a fantastic update.

Comment on lines +248 to +256
# If config is None, load src/config.json (= next to this file)
if config is None:
from pathlib import Path
default_path = str((Path(__file__).parent / "config.json").resolve())
loaded = load_config(default_path)
if not loaded:
raise RuntimeError(f"Default config not found or failed to load at '{default_path}'")
config = loaded

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The fallback logic for loading the default config.json relies on a hardcoded relative path from the __file__ location. This is brittle and can break if the package is installed in certain ways (e.g., as a zip archive or with some package managers).

A more robust approach is to embed the default configuration as a multi-line string within this module and parse it with json.loads(). This removes any dependency on the filesystem layout of the installed package.

    if config is None:
        # Embed default config as a string to avoid filesystem issues.
        default_config_str = """{
    "headers": {
        "Host": "api.devin.ai",
        "Sec-Ch-Ua-Platform": "\"macOS\"",
        "Accept-Language": "ja",
        "Sec-Ch-Ua": "\"Chromium\";v=\"139\", \"Not;A=Brand\";v=\"99\"",
        "Content-Type": "application/json",
        "Sec-Ch-Ua-Mobile": "?0",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36",
        "Accept": "*/*",
        "Origin": "https://deepwiki.com",
        "Sec-Fetch-Site": "cross-site",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Dest": "empty",
        "Referer": "https://deepwiki.com/",
        "Accept-Encoding": "gzip, deflate, br",
        "Priority": "u=4, i"
    },
    "body_template": {
        "engine_id": "multihop",
        "keywords": [],
        "additional_context": "",
        "use_notes": false,
        "generate_summary": false
    }
}"""
        try:
            config = json.loads(default_config_str)
        except json.JSONDecodeError:
            raise RuntimeError("Default config is malformed and could not be loaded.")

Comment on lines +1 to +32
{
"headers": {
"Host": "api.devin.ai",
"Content-Length": "349",
"Sec-Ch-Ua-Platform": "\"macOS\"",
"Accept-Language": "ja",
"Sec-Ch-Ua": "\"Chromium\";v=\"139\", \"Not;A=Brand\";v=\"99\"",
"Content-Type": "application/json",
"Sec-Ch-Ua-Mobile": "?0",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36",
"Accept": "*/*",
"Origin": "https://deepwiki.com",
"Sec-Fetch-Site": "cross-site",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Dest": "empty",
"Referer": "https://deepwiki.com/",
"Accept-Encoding": "gzip, deflate, br",
"Priority": "u=4, i"
},
"body_template": {
"engine_id": "multihop",
"user_query": "<relevant_context>This query was sent from the wiki page: VS Code Architecture Overview.</relevant_context>pluginの仕組みを教えて",
"keywords": [],
"repo_names": [
"microsoft/vscode"
],
"additional_context": "",
"query_id": "plugin_3b6b0fb5-14ff-4d76-a8b9-758425e82f4d",
"use_notes": false,
"generate_summary": false
}
} No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This configuration file contains hardcoded values that should be dynamic or removed.

  1. "Content-Length": "349": This header should be removed. HTTP client libraries like requests calculate this automatically. A hardcoded value will be incorrect if the request body changes, causing requests to fail.
  2. The user_query and query_id in the body_template are specific to one example. While the code in chat.py correctly overwrites these, it's confusing and bad practice to have such specific values in a template. They should be removed to make it clear they are dynamically generated.
{
    "headers": {
        "Host": "api.devin.ai",
        "Sec-Ch-Ua-Platform": "\"macOS\"",
        "Accept-Language": "ja",
        "Sec-Ch-Ua": "\"Chromium\";v=\"139\", \"Not;A=Brand\";v=\"99\"",
        "Content-Type": "application/json",
        "Sec-Ch-Ua-Mobile": "?0",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36",
        "Accept": "*/*",
        "Origin": "https://deepwiki.com",
        "Sec-Fetch-Site": "cross-site",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Dest": "empty",
        "Referer": "https://deepwiki.com/",
        "Accept-Encoding": "gzip, deflate, br",
        "Priority": "u=4, i"
    },
    "body_template": {
        "engine_id": "multihop",
        "keywords": [],
        "repo_names": [],
        "additional_context": "",
        "use_notes": false,
        "generate_summary": false
    }
}

Comment on lines +71 to +74
[tool.setuptools.data-files]
"share/deepwiki_to_md" = [
"src/config.json",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using [tool.setuptools.data-files] to install src/config.json into share/deepwiki_to_md is not recommended. This approach can pollute the user's global Python environment (sys.prefix) and makes the file difficult for the package to locate reliably.

A more robust practice for packaging default data is to include it within the package itself and access it using importlib.resources. Given the current flat module structure, the simplest solution would be to embed the default configuration as a multi-line string directly within the src/chat.py module. This would remove the need for this data-files entry and make the default config loading more reliable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant