-
Notifications
You must be signed in to change notification settings - Fork 1
Fix/all code refactoring #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Removed all files in deepwiki_to_md/ directory except setup.py - Deleted Selenium-dependent chat functionality - Removed old scraping strategies and utilities - Preserved setup.py, requirements.txt, tests/, and documentation - Preparing for new Next.js-based architecture implementation
- Created new modular directory structure: - core/ (コア処理層) - strategies/ (データ取得戦略層) - extractors/ (データ抽出システム) - processors/ (処理・出力層) - utils/ (ユーティリティ機能) - Added __init__.py files to all modules with proper documentation - Created config/default.yaml with comprehensive configuration - Updated requirements.txt to remove Selenium dependencies - Ready for Phase 1: Core functionality implementation
✅ Implemented core features: - NextJSExtractor: __NEXT_DATA__ parsing and Markdown extraction - JSONPathNavigator: Advanced JSON path navigation and content search - UnifiedErrorHandler: Retry strategies and error management system 🧪 Comprehensive testing: - All modules tested with mock data - JSON path resolution verified - Error handling and retry logic confirmed - Markdown validation and content extraction working 📊 Test Results: - JSONPathNavigator: ✅ Path parsing, content search, fallback extraction - UnifiedErrorHandler: ✅ Success cases, retry mechanisms - NextJSExtractor: ✅ HTML parsing, content extraction, validation Ready for Phase 2: Strategy system implementation
- `deepwiki_to_md/` から実装済みモジュール、設定ファイル、ユーティリティを全て削除。 - `debug_vscode_data.py` およびその依存関係を削除。 - 新アーキテクチャ導入に向け、既存機能を全てクリア。フェーズ3実装準備完了。
…r new implementation.
/gemini review |
Summary of ChangesHello @yuyu1815, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers a comprehensive re-architecture of the Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a major and impressive refactoring of the entire project. It modernizes the packaging by adopting pyproject.toml
, eliminates external dependencies for the core library, flattens the source structure into src/
, and significantly improves the documentation and developer tooling (like Makefile
and example scripts). The overall changes are excellent and greatly enhance the project's maintainability and usability. I've found a few issues, including a critical bug in the CLI argument parsing and some minor maintainability and documentation points, which I've detailed in the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a major and impressive refactoring of the entire project. It modernizes the packaging by adopting pyproject.toml
and a src
layout, removes legacy files and unused dependencies, and introduces a much cleaner, more extensible architecture based on the Strategy pattern. The documentation has also been significantly improved and restructured. Overall, these are excellent changes that greatly enhance the project's maintainability and clarity. I have a few suggestions, including one critical security fix, to further polish this great work.
…ore`, and bump version to 2.0.3 - Introduced `--chat` mode for DeepWiki-to-CLI interactions with additional options (`--deep-research`, `--devlog`, etc.). - Improved input validation and handling for CLI commands, including URL parsing and fallbacks. - Appended `.gitignore` entries for logs, `.pypirc`, and cache directories like `.ruff_cache/`. - Updated `README.md` and `README_JP.md` to include chat usage examples. - Added `config.json` to project data files for distribution. - Adjusted Makefile for streamlined PyPI uploads. - Skipped detailed `<details>` and `<summary>` blocks during content extraction to improve Markdown clarity. - Incremented version to 2.0.3 to reflect new functionality.
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a major and impressive refactoring of the entire project. It successfully transitions the codebase from a complex, multi-dependency package to a streamlined, zero-dependency library with a much clearer focus and structure. The adoption of pyproject.toml
for packaging, the clean separation of concerns into modules like cli.py
, chat.py
, and deepwiki_to_md.py
, and the use of a Strategy pattern for content extraction are all excellent improvements that significantly enhance maintainability and extensibility. The documentation has also been massively overhauled, with new English and Japanese guides that are well-structured and informative.
My review focuses on a few critical points to ensure the new structure is robust:
- A syntax error in the new
Makefile
that would prevent it from running. - A potential silent failure in the HTTP client when handling Brotli-compressed content, which also conflicts with the zero-dependency goal.
- The presence of a redundant and conflicting
setup.py
file alongside the newpyproject.toml
.
Addressing these points will help solidify this excellent refactoring effort.
…s, and metadata adjustments.
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a major and impressive refactoring of the entire project. It modernizes the packaging from setup.py
to pyproject.toml
, flattens the project structure into a src
layout, and completely rewrites the core scraping logic to be zero-dependency and strategy-based. The documentation has also been significantly improved and reorganized. Overall, these are excellent changes that improve maintainability and usability. I've found a few issues, including a critical one in the Makefile
and a bug in a README.md
example, along with some suggestions for improving consistency and clarity.
/gemini review |
…s, and metadata adjustments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a major and impressive refactoring of the entire project. It modernizes the packaging from setup.py
and requirements.txt
to a pyproject.toml
-based setup. The core logic has been completely rewritten with a clean, extensible, and zero-dependency architecture using a Strategy pattern for content extraction. The old, complex, and scattered scraper implementations have been replaced with a single, coherent deepwiki_to_md.py
module. The CLI has also been streamlined. Documentation has been significantly improved with new README.md
files in English and Japanese, and detailed development guidelines. The addition of example scripts in wiki_tests/
is also a great help for developers.
My review has found a few minor issues: a critical syntax error in the new Makefile
, a small configuration error in pyproject.toml
for code coverage, and a misleading function name in the new chat.py
module. Once these are addressed, this will be an excellent update.
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and impressive refactoring of the project. It modernizes the structure by adopting pyproject.toml
and a src
layout, which is a great move for maintainability. The codebase has been thoughtfully modularized, and the shift to zero core dependencies with lazy imports for optional features is a solid design choice. The documentation has also been substantially improved and reorganized. My review focuses on potential improvements in the new modules, configuration clarity, and documentation consistency.
…I formatting in `cli.py`. - Removed caller-relative path fallback in favor of CWD-based resolution in `chat.py`. - Extracted search results formatting into `format_search_results_devlog` function for better modularity in `cli.py`. - Updated Markdown section splitting in `deepwiki_to_md.py` with state machine-based parsing and dedicated filtering logic for improved readability and reusability.
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request represents a significant and impressive refactoring of the entire project. The move to a modern pyproject.toml
-based packaging system, the adoption of a src
layout, and the redesign of the core logic using a Strategy pattern are all excellent changes that greatly improve maintainability and code quality. The documentation has also been substantially rewritten and is much clearer.
My review focuses on a few areas to further polish this new structure, including improving the packaging of data files, clarifying developer documentation, and increasing the robustness of some of the new modules. Overall, this is a fantastic update.
# If config is None, load src/config.json (= next to this file) | ||
if config is None: | ||
from pathlib import Path | ||
default_path = str((Path(__file__).parent / "config.json").resolve()) | ||
loaded = load_config(default_path) | ||
if not loaded: | ||
raise RuntimeError(f"Default config not found or failed to load at '{default_path}'") | ||
config = loaded | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback logic for loading the default config.json
relies on a hardcoded relative path from the __file__
location. This is brittle and can break if the package is installed in certain ways (e.g., as a zip archive or with some package managers).
A more robust approach is to embed the default configuration as a multi-line string within this module and parse it with json.loads()
. This removes any dependency on the filesystem layout of the installed package.
if config is None:
# Embed default config as a string to avoid filesystem issues.
default_config_str = """{
"headers": {
"Host": "api.devin.ai",
"Sec-Ch-Ua-Platform": "\"macOS\"",
"Accept-Language": "ja",
"Sec-Ch-Ua": "\"Chromium\";v=\"139\", \"Not;A=Brand\";v=\"99\"",
"Content-Type": "application/json",
"Sec-Ch-Ua-Mobile": "?0",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36",
"Accept": "*/*",
"Origin": "https://deepwiki.com",
"Sec-Fetch-Site": "cross-site",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Dest": "empty",
"Referer": "https://deepwiki.com/",
"Accept-Encoding": "gzip, deflate, br",
"Priority": "u=4, i"
},
"body_template": {
"engine_id": "multihop",
"keywords": [],
"additional_context": "",
"use_notes": false,
"generate_summary": false
}
}"""
try:
config = json.loads(default_config_str)
except json.JSONDecodeError:
raise RuntimeError("Default config is malformed and could not be loaded.")
{ | ||
"headers": { | ||
"Host": "api.devin.ai", | ||
"Content-Length": "349", | ||
"Sec-Ch-Ua-Platform": "\"macOS\"", | ||
"Accept-Language": "ja", | ||
"Sec-Ch-Ua": "\"Chromium\";v=\"139\", \"Not;A=Brand\";v=\"99\"", | ||
"Content-Type": "application/json", | ||
"Sec-Ch-Ua-Mobile": "?0", | ||
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36", | ||
"Accept": "*/*", | ||
"Origin": "https://deepwiki.com", | ||
"Sec-Fetch-Site": "cross-site", | ||
"Sec-Fetch-Mode": "cors", | ||
"Sec-Fetch-Dest": "empty", | ||
"Referer": "https://deepwiki.com/", | ||
"Accept-Encoding": "gzip, deflate, br", | ||
"Priority": "u=4, i" | ||
}, | ||
"body_template": { | ||
"engine_id": "multihop", | ||
"user_query": "<relevant_context>This query was sent from the wiki page: VS Code Architecture Overview.</relevant_context>pluginの仕組みを教えて", | ||
"keywords": [], | ||
"repo_names": [ | ||
"microsoft/vscode" | ||
], | ||
"additional_context": "", | ||
"query_id": "plugin_3b6b0fb5-14ff-4d76-a8b9-758425e82f4d", | ||
"use_notes": false, | ||
"generate_summary": false | ||
} | ||
} No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This configuration file contains hardcoded values that should be dynamic or removed.
"Content-Length": "349"
: This header should be removed. HTTP client libraries likerequests
calculate this automatically. A hardcoded value will be incorrect if the request body changes, causing requests to fail.- The
user_query
andquery_id
in thebody_template
are specific to one example. While the code inchat.py
correctly overwrites these, it's confusing and bad practice to have such specific values in a template. They should be removed to make it clear they are dynamically generated.
{
"headers": {
"Host": "api.devin.ai",
"Sec-Ch-Ua-Platform": "\"macOS\"",
"Accept-Language": "ja",
"Sec-Ch-Ua": "\"Chromium\";v=\"139\", \"Not;A=Brand\";v=\"99\"",
"Content-Type": "application/json",
"Sec-Ch-Ua-Mobile": "?0",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36",
"Accept": "*/*",
"Origin": "https://deepwiki.com",
"Sec-Fetch-Site": "cross-site",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Dest": "empty",
"Referer": "https://deepwiki.com/",
"Accept-Encoding": "gzip, deflate, br",
"Priority": "u=4, i"
},
"body_template": {
"engine_id": "multihop",
"keywords": [],
"repo_names": [],
"additional_context": "",
"use_notes": false,
"generate_summary": false
}
}
[tool.setuptools.data-files] | ||
"share/deepwiki_to_md" = [ | ||
"src/config.json", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using [tool.setuptools.data-files]
to install src/config.json
into share/deepwiki_to_md
is not recommended. This approach can pollute the user's global Python environment (sys.prefix
) and makes the file difficult for the package to locate reliably.
A more robust practice for packaging default data is to include it within the package itself and access it using importlib.resources
. Given the current flat module structure, the simplest solution would be to embed the default configuration as a multi-line string directly within the src/chat.py
module. This would remove the need for this data-files
entry and make the default config loading more reliable.
No description provided.