🚀 GitHub toolkit

A Python-based web scraper that collects GitHub developer information, their followers, and repository details using Selenium and stores the data in a MySQL database.

✨ Features

🔥 Scrapes trending developers across multiple programming languages
👥 Collects follower information (up to 1000 per developer)
📦 Gathers repository details including name, URL, description, language, stars, and forks
🔐 Supports authentication via cookies or username/password
🗄️ Stores data in a MySQL database with automatic schema creation
⚠️ Includes error handling and logging
🧩 Follows clean architecture principles

🗂️ Project Structure

github-toolkit/
├── config/
│   └── settings.py           # Configuration and environment variables
├── core/
│   ├── entities.py          # Domain entities
│   └── exceptions.py        # Custom exceptions
├── infrastructure/
│   ├── database/           # Database-related code
│   │   ├── connection.py
│   │   └── models.py
│   └── auth/              # Authentication service
│       └── auth_service.py
├── services/
│   └── scraping/          # Scraping services
│       ├── github_developer_scraper.py
│       └── github_repo_scraper.py
├── utils/
│   └── helpers.py         # Utility functions
├── controllers/
│   └── github_scraper_controller.py  # Main controller
├── main.py                # Entry point
└── README.md

🛠️ Prerequisites

🐍 Python 3.8+
🗄️ MySQL database
🌐 Chrome browser
🧰 Chrome WebDriver

⚙️ Installation

Clone the repository:

git clone git@github.com:trinhminhtriet/github-toolkit.git
cd github-toolkit

Create a virtual environment and activate it:

python3 -m venv .venv
source ~/.venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Create a .env file in the root directory with the following variables:

GITHUB_USERNAME=your_username
GITHUB_PASSWORD=your_password
DB_USERNAME=your_db_username
DB_PASSWORD=your_db_password
DB_HOST=your_db_host
DB_NAME=your_db_name

Create a config directory:

mkdir config

📋 Requirements

Create a requirements.txt file with:

selenium
sqlalchemy
python-dotenv

▶️ Usage

Run the scraper:

cd src
python main.py

The scraper will:

🔑 Authenticate with GitHub
🌟 Scrape trending developers for specified languages
👥 Collect their followers (up to 1000 per developer)
📦 Scrape their repositories
💾 Store all data in the MySQL database

⚙️ Configuration

Modify config/settings.py to change:
- LANGUAGES: List of programming languages to scrape
- USE_COOKIE: Toggle between cookie-based and credential-based authentication
⏱️ Adjust sleep times in services if needed for rate limiting

🗃️ Database Schema

github_users

🆔 id (PK)
👤 username (unique)
🔗 profile_url
🕒 created_at
🕒 updated_at
📅 published_at

github_repos

🆔 id (PK)
👤 username
📦 repo_name
📝 repo_intro
🔗 repo_url (unique)
🏷️ repo_lang
⭐ repo_stars
🍴 repo_forks
🕒 created_at
🕒 updated_at
📅 published_at

🛡️ Error Handling

❗ Custom exceptions for authentication, scraping, and database operations
📝 Logging configured at INFO level
🛑 Graceful shutdown of browser instance

🤝 Contributing

Fork the repository.
Create a feature branch (git checkout -b feature/your-feature).
Commit changes (git commit -m "Add your feature").
Push to the branch (git push origin feature/your-feature).
Open a pull request.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details (create one if needed).

🙏 Acknowledgments

Built with Selenium, SQLAlchemy, and Python.
Inspired by the need to automate GitHub data collection.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
config		config
src		src
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 GitHub toolkit

✨ Features

🗂️ Project Structure

🛠️ Prerequisites

⚙️ Installation

📋 Requirements

▶️ Usage

⚙️ Configuration

🗃️ Database Schema

github_users

github_repos

🛡️ Error Handling

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

trinhminhtriet/github-toolkit

Folders and files

Latest commit

History

Repository files navigation

🚀 GitHub toolkit

✨ Features

🗂️ Project Structure

🛠️ Prerequisites

⚙️ Installation

📋 Requirements

▶️ Usage

⚙️ Configuration

🗃️ Database Schema

github_users

github_repos

🛡️ Error Handling

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages