A comprehensive web scraping and data analysis platform that extracts job listings from theprotocol.it to analyze technology demand trends for developers in the Polish job market. The system provides automated data collection, intelligent analysis, and interactive visualizations to help developers understand market requirements and prioritize their learning efforts.
- Multi-Position Support: Configurable job position targeting (Python, Java, JavaScript, etc.)
- Automated Scraping: Scheduled data collection using Celery and Redis
- Intelligent Data Processing: Natural language processing with NLTK and fuzzy matching
- Interactive Visualizations: Comprehensive charts and graphs using Matplotlib
- RESTful API: Flask-based web service for data access
- Containerized Deployment: Full Docker and Docker Compose support
- Advanced Web Scraping: Scrapy + Selenium integration for JavaScript-rendered content
- Robust Data Pipeline: Custom ItemPipelines and Middlewares
- Database Integration: MySQL with SQLAlchemy ORM
- Task Queue Management: Celery with Redis broker
- Production-Ready: Nginx reverse proxy configuration
Category | Technologies |
---|---|
Web Scraping | Scrapy, Selenium, scrapy-selenium4 |
Data Processing | Pandas, NLTK, fuzzywuzzy |
Visualization | Matplotlib, Jupyter |
Web Framework | Flask, SQLAlchemy |
Task Queue | Celery, Redis |
Database | MySQL |
Infrastructure | Docker, Docker Compose, Nginx |
The platform generates comprehensive visualizations including:
- Skills Analysis: Required vs. optional technical skills
- Experience Levels: Distribution of seniority requirements
- Employment Types: Contract types and arrangements
- Geographic Distribution: Location-based job distribution
- Market Trends: Technology demand over time
- Ukraine Support: Companies supporting Ukrainian developers
- Docker and Docker Compose installed
- Git
- Clone the repository
git clone https://github.com/danieladdisonorg/Jobsite-Scraper-and-Analyzer.git
cd Jobsite-Scraper-and-Analyzer
- Configure environment variables
cp .env.sample .env
Edit .env
file with your configuration settings.
- Launch the application
docker-compose up --build
- Access the application
Navigate to
http://localhost:8000/scraping/diagrams
Endpoint | Method | Description |
---|---|---|
/scraping/diagrams |
GET | Initiates data analysis and returns Celery task ID |
/scraping/diagrams/<task_id> |
GET | Checks task status and retrieves results |
Modify config.py
to customize scraping parameters:
# Target job position (lowercase)
POSITION = "python" # Options: "java", "javascript", "dev", etc.
# Scraping frequency
SCRAPING_EVERY_DAYS = 7
βββ analyzing/ # Data analysis and visualization modules
βββ web_server/ # Flask web application
βββ main_celery/ # Celery configuration and tasks
βββ static/ # Static assets and scraping results
βββ config.py # Application configuration
βββ docker-compose.yml # Container orchestration
βββ requirements.txt # Python dependencies
- Web Scraping Mastery: Advanced techniques with Scrapy and Selenium
- Data Pipeline Development: ETL processes and data transformation
- Asynchronous Task Processing: Celery and Redis implementation
- Containerization: Docker and microservices architecture
- Data Visualization: Statistical analysis and chart generation
- Cloud Storage Integration: Migrate from local file storage to cloud solutions (AWS S3, Google Cloud Storage)
- Database Optimization: Implement proper data warehousing for scraped content
- Container Optimization: Reduce Docker image sizes and improve build efficiency
- Code Architecture: Refactor to follow SOLID principles and improve modularity
- Comprehensive Testing: Unit, integration, and end-to-end test coverage
- CI/CD Pipeline: Automated testing and deployment workflows
- API Documentation: OpenAPI/Swagger integration
- Real-time Analytics: WebSocket integration for live data updates
- Machine Learning: Predictive analytics for job market trends
- Multi-language Support: Expand beyond Polish job market
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details.
Daniel Addison
- GitHub: @danieladdisonorg
- Project Link: https://github.com/danieladdisonorg/Jobsite-Scraper-and-Analyzer