Skip to content

A comprehensive web scraping and data analysis platform that extracts job listings from theprotocol.it to analyze technology demand trends for developers in the Polish job market.

Notifications You must be signed in to change notification settings

danieladdisonorg/Jobsite-Scraper-and-Analyzer

Repository files navigation

Jobsite Scraper and Analyzer

Scrapy Selenium Flask Pandas Matplotlib SQLAlchemy Redis MySQL Docker

Overview

A comprehensive web scraping and data analysis platform that extracts job listings from theprotocol.it to analyze technology demand trends for developers in the Polish job market. The system provides automated data collection, intelligent analysis, and interactive visualizations to help developers understand market requirements and prioritize their learning efforts.

πŸš€ Features

Core Functionality

  • Multi-Position Support: Configurable job position targeting (Python, Java, JavaScript, etc.)
  • Automated Scraping: Scheduled data collection using Celery and Redis
  • Intelligent Data Processing: Natural language processing with NLTK and fuzzy matching
  • Interactive Visualizations: Comprehensive charts and graphs using Matplotlib
  • RESTful API: Flask-based web service for data access
  • Containerized Deployment: Full Docker and Docker Compose support

Technical Capabilities

  • Advanced Web Scraping: Scrapy + Selenium integration for JavaScript-rendered content
  • Robust Data Pipeline: Custom ItemPipelines and Middlewares
  • Database Integration: MySQL with SQLAlchemy ORM
  • Task Queue Management: Celery with Redis broker
  • Production-Ready: Nginx reverse proxy configuration

πŸ›  Technology Stack

Category Technologies
Web Scraping Scrapy, Selenium, scrapy-selenium4
Data Processing Pandas, NLTK, fuzzywuzzy
Visualization Matplotlib, Jupyter
Web Framework Flask, SQLAlchemy
Task Queue Celery, Redis
Database MySQL
Infrastructure Docker, Docker Compose, Nginx

πŸ“Š Analytics Dashboard

The platform generates comprehensive visualizations including:

  • Skills Analysis: Required vs. optional technical skills
  • Experience Levels: Distribution of seniority requirements
  • Employment Types: Contract types and arrangements
  • Geographic Distribution: Location-based job distribution
  • Market Trends: Technology demand over time
  • Ukraine Support: Companies supporting Ukrainian developers

πŸš€ Quick Start

Prerequisites

  • Docker and Docker Compose installed
  • Git

Installation

  1. Clone the repository
git clone https://github.com/danieladdisonorg/Jobsite-Scraper-and-Analyzer.git
cd Jobsite-Scraper-and-Analyzer
  1. Configure environment variables
cp .env.sample .env

Edit .env file with your configuration settings.

  1. Launch the application
docker-compose up --build
  1. Access the application Navigate to http://localhost:8000/scraping/diagrams

πŸ“– API Documentation

Endpoints

Endpoint Method Description
/scraping/diagrams GET Initiates data analysis and returns Celery task ID
/scraping/diagrams/<task_id> GET Checks task status and retrieves results

Configuration

Modify config.py to customize scraping parameters:

# Target job position (lowercase)
POSITION = "python"  # Options: "java", "javascript", "dev", etc.

# Scraping frequency
SCRAPING_EVERY_DAYS = 7

πŸ— Architecture

β”œβ”€β”€ analyzing/          # Data analysis and visualization modules
β”œβ”€β”€ web_server/         # Flask web application
β”œβ”€β”€ main_celery/        # Celery configuration and tasks
β”œβ”€β”€ static/             # Static assets and scraping results
β”œβ”€β”€ config.py           # Application configuration
β”œβ”€β”€ docker-compose.yml  # Container orchestration
└── requirements.txt    # Python dependencies

πŸ“ˆ Sample Visualizations

Skills by Experience Level

Skills by Experience Level

Required Technical Skills

Required Skills

Optional Technical Skills

Optional Skills

Experience Level Distribution

Experience Levels

Employment Types

Employment Types

Geographic Distribution

Locations

πŸ”§ Development

Key Learning Outcomes

  • Web Scraping Mastery: Advanced techniques with Scrapy and Selenium
  • Data Pipeline Development: ETL processes and data transformation
  • Asynchronous Task Processing: Celery and Redis implementation
  • Containerization: Docker and microservices architecture
  • Data Visualization: Statistical analysis and chart generation

Future Enhancements

High Priority

  • Cloud Storage Integration: Migrate from local file storage to cloud solutions (AWS S3, Google Cloud Storage)
  • Database Optimization: Implement proper data warehousing for scraped content
  • Container Optimization: Reduce Docker image sizes and improve build efficiency

Medium Priority

  • Code Architecture: Refactor to follow SOLID principles and improve modularity
  • Comprehensive Testing: Unit, integration, and end-to-end test coverage
  • CI/CD Pipeline: Automated testing and deployment workflows
  • API Documentation: OpenAPI/Swagger integration

Low Priority

  • Real-time Analytics: WebSocket integration for live data updates
  • Machine Learning: Predictive analytics for job market trends
  • Multi-language Support: Expand beyond Polish job market

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ“ž Contact

Daniel Addison


⭐ Star this repository if you find it helpful!

About

A comprehensive web scraping and data analysis platform that extracts job listings from theprotocol.it to analyze technology demand trends for developers in the Polish job market.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published