Notice π£ : Works with EA FC 25 π! This library has been migrated from Puppeteer (JavaScript) to Playwright (Python). This change has proven to be more effective against Cloudflare bot protections implemented by SoFIFA.
Puppeteer code is kept in the legacy branch.
A comprehensive Playwright-based web scraper that extracts player URLs and detailed statistics from sofifa.com and saves them to CSV files.
- Install dependencies
pip3 install -r requirements.txt playwright install chromium
- Validate the scraper
python3 tests/test_scraper.py
- Collect player URLs
python3 src/scrape_player_urls.py
- Scrape player stats (with optional CLI arguments)
Omit an argument to use its default value (e.g., remove
python3 src/sofifa_scraper.py --max-players 100 \ --player-urls-file player_urls.csv \ --output-file player_stats.csv--max-playersto scrape all players).
sofifa-web-scraper/
βββ src/ # Source code
β βββ scrape_player_urls.py # URL collector
β βββ player_scraper.py # Modular scraper class
β βββ sofifa_scraper.py # Main stats scraper
βββ tests/ # Test files
β βββ test_scraper.py # Single player test
βββ player_urls.csv # Input: Player URLs
βββ player_stats.csv # Output: Player stats
βββ README.md # This file
The scraper uses a modular design:
src/player_scraper.py (PlayerScraper class)
β
src/sofifa_scraper.py (SoFIFAScraper class)
β
player_stats.csv (output)
- Basic Info: player_id, version, name, full_name, description, image, height_cm, weight_kg, dob, positions
- Ratings & Value: overall_rating, potential, value, wage
- Profile Attributes: preferred_foot, weak_foot, skill_moves, international_reputation, body_type, real_face, release_clause, specialities
- Club Information: club_id, club_name, club_league_id, club_league_name, club_logo, club_rating, club_position, club_kit_number, club_joined, club_contract_valid_until
- National Team: country_id, country_name, country_league_id, country_league_name, country_flag, country_rating, country_position, country_kit_number
- Attacking Stats: crossing, finishing, heading_accuracy, short_passing, volleys
- Skill Stats: dribbling, curve, fk_accuracy, long_passing, ball_control
- Movement Stats: acceleration, sprint_speed, agility, reactions, balance
- Power Stats: shot_power, jumping, stamina, strength, long_shots
- Mentality Stats: aggression, interceptions, positioning, vision, penalties, composure
- Defending Stats: defensive_awareness, standing_tackle, sliding_tackle
- Goalkeeping Stats: gk_diving, gk_handling, gk_kicking, gk_positioning, gk_reflexes
- Special: play_styles (comma-separated list), url
- Headless mode - Runs without opening browser window
- Resource blocking - Blocks images, CSS, fonts, and media for faster loading
- Cloudflare bypass - Automatic retry with 10s backoff on bot challenges (5 retries)
- Async implementation - Better performance with async/await
- Incremental saves - Progress saved after each player is scraped
All CSV headers use lowercase_snake_case format with no spaces for consistency.
- Python 3.8 or higher
- pip (Python package manager)
-
Install Python dependencies:
pip3 install -r requirements.txt
-
Install Playwright browsers:
playwright install chromium
Test the scraper on a single player (Lionel Messi) to verify everything works:
python3 tests/test_scraper.pyExpected output:
- β Extraction summary in console
- β
test_output.jsonfile with all 70+ fields - β Field presence/absence report
First, run the URL scraper to get all player URLs:
python3 src/scrape_player_urls.pyThis script will:
- Navigate to
https://sofifa.com/players?col=oa&sort=desc - Extract all player URLs from the current page
- Click through pagination (offset increments of 60)
- Continue until no "Next" button exists
- Save all URLs to
player_urls.csvafter each page - Automatically retry on Cloudflare challenges (up to 5 times with 10s backoff)
Output: player_urls.csv with all player profile URLs
After you have player_urls.csv, run the stats scraper:
python3 src/sofifa_scraper.py [--max-players N] [--player-urls-file PATH] [--output-file PATH]This script will:
- Load player URLs from
player_urls.csv - Visit each player page and extract detailed stats (runs in headless mode)
- Save progress incrementally after each player to
player_stats.csv - Automatically retry on Cloudflare challenges (up to 5 times with 10s backoff)
- Display a summary with extracted data
Notes:
- Omit CLI arguments to use default values (
player_urls.csv,player_stats.csv, and scraping all players). - Use
--max-playersto limit runs for testing (for example,--max-players 50). - Provide alternate input/output paths with
--player-urls-fileand--output-file.
Comprehensive CSV with 70+ columns in lowercase_snake_case format:
player_id,version,name,full_name,description,image,height_cm,weight_kg,dob,positions,overall_rating,potential,value,wage,preferred_foot,weak_foot,skill_moves,international_reputation,body_type,real_face,release_clause,specialities,club_id,club_name,club_league_id,club_league_name,club_logo,club_rating,club_position,club_kit_number,club_joined,club_contract_valid_until,country_id,country_name,country_league_id,country_league_name,country_flag,country_rating,country_position,country_kit_number,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,skill_ball_control,movement_acceleration,movement_sprint_speed,movement_agility,movement_reactions,movement_balance,power_shot_power,power_jumping,power_stamina,power_strength,power_long_shots,mentality_aggression,mentality_interceptions,mentality_att_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_defensive_awareness,defending_standing_tackle,defending_sliding_tackle,goalkeeping_gk_diving,goalkeeping_gk_handling,goalkeeping_gk_kicking,goalkeeping_gk_positioning,goalkeeping_gk_reflexes,play_styles,url
158023,FC 26,L. Messi,Lionel AndrΓ©s Messi Cuccitini,...
Key Features:
- All headers in lowercase_snake_case (no spaces)
- 70+ comprehensive fields per player
- Incremental saves (progress preserved after each player)
You can modify the scraper by editing src/sofifa_scraper.py:
- Limit players: Set
max_players = 10to scrape specific amount instead of all - Change input file: Pass different filename to
SoFIFAScraper(player_urls_file="custom.csv") - Change output filename: Modify the filename in the scraper initialization
- Adjust retry settings: Modify
max_retriesand sleep duration in the scraping loop - Modify extracted fields: Edit the
PlayerScraperclass insrc/player_scraper.py
- The scraper runs in headless mode (no visible browser window)
- Resource blocking is enabled to skip loading images, CSS, fonts, and media for faster performance
- Cloudflare detection: Automatically retries up to 5 times with 10-second pauses if bot challenge is detected
- Incremental saves: Progress is saved after each player, so you can resume if interrupted
- The scraper respects the website's structure as of the implementation date
- Website structure changes may require updates to the selectors in
player_scraper.py - Be respectful of the website's terms of service and rate limits
Browser not found:
playwright install chromiumCloudflare challenges:
- The scraper automatically retries up to 5 times with 10s backoff
- If challenges persist, the website may have stricter bot detection
- Consider adding longer delays between requests
Timeout errors:
- Check your internet connection
- The website might be slow or down
- Increase timeout values in the script
No stats extracted:
- Run
python3 tests/test_scraper.pyfirst to verify extraction works - Check
test_output.jsonfor raw extracted data - The website structure may have changed
- Check if player URLs in CSV are valid and accessible
- Verify the selectors in
src/player_scraper.py
Missing fields in output:
- Some players may not have all fields (e.g., no club)
- This is expected and handled gracefully
- Run
python3 tests/test_scraper.pyto see which fields are typically missing
Before running full scrape:
- Run
python3 tests/test_scraper.pysuccessfully - Check
test_output.jsonhas expected data - Verify CSV headers match your schema
- Test with
max_players=5first insrc/sofifa_scraper.py - Then run full scrape
MIT License - Feel free to use and modify as needed.
