π Data Analytics Internship Task 5 | π§Ή Data Cleaning & Preprocessing β From Raw Data to Refined Intelligence
In the vast digital world, data is the new oil β but only when itβs clean, structured, and ready to drive insight. π‘ Through this project, I embarked on a data cleaning and preprocessing journey where raw and unstructured information was transformed into accurate, insightful, and analysis-ready datasets. This project focuses on mastering one of the most crucial stages in the data analytics lifecycle β ensuring data accuracy, consistency, and quality through systematic cleaning, transformation, and standardization. π§©β¨
The Data Cleaning & Preprocessing Project is an end-to-end initiative designed to prepare real-world data for analysis and visualization. Using Python (Pandas, NumPy, Matplotlib), this project demonstrates how data inconsistencies, missing values, and outliers are systematically handled to enhance data reliability and analytical depth. The dataset chosen for this project β Data_Salaries.csv β provides a fascinating view of job salaries, company scores, and professional attributes, serving as a perfect foundation for demonstrating preprocessing techniques in data analytics.
Dataset: Data_Salaries.csv Source: Kaggle β Real-world dataset capturing company names, job titles, locations, company scores, and salary details.
- π’ Company β The organization offering a position
- πΌ Job Title β The role or designation held
- π Location β The geographical location of the role
- π° Salary Estimate β Compensation figures (yearly/hourly)
- β Company Score β Company performance or reputation rating
This dataset mirrors the real-world challenges analysts face β inconsistent text, missing values, messy salary formats, and outliers β making it an ideal playground to practice comprehensive data cleaning and preprocessing.
π§ Step 1 β Data Inspection and Understanding
The dataset was first examined to understand its structure, column types, and missing values. Duplicates, nulls, and irrelevant fields were identified and summarized for systematic cleaning.
Missing or incomplete entries were treated carefully:
- Filled company scores with median values.
- Replaced missing locations with βNot Specified.
- Removed or imputed incomplete salary records.
To achieve consistency:
- Salary fields were standardized (converted hourly/monthly values into annual estimates).
- Text fields (company names, job titles, locations) were normalized β stripping spaces and capitalization issues.
- Location fields were further split into City and State for finer analysis.
Outliers in salary data were identified using Interquartile Range (IQR) and handled through winsorization, ensuring that extreme values didnβt distort the overall insights.
After rigorous transformations, the cleaned dataset was saved as: cleaned_data_salaries.csv β a structured, reliable dataset ready for analysis and visualization.
Visualization is where cleaned data becomes insightful. Using Matplotlib with a bright background and rich color palette, eight compelling visualizations were created to highlight salary trends, job distributions, and company performances:
- π’ Top 10 Companies by Job Listings β Bar chart of the most active employers.
- πΌ Mean Annual Salary by Job Title β Reveals which roles command higher compensation.
- π° Salary Distribution Histogram β Shows how salaries spread across roles.
- π¦ Salary Boxplot by Top Companies β Identifies variation and outliers.
- β Company Score vs. Salary Scatter Plot β Displays the relationship between company reputation and pay.
- π Top Hiring Locations (Pie Chart) β Shows where most opportunities are concentrated.
- π§© Missing Values Heatmap β Highlights data completeness.
- π§Ύ Salary Unit Comparison β Analyzes distribution between hourly, yearly, and unspecified salaries.
Clean data enables meaningful visualization β patterns once hidden in noise now tell clear stories about pay scales, organizational behavior, and job market dynamics.
- Annual salary ranges exhibit strong clustering around mid-tier roles, with a few high-paying outliers.
- Some companies consistently offer higher salaries aligned with strong company scores.
- Certain job titles show high salary variance, hinting at skill-based compensation structures.
- The datasetβs original inconsistencies, once cleaned, revealed distinct trends across company types and regions.
Clean, standardized data is not just a preparatory step β itβs the foundation of trustable analytics. Every insight depends on the accuracy achieved during preprocessing.
- Pandas β For cleaning, transformation, and handling missing data
- NumPy β For numerical and statistical operations
- Matplotlib β For bright, visually engaging charts
The integration of these tools ensured a smooth transition from raw, inconsistent data to a refined, analytics-ready dataset with clear, actionable insights.
This project demonstrates how effective preprocessing turns raw data into a structured foundation for analytics. By identifying inconsistencies, standardizing formats, and handling missing or extreme values, the dataset was transformed into an asset that drives reliable and visually rich insights.
Cleaning data is not just a mechanical task β itβs an art of discovery. It teaches patience, precision, and problem-solving. Every inconsistency tells a story about real-world data challenges, and every fix builds the credibility of analysis.
βGood analysis starts with clean data β because behind every chart, thereβs a story that only clean data can tell.β
- Abdullah Umar
- Data Analytics Intern at Internee.pk








