This project involves performing comprehensive Exploratory Data Analysis (EDA) on a simulated retail data warehouse modeled in a star schema format. Using SQL Server Management Studio (SSMS) and T-SQL, the project aims to uncover valuable business insights by querying fact and dimension tables.
The goal was to derive actionable intelligence around customer behavior, product performance, regional trends, and sales timelines — setting the stage for more advanced analytics such as BI dashboarding, forecasting, and segmentation.
- Understand the structure and volume of the retail data warehouse.
- Analyze key business entities: customers, products, sales transactions, and time.
- Generate analytical views of business magnitude and performance.
- Apply ranking and aggregation techniques to identify top/bottom contributors.
- Enable data-driven decision-making by providing clear metrics.
- SQL Server 2019
- T-SQL
- SSMS (SQL Server Management Studio)
- Star Schema Modeling
- Aggregate and Window Functions
The database is designed in a star schema, featuring:
-
Fact Table:
fact_sales
Contains transactional records of orders with measures likesales
,quantity
,price
, andorder_date
. -
Dimension Tables:
dim_customers
– Customer demographics, location, birthdate, etc.dim_products
– Product hierarchy including category, subcategory, product name, and cost.- Additional date-related fields (e.g.,
order_date
,ship_date
) embedded in the fact table.
The analysis begins with exploring the structure of the database using:
SELECT * FROM INFORMATION_SCHEMA.TABLES;
SELECT * FROM INFORMATION_SCHEMA.COLUMNS;
Findings:
- 6+ key tables
- ~100+ columns
- Clear star-schema relationships with proper foreign key design.
Purpose: Understand the static attributes of the business (customers, products, geography).
-
Identified unique countries customers are located in:
SELECT DISTINCT country FROM dim_customers;
-
Explored product hierarchy:
- 3 main categories: Accessories, Bikes, Clothing, and Components.
- 295 distinct products across subcategories.
-
Breakdown of products by category:
SELECT category, COUNT(product_id) FROM dim_products GROUP BY category;
Used MIN
, MAX
, and DATEDIFF
to analyze order timelines:
- First Order Date: Earliest transaction in the dataset.
- Last Order Date: Most recent transaction.
- Time Span Covered: Over 4 years of transactional data.
SELECT MIN(order_date), MAX(order_date), DATEDIFF(YEAR, MIN(order_date), MAX(order_date)) FROM fact_sales;
Also analyzed customer age range:
SELECT MIN(birth_date), MAX(birth_date) FROM dim_customers;
- Oldest customer age: ~70+ years
- Youngest customer age: ~18 years
Analyzed the core quantitative metrics using SQL aggregation functions:
SELECT SUM(sales), SUM(quantity), AVG(price), COUNT(DISTINCT order_number) FROM fact_sales;
- Total Sales: ₹23M+
- Total Quantity Sold: 400K+
- Average Price: ₹58.3
- Total Orders (Distinct): ~27K
- Total Products: 295
- Total Customers: 18K
- Active (ordering) Customers: 100%
Generated a consolidated business summary report using UNION ALL
:
SELECT 'TOTAL SALES', SUM(sales) FROM fact_sales
UNION ALL
SELECT 'TOTAL QUANTITY', SUM(quantity) ...
Purpose: Compare aggregate metrics across dimensions (e.g., revenue by category, products by gender, etc.).
-
Revenue by Product Category
SELECT category, SUM(sales) FROM fact_sales JOIN dim_products GROUP BY category;
-
Customer Count by Country/Gender
SELECT country, COUNT(*) FROM dim_customers GROUP BY country;
-
Average Cost by Product Category
SELECT category, AVG(product_cost) FROM dim_products GROUP BY category;
-
Quantity Distribution by Country
SELECT country, SUM(quantity) FROM fact_sales JOIN dim_customers GROUP BY country;
- Bikes dominate revenue contributions.
- United States leads in customer count and order volume.
- Male customers slightly outnumber female customers.
- Some categories have higher average costs but lower sales volumes.
Used SQL window functions to identify top and bottom performers in a business context.
SELECT product_name, ROW_NUMBER() OVER (ORDER BY SUM(sales) DESC) AS rank FROM fact_sales ...
... OVER (ORDER BY SUM(sales) ASC)
SELECT TOP 10 customer_id, SUM(sales) ...
SELECT TOP 3 customer_id, COUNT(DISTINCT order_number) ...
- The top 5 products contribute disproportionately to revenue.
- A few customers have placed very high-value orders.
- Some customers exist with only one or two purchases — useful for churn modeling.
Identified which dimensions are high or low in cardinality to aid future OLAP modeling:
-
Low cardinality:
gender
,country
,category
(Useful for slicing data in dashboards)
-
High cardinality:
product_name
,customer_id
(Important for indexing and efficient storage)
- Complete understanding of data warehouse structure and content.
- Quantitative profiles of products, customers, and time-based activity.
- Ranking insights to identify sales leaders and underperformers.
- Built a solid foundation for advanced use cases.
- Business Intelligence (BI): Create Power BI/Tableau dashboards using pre-aggregated views.
- Predictive Analytics: Feed insights into ML models for churn, upselling, and demand forecasting.
- OLAP Cubes: Create aggregations across time and region dimensions for fast reporting.
Pre-Requisite Medallion Architecture Project Access full SQL code and documentation: 🔗 GitHub: SQL Data Warehouse Project
This project demonstrates the real-world application of advanced SQL in a data warehousing environment. By leveraging analytical thinking, dimensional modeling, and SQL optimization techniques, we produced valuable insights for decision-makers.
If you're interested in data modeling, warehousing, or SQL performance tuning — this project provides a strong foundation in those areas.