🍔 NutriScore Analyzer: Unwrapping the Secrets of U.S. Packaged Foods Additives

👉 Click Here to See HTML Report | ipynb notebook

🎯 Project Overview

This project analyzes the nutrition facts of U.S. packaged foods and tests how they are reflected in Nutri‑Score labeling. The ML/NLP pipeline was built and includes data cleaning, EDA, feature engineering, PCA/UMAP, clustering, regression, and classification. The goal is to uncover the relationship between food additives and Nutri‑Score.

💡 Why it’s interesting

As consumers become increasingly health-conscious, test whether those Nutri‑Score grades capture what matters (sugar, fat, salt, fiber) and whether they miss what also matters (additive load, brand practices). Findings can inform healthier choices and what the label does not tell you.

🔍 Key Research Questions

Does Nutri-Score reflects the presence of food additives, or are there potential inconsistencies?
Do certain brands consistently use more additives, and tend to receive lower Nutri-Scores?
How do food additives and healthiness vary across different brands of packaged foods in the U.S.?

🤔 Tasks

Goal: Predict packaged food healthiness and analyze drivers (ingredients, additives, brand info).
- Multiclass classification → Nutri-Score: grade A to E.
- Binary classification → Healthy vs. Unhealthy.
- Exploratory & Unsupervised → PCA/UMAP for visualization, KMeans for cluster profiling.

🛎️ Key Findings

Big picture: U.S. packaged foods are dominated by fortified nutrients, sugars, stabilizers, and oils.
For Consumers: Nutri-Score alone is not sufficient for additive-conscious choices. "Clean label" ≠ low additives used.
For Regulators: Consider an additive/processing penalty in next-gen front-of-pack labeling.
Label gap: Nutri-Score emphasizes nutrients but it under-captures the complexity of food processing & additive.

📊 Dataset

Source: Open Food Facts (Kaggle)
Size: 173k+ product records
Features: 163 features

🛠️ Pipeline Methodology:

Cleaning & Standardization

Numerical: unit normalization (per 100g), outlier caps, missing‑value
Data deduplication: drop duplicates by barcode/product key

NLP Text Preprocessing

Normalization: lowercase, punctuation handling, strip non-English/noisy chars.
Stopwords: prune English (NLTK) and multilingual stopword lists; remove domain/measurement noise (e.g., organic, mg, g, oz).
Synonyms map: create dictionary for frequent variants (e.g., "ascorbic acid" ↔ “vitamin C”; brand aliases)
Tokenization: Lemmatization & Stemming ("starches" → "starch"; "sugary" → "sugar").
Standardize E-numbers: internationally used codes for food additives, map codes to names (e.g., E150d → "Caramel IV").
TF-IDF vectorization: Convert cleaned texts into numeric features. The informative tokens (e.g., unigram: "aspartame", "sodium", "E150d"; bigram: "palm oil", "citric acid") get high weight (large TF-IDF), while generic terms (e.g., water, organic) are down-weighted.
The following Word Cloud is the 50 Most Common Ingredients in the U.S. packaged foods:

Classification Modeling

Dimensionality Reduction: Principal Component Analysis (PCA)
Unsupervised Clustering: K-means clustering and visualization using UMAP:
- Elbow method to choose k=4 clusters
- Cluster profiling by nutrition features
- Visualize four clusters in 2D UMAP space

Prediction Modeling

Multiclass: Nutri-Score from A to E.
Binary: Healthy vs. Unhealthy (A, B v.s. C, D, E).
Logistic Regression
Random Forest
Evaluate with Accuracy, F1, ROC-AUC, confusion matrix
Sugars and Saturated Fat Are the Top Predictor: model highlights that products low in sugar, saturated fat, and salt are far more likely to be classified as healthy:

✍️ Results Summary

Model Fit:

OLS R² ≈ 0.83 (Adj R² = 0.83)

Primary Drivers:

Sugar: +7.06 → higher score (worse healthiness)
Total Fat: +5.05
Saturated Fat: +2.90
Fiber: –1.66 (improves healthiness)
Carbohydrates: –0.19

Nonlinear & Interaction Effects:

Diminishing returns: sugars² (–2.995) & fat² (–2.…)
Salt synergy: sugar×salt (+0.38), fat×salt (+1.26)
Sugar–fat offset: sugar×fat (–1.74)

Additive Impact:

Log(additives) +0.35 (p < 0.001), with tapering returns

Palm-Oil Flags:

“May-be from palm oil” (–0.06)
“Definitely from palm oil” not significant

Brand Effects:

Food Club (+1.50)
Great Value (+1.00) premiums
Shoprite (–0.41) marginally lower

🎯 Research Question 1: Does Nutri-Score reflects the presence of food additives?

Conclusion: Not reliably.

Some products with a high number of additives still receive a good Nutri-Score
This suggests Nutri-Score may overlook additive information when assessing overall healthfulness.

🎯 Research Question 2: Which brands use more additives and have lower Nutri-Scores?

Conclusion: Often as a brand-level marker.

Additives alone don’t determine the score, but heavy additive usage frequently coincides with lower overall nutrition quality when aggregated at the brand level.

🎯 Research Question 3: How do food additives and healthiness vary across different brands?

Conclusion: Additive usage varies by brand type.

Most frequently used additives include such as citric acid, riboflavin, and nicotinic acid, are widely recognized as safe and even beneficial.
Because Nutri-Score doesn’t factor additive types, two products can share the same grade yet have very different additive compositions.

🛠️ Technical Used

Python, Jupyter, Pandas, NumPy, scikit-learn, Matplotlib/Seaborn, NLP (TF-IDF), PCA/UMAP, K-Means, Logistic Regression, Random Forest

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Unwrapping_the_Secrets_Nutritional_Analytics_of_U_S_Packaged_Foods.ipynb		Unwrapping_the_Secrets_Nutritional_Analytics_of_U_S_Packaged_Foods.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!