Skip to content

🍔 End-to-end ML pipeline analyzing 173K+ U.S. packaged foods (163 features): data cleaning → EDA → PCA/UMAP clustering → Nutri-Score classification, with model interpretability and insights.

License

Notifications You must be signed in to change notification settings

Chengyuli33/nutriScore-analyzer

Repository files navigation

🍔 NutriScore Analyzer: Unwrapping the Secrets of U.S. Packaged Foods Additives

👉 Click Here to See HTML Report | ipynb notebook

🎯 Project Overview

This project analyzes the nutrition facts of U.S. packaged foods and tests how they are reflected in Nutri‑Score labeling. The ML/NLP pipeline was built and includes data cleaning, EDA, feature engineering, PCA/UMAP, clustering, regression, and classification. The goal is to uncover the relationship between food additives and Nutri‑Score.

💡 Why it’s interesting

As consumers become increasingly health-conscious, test whether those Nutri‑Score grades capture what matters (sugar, fat, salt, fiber) and whether they miss what also matters (additive load, brand practices). Findings can inform healthier choices and what the label does not tell you.

Nutri-Score

🔍 Key Research Questions

  • Does Nutri-Score reflects the presence of food additives, or are there potential inconsistencies?
  • Do certain brands consistently use more additives, and tend to receive lower Nutri-Scores?
  • How do food additives and healthiness vary across different brands of packaged foods in the U.S.?

🤔 Tasks

  • Goal: Predict packaged food healthiness and analyze drivers (ingredients, additives, brand info).
    • Multiclass classification → Nutri-Score: grade A to E.
    • Binary classification → Healthy vs. Unhealthy.
    • Exploratory & Unsupervised → PCA/UMAP for visualization, KMeans for cluster profiling.

🛎️ Key Findings

  • Big picture: U.S. packaged foods are dominated by fortified nutrients, sugars, stabilizers, and oils.
  • For Consumers: Nutri-Score alone is not sufficient for additive-conscious choices. "Clean label" ≠ low additives used.
  • For Regulators: Consider an additive/processing penalty in next-gen front-of-pack labeling.
  • Label gap: Nutri-Score emphasizes nutrients but it under-captures the complexity of food processing & additive.

📊 Dataset

  • Source: Open Food Facts (Kaggle)
  • Size: 173k+ product records
  • Features: 163 features

🛠️ Pipeline Methodology:

Cleaning & Standardization

  • Numerical: unit normalization (per 100g), outlier caps, missing‑value
  • Data deduplication: drop duplicates by barcode/product key

NLP Text Preprocessing

  • Normalization: lowercase, punctuation handling, strip non-English/noisy chars.

  • Stopwords: prune English (NLTK) and multilingual stopword lists; remove domain/measurement noise (e.g., organic, mg, g, oz).

  • Synonyms map: create dictionary for frequent variants (e.g., "ascorbic acid" ↔ “vitamin C”; brand aliases)

  • Tokenization: Lemmatization & Stemming ("starches" → "starch"; "sugary" → "sugar").

  • Standardize E-numbers: internationally used codes for food additives, map codes to names (e.g., E150d → "Caramel IV").

  • TF-IDF vectorization: Convert cleaned texts into numeric features. The informative tokens (e.g., unigram: "aspartame", "sodium", "E150d"; bigram: "palm oil", "citric acid") get high weight (large TF-IDF), while generic terms (e.g., water, organic) are down-weighted.

  • The following Word Cloud is the 50 Most Common Ingredients in the U.S. packaged foods:

    common-ingredient 50-common

Classification Modeling

  • Dimensionality Reduction: Principal Component Analysis (PCA)

  • Unsupervised Clustering: K-means clustering and visualization using UMAP:

    • Elbow method to choose k=4 clusters

    • Cluster profiling by nutrition features

    • Visualize four clusters in 2D UMAP space

      cluster_profile kmeans-umap

Prediction Modeling

  • Multiclass: Nutri-Score from A to E.

  • Binary: Healthy vs. Unhealthy (A, B v.s. C, D, E).

  • Logistic Regression

  • Random Forest

  • Evaluate with Accuracy, F1, ROC-AUC, confusion matrix

  • Sugars and Saturated Fat Are the Top Predictor: model highlights that products low in sugar, saturated fat, and salt are far more likely to be classified as healthy:

    feature-importance binary multiclass

✍️ Results Summary

Model Fit:

  • OLS R² ≈ 0.83 (Adj R² = 0.83)

Primary Drivers:

  • Sugar: +7.06 → higher score (worse healthiness)
  • Total Fat: +5.05
  • Saturated Fat: +2.90
  • Fiber: –1.66 (improves healthiness)
  • Carbohydrates: –0.19

Nonlinear & Interaction Effects:

  • Diminishing returns: sugars² (–2.995) & fat² (–2.…)
  • Salt synergy: sugar×salt (+0.38), fat×salt (+1.26)
  • Sugar–fat offset: sugar×fat (–1.74)

Additive Impact:

  • Log(additives) +0.35 (p < 0.001), with tapering returns

Palm-Oil Flags:

  • “May-be from palm oil” (–0.06)
  • “Definitely from palm oil” not significant

Brand Effects:

  • Food Club (+1.50)
  • Great Value (+1.00) premiums
  • Shoprite (–0.41) marginally lower

🎯 Research Question 1: Does Nutri-Score reflects the presence of food additives?

Conclusion: Not reliably.

  • Some products with a high number of additives still receive a good Nutri-Score
  • This suggests Nutri-Score may overlook additive information when assessing overall healthfulness.

🎯 Research Question 2: Which brands use more additives and have lower Nutri-Scores?

Conclusion: Often as a brand-level marker.

  • Additives alone don’t determine the score, but heavy additive usage frequently coincides with lower overall nutrition quality when aggregated at the brand level.

    highest-additives.png worst-score.png

🎯 Research Question 3: How do food additives and healthiness vary across different brands?

Conclusion: Additive usage varies by brand type.

  • Most frequently used additives include such as citric acid, riboflavin, and nicotinic acid, are widely recognized as safe and even beneficial.
  • Because Nutri-Score doesn’t factor additive types, two products can share the same grade yet have very different additive compositions.

healthyfood.png

🛠️ Technical Used

Python, Jupyter, Pandas, NumPy, scikit-learn, Matplotlib/Seaborn, NLP (TF-IDF), PCA/UMAP, K-Means, Logistic Regression, Random Forest

About

🍔 End-to-end ML pipeline analyzing 173K+ U.S. packaged foods (163 features): data cleaning → EDA → PCA/UMAP clustering → Nutri-Score classification, with model interpretability and insights.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published