👉 Click Here to See HTML Report | ipynb notebook
This project analyzes the nutrition facts of U.S. packaged foods and tests how they are reflected in Nutri‑Score labeling. The ML/NLP pipeline was built and includes data cleaning, EDA, feature engineering, PCA/UMAP, clustering, regression, and classification. The goal is to uncover the relationship between food additives and Nutri‑Score.
As consumers become increasingly health-conscious, test whether those Nutri‑Score grades capture what matters (sugar, fat, salt, fiber) and whether they miss what also matters (additive load, brand practices). Findings can inform healthier choices and what the label does not tell you.
- Does Nutri-Score reflects the presence of food additives, or are there potential inconsistencies?
- Do certain brands consistently use more additives, and tend to receive lower Nutri-Scores?
- How do food additives and healthiness vary across different brands of packaged foods in the U.S.?
- Goal: Predict packaged food healthiness and analyze drivers (ingredients, additives, brand info).
- Multiclass classification → Nutri-Score: grade A to E.
- Binary classification → Healthy vs. Unhealthy.
- Exploratory & Unsupervised → PCA/UMAP for visualization, KMeans for cluster profiling.
- Big picture: U.S. packaged foods are dominated by fortified nutrients, sugars, stabilizers, and oils.
- For Consumers: Nutri-Score alone is not sufficient for additive-conscious choices. "Clean label" ≠ low additives used.
- For Regulators: Consider an additive/processing penalty in next-gen front-of-pack labeling.
- Label gap: Nutri-Score emphasizes nutrients but it under-captures the complexity of food processing & additive.
- Source: Open Food Facts (Kaggle)
- Size: 173k+ product records
- Features: 163 features
- Numerical: unit normalization (per 100g), outlier caps, missing‑value
- Data deduplication: drop duplicates by barcode/product key
-
Normalization: lowercase, punctuation handling, strip non-English/noisy chars.
-
Stopwords: prune English (NLTK) and multilingual stopword lists; remove domain/measurement noise (e.g., organic, mg, g, oz).
-
Synonyms map: create dictionary for frequent variants (e.g., "ascorbic acid" ↔ “vitamin C”; brand aliases)
-
Tokenization: Lemmatization & Stemming ("starches" → "starch"; "sugary" → "sugar").
-
Standardize E-numbers: internationally used codes for food additives, map codes to names (e.g., E150d → "Caramel IV").
-
TF-IDF vectorization: Convert cleaned texts into numeric features. The informative tokens (e.g., unigram: "aspartame", "sodium", "E150d"; bigram: "palm oil", "citric acid") get high weight (large TF-IDF), while generic terms (e.g., water, organic) are down-weighted.
-
The following Word Cloud is the 50 Most Common Ingredients in the U.S. packaged foods:
-
Dimensionality Reduction: Principal Component Analysis (PCA)
-
Unsupervised Clustering: K-means clustering and visualization using UMAP:
-
Multiclass: Nutri-Score from A to E.
-
Binary: Healthy vs. Unhealthy (A, B v.s. C, D, E).
-
Logistic Regression
-
Random Forest
-
Evaluate with Accuracy, F1, ROC-AUC, confusion matrix
-
Sugars and Saturated Fat Are the Top Predictor: model highlights that products low in sugar, saturated fat, and salt are far more likely to be classified as healthy:
Model Fit:
- OLS R² ≈ 0.83 (Adj R² = 0.83)
Primary Drivers:
- Sugar: +7.06 → higher score (worse healthiness)
- Total Fat: +5.05
- Saturated Fat: +2.90
- Fiber: –1.66 (improves healthiness)
- Carbohydrates: –0.19
Nonlinear & Interaction Effects:
- Diminishing returns: sugars² (–2.995) & fat² (–2.…)
- Salt synergy: sugar×salt (+0.38), fat×salt (+1.26)
- Sugar–fat offset: sugar×fat (–1.74)
Additive Impact:
- Log(additives) +0.35 (p < 0.001), with tapering returns
Palm-Oil Flags:
- “May-be from palm oil” (–0.06)
- “Definitely from palm oil” not significant
Brand Effects:
- Food Club (+1.50)
- Great Value (+1.00) premiums
- Shoprite (–0.41) marginally lower
Conclusion: Not reliably.
- Some products with a high number of additives still receive a good Nutri-Score
- This suggests Nutri-Score may overlook additive information when assessing overall healthfulness.
Conclusion: Often as a brand-level marker.
-
Additives alone don’t determine the score, but heavy additive usage frequently coincides with lower overall nutrition quality when aggregated at the brand level.
Conclusion: Additive usage varies by brand type.
- Most frequently used additives include such as citric acid, riboflavin, and nicotinic acid, are widely recognized as safe and even beneficial.
- Because Nutri-Score doesn’t factor additive types, two products can share the same grade yet have very different additive compositions.
Python, Jupyter, Pandas, NumPy, scikit-learn, Matplotlib/Seaborn, NLP (TF-IDF), PCA/UMAP, K-Means, Logistic Regression, Random Forest










