-
Notifications
You must be signed in to change notification settings - Fork 5
Post on Deep Imbalanced Regression #219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
180 changes: 180 additions & 0 deletions
180
collections/_posts/2025-07-15-Deep-Imbalanced-Regression.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,180 @@ | ||
--- | ||
layout: review | ||
title: "Delving into Deep Imbalanced Regression" | ||
tags: imbalanced-learning regression | ||
author: "Clara Cousteix" | ||
cite: | ||
authors: "Yuzhe Yang, Kaiwen Zha, Ying-Cong Chen, Hao Wang, Dina Katabi" | ||
title: "Delving into Deep Imbalanced Regression" | ||
venue: "38 th International Conference on Machine Learning" | ||
pdf: "https://proceedings.mlr.press/v139/yang21m/yang21m.pdf" | ||
--- | ||
|
||
# Highlights | ||
|
||
* Defines **Deep Imbalanced Regression (DIR)** as learning from imbalanced data with continuous targets. | ||
* Introduces two distribution smoothing methods: **Label Distribution Smoothing (LDS)** and **Feature Distribution Smoothing (FDS)**. | ||
* Benchmarks performance on several real-world datasets. | ||
* [Code available on GitHub](https://github.com/YyzHarry/imbalanced-regression). | ||
|
||
| ||
|
||
# Related Work | ||
## Imbalanced Classification | ||
|
||
Many prior works have focused on imbalanced **classification** problems. Two main approaches emerged: | ||
|
||
* **Data-based methods**: under-sampling majority classes or over-sampling minority classes (e.g., SMOTE). | ||
* **Model-based methods**: modifying loss functions via re-weighting to compensate for imbalance. | ||
|
||
Recent studies also show that semi-supervised and self-supervised learning can improve performance under imbalance. While these methods can partially transfer to regression, they have **intrinsic limitations** due to the continuous nature of regression targets. | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
## Imbalanced Regression | ||
|
||
Imbalanced **regression** has been less explored. Most existing studies adapt SMOTE to regression by generating synthetic samples via input-target interpolation or Gaussian noise augmentation. However, these methods, inspired by classification strategies, fail to leverage the **continuity** in the label space. Moreover, linear interpolation may not produce meaningful synthetic samples. | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The proposed methods in this paper differ fundamentally and can **complement** these prior approaches. | ||
|
||
| ||
|
||
# Methods: Label and Features Distribution Smoothing (LDS & FDS) | ||
|
||
## Problem Setting | ||
|
||
Let $$\{ (x_i, y_i) \}_{i=1}^N$$ be the training set, where $$x_i \in \mathbb{R}^d$$ isthe input and $$y_i \in \mathbb{R}$$ is the continous target. We can divide the label space $$\mathcal{Y}$$ into $$B$$ bins with equal intervals: $$[y_0, y_1[, [y_1, y_2[, .., [y_{n-1}, y_n[$$. The defined bins reflect a minimum resolution we care for grouping data in a regression task (ex: in age estimation, $$\delta y = 1$$ year). Finally, we denote $$z = f(x; \theta)$$ s the feature for x, where $$f(x; \theta)$$, is parameterized by a deep neural network model with parameter $$\theta$$. The final prediction $$\hat{y}$$ is given by a regression function $$g(z)$$. | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ||
|
||
## Label Distribution Smoothing (LDS) | ||
|
||
#### Motivation Example | ||
|
||
To motivate LDS, the authors compare **categorical** labels (CIFAR-100) with **continuous** labels (IMDB-WIKI). Both datasets are subsampled to simulate imbalance, as illustrated on top of Firgure (1). | ||
|
||
<div style="text-align:center"> | ||
<img src="/collections/images/Deep_Imbalanced_Regression/01_motivation_example.jpg" width=600></div> | ||
<p style="text-align: center;font-style:italic">Figure 1.Comparison on the test error distribution (bottom) using same training label distribution (top) on two different datasets: (a) CIFAR-100, a classification task with categorical label space. (b) IMDB-WIKI, a regression task with continuous label space.</p> | ||
|
||
|
||
ResNet-50 trained on CIFAR-100 yields a test error distribution strongly correlated with label density (high negative Pearson correlation). For IMDB-WIKI, the test error distribution is smoother and less correlated with label density (Pearson = -0.47). | ||
|
||
This example shows that, given the continuous caracteristic of the label, the network can learn from the neighborhood and give good performance on interpolation. So the imbalance seen by the network is different from the empirical class imbalance. Hence, compensating for data imbalance based on empirical label density is inaccurate for the continuous label space. | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#### LDS Formulation | ||
|
||
Label Distribution Smoothing (LDS) applies **kernel density estimation** to smooth the empirical label distribution: | ||
|
||
<div style="text-align:center"> | ||
<img src="/collections/images/Deep_Imbalanced_Regression/02_LDS.jpg" width=800></div> | ||
<p style="text-align: center;font-style:italic">Figure 2.Label distribution smoothing (LDS) convolves a symmetric kernel with the empirical label density to estimate the effective label density distribution that accounts for the continuity of labels</p> | ||
|
||
A symmetric kernel is any kernel that satisfies: $$k(y, y') = k(y', y) \quad$$ and $$\quad \nabla_y k(y, y') + \nabla_{y'} k(y', y) = 0, \quad \forall y, y' \in \mathcal{Y}.$$ (e.g., Gaussian or Laplace). The symmetric kernel characterizes the similarity between target values $$y'$$ and any $$y$$ w.r.t. their distance in the target space. Thus, LDS computes the effective label density distribution as: | ||
|
||
$$ | ||
\tilde{p}(y') = \int_{\mathcal{Y}} k(y, y') p(y) \, dy | ||
$$ | ||
|
||
where $$p(y)$$ is the number of appearances of label of $$y$$ in the training data, and $$\tilde{p}(y')$$ is the effective density of label $$y'$$. | ||
|
||
The smoothed distribution correlates better with error distributions (Pearson = -0.83). Standard imbalance mitigation methods (e.g., re-weighting) can then be applied using $$\tilde{p}(y')$$. | ||
| ||
|
||
## Feature Distribution Smoothing (FDS) | ||
|
||
We are motivated by the intuition that continuity in the target space should create a corresponding continuity in the feature space. That is, if the model works properly and the data is balanced, one expects the feature statistics corresponding to nearby targets to be close to each other. | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#### Motivation example | ||
|
||
The authors trained a plain model on the images in the IMDB-WIKI dataset to infer a person’s age from visual appearance. They focused on the feature space, ie **z**, grouped them within the same target age value and computed feature statistics (ie mean and variance) with respect of each bin, denoted $$\{\mu_b, \sigma_b\}$$. To visualize the similarity between feature statistics, we select an anchor bin $$b_0$$, and calculate the **cosine similarity** of the feature statistics between $$b_0$$ and all other bins. | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
<div style="text-align:center"> | ||
<img src="/collections/images/Deep_Imbalanced_Regression/03_FDS_motivation.jpg" width=800></div> | ||
<p style="text-align: center;font-style:italic">Figure 3. Feature statistics similarity for age 30. Top: Cosine similarity of the feature mean at a particular age w.r.t. its value at the anchor age. Bottom: Cosine similarity of the feature variance at a particular age w.r.t. its value at the anchor age. The color of the | ||
background refers to the data density in a particular target range.</p> | ||
|
||
The figure above shows that the feature statistics around the anchor b0 = 30 are highly similar to their values at b0 = 30. So, the figure confirms the intuition that when there is enough data, and for continuous targets, the feature statistics are similar to nearby bins. Interestingly, the figure also shows a high similarity between the age 0 to 6, that have a very few samples, with b0=30. This unjustified similarity is due to data imbalance. Specifically, since there are not enough images for ages 0 to 6, this range thus inherits its priors from the range with the maximum amount of data, which is the range around age 30. | ||
|
||
#### FDS Formulation | ||
|
||
FDS aims at transfering the feature statistics between nearby target bin, so that it calibrates the potentially biased estimates of feature distribution, especially for underrepresented target values, with the following procedure: | ||
|
||
* Compute feature mean and covariance per bin: | ||
|
||
$$\mu_b = \frac{1}{N_b} \sum_{i=1}^{N_b} z_i$$ | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
$$\Sigma_b = \frac{1}{N_b - 1} \sum_{i=1}^{N_b} (z_i - \mu_b)(z_i - \mu_b)^\top$$ | ||
|
||
* Smooth using kernel $$k(y_b, y_{b'})$$: | ||
|
||
$$\tilde{\mu}_b = \sum_{b' \in \mathcal{B}} k(y_b, y_{b'}) \, \mu_{b'}$$ | ||
|
||
$$\tilde{\Sigma}_b = \sum_{b' \in \mathcal{B}} k(y_b, y_{b'}) \, \Sigma_{b'}$$ | ||
|
||
* Re-calibrate: | ||
|
||
$$z = \tilde{\Sigma}_b^{\frac{1}{2}} \, \Sigma_b^{-\frac{1}{2}} (z - \mu_b) + \tilde{\mu}_b$$ | ||
|
||
The FDS Algorithm can be integrated to the network with a feature calibration layer after the final feature map. During the training, its employs a *momentum update* (exponential moving average) of the running statistics $$\{\mu_b, \Sigma_b\}$$ across each epoch. Correspondingly, the smoothed statistics $$\{\tilde{\mu}_b, \tilde{\Sigma}_b\}$$ are updated across different epochs but fixed within each training epoch. | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
<div style="text-align:center"> | ||
<img src="/collections/images/Deep_Imbalanced_Regression/04_FDS_principle.jpg" width=800></div> | ||
<p style="text-align: center;font-style:italic">Figure 3. Feature distribution smoothing (FDS) introduces a feature calibration layer that uses kernel smoothing to smooth the distributions of feature mean and covariance over the target space.</p> | ||
|
||
We note that FDS can be integrated with any neural network model, as well as any past work on improving label imbalance.* | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
| ||
|
||
# Results | ||
|
||
## Benchmark DIR | ||
|
||
#### Datasets | ||
**Age Prediction**. IMDB-WIKI-DIR & AgeDB-DIR: Imbalanced training set, manually constructed balanced validation & test set | ||
|
||
**Health Condition Score**. SHHS-DIR, based on the SHHS dataset, containing night EEG, ECG as inputs and a general heal th score is the output. | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#### Baselines | ||
|
||
* *Vanilla Model*: a model that does not include any technique for dealing with imbalanced data. . To combine the vanilla model with LDS, the authors re-weighted the loss function by multiplying it by the inverse of the LDS estimated density for each target bin | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* *Synthetic Samples*: Using SMOTER and SMOGN as baselines. SMOTER first defines frequent and rare regions using the original label density, and creates synthetic samples for rare regions by linearly interpolating both inputs and targets. SMOGN further adds Gaussian noise to SMOTER. We note that LDS can be directly used for a better estimation of label density when dividing the target space. | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* *Error Aware Loss*: Inspired from the Focal Loss, the authors introduce **Focal-R** Loss = $$\frac{1}{n} \sum_{i=1}^{n} \sigma(\lvert \beta e_i \rvert) ^{\gamma} e_i$$, where $$e_i$$ is the L1 error for i-th sample, $$σ(·)$$ is the Sigmoid function, and $$\beta, \gamma$$ are hyper-parameters. . To combine Focal-R with LDS, we multiply the loss with the inverse frequency of the estimated label density. | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* *Two-stage training*: we propose a regression version called regressor re-training (RRT), where in the first stage we train the encoder normally, and in the second stage freeze the encoder and re-train the regressor g(·) with inverse re-weighting. When adding LDS, the re-weighting in the second stage is based on the label density estimated through LDS. | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* *Cost-sensitive re-weighting*: Since the authors divided the target space into finite bins, classic re-weighting methods can be directly plugged in. We adopt two re-weighting schemes based on the label distribution: inverse-frequency weighting (**INV**) and its square-root weighting variant (**SQINV**). When combining with LDS, instead of using the original label density, we use the LDS estimated target density. | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#### Benchmarks | ||
|
||
<div style="text-align:center"> | ||
<img src="/collections/images/Deep_Imbalanced_Regression/05_results_DB_age.jpg" width=800></div> | ||
|
||
SMOTER and SMOGN can actually degrade the performance in comparison to the vanilla model. Moreover, within each group, adding either LDS, FDS, or both leads to performance gains, while LDS + FDS often achieves the best results. Finally, when compared to the vanilla model, using our LDS and FDS maintains or slightly improves the performance overall and on the many-shot regions, while substantially boosting the performance for the medium-shot and few-shot regions. | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
<div style="text-align:center"> | ||
<img src="/collections/images/Deep_Imbalanced_Regression/06_results_DB_health_score.jpg" width=800></div> | ||
|
||
|
||
## Further Analysis | ||
|
||
#### Interpolation & Extrapolation | ||
|
||
To simulate a scenario of certain target values with no samples, the authors curated the age dataset on the training set, as shown on the figure below. | ||
|
||
<div style="text-align:center"> | ||
<img src="/collections/images/Deep_Imbalanced_Regression/07_extrapolation_interpolation.jpg" width=800></div> | ||
<p style="text-align: center;font-style:italic">Figure 4. The absolute MAE gains of LDS + FDS over the vanilla model, on a curated subset of IMDB-WIKI-DIR with certain target values having no training data. We establish notable performance gains w.r.t. all regions, especially for extrapolation & interpolation</p> | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#### Understanding FDS | ||
|
||
<div style="text-align:center"> | ||
<img src="/collections/images/Deep_Imbalanced_Regression/08_understanding_FDS.jpg" width=800></div> | ||
<p style="text-align: center;font-style:italic"> Figure 5. <b>(a)</b> Feature statistics similarity for age 0, without FDS <b>(b)</b> Feature statistics similarity for age 0, with FDS <b>(c)</b> L1 distance between the running statistics {μb, Σb} and the smoothed statistics { ˜μb, ˜Σb} during training.</p> | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
As the figure indicates, since age 0 lies in the few-shot region, the feature statistics can have a large bias, i.e., age 0 shares large similarity with region 40 ∼ 80 as in Fig. 8(a). In contrast, when FDS is added, the statistics are better calibrated, resulting in a high similarity only in its neighborhood, and a gradually decreasing similarity score as target value becomes larger. | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
The L1 distance between the running statistics $$\{\mu_b, \Sigma_b\}$$ and the smoothed statistics $$\{\tilde{\mu}_b, \tilde{\Sigma}_b\}$$ during training is plotted in Fig. 8(c). Interestingly, the average L1 distance becomes smaller and gradually diminishes as the training evolves, indicating that the model learns to generate features that are more accurate even without smoothing, and finally the smoothing module can be removed during inference. | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
| ||
|
||
# Conclusion | ||
|
||
We introduce the DIR task that learns from natural imbalanced data with continuous targets, and generalizes to the entire target range. We propose two simple and effective algorithms for DIR that exploit the similarity between nearby targets in both label and feature spaces. | ||
Clarax99 marked this conversation as resolved.
Show resolved
Hide resolved
|
Binary file added
BIN
+66.1 KB
collections/images/Deep_Imbalanced_Regression/01_motivation_example.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+121 KB
collections/images/Deep_Imbalanced_Regression/06_results_DB_health_score.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+56.7 KB
collections/images/Deep_Imbalanced_Regression/07_extrapolation_interpolation.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+130 KB
collections/images/Deep_Imbalanced_Regression/08_understanding_FDS.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.