BACS - Tackling Background Ambiguity in Continual Semantic Segmentation

Publications · 30 Apr 2024 - 9 minutes to read.

Semantic Segmentation, the task of assigning a class label to every pixel in an image, is fundamental for detailed scene understanding, especially in applications like autonomous driving and robotics. However, creating these pixel-perfect annotations is laborious. Furthermore, real-world systems often need to learn new object categories incrementally as their operational environment expands, without forgetting previously learned ones. This challenge is known as Continual Semantic Segmentation (CSS).

In this research, we address a critical but often overlooked issue in CSS: the ambiguity of the ‘background’ class. We propose a novel method, BACS (Background Aware Continual Semantic Segmentation), to explicitly handle this ambiguity and improve the performance of continually learning segmentation models (ElAraby et al., 2024).

What is Continual Semantic Segmentation (CSS)?

Imagine an autonomous car initially trained to recognize ‘roads’, ‘cars’, and ‘pedestrians’. Later, we want it to also recognize ‘bicycles’ and ‘traffic lights’ using new data, without having to retrain from scratch on all data and without forgetting how to identify the original classes. This is the goal of CSS.

A major hurdle in any continual learning setup is catastrophic forgetting, where the model’s performance on old tasks degrades significantly when learning new ones. CSS adds another layer of complexity due to how annotations are often handled.

CSS Scenarios

As highlighted by early work in the field (e.g., Michieli et al. ICCV-W 2019, Cermelli et al. CVPR 2020), CSS differs significantly from continual learning in image classification. Because semantic segmentation involves classifying every pixel, multiple object classes (both old and new) can co-exist within the same image. This leads to different ways the learning process can be structured, often categorized into these scenarios:

Sequential: In a given learning task T focusing on current classes C, the model only sees images containing at least one pixel labeled with a class from C. If an image contains classes yet to be learned (future classes), it might be excluded. Importantly, in the ideal sequential setting, all visible pixels belonging to either current classes C or previously learned classes are correctly labeled. Other areas might be marked as background (label 0).
Disjoint: This scenario is common due to annotation efficiency. Like Sequential, the model focuses on learning current classes C. However, if an object belonging to a previously learned class appears in an image during training for task T, its pixels are simply labeled as ‘background’ (label 0). Only pixels belonging to the current classes C are specifically labeled. During testing, however, the model is expected to correctly identify all classes learned so far (old and current).
Overlap: This is similar to the Disjoint scenario, where old classes are labeled as background during training for new classes. The key difference is that the training data for task T can include images that contain future classes (classes not yet formally introduced), provided at least one current class C is also present in that image. Like Disjoint, testing requires identifying all learned classes.

The Disjoint and Overlap scenarios, while practical for reducing annotation effort, introduce significant challenges because the meaning of the ‘background’ label changes over time, encompassing previously learned classes. This is the core issue BACS aims to solve.

The “Background Shift” Challenge in CSS

In CSS, to save annotation effort, datasets for new tasks often only label the new classes, marking everything else, including objects from old classes, as ‘background’. This creates two problems:

Forward Background Shift: Pixels labelled as ‘background’ in early training stages might actually belong to classes that will be introduced later.
Backward Background Shift: Pixels labelled as ‘background’ in later stages might belong to classes learned previously but not labelled in the current data increment.

This constantly shifting meaning of ‘background’ confuses the learning process, leading standard continual learning techniques to fail or perform poorly. Existing methods often struggle because they treat the background class like any other, failing to account for its ambiguous and dynamic nature.

Proposed Framework: BACS

BACS is designed to specifically mitigate the negative effects of background shift, particularly the backward shift, while also addressing catastrophic forgetting.

The core components of BACS are:

Transformer Decoder: We replace the standard Atrous Spatial Pyramid Pooling (ASPP) module found in many segmentation networks (like DeepLabV3+) with a transformer decoder. This allows for richer, global feature representations which are beneficial for distinguishing subtle class differences, including those hidden in the background.
Backward Background Shift Detector: This module identifies pixels currently labelled as ‘background’ but likely belonging to a previously learned class. It operates by comparing the feature embeddings of current background pixels to stored prototypes (e.g., class centroids in feature space) of old classes using the Mahalanobis distance. Pixels close to an old class prototype are flagged as potential backward-shifted pixels.
Masked Knowledge Distillation (KD): To combat catastrophic forgetting, we use knowledge distillation, encouraging the current model’s predictions to stay consistent with the previous model’s outputs. Crucially, we mask the distillation loss for background pixels identified by the BACS detector. This prevents the model from incorrectly learning to classify old classes as background based on the ambiguous labels.
Background-Aware Losses: We introduce specialized loss functions:
- Background Detector Focal Loss: Trains the Backward Background Shift Detector module effectively, focusing on hard-to-classify examples.
- Background Aware Cross-Entropy: A modified cross-entropy loss that potentially incorporates information from the background detector.
- Foreground vs. Background Loss: Specifically penalizes the model more when it misclassifies true foreground pixels (especially those flagged by BACS) based on the detector’s confidence.

BACS Components in Detail

Backward Background Shift Detector

At each learning step t, for a pixel i labeled as background \(y_i^t = c_{bg}\) , we compute its feature embedding \(f_{\theta_t}(x_t)_i\). We then calculate the Mahalanobis distance between this embedding and the stored feature centroid and covariance for each previously learned class \(c \in C_{1:t-1}\).

\[D_M(f_{\theta_t}(x_t)_i, c) = \sqrt{(f_{\theta_t}(x_t)_i - \mu_c)^T \Sigma_c^{-1} (f_{\theta_t}(x_t)_i - \mu_c)}\]

If the minimum distance across all old classes is below a threshold, the pixel is considered a likely instance of a backward-shifted background pixel. The detector outputs a probability \(P(\text{old class}|y_i^t=c_{bg})\) for each background pixel.

Masked Knowledge Distillation

The standard KD loss aims to match logits z from the old model and the current model \(\theta_t\). \(\mathcal{L}_{KD} = \sum_{i} D_{KL}( \sigma(z_{\theta_{t-1}}(x)_i / T) || \sigma(z_{\theta_t}(x)_i / T) )\) (where \(\sigma\) is softmax and T is temperature).

In BACS, we apply a mask M_i derived from the background detector’s output. If pixel i is identified as potentially backward-shifted background, M_i=0, otherwise M_i=1. \(\mathcal{L}_{MaskedKD} = \sum_{i} M_i \cdot D_{KL}( \sigma(z_{\theta_{t-1}}(x)_i / T) || \sigma(z_{\theta_t}(x)_i / T) )\) This prevents enforcing the incorrect “background” label from the old model onto pixels that are actually instances of old classes.

Background Aware Losses

These losses (\(\mathcal{L}_{Focal}, \mathcal{L}_{BACE}, \mathcal{L}_{FG/BG}\)) work together to train the detector and the main segmentation network, making the model explicitly aware of potential background ambiguities and penalizing errors related to them appropriately. (See paper/presentation for full formulations).

Experiments

We evaluated BACS on standard CSS benchmarks:

Datasets: Pascal VOC 2012, ADE20K, Cityscapes
Protocols:
- Disjoint: Classes are introduced in distinct steps with no overlap.
- Overlap: New steps include both new classes and some classes from previous steps.
Backbone: DeepLabV3+ with ResNet-101 and the proposed Transformer Decoder.
Baselines: We compared BACS against state-of-the-art CSS methods like MiB, PLOP, SDR, SSUL, and RECALL.

Results

BACS consistently outperformed previous state-of-the-art methods across different datasets and continual learning setups, including both Disjoint and the often more challenging Overlap scenarios. The table below summarizes the key results for the Overlap protocol:

Dataset	Protocol	Steps	Method	mIoU (%)
Pascal VOC	Overlap	10-1	SDR (Michieli & Zanuttigh, 2021)	59.8
			BACS (Ours)	63.4
Pascal VOC	Overlap	15-1	RECALL (Maracani et al., 2021)	74.5
			BACS (Ours)	77.8
ADE20K	Overlap	50-50	SSUL (Cha et al., 2021)	31.5
			BACS (Ours)	33.1
ADE20K	Overlap	100-50	SSUL (Cha et al., 2021)	26.6
			BACS (Ours)	28.4

(Table 1: Summary of main results (mIoU) for the Overlap protocol, compared to previous SOTA. BACS shows significant improvements.)

Qualitative results also demonstrate that BACS produces more coherent segmentation maps, effectively identifying old classes even when they are labelled as background in the current training data.

Qualitative comparison of BACS segmentation results

Discussion

The experiments validate that explicitly addressing the background shift problem is crucial for robust Continual Semantic Segmentation. BACS provides an effective mechanism to detect and mitigate backward background shift using a dedicated detector module and tailored loss functions, integrated with masked knowledge distillation to prevent catastrophic forgetting. The use of a transformer decoder further enhances feature representation for this challenging task.

Our approach sets a new state-of-the-art on several CSS benchmarks, demonstrating its potential for real-world applications where semantic segmentation models must adapt to evolving environments.

Check the paper for more technical details, ablation studies, and analysis (ElAraby et al., 2024).

Collaborators

References

ElAraby, M., Harakeh, A., & Paull, L. (2024). BACS: Background Aware Continual Semantic Segmentation. Proceedings of the Conference on Robots and Vision.
Michieli, U., & Zanuttigh, P. (2021, June). Continual Semantic Segmentation via Repulsion-Attraction of Sparse and Disentangled Latent Representations. Computer Vision and Pattern Recognition (CVPR).
Maracani, A., Michieli, U., Toldo, M., & Zanuttigh, P. (2021). Recall: Replay-based continual learning in semantic segmentation. Proceedings of the IEEE/CVF International Conference on Computer Vision, 7026–7035.
Cha, S., Yoo, Y. J., Moon, T., & others. (2021). Ssul: Semantic segmentation with unknown label for exemplar-based class-incremental learning. Advances in Neural Information Processing Systems, 34, 10919–10930.