Published on

VarifocalNet (VFNet)- An IoU-aware Dense Object Detector

Authors

💡 Overview

VFNet is the one best anchor-free single-stage models yet it's not under the radar. VFNet is simply an aggregation of several clever ideas. Each one contributes to improve the overall performance of the original FCOS model.

Table of Contents

🧠 KEY TAKEAWAYS

  • VFNet = FCOS + ATSS + GIoU Loss + VariFocal Loss + IACS + Box Refinement

  • VFNet is one of the best anchor-free single-stage models

  • It helps by selecting high-quality bounding box (bbox) candidates

  • VFNet uses IoU-aware Classification Score (IACS), one of the big boosters of VFNet

  • It introcudes the VariFocal Loss which is better than the original Focal Loss

  • VFNet offers strong performance even for partially trained models: Only the backbone is trained whereas both the FPN and the heads are not.

vfnet

IACS helps to have high-quality bounding box (bbox) candidates. It is a combination of IoU and Classification Score. It addresses the problem of low-quality bounding box candidates found when using Non-Maximum Suppression (NMS) algorithm.

Misclassification has a negative impact on a model's performance because NMS relies on classification scores to determine which bounding boxes are the most relevant. NMS works by:

  • Selecting the bounding boxes with the highest classification scores,

  • and then, discarding duplicates using the IoU threshold

So, in cases where we have a bbox with better localization (high IoU with the ground truth bbox) but its classification score is low, it will be discarded if another bbox has a higher classification score and lower IoU with the ground truth bbox (less accurate localization)

Here is a comparison of the impact of the different classification techniques on the model's performance.

vfnet-ablation

🔍 Background

  • Accurately ranking candidate detections is crucial for dense object detectors to achieve high performance

  • Prior work uses the classification score or a combination of classification and predicted localization scores (centerness) to rank candidates.

  • Those 2 scores are still not optimal

✍️ Novelty

  • VFNet proposes to learn an IoU-Aware Classification Score (IACS) as a joint representation of object presence confidence and localization accuracy using IoU

  • VFNet introduces VariFocal Loss

  • The VariFocal Loss down-weights only negative easy samples for addressing the class imbalance problem during training

  • The VariFocal Loss up-weights high-quality positive examples for generating prime detections

As a reminder, the Focal Loss down-weights bothe the positive and negative easy samples for addressing the class imbalance problem

💠 VFNet Architecture

  • VFNet is based on the FCOS+ATSS with the centerness branch removed

  • It has three new components:

    1. The VariFocal Loss,
    2. The star-shaped bounding box features representation (see figure here above)
    3. The bounding box refinement
  • VFNet also uses GIoU Loss for both bounding boxes branches

  • VariFocal Loss consistently improved RetinaNet, FoveaBox, ATSS by 0.9 AP, and by 1.4 AP for RepPoints

🎯 Actionable resources for VFNet

IceVison allows you to easily train SOTA object detection models with few lines. You can train one of the many VFNet models using this notebook: Getting Started in Object Detection Notebook

👨‍💻 Code snippet

# VFNet Model
if selection == 0:
  model_type = models.mmdet.vfnet
  backbone = model_type.backbones.resnet50_fpn_mstrain_2x

...

If you want to have a peek at the code on how FPN is used in the RetinaNet in the MMDetection library, check out this code snippet:

Source: MMDetection VFNet Configuration File

# VFNet model settings
model = dict(
    type='VFNet',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        norm_cfg=dict(type='BN', requires_grad=True),
        norm_eval=True,
        style='pytorch',
        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        start_level=1,
        add_extra_convs='on_output',  # use P5
        num_outs=5,
        relu_before_extra_convs=True),
    bbox_head=dict(
        type='VFNetHead',
        num_classes=80,
        in_channels=256,
        stacked_convs=3,
        feat_channels=256,
        strides=[8, 16, 32, 64, 128],
        center_sampling=False,
        dcn_on_last_conv=False,
        use_atss=True,
        use_vfl=True,
        loss_cls=dict(
            type='VarifocalLoss',
            use_sigmoid=True,
            alpha=0.75,
            gamma=2.0,
            iou_weighted=True,
            loss_weight=1.0),
        loss_bbox=dict(type='GIoULoss', loss_weight=1.5),
        loss_bbox_refine=dict(type='GIoULoss', loss_weight=2.0)),

    ...
  • In the above code snippet, the backbone is a ResNet50. The neck is a FPN

  • in_channels (List[int]): Number of input channels per scale. in_channels=[256, 512, 1024, 2048]

  • out_channels (int): Number of output channels (used at each scale). out_channels=256

  • start_level (int): Index of the start input backbone level used to build the feature pyramid. start_level=1

  • num_outs (int): Number of output scales (P3 to P7). num_outs=5

  • VFNetHead is a modified head that uses VariFocal Loss and star-shaped bounding box features

  • GIoU Loss is used for both bounding box and athe box refinement branches

📚 References

📰 Paper for more details.

MMDetection Repo

MMDetection Documentation

IceVision Repo

IceVision Documentation

IceVision Documentation 2