- Published on
VarifocalNet (VFNet)- An IoU-aware Dense Object Detector
- Authors
- Name
- Farid Hasainia, PhD
- @ai_fast_track
💡 Overview
VFNet is the one best anchor-free single-stage models yet it's not under the radar. VFNet is simply an aggregation of several clever ideas. Each one contributes to improve the overall performance of the original FCOS model.
Table of Contents
🧠 KEY TAKEAWAYS
VFNet = FCOS + ATSS + GIoU Loss + VariFocal Loss + IACS + Box Refinement
VFNet is one of the best anchor-free single-stage models
It helps by selecting high-quality bounding box (bbox) candidates
VFNet uses IoU-aware Classification Score (IACS), one of the big boosters of VFNet
It introcudes the VariFocal Loss which is better than the original Focal Loss
VFNet offers strong performance even for partially trained models: Only the backbone is trained whereas both the FPN and the heads are not.
IACS helps to have high-quality bounding box (bbox) candidates. It is a combination of IoU and Classification Score. It addresses the problem of low-quality bounding box candidates found when using Non-Maximum Suppression (NMS) algorithm.
Misclassification has a negative impact on a model's performance because NMS relies on classification scores to determine which bounding boxes are the most relevant. NMS works by:
Selecting the bounding boxes with the highest classification scores,
and then, discarding duplicates using the IoU threshold
So, in cases where we have a bbox with better localization (high IoU with the ground truth bbox) but its classification score is low, it will be discarded if another bbox has a higher classification score and lower IoU with the ground truth bbox (less accurate localization)
Here is a comparison of the impact of the different classification techniques on the model's performance.
🔍 Background
Accurately ranking candidate detections is crucial for dense object detectors to achieve high performance
Prior work uses the classification score or a combination of classification and predicted localization scores (centerness) to rank candidates.
Those 2 scores are still not optimal
✍️ Novelty
VFNet proposes to learn an IoU-Aware Classification Score (IACS) as a joint representation of object presence confidence and localization accuracy using IoU
VFNet introduces VariFocal Loss
The VariFocal Loss down-weights only negative easy samples for addressing the class imbalance problem during training
The VariFocal Loss up-weights high-quality positive examples for generating prime detections
As a reminder, the Focal Loss down-weights bothe the positive and negative easy samples for addressing the class imbalance problem
💠 VFNet Architecture
VFNet is based on the FCOS+ATSS with the centerness branch removed
It has three new components:
- The VariFocal Loss,
- The star-shaped bounding box features representation (see figure here above)
- The bounding box refinement
VFNet also uses GIoU Loss for both bounding boxes branches
VariFocal Loss consistently improved RetinaNet, FoveaBox, ATSS by 0.9 AP, and by 1.4 AP for RepPoints
🎯 Actionable resources for VFNet
IceVison allows you to easily train SOTA object detection models with few lines. You can train one of the many VFNet models using this notebook: Getting Started in Object Detection Notebook
👨💻 Code snippet
# VFNet Model
if selection == 0:
model_type = models.mmdet.vfnet
backbone = model_type.backbones.resnet50_fpn_mstrain_2x
...
If you want to have a peek at the code on how FPN is used in the RetinaNet in the MMDetection library, check out this code snippet:
Source: MMDetection VFNet Configuration File
# VFNet model settings
model = dict(
type='VFNet',
backbone=dict(
type='ResNet',
depth=50,
num_stages=4,
out_indices=(0, 1, 2, 3),
frozen_stages=1,
norm_cfg=dict(type='BN', requires_grad=True),
norm_eval=True,
style='pytorch',
init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
neck=dict(
type='FPN',
in_channels=[256, 512, 1024, 2048],
out_channels=256,
start_level=1,
add_extra_convs='on_output', # use P5
num_outs=5,
relu_before_extra_convs=True),
bbox_head=dict(
type='VFNetHead',
num_classes=80,
in_channels=256,
stacked_convs=3,
feat_channels=256,
strides=[8, 16, 32, 64, 128],
center_sampling=False,
dcn_on_last_conv=False,
use_atss=True,
use_vfl=True,
loss_cls=dict(
type='VarifocalLoss',
use_sigmoid=True,
alpha=0.75,
gamma=2.0,
iou_weighted=True,
loss_weight=1.0),
loss_bbox=dict(type='GIoULoss', loss_weight=1.5),
loss_bbox_refine=dict(type='GIoULoss', loss_weight=2.0)),
...
In the above code snippet, the backbone is a ResNet50. The neck is a FPN
in_channels (List[int]): Number of input channels per scale. in_channels=[256, 512, 1024, 2048]
out_channels (int): Number of output channels (used at each scale). out_channels=256
start_level (int): Index of the start input backbone level used to build the feature pyramid. start_level=1
num_outs (int): Number of output scales (P3 to P7). num_outs=5
VFNetHead is a modified head that uses VariFocal Loss and star-shaped bounding box features
GIoU Loss is used for both bounding box and athe box refinement branches
📚 References
📰 Paper for more details.