DyCo3D: Robust Instance Segmentation of 3D Point Clouds through Dynamic Convolution

论文信息

作者	作者单位	年份	会议/期刊名	引用量	研究方向
Tong He；沈春华	Adelaide	2021	CVPR	2	3D实例分割

0. TL;DR

We propose instead a dynamic , proposal-free , data-driven approach (labeled DyCo3D) that generates the appropriate convolution kernels to apply in
response to the nature of the instances (Abs )

1. Motivation

Previous top-performing approaches for point cloud instance segmentation involve a bottom-up strategy
includes inefficient operations or complex pipelines , such as grouping over-segmented components , introducing additional steps for refining , or designing complicated loss
functions.
The inevitable variation in the instance scales can lead bottom-up methods to become particularly sensitive to hyper-parameter values

2. Previous work

Drawbacks
- Previous top-performing approaches for point cloud instance segmentation involve a bottom-up strategy
- Including inefficient operations or complex pipelines, such as grouping over-segmented components , introducing additional steps for refining , or designing complicated loss functions .
- The inevitable variation in the instance scales can lead bottom-up methods to become particularly sensitive to hyper-parameter values (the performance is sensitive to values of the pre-defined hyper-parameters, which require manual tuning )
- heavily reliant on the quality of the proposals , which limits their robustness and can lead to joint/fragmented instances in practice.

3D-MPA: extracts proposals from the predicted instance centroids. Instances are then generated by aggregating proposal-wise embeddings.
PointGroup: generates instances proposals by gradually merging neighbouring points that share the same category label. Both original and centroid-shifted points are explored with a manually specified search radius. A separate model (labelled ScoreNet) is used to estimate the objectness of the proposals.
Dynamic Convolution
主要参考两篇论文：Dynamic filter networks. (NIPS 2016) 、Conditional convolutions for instance segmentation. (ECCV 2020)
Conditional convolutions for instance segmentation 直接从2D迁移到3D效果不好有如下原因：
1. it introduces a large amount of computation, resulting in optimization difficulties. （优化难度提升很大）
2. the performance is constrained by the limited receptive field and representation capability due to the sparse convolution （感受野大小被稀疏卷积限制了）

3. Method

Summary

Features
1. propose a novel pipeline tailored to 3D point cloud instance segmentation using dynamic convolution, that we label DyCo3D
2. propose to encode category-specific context by deploying a lightweight sub-network to explore homogenous points that have close votes for instance centroids and share the semantic
  labels.
3. propose to introduce a small transformer to capture a long-range dependency and build high-level interactions among different regions.
4. Robustness; 对于超参数不敏感
5. More efficiency
Point of View
- To make the kernels discriminative , we explore a large context by gathering homogeneous points that share identical semantic categories and have close votes for the geometric centroids . （聚合属于同一个语义类别且具有相近实例中心的同源点）
- Due to the** limited receptive field introduced by the sparse convolution** , a small light-weight transformer is also devised to capture the long-range dependencies and
  high-level interactions among point samples.

Overall Architecture

DyCo3D is comprised of three primary components（整体结构）:
1. a backbone network , which is** based on sparse convolution ** for feature extraction, and **contains a light-weighted ** transformer .
2. A weight generator that responds to the individual characteristics of each instance to dynamically generate the appropriate filter parameters . To make the filters discriminative, a large category-specific context is introduced.
3. An instance decoder . Instances are separated in parallel , using only three convolution layers, by convolving the generated class-aware filters with position embedded features.

produce instance masks using only a small number of simple convolutional layers.

The associated convolution filters are dynamically generated , conditioned on both spatial distribution of the data and the semantic predictions. （卷积核动态生成，受数据空间分布以及语义分割预测控制）

Backbone

U-Net + Transformer

Transformer
except for the position embedding layer, where the position-sensitive information is **encoded ** as the mean of the pairwise direction vector or relative position.
论文中没有具体写这个transformer加在哪里以及具体细节，推测是在U-Net bottleneck位置上的self-attention

Output by the backbone

语义分割$\mathbf{F}_{\mathrm{seg}} \in \mathbb{R}^{N \times C}$ (apply traditional cross entropy loss)
中心偏移量$\mathbf{O}_{\mathrm{off}} \in \mathbb{R}^{N \times 3}$

$$
\mathcal{L}{\mathrm{ctr}}=\frac{1}{N{v}} \sum_{i=0}^{N}\left|p^{i}+o_{\mathrm{off}}^{i}-c t r_{\mathrm{gt}}^{i}\right| \cdot \mathbb{1}\left(p^{i}\right)
$$

instance masking $ \mathbf{F}_{\mathrm{mask}} \in \mathbb{R}^{N \times D}$, where D is the dimension of the output channel

Dynamic Weight Generator

To **generate discriminative filters for distinguishing different instances ** we propose to group homogenous points that have close votes for the geometric centroids and
share the category predictions.

instance-aware filters are dynamically generated by applying a small sub-network for large context aggregation

引入这个模块的动机还是稀疏卷积造成的感受野偏小 (impair the method’s ability to exploit large-scale context )，提取cluster级的feature

Grouping points by using a similar strategy to PointGroup

applying a breadth-first searching algorithm to group homogenous points that have identical semantic labels and close centroids predictions.
**Integrates ** large context to generate filters for instances decoding

Due to the removal of the reliance on the quality of the instance proposals, the performance of our method is robust to the pre-defined hyper-parameters
extract instance feature by 3-layer network $G_w(\cdot)$
- First, voxelize instance points (14x14x14). The features of each grid is calculated as the average of the point feature $F_b$ within the grid, where $F_b$ is the output of the backbone
- aggregate context for cluster. It contains two sparse convolutional layers with a kernel size of 3, a global pooling layer, and an MLP layer. $\mathcal{W}_C^z$

Instance Decoder

directly append position embeddings in the feature space . (instance cluster中每一个点的位置相对于质心做归一化)

Given a specific category, position representation is critical to separate different instances. (作者认为位置信息对于分隔两个实例来说很关键)

$$
f_{\mathrm{pos}}^{i}=p^{i}-\mathcal{C}_{\mathrm{ctr}}^{z}
$$

每个Cluster中每个point的input feature $f_z^i$由$f_{pos}^i$和$f_{mask}^i$拼接（注意$f_z$的维度是$N\times(D+3)$）

decode binary segmentations of instances
Network: The whole decoder contains** three convolution** layers with a kernel size of 1 × 1 . Each layer uses **ReLU ** as the activation function without normalization . $m_{z} \in \mathbb{R}^{N}$

这里的意思是我们将$\mathcal{W}_{\mathcal{C}}^{z}$作为卷积层的参数

$$
177=\underbrace{(8+3) \times 8+8}{\text {conv } 1}+\underbrace{8 \times 8+8}{\text {conv } 2}+\underbrace{8 \times 1+1}_{\text {conv3 }}
$$

$$
m_{z}=\operatorname{Conv}\left(\mathcal{W}{\mathcal{C}}^{z}, f{z}\right)
$$

mask prediction (Before that, dice loss is also utilized)

$$
\mathcal{L}{\text {mask }}=\frac{1}{Z} \sum{z=1}^{Z} \frac{1}{N_{z}} \sum_{j=1}^{N} \mathbb{1}{l{\mathrm{seg}}^{j}=l_{\mathrm{c}}^{z}} \cdot L_{\mathrm{BCE}}\left(m_{z}^{j}, \hat{m}_{z}^{j}\right)
$$

Training & Post-process

Total loss

$$
\mathcal{L}=\mathcal{L}{\text {seg }}+\mathcal{L}{\text {ctr }}+\mathcal{L}{\text {mask }}+\mathcal{L}{\text {dice }}
$$

NMS on the instance binary mask

4. Experiment

Metric

mConv: mConv is defined as the mean instance-wise IoU
mWConv: mWConv denotes the weighted version of mConv
mPrec
mRec

Ablation Study

Efficiency

DyCo3D 0.28s

PointGroup 0.39s

Result

5. Remark

PointGroup基础上提升了效率、精度、鲁棒性
Backbone有改动，引入小的Transformer模块
动态卷积CondInst
instance embedding + mask + NMS方式解决over-seg问题
PointGroup+他们团队之前CondInst

DyCo3D: Robust Instance Segmentation of 3D Point Clouds through Dynamic Convolution

DyCo3D: Robust Instance Segmentation of 3D Point Clouds through Dynamic Convolution

论文信息

0. TL;DR

1. Motivation

2. Previous work

3. Method

Summary

Overall Architecture

Backbone

Dynamic Weight Generator

Instance Decoder

Training & Post-process

4. Experiment

Metric

Ablation Study

Result

5. Remark

glob&shutil学习

OccuSeg: Occupancy-aware 3D Instance Segmentation

唐川