CN112115951A

CN112115951A - RGB-D image semantic segmentation method based on spatial relationship

Info

Publication number: CN112115951A
Application number: CN202011301588.6A
Authority: CN
Inventors: 张健; 费哲遥; 李月华; 谢天; 朱世强
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2020-12-22
Anticipated expiration: 2040-11-19
Also published as: CN112115951B

Abstract

The invention discloses a RGB-D image semantic segmentation method based on spatial relationship, which constructs a semantic segmentation network by taking Deeplab-v3 as a basic model and comprises a feature extraction module, a spatial relationship similarity loss module, a decoder module and a loss function module. Semantic segmentation is carried out on an RGB-D image of an indoor scene, RGB information and Depth information are effectively fused through a deep learning network, and spatial relationship similarity is introduced into a backbone network. On the basis of parallel design of a network structure, the depth information and RGB information fusion effect is assisted to be improved by calculating regional characteristic values and similarity degree measurement of the depth information and the RGB information. The method is simple and convenient only depending on sensor equipment capable of providing RGB data and depth data, and is an effective method based on image matching in Kinect, Xtion and other somatosensory equipment applications.

Description

RGB-D image semantic segmentation method based on spatial relationship

Technical Field

The invention belongs to the field of computer image processing, and particularly relates to an RGB-D image semantic segmentation method based on a spatial relationship.

Background

Semantic segmentation is an important application in computer vision, and is widely applied to the fields of robots, automatic driving, security monitoring and the like.

Compared to conventional RGB solutions, RGB-D sensors can provide multi-mode information including color, depth. In scenes with unobvious color boundaries, weak texture features, inconsistent target depths and the like, the depth information has a strong guiding function on semantic segmentation. Based on the principle, the semantic segmentation method utilizing the RGB-D information can obtain a segmentation effect superior to that of the traditional method.

The existing RGB-D fusion scheme can be mainly divided into three types, namely 2D multi-mode semantic fusion, network structure parallel design and 3D point cloud space mapping. The 2D multi-mode semantic fusion and the network structure parallel design respectively guide the fusion of depth and RGB information in a manual excavation and network extraction mode, and the fusion effect is limited; the 3D point cloud space mapping method incurs a large amount of computational overhead.

Disclosure of Invention

The invention aims to provide an RGB-D image semantic segmentation method based on a spatial relationship aiming at the defects of the prior art.

The purpose of the invention is realized by the following technical scheme: a RGB-D image semantic segmentation method based on spatial relationship comprises the following steps:

(1) constructing a semantic segmentation network by taking Deeplab-v3 as a basic model, wherein the semantic segmentation network comprises a feature extraction module, a spatial relationship similarity loss module, a decoder module and a loss function module; inputting an RGB-D image and outputting a semantic classification score map;

(2) training the semantic segmentation network constructed in the step (1);

(3) and (3) inputting the RGB-D image to be tested to the semantic segmentation network trained in the step (2), and taking the maximum score category in the output semantic classification score map as each pixel point category to obtain a semantic segmentation image.

Further, the feature extraction module is: the Resnet101 is used as a backbone network of the feature extraction module, parallel RGB and depth branches are constructed, and the structure is kept consistent.

Further, in the training process of the step (2), data enhancement is performed by using a random inversion, cutting and gamma value transformation method; the pre-training parameters of ImageNet are loaded on the RGB in the model and the main network corresponding to the deep branch; and the model is trained using a back propagation algorithm.

Further, the construction of the spatial relationship similarity loss module comprises the following sub-steps:

(a1) respectively extracting output characteristics of b sub-modules in RGB and deep branch networks, and constructing multiple groups of pairwise relationsf _i：

f _i={f _i,rgb,f _i,dep}

Wherein the content of the first and second substances,

b represents the number of selected sub-modules;

is the RGB branch ofiThe output characteristics of the individual modules are,

is a deep branchiOutput characteristics of the individual modules;

(a2) respectively mixing each groupf _iInner RGB, depth feature transformation into feature regions

：

Wherein the function

Representing a global pooling operation based on an original feature scale downsampling;

、

is that

、

A corresponding feature region;

(a3) computing paired feature regions

Corresponding autocorrelation spatial features

：

Wherein the content of the first and second substances,

、

is that

、

Corresponding autocorrelation spatial features;

representing an autocorrelation spatial matrix;

，

；

、

to represent

Any two of the regions m, n in the region,

、

to represent

Any two of the regions m and n; the dst (x, y) function represents the distance operation;

(a4) calculating the distance between RGB and depth autocorrelation spatial features and generating a spatial relationship similarity loss

：

。

Further, the dst (x, y) function is dst (x, y) = cos (norm (x), norm (y)), where norm represents a norm.

Further, the decoder module is configured to: final set of feature maps for RGB and depth branch output

Performing feature splicing through a feature weighting module; features after splicing

Generating a characteristic diagram through a multi-scale cavity convolution module, and comparing the characteristic diagram with the original characteristic diagram

And overlapping the channels to finally obtain a semantic classification score map.

Further, the construction of the decoder module comprises the sub-steps of:

(b1) will be provided with

、

Respectively inputting into the global average pooling layer, subsequently connecting with two full-connection layers with same-ratio compression and expansion of channels, activating function, and outputting

；

(b2) Outputting step (b 1)

Adding to obtain a feature map after feature splicing

；

(b3) Splicing the step (b 2)Later feature map

Inputting a multi-scale cavity convolution module, parallelly passing through 4 cavity convolution layers with different scales and 1 mean value pooling layer, superposing the 5 types of outputs on a channel, compressing the outputs by convolution of 1 x 1, and outputting

；

(b4) Will be provided with

And

after the features are overlapped on the channels, inputting 3 × 3 convolution layers and 1 × 1 convolution layers, and finally outputting a semantic classification score map.

Further, the loss function module is: and fitting the semantic classification score map and the real label by using the cross entropy loss as a loss function, and using a random gradient descent method as an optimization method.

The invention has the beneficial effects that: the invention relates to an image fusion descriptor method based on an RGB-D sensor, which is used for performing semantic segmentation on an RGB-D image of an indoor scene, effectively fusing RGB information and Depth information through a deep learning network and introducing spatial relationship similarity in a backbone network. On the basis of parallel design of a network structure, the depth information and RGB information fusion effect is assisted to be improved by calculating regional characteristic values and similarity degree measurement of the depth information and the RGB information. The method is simple and convenient only depending on sensor equipment capable of providing RGB data and depth data, and is an effective method based on image matching in Kinect, Xtion and other somatosensory equipment applications.

Drawings

FIG. 1 is a diagram of the overall architecture of a network;

FIG. 2 is a block diagram of spatial relationship similarity loss;

FIG. 3 is a schematic diagram illustrating the effect of the present invention; wherein, a is an image schematic diagram of an RGB-D to-be-tested image of an indoor scene, and b is a semantic classification score map.

Detailed Description

The invention relates to a RGB-D image semantic segmentation method based on spatial relationship, which comprises the following steps as shown in figure 1:

step one, constructing a semantic segmentation network:

the overall network architecture design is based on an open-source deep learning framework pytorch, and is transformed on the basis of the public Deeplab-v3 network architecture, so that three parts, namely a feature extraction module, a spatial relationship similarity loss module and a decoder module, are realized.

(1) Building feature extraction module

The module selects a backbone network of Resnet101 as a basic framework of the feature extraction module, and two parallel branches of RGB and Depth (Depth) are synchronously constructed.

(2) Building spatial relationship similarity loss module

The RGB and the depth branch structure are kept consistent, the output characteristics of four sub-modules in the network of the RGB and the depth branch structure are extracted, and four groups of pairwise relations are constructedf _iAnd is recorded as:

f _i={f _i,rgb,f _i,dep}

wherein the content of the first and second substances,

i belongs to {1,2,3,4}, and corresponds to 4 groups of characteristics;

is the RGB branch ofiThe output characteristics of the individual modules are,

is a deep branchiOutput characteristics of the individual modules; w, h, c refer to feature map dimensions.

Pairwise relationships for each groupf _iConverting the RGB, depth features within the set into feature regions, respectively

And is recorded as:

wherein, the function p (x) = maxporoling (x,5), representing a global maximum pooling operation of down-sampling 5 times based on the original feature scale; then it is corresponding to

，h’=h/5，w’=w/5。

Computing paired feature regions

Respectively corresponding autocorrelation spatial features

The autocorrelation is related to the distance of different regions in the same feature map, and is expressed as:

wherein the content of the first and second substances,

、

is an auto-correlated spatial feature of RGB and depth,

representing an autocorrelation spatial matrix;

，

；

、

to represent

Any two of the regions m, n in the region,

、

to represent

Any two of the regions m and n; such as

Corresponding region m is

A point position element set of each channel in a third dimension corresponding to a point in the first and second dimensions; and a function of distancedstCosine distance formula dst (x, y) = cos (norm (x), norm (y)) is selected, and function norm (x) represents norm sampling.

As shown in FIG. 2, the distances between each set of RGB and depth autocorrelation spatial features are calculated and a spatial relationship similarity loss is generated

：

Where b =4 represents a 4-component-pair feature map of RGB and depth branch outputs.

(3) Building decoder modules

Final set of feature maps for RGB and depth branching output

Inputting a feature weighting module and completing feature splicing; features after splicing

Generating a new characteristic diagram by a multi-scale void convolution (ASPP) module, and comparing the new characteristic diagram with the original characteristic diagram

And (4) overlapping channels, and finally generating a semantic classification score map with the channel number of 40 by a decoder module. The characteristic weighting module adopts 16 times of compression and expansion rate, and the activation function selects sigmod (x); the multiscale void convolution (ASPP) module selects (1,6,12,18) different expansion coefficients.

(3.1) performing feature splicing on the RGB and the output feature map of the last module in the depth branch, and inputting in a feature weighting and adding mode, wherein the specific process is as follows:

a) the characteristic weight is that

Respectively inputting the global average pooling layer to obtain two data corresponding to B × C × 1 × 1 scale (B, C respectively refer to batch and feature map of training process

The corresponding number of channels); subsequently, two full-connection layers with the same compression and expansion of the channels are connected, and the full-connection layers are output after the full-connection layers are activated by a function

。

b) The feature summation is to add the weighted RGB and the depth branch feature value, and the calculated feature map value after feature splicing is

。

(3.2) feature map after stitching

Inputting the semantic classification score map into a decoder network corresponding to the Deeplab-v3, and finally outputting the semantic classification score map, wherein the specific flow is as follows:

a) characteristic diagram

Inputting the data into a multi-scale hole convolution module (ASPP),

the input passes through 4 void convolution layers of different scales and 1 mean pooling layer mechanism in parallel. Superposing the above 5 kinds of outputs on the channel, compressing with convolution of 1 × 1, and outputting

。

b)

And

and after the features are superposed on the channels, inputting a standard 3 × 3 convolution layer and a standard 1 × 1 convolution layer, and outputting a final semantic classification score map.

(4) Loss function module

And fitting the semantic classification score map and a real label by using cross-entropy loss (cross-entropy loss) as a loss function, and reversely propagating the whole semantic segmentation network by using a random-gradient descent (mini-batch SGD) as an optimization method, so that the whole model framework is constructed.

Step two, selecting an open source NYU-depth v2 semantic segmentation data set as a task sample; the data lump meter comprises 1449 marked RGB-D images, 795 images are divided to be used as training sets, and 654 images are divided to be used as testing sets. In the training process, data enhancement is carried out on line by using a random overturning, cutting and gamma value transformation method. The pre-training parameters of ImageNet are loaded on the RGB in the model and the trunk network corresponding to the deep branch; and the model is trained using a back propagation algorithm.

And step three, in the task verification process, as shown in fig. 3, inputting an RGB-D image to be tested (a in fig. 3) of an indoor scene, outputting a semantic segmentation image by taking the maximum score category as each pixel point category according to an output final output semantic classification score map (b in fig. 3), and finishing the visualization process.

Claims

1. A RGB-D image semantic segmentation method based on spatial relationship is characterized by comprising the following steps:

(2) training the semantic segmentation network constructed in the step (1);

2. The RGB-D image semantic segmentation method based on spatial relationship as claimed in claim 1, wherein the feature extraction module is: the Resnet101 is used as a backbone network of the feature extraction module, parallel RGB and depth branches are constructed, and the structure is kept consistent.

3. The RGB-D image semantic segmentation method based on spatial relationship as claimed in claim 2, wherein in the training process of step (2), data enhancement is performed by using random inversion, clipping and gamma value transformation methods; the pre-training parameters of ImageNet are loaded on the RGB in the model and the main network corresponding to the deep branch; and the model is trained using a back propagation algorithm.

4. The RGB-D image semantic segmentation method based on spatial relationship as claimed in claim 2, wherein the construction of the spatial relationship similarity loss module comprises the following sub-steps:

f _i={f _i,rgb,f _i,dep}

Wherein the content of the first and second substances,

b represents the number of selected sub-modules;

is the RGB branch ofiThe output characteristics of the individual modules are,

is a deep branchiOutput characteristics of the individual modules;

(a2) respectively mixing each group

Inner RGB, depth feature transformation into feature regions

：

Wherein the function p (x) represents a global pooling operation based on the original feature scale down-sampling;

、

is that

、

A corresponding feature region;

(a3) computing paired feature regions

Corresponding autocorrelation spatial features

：

Wherein the content of the first and second substances,

、

is that

、

Corresponding autocorrelation spatial features; d (x) represents an autocorrelation spatial matrix;

，

；

、

to represent

Any two of the regions m, n in the region,

、

to represent

：

。

5. The RGB-D image semantic segmentation method based on spatial relationship as claimed in claim 4, wherein the dst (x, y) function is dst (x, y) = cos (norm (x), norm (y)), and norm represents norm.

6. The method for semantic segmentation of RGB-D images based on spatial relationships according to claim 4, wherein the decoder module is configured to: final set of feature maps for RGB and depth branch output

7. The method for semantic segmentation of RGB-D images based on spatial relationships according to claim 6, wherein the construction of the decoder module includes the sub-steps of:

(b1) will be provided with

、

；

(b2) Outputting step (b 1)

Adding to obtain a feature map after feature splicing

；

(b3) Splicing the characteristic graph obtained in the step (b 2)

；

(b4) Will be provided with

And

8. The RGB-D image semantic segmentation method based on spatial relationship as claimed in claim 1, wherein the loss function module is: and fitting the semantic classification score map and the real label by using the cross entropy loss as a loss function, and using a random gradient descent method as an optimization method.