CN113222003B

CN113222003B - Construction method and system of indoor scene pixel-by-pixel semantic classifier based on RGB-D

Info

Publication number: CN113222003B
Application number: CN202110498856.6A
Authority: CN
Inventors: 周锋; 张凤全; 蔡兴泉
Original assignee: North China University of Technology
Current assignee: North China University of Technology
Priority date: 2021-05-08
Filing date: 2021-05-08
Publication date: 2023-08-01
Anticipated expiration: 2041-05-08
Also published as: CN113222003A

Abstract

The invention relates to an indoor scene pixel-by-pixel semantic classifier construction method and system based on RGB-D, wherein the method comprises the following steps: s1: image acquisition is carried out on an indoor scene, and RGB data and Depth data are obtained; s2: defining the object category in the image, and labeling each pixel of the object category; s3: the RGB data and Depth data are respectively input into a feature extraction module, and simultaneously the RGB data is input into a Depth estimation module to supervise the feature extraction process of the RGB data so as to obtain a corresponding feature f _rgb And f _depth The method comprises the steps of carrying out a first treatment on the surface of the S4: will f _rgb And f _depth The input scale perception module is used for inputting the scale perception module,obtaining scale-aware featuresAnds5: will beAndrespectively input into self-attention mechanism module to obtain characteristicsAnd featuresS6: will beAndthe input mode self-adaption module calculates the mode self-adaption weight and fuses by utilizing the mode self-adaption weightAnda pixel-by-pixel semantic classification of the image is obtained. The method can be applied to the understanding of indoor scenes, and can effectively help indoor automatic navigation and other applications by utilizing the collected RGB-D image pixel-by-pixel semantic information.

Description

Construction method and system of indoor scene pixel-by-pixel semantic classifier based on RGB-D

Technical Field

The invention relates to the field of blockchain and machine learning, in particular to an indoor scene pixel-by-pixel semantic classifier construction method and system based on RGB-D.

Background

In recent years, related research results of indoor scene semantic understanding have been widely applied in different fields, including emergency replay in the security field, positioning, obstacle avoidance, target searching functions and the like in the intelligent robot field, become hot spot problems of research in the fields of virtual reality, augmented reality and the like, and bring convenience to daily life and work of people. However, the indoor scene has great challenges for semantic understanding due to the condition that the light rays are dark and the objects are mutually overlapped, and the like, and the research has been a fundamental and classical problem in the fields of computer graphics, virtual reality, computer vision and the like.

Scene semantic segmentation belongs to a pixel-level picture classification task, a picture I is given, the size of the picture I is assumed to be w 'x h', the object of a scene semantic segmentation algorithm is to output a result graph with the size of w 'x h', and any pixel point in the graph is selectedCorresponding to the corresponding position pixel, x in I _i,j The value of (c) represents the label class corresponding to the corresponding voxel in the corresponding I, where I, j represents the (I, j) th location in the graph. Image-based semantic segmentation is a necessary premise for semantic understanding of indoor scenes. Extensive and intensive studies have been conducted on semantic segmentation studies of images, and in particular, artificial feature-based methods and deep learning-based methods have been conducted. The method based on the artificial features mainly extracts robust artificial features from input image data. Depending on the manner of utilizing the features, there are generally several types of methods: image segmentation method based on artificial threshold setting: for example, a threshold value mode (mean value method, histogram correlation technology, vision technology and the like) is used for an input image to solve the image segmentation problem, but the method has weak processing effect on a homogeneous region in the image; based on a watershed algorithm: a better segmentation effect can be obtained by using a watershed algorithm on an original gradient image, but the algorithm needs to solve the gradient map of the image in advance and needs a relatively complex pre-processing. With the development of deep learning, a large number of algorithms based on deep neural networks are proposed. For example, the FCN algorithm, by modifying the last layer of the classical AlexNet network to be a 1 x 1 convolution, well achieves a pixel-by-pixel classification in the input image. Subsequent calculations regarding semantic segmentation of imagesThe vast majority of the processes are based on FCN improvements. However, these algorithms cannot effectively extract RGB features and Depth features, and use these features to achieve a better indoor scene semantic segmentation effect.

Disclosure of Invention

In order to solve the technical problems, the invention provides an indoor scene pixel-by-pixel semantic classifier construction method and system based on RGB-D.

The technical scheme of the invention is as follows: an indoor scene pixel-by-pixel semantic classifier construction method based on RGB-D comprises the following steps:

step S1: image acquisition is carried out on an indoor scene, and RGB data and Depth data are obtained;

step S2: defining the object category in the image, and labeling each pixel of the object category;

step S3: the RGB data and Depth data are respectively input into a feature extraction module, and simultaneously the RGB data is input into a Depth estimation module, and the module is utilized to monitor the RGB data feature extraction process to obtain a corresponding feature f _rgb And f _depth ；

Step S4: the f is set to _rgb And said f _depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->

Step S5: the saidAnd said->Respectively inputting into self-attention mechanism modules for expanding receptive field to obtain characteristic +.>And features->

Step S6: features to be characterizedAnd->The method comprises the steps of inputting a modal self-adaption module, calculating modal self-adaption weights, and fusing +.>And->And obtaining pixel-by-pixel semantic classification of the image.

Compared with the prior art, the invention has the following advantages:

1. the invention provides an indoor scene pixel-by-pixel classifier construction method based on RGB-D data, which can utilize Depth as a Depth standard value of RGB data according to interaction between the collected RGB and the Depth data in the indoor scene, promote RGB branches to extract more effective and targeted features, simultaneously design a plurality of modules to fuse the features from the RGB data and the features of the Depth data, and finally provide a fractional graph organic combination mode of different modes to provide an effective solution for pixel-by-pixel classification of the indoor scene.

2. The present invention provides an end-to-end network architecture in which the network modules for feature extraction may be generic or post-designed without significant impact on the several modules presented herein. The invention can be applied to related applications in indoor scenes by integrating various modules to perform end-to-end optimization, and can effectively help indoor robot navigation and other related applications by effectively performing semantic segmentation on the currently captured scene especially aiming at tasks requiring fine-grained semantic understanding.

Drawings

FIG. 1 is a flow chart of a construction method of an indoor scene pixel-by-pixel semantic classifier based on RGB-D in an embodiment of the invention;

fig. 2 is a block diagram of step S4 in a construction method of an indoor scene pixel-by-pixel semantic classifier based on RGB-D according to an embodiment of the present invention: will f _rgb And f _depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->Is a flow chart of (2);

fig. 3 is a block diagram of step S6 in a construction method of an indoor scene pixel-by-pixel semantic classifier based on RGB-D according to an embodiment of the present invention: features to be characterizedAnd->The input mode self-adaptive module calculates the mode self-adaptive weight, and the mode self-adaptive weight is utilized to fuse +.>And->Obtaining a flow chart of pixel-by-pixel semantic classification of the image;

fig. 4 is a structural block diagram of an indoor scene pixel-by-pixel semantic classifier construction system based on RGB-D according to an embodiment of the present invention.

Detailed Description

The invention provides an indoor scene pixel-by-pixel semantic classifier construction method based on RGB-D, which can be applied to related applications in indoor scenes, in particular to tasks requiring fine-granularity semantic understanding, and can effectively help indoor robot navigation and other related applications by effectively carrying out semantic segmentation on a currently captured scene.

The present invention will be further described in detail below with reference to the accompanying drawings by way of specific embodiments in order to make the objects, technical solutions and advantages of the present invention more apparent.

The execution environment of the embodiment of the invention adopts a 4.0GHZ central processing unit and 128 Gbyte memory core 4 kernel computer, and in order to accelerate the training and reasoning process of the semantic segmentation network model, the invention adopts 8 Inlet Weida Geforce GTX 1080TI GPU graphics cards to perform acceleration calculation. Meanwhile, a construction program of an indoor scene pixel-by-pixel classifier based on a convolutional neural network serving RGB-D input data, which is compiled by languages such as Python and C++, is adopted. The invention can be based on other execution environments on the premise of allowing the memory and the video memory of the computer, and the description is omitted here.

Example 1

As shown in fig. 1, the construction method of the indoor scene pixel-by-pixel semantic classifier based on RGB-D provided by the embodiment of the invention comprises the following steps:

step S3: the RGB data and Depth data are respectively input into a feature extraction module, and simultaneously the RGB data is input into a Depth estimation module, and the module is utilized to monitor the feature extraction process of the RGB data to obtain a corresponding feature f _rgb And f _depth ；

Step S4: will f _rgb And f _depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->

Step S5: will beAnd->Respectively inputting into self-attention mechanism modules to expand receptive fields to obtain featuresAnd features->

Step S6: will beAnd->The input mode self-adaptive module calculates the mode self-adaptive weight, and the mode self-adaptive weight is utilized to fuse +.>And->A pixel-by-pixel semantic classification of the image is obtained.

In one embodiment, step S1 described above: image acquisition is carried out on an indoor scene to obtain RGB data and Depth data, and the method specifically comprises the following steps:

and acquiring data of an indoor scene by using a consumer-level Depth camera such as Microsoft Kinect and the like to acquire RGB data and Depth data.

And converting the point coordinates of the RGB and Depth images into a camera coordinate system, converting by utilizing the infrared and the optical heart external parameters of the RGB camera, and finding the relation between the points on the RGB image and the Depth image by utilizing the method so as to finish data alignment.

In one embodiment, step S2 above: defining the object category in the image, and labeling each pixel of the object category, which specifically comprises the following steps:

and counting the total categories of the objects in the current indoor scene, wherein the number of the total categories is the number of the categories of each pixel in the indoor scene after counting. Pixels in RGB and Depth images captured in indoor scenes are artificially marked. Since RGB and Depth describe the same scene, the tag values of RGB and Depth are the same at this time.

In one embodiment, the step S3: the RGB data and Depth data are respectively input into a feature extraction module, and simultaneously the RGB data is input into a Depth estimation module, and the module is utilized to monitor the feature extraction process of the RGB data to obtain a corresponding feature f _rgb And f _depth The method specifically comprises the following steps:

the RGB data and Depth data are respectively input into an RGB feature extraction module and a Depth feature extraction module for feature extraction, and when the RGB features are extracted, the RGB data are simultaneously input into a Depth estimation module, and the module is utilized to monitor the RGB data feature extraction process to respectively obtain corresponding features f _rgb And f _depth 。

In the step, the characteristic extraction is carried out on input RGB and Depth data based on networks such as ResNet and the like by using a classical convolutional neural network structure, and in the embodiment of the invention, resNet-18 is used as a coding end, and the ResNet-18 adopts a B1-B2-B3-B4 structure. Since RGB and Depth are two modality image data, it is necessary to use two feature extraction modules to perform feature extraction on them, respectively. In the RGB feature extraction branch, the depth estimation is used as an additional supervision signal, and the depth estimation module of the embodiment of the invention adopts a B1-B2-B3-B4-B4-B3-B2-B1 structure. The resolution of the extracted features replaces the downsampling layer with tape Kong Juanji to prevent the resolution of the extracted features from becoming too small, generally requiring that the resolution after feature extraction be no less than 1/8 of the original input image. Finally, the step extracts the characteristic f corresponding to the input RGB data and Depth data _rgb And f _depth 。

As shown in FIG. 2, in one embodiment, the above stepsS4: will f _rgb And f _depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->The method specifically comprises the following steps:

step S41: utilizing feature pyramid to pair feature f _rgb And f _depth Extracting multi-scale characteristics to obtainAnd->

In the embodiment of the invention, a cavity space convolution pooling feature pyramid (ASPP) is used for extracting multi-scale features, a band Kong Juanji with the band porosity of {6,12,18,24} is used for extracting 4-scale feature information of RGB and Depth respectively, and thus, the method is obtainedAnd->

Step S42: will beAnd->Fusion is carried out to obtain fusion characteristics->Obtaining a multichannel scale weight graph through a convolution network;

in this step, a 3-layer network structure is constructed, which has the structure of cat-conv-sm (link layer-convolution layer-softmax layer),wherein the cat layer is used to extract multi-scale features of the RGB modalityAnd multiscale features of Depth modalitiesLigating together to construct a fusion feature->Because the feature pyramid used for extracting the features in the embodiment of the invention has 4 scales, namely the RGB features have 4 scales, and the Depth features have 4 scales, the method is realized by _c A 1 x 1 convolutional network of onv layers can obtain an 8-channel scale weight map.

Step S43: feature selection is carried out on the multichannel scale weight graph to respectively obtain scale perception featuresAnd->

In this step, the 8-channel weight map is input _sm The weight normalization is carried out in the layers, and finally the first 4 channels are used for selecting RGB scale characteristics, the last 4 channels are used for selecting Depth scale characteristics, and the scale perception characteristics can be obtained after the 3-layer network structure is adoptedAnd->The following formulas (1) to (2) show:

wherein,,respectively, a 4-channel weight map for selecting RGB scale features and a 4-channel weight map for selecting Depth scale features, denoted hadamard product.

In one embodiment, the step S5 is as follows: will beAnd->Respectively inputting into self-attention mechanism modules for expanding receptive field to obtain characteristic +.>And->The method specifically comprises the following steps:

in this step, the self-attention mechanism module is constructed by using the following formulas (3) to (5)And->Inputting the self-attention mechanism module to obtain the characteristic +.>And features->

Wherein,, _c the number of characteristic channels represented will +.>3 copies of the new feature +.3 times> N=h×w; λ and β are weight parameters.

Each element in f of equation (3) characterizes the self-correlation between the current feature element by calculating the similarity between each element and all the remaining elements in the feature map to determine the relationship between the current element and all the remaining elements. In this way, the problem of smaller extracted characteristic receptive fields is solved. In order to make the network not completely depend on the attention mechanism characteristics in the training process, the participation degree of the attention mechanism characteristics in different stages in the RGB and Depth network training process is adjusted by introducing lambda and beta weight parameters. By the above way, obtainAnd->

As shown in fig. 3, in one embodiment, step S6 described above: features to be characterizedAnd->The input mode self-adaptive module calculates the mode self-adaptive weight, and the mode self-adaptive weight is utilized to fuse +.>And->The pixel-by-pixel semantic classification of the image is obtained, which specifically comprises the following steps:

step S61: constructing a modal self-adaptive module, adopting a 4-layer network structure cat ' -conv ' -sm ' -mul ' (connecting layer-convolution layer-softmax layer-matrix multiplication operation layer), wherein the cat ' layer is used for multiplyingAnd +.>The two features are fused together to form a fused feature +.>

Step S62: will fuse featuresInputting the conv' layer to obtain the map ^2*h*w A weight mask map;

step S63: regularizing the weight mask map by utilizing the sm' layer, and separating according to the channels to obtain the weight mask mapAnd +.>

Step S64: applying a weight mask pattern to a channel using mul' layerP is obtained according to the formula (6) _rgb The method comprises the steps of carrying out a first treatment on the surface of the The weight mask map of the other channel acts on +.>P is obtained according to the formula (7) _depth The method comprises the steps of carrying out a first treatment on the surface of the And (3) adding the two according to a formula (8) to obtain a pixel-by-pixel classification result P of the image.

P＝P _rgb +P _depth (8)

The invention provides an indoor scene pixel-by-pixel classifier construction method based on RGB-D data, which can utilize Depth as a Depth standard value of RGB data according to interaction between the collected RGB and the Depth data in the indoor scene, promote RGB branches to extract more effective and targeted features, simultaneously design a plurality of modules to fuse the features from the RGB data and the features of the Depth data, and finally provide a fractional graph organic combination mode of different modes to provide an effective solution for pixel-by-pixel classification of the indoor scene.

The present invention provides an end-to-end network architecture in which the network modules for feature extraction may be generic or post-designed without significant impact on the several modules presented herein. The invention can be applied to related applications in indoor scenes by integrating various modules to perform end-to-end optimization, and can effectively help indoor robot navigation and other related applications by effectively performing semantic segmentation on the currently captured scene especially aiming at tasks requiring fine-grained semantic understanding.

Example two

As shown in fig. 4, an embodiment of the present invention provides an indoor scene pixel-by-pixel semantic classifier construction system based on RGB-D, which includes the following modules:

the image acquisition module 71 is used for acquiring images of indoor scenes and acquiring RGB data and Depth data;

a category labeling module 72, configured to define a category of an object in the image and label each pixel thereof with a category;

the feature extraction module 73 is used for inputting the RGB data and Depth data into the feature extraction module respectively, inputting the RGB data into the Depth estimation module, and supervising the RGB data feature extraction process by using the feature extraction module to obtain the corresponding feature f _rgb And f _depth ；

A scale-aware feature extraction module 74 for extracting f _rgb And f _depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->

Attention feature extraction module 75 for extractingAnd->Respectively inputting into self-attention mechanism modules for expanding receptive field to obtain characteristic +.>And->

An image classification module 76 for classifyingAnd->The input mode self-adaptive module calculates the mode self-adaptive weight, and the mode self-adaptive weight is utilized to fuse +.>And->A pixel-by-pixel semantic classification of the image is obtained.

The above examples are provided for the purpose of describing the present invention only and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalents and modifications that do not depart from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. An indoor scene pixel-by-pixel semantic classifier construction method based on RGB-D is characterized by comprising the following steps:

Step S4: the f is set to _rgb And said f _depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->The method specifically comprises the following steps:

step S41: using feature pyramid to make the feature f _rgb And f _depth Extracting multi-scale characteristics to obtainAnd->

Step S42: the saidAnd->Fusion is carried out to obtain fusion characteristics->Obtaining a multichannel scale weight graph through a convolution network;

Step S5: the saidAnd said->Respectively inputting into self-attention mechanism modules for expanding receptive field to obtain characteristic +.>And->

Step S6: the saidAnd said->The mode self-adaptive module is input, the mode self-adaptive weight is calculated, and the mode self-adaptive weight is utilized to fuse the +.>And said->The pixel-by-pixel semantic classification of the image is obtained, specifically comprising:

step S61: constructing the mode self-adaptive module, and adopting a 4-layer network structure cat ' -conv ' -sm ' -mul ', wherein a cat ' layer is used for connectingAnd +.>The two features are fused together to form a fused feature +.>

Step S62: characterizing the fusionInputting the conv' layer to obtain the map ^2*h*w Weight maskCode patterns;

step S63: regularizing the weight mask map by utilizing the sm' layer, and separating according to channels to respectively obtainAnd +.>

Step S64: applying the weight mask pattern to a channel using mul' layerObtaining P _rgb The method comprises the steps of carrying out a first treatment on the surface of the The weight mask map of the other channel acts on +.>Obtaining P _depth The method comprises the steps of carrying out a first treatment on the surface of the And adding the two to obtain a pixel-by-pixel classification result of the image.

2. The method for constructing an RGB-D based indoor scene pixel-by-pixel semantic classifier according to claim 1, wherein the step S3: the RGB data and Depth data are respectively input into a feature extraction module, and simultaneously the RGB data is input into a Depth estimation module, and the module is utilized to monitor the RGB data feature extraction process to obtain a corresponding feature f _rgb And f _depth The method specifically comprises the following steps:

inputting the RGB data and Depth data into an RGB feature extraction module and a Depth feature extraction module respectively for feature extraction, inputting the RGB data into a Depth estimation module simultaneously when extracting the RGB features, and supervising the RGB data feature extraction process by using the module to respectively obtain corresponding features f _rgb And f _depth 。

3. The RGB-D based indoor scene pixel-by-pixel semantic classifier construction method of claim 1, whichCharacterized in that said step S5: the saidAnd said->Inputting into a self-attention mechanism module, expanding receptive field, and respectively obtaining characteristic +.>And features->The method specifically comprises the following steps:

the saidAnd said->Respectively inputting the self-attention mechanism modules to obtain the characteristic +.>And features->Wherein the self-attention mechanism module is shown in the following formulas (3) to (5);

wherein,,c represents the number of characteristic channels; />For the purpose of->N=h×w; λ and β are weight parameters.

4. An indoor scene pixel-by-pixel semantic classifier construction system based on RGB-D is characterized by comprising the following modules:

the image acquisition module is used for acquiring images of indoor scenes and acquiring RGB data and Depth data;

the class labeling module is used for defining the class of the object in the image and labeling each pixel of the class;

the feature extraction module is used for inputting the RGB data and the Depth data into the feature extraction module respectively, inputting the RGB data into the Depth estimation module, and supervising the RGB data feature extraction process by using the feature extraction module to obtain the corresponding feature f _rgb And f _depth ；

A scale-aware feature extraction module for extracting the f _rgb And said f _depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->The method specifically comprises the following steps:

step S41: the feature pyramid is utilized to make the featuref _rgb And f _depth Extracting multi-scale characteristics to obtainAnd->

An attention feature extraction module for extracting the followingAnd said->Respectively inputting into self-attention mechanism modules for expanding receptive field to obtain characteristic +.>And->

An image classification module for classifying the imagesAnd said->The mode self-adaptive module is input, the mode self-adaptive weight is calculated, and the mode self-adaptive weight is utilized to fuse the +.>And said->The pixel-by-pixel semantic classification of the image is obtained, specifically comprising:

Step S62: characterizing the fusionInputting the conv' layer to obtain the map ^2*h*w A weight mask map;