CN110096961A

CN110096961A - A kind of indoor scene semanteme marking method of super-pixel rank

Info

Publication number: CN110096961A
Application number: CN201910269599.1A
Authority: CN
Inventors: 王立春; 陆建霖; 王少帆; 孔德慧; 李敬华
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-04-04
Filing date: 2019-04-04
Publication date: 2019-08-06
Anticipated expiration: 2039-04-04
Also published as: CN110096961B

Abstract

The indoor scene semanteme marking method for disclosing a kind of super-pixel rank can be avoided depth network application in Pixel-level indoor scene and mark the calculating huge problem of cost, and depth network can be made to receive super-pixel set as input.The indoor scene semanteme marking method of this super-pixel rank, comprising the following steps: (1) super-pixel segmentation is carried out to indoor scene color image using simple linear iteration cluster segmentation algorithm；(2) super-pixel for combining indoor scene depth image to obtain step (1) extracts super-pixel core and describes subcharacter (primary features)；(3) neighborhood of super-pixel is constructed；(4) super-pixel depth network SuperPixelNet is constructed, the multi-modal feature of super-pixel is learnt；To super-pixel to be marked, in conjunction with the super-pixel and its multi-modal feature of neighborhood super-pixel, super-pixel grade semantic tagger is provided to indoor scene RGB-D image.

Description

Indoor scene semantic annotation method at super-pixel level

Technical Field

The invention relates to the technical field of multimedia technology and computer graphics, in particular to a super-pixel level indoor scene semantic annotation method.

Background

Indoor scene semantic annotation, as a necessary task in computer vision research, has been a research hotspot and difficulty in the field of image processing. Compared with an outdoor scene, the indoor scene has the following characteristics: 1. the variety of objects is complicated; 2. the shielding between objects is more serious; 3. the scene difference is large; 4. the illumination is not uniform; 5. the characteristics of strong discriminability are lacking. Therefore, compared with the outdoor scene, the indoor scene is more difficult to label. The semantic annotation of the indoor scene is the core content for understanding the indoor scene, and has a wide application in the fields of service, fire fighting and the like, such as the mobile positioning and environment interaction of robots, event detection in the field of security monitoring and the like.

Scene semantic labeling (or scene semantic segmentation) labels each pixel in an image with an object class label to which it belongs. Scene semantic annotation is a challenging task because it combines the traditional problems of detection, segmentation, and multi-label recognition in a single process. The RGB-D image is a color image and a depth image acquired synchronously, containing color and depth information of the image. A Depth Image (Depth Image), or distance Image (Range Image), is a special Image, and each pixel information contains the Depth of a corresponding point in the actual scene. Compared with an RGB image, the method is not easily influenced by illumination, shadow and the like, and can better express the original appearance of a scene, so that the method is widely applied to indoor scenes. Results from studies by Silberman and Fergus show that experimental results using RGB-D data are significantly better than results using RGB alone when performing semantic segmentation on indoor scenes.

In the research work of semantic annotation of indoor scene rgb (d) images, scene semantic annotation methods can be roughly divided into two categories according to different inputs, one category is using pixels as basic annotation units (pixel-level annotation), and the other category is using superpixels as basic annotation units (superpixel-level annotation).

Since the advent of AlexNet in 2012, deep networks have had enormous achievements and wide applications in image classification and image detection, and have proven to be a powerful method for extracting image features. The deep network is also widely used in the field of semantic segmentation, and the indoor scene semantic annotation based on the deep network is pixel-level annotation, and related methods can be roughly divided into two types. The first type is a Full Convolutional Network (FCN) based approach. Jonathan Long et al proposed FCN for semantic segmentation of images in 2015, so that end-to-end training of image semantic segmentation can be achieved. FCN is very poor in its ability to describe object boundaries and shape structure. To learn more contextual information, Liang-Chiehchen et al use Conditional Random Fields (CRF) to integrate global context and object structure information into the FCN.

The second category is methods based on encoding-decoding network architectures. SegNet and DeconvNet are typical encoding-decoding structures. The decoder is a key network structure of SegNet, the decoders of the SegNet are advanced layer by layer, and each decoder in each layer is provided with a corresponding encoder. DeconvoNet considers semantic segmentation as an instance-wise segmentation problem, performs pixel-by-pixel semantic segmentation on all object artifacts in the picture, and finally integrates semantic segmentation results of all object artifacts to generate a semantic segmentation result of the whole picture. DeconvoNet network parameters are large, resulting in very difficult training and very large time overhead in the testing phase. The extraction of object tokens and the network semantic segmentation are two separate steps.

A super-pixel level indoor scene semantic annotation method comprises the steps of firstly segmenting an indoor scene image into super-pixels according to pixel similarity, then extracting super-pixel characteristics, and further classifying the super-pixels. Four types of feature representations are proposed by Liefeng Bo and Xiaoofeng Ren in 2011 for indoor scene recognition, namely a size kernel descriptor (for extracting physical size information of an object), a shape kernel descriptor (for extracting three-dimensional shape information of the object), a gradient kernel descriptor (for extracting depth information of the object) and a local binary kernel descriptor (for extracting local texture information of the object), which are superior to traditional 3D features (such as Spin Image) and greatly improve the accuracy of object recognition in RGB-D indoor scenes. Xiaofeng Ren et al used a deep kernel descriptor to describe superpixel features in 2012, and modeled the context between superpixels based on a partition tree using a markov random field, increasing the indoor scene semantic annotation accuracy on the NYU v1 dataset from 56.6% to 76.1%.

The pixel level indoor scene semantic annotation method has the advantage that the calculation cost is high due to the large number of pixels. However, after the image is divided into the super pixels, although the calculation cost can be reduced, the position relation among the super pixels is not regulated any more, namely, the super pixels obtained by dividing one image are disordered, and most of the deep networks require input data in a normalized matrix form, so that the contradiction exists.

Disclosure of Invention

In order to overcome the defects of the prior art, the technical problem to be solved by the invention is to provide a superpixel-level indoor scene semantic annotation method, which can avoid the problem of huge calculation cost when a deep network is applied to pixel-level indoor scene annotation, and can enable the deep network to accept a superpixel set as input.

The technical scheme of the invention is as follows: the method for labeling the indoor scene semantics at the superpixel level comprises the following steps:

(1) performing super-pixel segmentation on the color image of the indoor scene by using a simple linear iterative clustering segmentation algorithm;

(2) extracting superpixel kernel descriptor features (primary features) by combining the superpixels obtained in the step (1) with the indoor scene depth images;

(3) constructing a neighborhood of the superpixel;

(4) constructing a super-pixel depth network SuperPixelNet and learning super-pixel multi-mode characteristics; and (3) giving out the semantic annotation of the super pixel level on the RGB-D image of the indoor scene by combining the multi-mode characteristics of the super pixel and the neighborhood super pixel to the super pixel to be annotated.

The invention constructs a depth network structure SuperPixelNet for processing the disordered superpixel set, and the network takes the disordered superpixel and the neighborhood superpixel thereof as input and gives the superpixel-level semantic annotation to the RGB-D image of the indoor scene, so that the problem of huge calculation cost when the depth network is applied to the pixel-level indoor scene annotation can be avoided, and the depth network can accept the superpixel set as input.

Drawings

FIG. 1 is a flow chart of a method for semantic labeling of indoor scenes at a superpixel level according to the present invention.

FIG. 2 is a SuperPixelNet superpixel depth network structure of the superpixel-level indoor scene semantic annotation method according to the present invention.

Detailed Description

The invention provides a super-pixel depth network, which is used for carrying out super-pixel-level semantic annotation on an RGB-D indoor scene. Firstly, performing superpixel segmentation on an RGB-D indoor scene image by adopting an SLIC algorithm. And searching the neighborhood superpixels of each superpixel according to a certain rule, and recording the superpixel to be marked as a core superpixel. The kernel descriptor characteristics and the geometric characteristics (primary characteristics) of the core superpixel and the neighborhood superpixel are used as the input of the superpixel depth network to learn the multi-mode fusion characteristics of the core superpixel and the neighborhood superpixel; the neighborhood context characteristics of the core superpixel are learned based on the multi-modal fusion characteristics of the core superpixel and the neighborhood superpixels thereof, and the multi-modal fusion characteristics of the core superpixel are spliced to be used as the characteristic representation of superpixel classification, so that the superpixel-level semantic annotation of the RGB-D images of the indoor scene is realized. As shown in fig. 1, the method for labeling indoor scene semantics at a superpixel level includes the following steps:

(3) constructing a neighborhood of the superpixel;

The invention tests on a NYU v1 RGB-D data set, which contains 2284 scenes, 13 categories in total. The data set is partitioned into two disjoint subsets for training and testing, respectively. The training set contains 1370 scenarios, and the test set contains 914 scenarios.

The method provided by the invention comprises the following specific steps:

simple Linear Iterative Clustering (SLIC) is expanded on the basis of a K-Means Clustering algorithm and is a Simple and efficient method for constructing superpixels. Preferably, the step (1) comprises the following substeps:

(1.1) converting the image from an RGB color space to an LAB color space;

(1.2) firstly, determining a parameter K, namely the number of the super pixels obtained by segmentation;

(1.3) calculation ofWherein N is the number of pixels contained in the image, and S is used as the step length to uniformly initialize K clustering centers c_jJ is more than or equal to 1 and less than or equal to K; setting clustering center label L (c)_j)＝j；

(1.4) clustering centers c_jAny pixel point q epsilon Nb in 3 x 3 neighborhood₃(c_j)＝{(x_q,y_q)|x_j-2≤x_q≤x_j+2,y_j-2≤y_q≤y_j+2}, calculating LAB color gradient thereof

If pixel point c in the neighborhood_kHas a minimum color gradient value of (c) or CD (c)_k) Less than or equal to CD (q), then x_j＝x_q,y_j＝y_q；

(1.5) giving each coordinate except the cluster center point in the image as (x)_i,y_i) The pixel point i of (a) sets a label l (i) ═ 1, and a distance d (i) ═ infinity;

(1.6) clustering centers c_jAny pixel point i epsilon Nb in 2S multiplied by 2S neighborhood_2S(c_j)＝{(x_i,y_i)|x_j-2S-1≤x_i≤x_j+2S+1,y_j-2S-1≤y_i≤y_j+2S +1} and c_jIs a distance ofWherein (x)_i,y_i) And (l)_i,a_i,b_i) Is the coordinate of pixel point i and the color value in LAB color space, (x)_j,y_j) And (l)_j,a_j,b_j) Is the center of the cluster c_jThe variable m is used for balancing the influence of the color distance and the space distance on the similarity of the pixels, and the larger m is, the larger the space distance influence is, and the more compact the superpixel is; the smaller m is, the larger the influence of the color distance is, and the super pixel is more attached to the edge of the image;

(1.7) if D (i, c)_j) < d (i), set L (i) ═ L (c)_j)＝j，d(i)＝D(i,c_j)；

(1.8) repeating the above steps (1.6) - (1.7) until all cluster centers c are traversed_j；

(1.9) all the pixels with label value j are marked as the jth superpixel SP_j，SP_j＝{(x_i,y_i) J | (i) ≦ j, j ≦ 1 ≦ K }, and calculate superpixels SP_jC 'of center of gravity'_j(x′_j,y′_j)，c′_jDefined as a super-pixel SP_jOf new cluster centers, whereinNew clustering center c'_jColor value in LAB color space (l'_j,a'_j,b'_j) Is the average of the colors of the super-pixels,

(1.10) accumulating Euclidean distances between all new cluster centers and old cluster centers

(1.11) if E is greater than a given threshold, repeating the above steps (1.6) - (1.10); otherwise, the algorithm is finished to obtain K superpixels.

The step (2) comprises the following sub-steps:

(2.1) Patch feature calculation:

patch is defined as a 16x 16-sized grid (the grid size can be modified according to actual data), n pixels (the n value in the experiment of the invention is 2) are used as the step length to slide from the upper left corner of the color image RGB and the Depth image Depth to the right and downwards, and finally, a dense grid is formed on the color image RGB and the Depth image Depth; taking the diagram with size of N × M as an example, the number of Patch obtained finally isFour types of features are calculated for each Patch: depth gradient features, color features, texture features;

(2.2) obtaining superpixel features:

super pixel feature F_segIs formula (5):

wherein,respectively representing a super-pixel depth gradient characteristic, a super-pixel color characteristic and a super-pixel texture characteristic, and represented by formula (6):

wherein, F_{g_d}(i),F_{g_c}(i),F_col(i),F_tex(i) Denotes the characteristic of the Patch whose ith center position falls within the super pixel seg, and n denotes the number of Patch whose center position falls within the super pixel seg

Superpixel geometryAnddefined by formula (7):

wherein the super pixel area A^seg＝∑_s∈seg1, s are pixels within the super-pixel seg; super pixel perimeter P^segDefined by formula (8):

wherein M, N represents the horizontal and vertical resolution of the RGB scene image, respectively; seg, seg' represent different superpixels; n is a radical of₄(s) is a set of four-neighbor domains of pixel s; b is_segIs a boundary image of the super-pixel segA set of elements;

area to perimeter ratio R of super pixel^segIs of formula (9):

is based on the x-coordinate s of the pixel s_xY coordinate s_yAnd the second-order (2+0 ═ 2 or 0+2 ═ 2) Hu moments calculated by multiplying the x and y coordinates, respectively, are defined as equations (10), (11), (12)

In the formula (13)The x coordinate mean, the y coordinate mean, the x coordinate mean square, and the y coordinate mean square of the pixels included in the super pixel are respectively expressed, and are defined as formula (13):

width, Height respectively represent the image Width and Height,a calculation is made based on the normalized pixel coordinate values,

D_varrespectively representing the depth values s of the pixels s within the superpixel seg_dAverage value of (1), depth value s_dThe mean of the squares, the variance of the depth values, is defined as equation (14):

D_missthe proportion of pixels in a super-pixel that lose depth information is defined as equation (15):

N^segis the principal normal vector modulo length of the point cloud corresponding to the superpixel, where the principal normal vector of the point cloud corresponding to the superpixel is estimated by Principal Component Analysis (PCA).

The depth gradient of step (2.1) is characterized by:

patch in depth image is noted as Z^dFor each Z^dComputing depth gradient feature F_{g_d}Wherein the value of the t-th component is defined by equation (1):

wherein Z ∈ Z^dRepresents the relative two-dimensional coordinate position of pixel z in depth Patch;andrespectively representing the depth gradient direction and gradient of the pixel zSize;andthe depth gradient base vectors and the position base vectors are respectively, and the two groups of base vectors are predefined values; d_gAnd d_sRespectively representing the number of depth gradient base vectors and the number of position base vectors;is thatThe mapping coefficient of the t-th principal component obtained by applying kernel principal component analysis KPCA,representing the kronecker product.Andrespectively a depth gradient gaussian kernel function and a position gaussian kernel function,andfor parameters corresponding to the Gaussian kernel function, the EMK (empirical matrix kernel) algorithm is used for transforming the depth gradient feature, and the transformed feature vector is still marked as F_{g_d}。

The color gradient characteristic of the step (2.1) is as follows:

patch in color image is noted as Z^cFor each Z^cCalculating color gradient feature F_{g_c}Wherein the value of the t-th component is defined by equation (2):

wherein Z ∈ Z^cRepresents the relative two-dimensional coordinate position of a pixel z in the color image Patch;andrespectively representing the gradient direction and the gradient magnitude of the pixel z;andcolor gradient base vectors and position base vectors are respectively, and the two groups of base vectors are predefined values; c. C_gAnd c_sRespectively representing the number of color gradient base vectors and the number of position base vectors;is thatThe mapping coefficient of the t-th principal component obtained by applying kernel principal component analysis KPCA,which represents the kronecker product of,andrespectively a color gradient gaussian kernel function and a position gaussian kernel function,andfor parameters corresponding to the Gaussian kernel function, the color gradient feature is transformed by using an EMK (empirical mode kernel) algorithm, and the transformed feature vector is still marked as F_{g_c}。

The color characteristics of the step (2.1) are as follows:

patch in color image is noted as Z^cFor each Z^cCalculating color characteristics F_colWherein the value of the t-th component is defined by equation (3):

wherein Z ∈ Z^cRepresents the relative two-dimensional coordinate position of pixel z in the color image Patch; r (z) is a three-dimensional vector, which is the RGB value of pixel z;andcolor basis vectors and position basis vectors are respectively adopted, and the two groups of basis vectors are predefined values; c. C_cAnd c_sRespectively representing the number of the color basis vectors and the number of the position basis vectors;is thatApplying mapping coefficient of t-th principal component obtained by Kernel Principal Component Analysis (KPCA),which represents the kronecker product of,andrespectively a color gaussian kernel function and a position gaussian kernel function,andfor parameters corresponding to the Gaussian kernel function, the color feature is transformed by using an EMK (empirical matrix kernel) algorithm, and the transformed feature vector is still marked as F_col。

The texture characteristics of the step (2.1) are as follows:

firstly, an RGB scene image is converted into a gray scale image, and Patch in the gray scale image is recorded as Z^gFor each Z^gCalculating texture feature F_texWherein the value of the t-th component is defined by equation (4):

wherein Z ∈ Z^gRepresents the relative two-dimensional coordinate position of pixel z in the color image Patch; s (z) represents the standard deviation of the pixel gray values in a 3 × 3 region centered on pixel z; LBP (z) is the local binary pattern feature, LBP, of pixel z;andrespectively are a local binary pattern base vector and a position base vector, and the two groups of base vectors are predefined values; g_bAnd g_sRespectively representing the number of the base vectors of the local binary pattern and the number of the position base vectors;is thatThe mapping coefficient of the t-th principal component obtained by applying kernel principal component analysis KPCA,which represents the kronecker product of,andrespectively a local binary pattern gaussian kernel function and a position gaussian kernel function,andfor the parameters corresponding to the Gaussian kernel function, the EMK (empirical match kernel) algorithm is used for transforming the texture features, and the transformed feature vector is still marked as F_tex。

In the step (3), the super-pixel set obtained after the indoor scene image is subjected to SLIC segmentation is recorded asWherein seg_kRepresenting the kth super-pixel, super-pixel seg_kIs taken as a setInner pixel as setAny super pixel seg_tE Im, if with superpixel seg_kHaving a common boundaryThen call seg_tIs seg_kAdjacent super-pixel of (seg)_kAll adjacent ones ofPixel is expressed as The adjacent superpixel of all superpixels in Andtogether constitute a superpixel seg_kNeighborhood superpixel set of

As shown in fig. 2, in the step (4), the input of the superpixel depth network SuperPixelNet is each superpixel seg obtained by segmenting the indoor scene image_kAnd its neighborhood superpixel NS (seg)_k) The output is a super pixel seg_kScores belonging to each semantic category; the network comprises three sub-networks: a multimodal fusion learning subnetwork, a superpixel neighborhood information fusion subnetwork, and a superpixel classification subnetwork.

In the step (4), the step of (C),

the multi-modal converged learning subnetwork comprises 7 branches B_i{ i ═ 1, … …, 7}, each characterized by a superpixel depth gradientSuperpixel color gradient featureSuper pixel color featureSuperpixel textureAnd three classes of superpixel geometryIs input; each branch input is a superpixel seg_kFeature representation of N superpixels in total with its N-1 neighborhood superpixelsAre all in 200-dimensional state,is in a 3-dimensional mode and has the characteristics of high sensitivity,is in the range of 7-dimension,is 5-dimensional; the first four network branches B_iThe structures of { i ═ 1, … … and 4} are the same, and are all one-layer convolution (conv-64), the convolution kernel size is 1 × 1, the convolution step size is 1, and the output channel size is 64 dimensions; the last three network legs B_iThe structures of { i ═ 5, 6 and 7} are the same, and are all one-layer convolution (conv-32), the convolution kernel size is 1 × 1, the convolution step size is 1, and the output channel size is 32 dimensions; then connecting the outputs of the three branches, and performing characteristic fusion by a convolution layer (conv-64) with the convolution kernel size of 1 multiplied by 3, the convolution step length of 1 and the output channel size of 64 dimensions; finally, the output of the front four branches is connected with the characteristics obtained by fusing the rear three branches, and the characteristics are fused through a convolutional layer (conv-1024) with a convolutional kernel of 1 multiplied by 5, a convolutional step of 1 and an output channel of 1024 dimensions, so that the multi-mode fusion characteristics of the superpixel are obtained;

the multi-modal fusion features of the N superpixels are used as input, enter a superpixel neighborhood information fusion sub-network, and are subjected to a layer of average pooling operation to obtain fusion features of the N superpixels; averaging the output of the pooling operation, passing through two layersOutputting full connection layers (FC-256 and FC-128) with the dimensionalities of 256 and 128 respectively to obtain final neighborhood characteristics; associating neighborhood features with superpixels seg_kThe 1024-dimensional multi-modal fusion features are connected, so that the super-pixel features with neighborhood information are obtained;

the super-pixel classification sub-network consists of three convolution layers, the sizes of the convolution kernels are all 1 multiplied by 1, the convolution step sizes are all 1, the output dimensions are respectively 512, 256 and 13(conv-512, conv-256, con-13), a dropout layer is arranged between conv-256 and conv-13, and the dropout probability is 0.5. Finally outputting the super pixel seg_kScores of the categories to which they belong; the invention adopts NYU V1 data sets collected and sorted by Silberman, Fergus and the like to carry out experiments, wherein the data sets have 13 semantic categories (Bel, Blind, Bookshelf, barrel, inspecting, Floor, Picture, Sofa, Table, TV, Wall, Window, Background) and 7 scenes in total; the data set comprises 2284 frames of color images (RGB) and 2284 frames of Depth images (Depth), wherein the color images correspond to the Depth images one by one, and the resolution of each image is 480 multiplied by 640; according to the traditional division method, 60% of the data set is selected for training and 40% of the data set is selected for testing; based on an NYU V1 data set, a comparison experiment between the method provided by the invention and the method provided by 5 people, such as Silberman, Ren, Salman H.Khan, Anran, Heng and the like, is carried out, and the experimental result is shown in table 1 (class average accuracy rate), so that the method provided by the invention can be seen to obtain an excellent labeling effect in indoor scene semantic labeling; in the invention, the value of N is 10, the network hyper-parameter batch size is set to be 16, the learning rate is set to be 5e-6, the initialization of all parameters in the network uses an Xavier initialization method, the rest convolutional layers and the full-link layer use Relu as an activation function except the last layer which does not use an activation function, the full-link layers FC-256 and FC-128 use a parameter of 0.01 as an L2 regularization parameter, and batch normalization is added to all the convolutional layers.

Table 1 shows the comparison of the present invention with other methods on the NYU v1 data set, from which it can be seen that the present invention is greatly superior to other methods.

TABLE 1

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications, equivalent variations and modifications made to the above embodiment according to the technical spirit of the present invention still belong to the protection scope of the technical solution of the present invention.

Claims

1. A super-pixel level indoor scene semantic annotation method is characterized in that: the method comprises the following steps:

(2) extracting superpixel kernel descriptor features by combining the superpixels obtained in the step (1) with the indoor scene depth image;

(3) constructing a neighborhood of the superpixel;

2. The method for semantic labeling of indoor scenes at the superpixel level according to claim 1, characterized in that: the step (1) comprises the following sub-steps:

(1.1) converting the image from an RGB color space to an LAB color space;

(1.4) clustering centers c_jAny pixel point q epsilon Nb in 3 x 3 neighborhood₃(c_j)＝{(x_q,y_q)|x_j-2≤x_q≤x_j+2,y_j-2≤y_q≤y_j+2}, calculating LAB color gradient thereofIf pixel point c in the neighborhood_kHas a minimum color gradient value of CD (c)_k) Less than or equal to CD (q), then x_j＝x_q,y_j＝y_q；

(1.6) clustering centers c_jAny pixel point i epsilon Nb in 2S multiplied by 2S neighborhood_2S(c_j)＝{(x_i,y_i)|x_j-2S-1≤x_i≤x_j+2S+1,y_j-2S-1≤y_i≤y_j+2S +1} and c_jIs a distance ofWherein (x)_iYi) and (l)_i,a_i,b_i) Is the coordinate of pixel point i and the color value in LAB color space, (x)_j,y_j) And (l)_j,a_j,b_j) Is the center of the cluster c_jThe variable m is used for balancing the influence of the color distance and the space distance on the similarity of the pixels, and the larger m is, the larger the space distance influence is, and the more compact the superpixel is; the smaller m is, the larger the influence of the color distance is, and the super pixel is more attached to the edge of the image;

(1.7) if D (i, c)_j) < d (i), set L (i) ═ L (c)_j)＝j，d(i)＝D(i,c_j)；

3. The method for semantic labeling of indoor scenes at the superpixel level according to claim 2, characterized in that: the step (2) comprises the following sub-steps:

(2.1) Patch feature calculation:

patch is defined as a 16x 16-sized grid, and slides from the upper left corner of the color image RGB and the Depth image Depth to the right and downwards in steps of n pixels, so as to finally form a dense grid on the color image and the Depth image; four types of features are calculated for each Patch: depth gradient features, color features, texture features;

(2.2) obtaining superpixel features:

super pixel feature F_segIs formula (5):

wherein, F_{g_d}(i)，F_{g_c}(i)，F_col(i)，F_tex(i) Denotes the characteristic of the Patch whose ith center position falls within the super pixel seg, and n denotes the number of Patch whose center position falls within the super pixel seg

Superpixel geometryAnddefined by formula (7):

wherein M, N represents the horizontal and vertical resolution of the RGB scene image, respectively; seg, seg' represent different superpixels; n is a radical of₄(s) is a set of four-neighbor domains of pixel s; b is_segIs the set of boundary pixels of the super-pixel seg;

area to perimeter ratio R of super pixel^segIs of formula (9):

is based on the x-coordinate s of the pixel s_xY coordinate s_yAnd the second-order Hu moment calculated by multiplying the x coordinate by the y coordinate is defined as formulas (10), (11) and (12)

In the formula (13)Respectively representing a super imageThe x coordinate mean, the y coordinate mean, the x coordinate mean square, and the y coordinate mean square of the pixels included in the pixel are defined as formula (13):

4. The method of claim 3, wherein the indoor scene semantic labeling at a superpixel level comprises:

the depth gradient of step (2.1) is characterized by:

wherein Z ∈ Z^dRepresents the relative two-dimensional coordinate position of pixel z in depth Patch;andrespectively representing the depth gradient direction and the gradient magnitude of the pixel z;andthe depth gradient base vectors and the position base vectors are respectively, and the two groups of base vectors are predefined values; d_gAnd d_sRespectively representing the number of depth gradient base vectors and the number of position base vectors;is thatThe mapping coefficient of the t-th principal component obtained by applying kernel principal component analysis KPCA,represents the kronecker product;andrespectively a depth gradient gaussian kernel function and a position gaussian kernel function,andfor parameters corresponding to the Gaussian kernel function, the EMK algorithm is used for transforming the depth gradient feature, and the transformed feature vector is still marked as F_{g_d}。

5. The method of claim 4, wherein the indoor scene semantic labeling at a superpixel level comprises: the color gradient characteristic of the step (2.1) is as follows:

wherein Z ∈ Z^cRepresents the relative two-dimensional coordinate position of a pixel z in the color image Patch;andrespectively representing the gradient direction and the gradient magnitude of the pixel z;andcolor gradient base vectors and position base vectors are respectively, and the two groups of base vectors are predefined values; c. C_gAnd c_sRespectively representing the number of color gradient base vectors and the number of position base vectors;is thatThe mapping coefficient of the t-th principal component obtained by applying kernel principal component analysis KPCA,which represents the kronecker product of,andrespectively a color gradient gaussian kernel function and a position gaussian kernel function,andfor parameters corresponding to the Gaussian kernel function, the color gradient characteristics are transformed by using an EMK algorithm, and the transformed characteristic vector is still marked as F_{g_c}。

6. The method of claim 5, wherein the indoor scene semantic labeling at a superpixel level comprises: the color characteristics of the step (2.1) are as follows:

wherein Z ∈ Z^cRepresents the relative two-dimensional coordinate position of pixel z in the color image Patch; r (z) is a three-dimensional vector, which is the RGB value of pixel z;andcolor basis vectors and position basis vectors are respectively adopted, and the two groups of basis vectors are predefined values; c. C_cAnd c_sRespectively representing the number of the color basis vectors and the number of the position basis vectors;is thatThe mapping coefficient of the t-th principal component obtained by applying kernel principal component analysis KPCA,which represents the kronecker product of,andrespectively a color gaussian kernel function and a position gaussian kernel function,andfor parameters corresponding to the Gaussian kernel function, the EMK algorithm is used for transforming the color characteristics, and the transformed characteristic vector is still marked as F_col。

7. The method of claim 6, wherein the indoor scene semantic labeling at a superpixel level comprises: the texture characteristics of the step (2.1) are as follows:

first, the RGB scene image is converted into a gray scale image, and Patch in the gray scale image is recorded asZ^gFor each Z^gCalculating texture feature F_tex，Wherein the value of the t-th component is defined by equation (4):

wherein Z ∈ Z^gRepresents the relative two-dimensional coordinate position of pixel z in the color image Patch; s (z) represents the standard deviation of the pixel gray values in a 3 × 3 region centered on pixel z; LBP (z) is the local binary pattern feature, LBP, of pixel z;andrespectively are a local binary pattern base vector and a position base vector, and the two groups of base vectors are predefined values; g_bAnd g_sRespectively representing the number of the base vectors of the local binary pattern and the number of the position base vectors;is thatThe mapping coefficient of the t-th principal component obtained by applying kernel principal component analysis KPCA,which represents the kronecker product of,andrespectively a local binary pattern gaussian kernel function and a position gaussian kernel function,andfor parameters corresponding to the Gaussian kernel function, the texture features are transformed by using an EMK algorithm, and the transformed feature vector is still marked as F_tex。

8. The method of claim 7, wherein the indoor scene semantic labeling at a superpixel level comprises: in the step (3), the super-pixel set obtained after the indoor scene image is subjected to SLIC segmentation is recorded asWherein seg_kRepresenting the kth super-pixel, super-pixel seg_kIs taken as a setInner pixel as setAny super pixel seg_tE Im, if with superpixel seg_kHaving a common boundaryThen call seg_tIs seg_kAdjacent super-pixel of (seg)_kAll neighboring superpixels of The adjacent superpixel of all superpixels in Andtogether constitute a superpixel seg_kNeighborhood superpixel set of

9. The method of claim 8, wherein: in the step (4), the input of the superpixel depth network SuperPixelNet is each superpixel seg obtained by segmenting the indoor scene image_kAnd its neighborhood superpixel NS (seg)_k) The output is a super pixel seg_kA score belonging to each semantic category, the score being a basis for determining a final semantic label for the superpixel; the network comprises three sub-networks: a multimodal fusion learning subnetwork, a superpixel neighborhood information fusion subnetwork, and a superpixel classification subnetwork.

10. The method of claim 9, wherein: in the step (4), the step of (C),

the multi-modal converged learning subnetwork comprises 7 branches B_i{ i ═ 1, … …, 7}, each characterized by a superpixel depth gradientSuperpixel color gradient featureSuper pixel color featureSuperpixel textureAnd three classes of superpixel geometryIs input; each branch input is a superpixel seg_kFeature representation of N superpixels in total with its N-1 neighborhood superpixelsAre all in 200-dimensional state,is in a 3-dimensional mode and has the characteristics of high sensitivity,is in the range of 7-dimension,is 5-dimensional; the first four network branches B_iThe structures of { i ═ 1, … … and 4} are the same, and are all one layer of convolution conv-64, the convolution kernel size is 1 × 1, the convolution step size is 1, and the output channel size is 64 dimensions; the last three network legs B_iThe structures of { i ═ 5, 6 and 7} are the same, and are all one layer of convolution conv-32, the size of a convolution kernel is 1 × 1, the convolution step size is 1, and the size of an output channel is 32 dimensions; then connecting the outputs of the three branches, and performing characteristic fusion by convolution layer conv-64 with convolution kernel size of 1 multiplied by 3, convolution step length of 1 and output channel size of 64 dimensions; finally, the output of the front four branches is connected with the characteristics obtained by fusing the rear three branches, and the characteristics are fused by a convolution layer conv-1024 with the convolution kernel size of 1 multiplied by 5, the convolution step length of 1 and the output channel size of 1024 dimensions, so that the multi-mode fusion characteristics of the superpixel are obtained;

the multi-modal fusion features of the N superpixels are used as input, enter a superpixel neighborhood information fusion sub-network, and are subjected to a layer of average pooling operation to obtain fusion features of the N superpixels; averaging the output of the pooling operation, and obtaining the final neighborhood characteristics through two layers of full-connection layer FC-256 and FC-128 with output dimensions of 256 and 128 respectively; associating neighborhood features with a super-imagePlain seg_kThe 1024-dimensional multi-modal fusion features are connected, so that the super-pixel features with neighborhood information are obtained;

the super-pixel classification sub-network consists of three layers of convolution layers, the sizes of the convolution layers are all 1 multiplied by 1, the convolution step lengths are all 1, the output dimensions are respectively 512, 256 and 13, a dropout layer is arranged between conv-256 and conv-13, the dropout probability is 0.5, and finally the super-pixel seg is output_kScores belonging to each semantic category; the value of N is 10, the network hyper-parameter batch size is set to be 16, the learning rate is set to be 5e-6, the initialization of all parameters in the network uses an Xavier initialization method, the rest convolutional layers and the full-link layer use Relu as an activation function except that the last layer does not use an activation function, the full-link layers FC-256 and FC-128 use a parameter 0.01 as an L2 regularization parameter, and batch normalization is added to all the convolutional layers.