CN109829449B

CN109829449B - RGB-D indoor scene labeling method based on super-pixel space-time context

Info

Publication number: CN109829449B
Application number: CN201910174110.2A
Authority: CN
Inventors: 王立春; 王梦涵; 王少帆; 孔德慧
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-03-08
Filing date: 2019-03-08
Publication date: 2021-09-14
Anticipated expiration: 2039-03-08
Also published as: CN109829449A

Abstract

The invention discloses an RGB-D indoor scene labeling method based on super-pixel space-time context, which is characterized in that in the field of computer vision, the process of subdividing a digital image into a plurality of image sub-regions is called super-pixel segmentation. Superpixels are usually small regions composed of a series of pixel points with adjacent positions and similar characteristics such as color, brightness, texture and the like, and the small regions retain local effective information and generally do not destroy boundary information of objects in an image. In the method, the semantic annotation of the superpixel determined by the 0.08 threshold is taken as an optimization target, and the superpixel determined by the 0.06 segmentation threshold is taken as a spatial context and is used for optimizing a semantic annotation result. And performing semantic classification on each block of superpixels corresponding to the leaf nodes and the intermediate nodes to obtain the semantic labeling probability of each superpixel in the superpixel segmentation graph under the threshold values of 0.06 and 0.08. The method is obviously superior to the conventional indoor scene labeling method.

Description

RGB-D indoor scene labeling method based on super-pixel space-time context

Technical Field

The invention relates to RGB-D indoor scene image annotation, and belongs to the field of computer vision and pattern recognition.

Background

Semantic annotation of indoor scene images is a challenging task in current vision-based scene understanding, with the basic goal of densely providing a predefined semantic class label for each pixel in a given indoor scene image (or frame in a captured indoor scene video).

The problems of a large number of semantic categories, mutual shielding of scene objects, weak identification of bottom layer visual features, uneven illumination and the like exist in an indoor scene, so that the image annotation of the indoor scene faces huge difficulty. With the popularity of depth sensors, RGB-D data including color, texture, and depth has now been readily and reliably available. RGB-D indoor scene labeling generally has two types of methods, one is RGB-D indoor scene labeling based on definition characteristics; and secondly, RGB-D indoor scene labeling based on learning features. The invention provides an RGB-D indoor scene labeling method based on super-pixel space-time context, belonging to an RGB-D indoor scene labeling method based on definition characteristics.

Comprehensive analysis is given below for the primary method of RGB-D indoor scene labeling based on defined features. As a precursor for performing indoor scene semantic annotation by using Depth information, Silberman and the like extract SIFT feature descriptors from color images (RGB), Depth images (Depth) and RGB after rotation processing, and perform semantic classification on the feature descriptors through a feedback type forward neural network to obtain an image semantic annotation result. And after obtaining a semantic annotation result, further optimizing by using simple CRFs (conditional random field probability map models). Ren et al performs superpixel segmentation on an image using the gPb/UCM algorithm, and combines a superpixel set into a hierarchical tree structure based on a segmentation threshold. Feature descriptions of a Patch (image block) are densely calculated on an RGB-D image, and feature descriptions of a super pixel region are calculated based on the Patch features. In semantic classification, the superpixel features are used as the input of the SVM, and the classification result of each superpixel is given. And constructing new super-pixel class characteristics based on the label vectors obtained by the SVM classifier, and further optimizing the recognition result by using the new characteristics to construct an MRFs (Markov random field) model.

In semantic recognition, one consensus is to use more context information, and the recognition results are usually more accurate. The pixel-level spatial context generally constructs an MRF or CRF model based on the adjacency relation between pixels, and the semantic labels of adjacent pixels are constrained to be consistent. And (3) the super-pixel level space context is used for connecting super-pixel features with inclusion relation in series to serve as classification features, or a super-pixel information CRF model is used. In the superpixel information CRF model, the pre-estimated probability of a pixel point is used as unitary energy, the characteristic difference of a pixel point pair is used as binary energy, the superpixel information is used as high-level energy, and an optimal label is determined by solving a defined energy function.

In the use of time context, Kundu considers that the pixel information of adjacent frames in a video sequence under the same scene is overlapped, so a new dense CRF model method is provided.

Object of the Invention

The invention aims to fully utilize time and space context, calculate superpixel time context by utilizing continuous frame images in the annotation process, and jointly complete an indoor scene annotation task by utilizing the space context provided by hierarchical superpixel segmentation.

In order to achieve the purpose, the technical scheme adopted by the invention is an RGB-D indoor scene labeling method based on super-pixel space-time context, and an image Fr to be labeled is input_tarAnd its time-sequential adjacent frames Fr_tar-1、Fr_tar+1Output is Fr_tarPixel level labeling.

Computing image Fr to be annotated based on optical flow algorithm_tarWhere each super pixel is at Fr_tarChronologically adjacent frames Fr_tar-1And Fr_tar+1The corresponding superpixel in (1), namely the time context of the corresponding superpixel; the image is superpixel segmented using gPb/UCM algorithm, and the segmentation results are organized into a segmentation tree, Fr, according to a threshold_tarIs its spatial context at the sub-node of the partition tree.

Structure Fr_tarPerforming feature representation of each super pixel based on time context, and classifying by using the feature of the super pixel based on time context by adopting a Gradient Boost Decision Tree (GBDT); obtaining Fr by weighting and combining the semantic classification results of the superpixels and the space contexts thereof by utilizing the superpixel space context_tarAnd (5) semantic annotation of the middle super pixel.

S1 super pixel

In the field of computer vision, the process of subdividing a digital image into a plurality of image sub-regions is known as superpixel segmentation. Superpixels are usually small regions composed of a series of pixel points with adjacent positions and similar characteristics such as color, brightness, texture and the like, and the small regions retain local effective information and generally do not destroy boundary information of objects in an image.

S1.1 superpixel segmentation of images

Super-pixel segmentation uses gPb/UCM algorithm to calculate probability value of pixel belonging to boundary through local and global features of image

The gPb/UCM algorithm is applied to the color image and the depth image respectively, and the calculation is carried out according to the formula (1)

. In the formula (1), the reaction mixture is,

is a probability value calculated based on the color image that a pixel belongs to the boundary,

is a probability value of a pixel belonging to a boundary calculated based on the depth image.

Probability value obtained according to formula (1)

And setting different probability threshold values tr to obtain a multi-level segmentation result.

The probability threshold tr set in the method is 0.06 and 0.08, and the pixels with the probability values smaller than the set threshold are connected into a region according to the eight-connection principle, wherein each region is a super pixel.

s1.2Patch feature

Patch is defined as an m × m-sized grid, and slides from the upper left corner of the color image and the depth image downward to the right in steps of n pixels, eventually forming a dense grid on the color image and the depth image. In the method, the size of the Patch is set to be 16 multiplied by 16 in an experiment, an image with the sliding step length N of 2 and the size of N multiplied by M is selected when the Patch is selected, and the number of the finally obtained patches is

Four types of features are calculated for each Patch: depth gradient features, color features, texture features.

S1.2.1 depth gradient feature

Patch in depth image is noted as Z^dFor each Z^dComputing depth gradient feature F_{g_d}Wherein the value of the t-th component is defined by equation (2):

in the formula (2), Z ∈ Z^dRepresents the relative two-dimensional coordinate position of pixel z in depth Patch;

and

respectively representing the depth gradient direction and the gradient magnitude of the pixel z;

and

the depth gradient base vectors and the position base vectors are respectively, and the two groups of base vectors are predefined values; d_gAnd d_sRespectively representing the number of depth gradient base vectors and the number of position base vectors;

is that

Applying mapping coefficient of t-th principal component obtained by Kernel Principal Component Analysis (KPCA),

representing the kronecker product.

And

respectively a depth gradient gaussian kernel function and a position gaussian kernel function,

and

are parameters corresponding to a gaussian kernel function. Finally, the depth gradient feature is transformed by using an EMK (efficient Match Kernel) algorithm, and the transformed feature vector is still marked as F_{g_d}。

S1.2.2 color gradient feature

Patch in color image is noted as Z^cFor each Z^cCalculating color gradient feature F_{g_c}Wherein the value of the t-th component is defined by equation (3):

in the formula (3), Z ∈ Z^cRepresents the relative two-dimensional coordinate position of a pixel z in the color image Patch;

and

respectively representing the gradient direction and the gradient magnitude of the pixel z;

and

color gradient base vectors and position base vectors are respectively, and the two groups of base vectors are predefined values; c. C_gAnd c_sRespectively representing the number of color gradient base vectors and the number of position base vectors;

is that

representing the kronecker product.

And

respectively a color gradient gaussian kernel function and a position gaussian kernel function,

and

are parameters corresponding to a gaussian kernel function. Finally, the color gradient features are transformed by using an EMK algorithm, and the transformed feature vector is still marked as F_{g_c}。

S1.2.3 color characteristics

Patch in color image is noted as Z^cFor each Z^cCalculating color characteristics F_colWherein the value of the t-th component is represented by the formula(4) Defining:

in the formula (4), Z ∈ Z^cRepresents the relative two-dimensional coordinate position of pixel z in the color image Patch; r (z) is a three-dimensional vector, which is the RGB value of pixel z;

and

color basis vectors and position basis vectors are respectively adopted, and the two groups of basis vectors are predefined values; c. C_cAnd c_sRespectively representing the number of the color basis vectors and the number of the position basis vectors;

is that

The mapping coefficient of the t-th principal component obtained by applying kernel principal component analysis KPCA,

representing the kronecker product.

And

respectively a color gaussian kernel function and a position gaussian kernel function,

and

are parameters corresponding to a gaussian kernel function. Finally, the color features are transformed by using an EMK algorithm, and the transformed feature vector is still marked as F_col。

S1.2.4 Texture feature (Texture)

Firstly, an RGB scene image is converted into a gray scale image, and Patch in the gray scale image is recorded as Z^gFor each Z^gCalculating texture feature F_texWherein the value of the t-th component is defined by equation (5):

in the formula (5), Z ∈ Z^gRepresents the relative two-dimensional coordinate position of pixel z in the color image Patch; s (z) represents the standard deviation of the pixel gray values in a 3 × 3 region centered on pixel z; LBP (z) is the Local Binary Pattern feature (LBP) of pixel z;

and

respectively are a local binary pattern base vector and a position base vector, and the two groups of base vectors are predefined values; g_bAnd g_sRespectively representing the number of the base vectors of the local binary pattern and the number of the position base vectors;

is that

representing the kronecker product.

And

respectively local binary pattern Gaussian kernel function and bitThe method is characterized by comprising the following steps of setting a Gaussian kernel function,

and

are parameters corresponding to a gaussian kernel function. Finally, the texture features are transformed by using an EMK algorithm, and the transformed feature vector is still marked as F_tex。

S1.3 superpixel features

Super pixel feature F_segIs defined as formula (6):

respectively representing a super-pixel depth gradient feature, a color feature and a texture feature, and defined as formula (7):

in the formula (7), F_{g_d}(p)，F_{g_c}(p)，F_col(p)，F_tex(p) represents the feature of the Patch whose p-th center position falls within the super pixel seg, and n represents the number of the patches whose center positions fall within the super pixel seg.

Superpixel geometry

Is defined by the formula (8):

the components in equation (8) are defined as follows:

super pixel area A^seg＝∑_s∈seg1, s are pixels within the super-pixel seg; super pixel perimeter P^segAs defined in formula (9):

in formula (9), M, N represents the horizontal and vertical resolutions of the RGB scene image, respectively; seg, seg' represent different superpixels; n is a radical of₄(s) is a set of four-neighbor domains of pixel s; b is_segIs the set of boundary pixels of the super-pixel seg.

Area to perimeter ratio R of super pixel^segAs defined in formula (10):

is based on the x-coordinate s of the pixel s_xY coordinate s_yAnd a second-order (2+0 ═ 2 or 0+2 ═ 2) Hu moment calculated by multiplying the x coordinate by the y coordinate, respectively, as defined in equations (11), (12) and (13)

In formula (14)

Respectively representing the mean value of x coordinates, the mean value of y coordinates, the square of the mean value of x coordinates and the square of the mean value of y coordinates of the pixels contained in the super pixels, and defining the following formula (14):

width and Height respectively represent the Width and Height of the image, i.e.

The calculation is based on the normalized pixel coordinate values.

D_varRespectively representing the depth values s of the pixels s within the superpixel seg_dAverage value of (1), depth value s_dMean of squares, variance of depth values, defined as (15):

D_missthe proportion of pixels in a super-pixel that lose depth information is defined as (16):

N^segis the principal normal vector modulo length of the point cloud corresponding to the superpixel, where the principal normal vector of the point cloud corresponding to the superpixel is estimated by Principal Component Analysis (PCA).

S2 superpixel context

The method respectively constructs a time context and a space context based on an RGB-D image sequence time sequence relation and a tree structure of super-pixel segmentation.

S2.1 superpixel temporal context

S2.1.1 interframe optical flow calculation

In the method, the optical flow obtained by calculating from a target frame to a reference frame is defined as a forward optical flow, and the optical flow obtained by calculating from the reference frame to the target frame is defined as a backward optical flow.

(1) Initial optical flow estimation

The SimpleFlow method is adopted for the interframe initial optical flow estimation. For two frame images Fr_tarAnd Fr_tar+1(x, y) represents Fr_tarThe middle pixel point, (u (x, y), v (x, y)) represents the optical flow vector at (x, y). Defining an image Fr_tarAs target frame, image Fr_tar+1Is a reference frame, then image Fr_tarTo the image Fr_tar+1The forward optical flow of (A) is Fr_tarThe set of optical flow vectors of all the pixel points in (i.e., { (u (x, y), v (x, y)) | (x, y) ∈ Fr)_tar}. In the following process, u (x, y) and v (x, y) are abbreviated as u and v, respectively, and Fr_tarMiddle pixel (x, y) is calculated from the optical flow at Fr_tar+1The corresponding pixel point in (x + u, y + v).

First, the image Fr is calculated_tarTo the image Fr_tar+1Forward optical flow of (f), for Fr_tarFrame pixel (x)₀，y₀) Taking a window of a size a x a centered on it

In this process, where a is 10, W₁At an arbitrary point (p, q) in Fr_tar+1The corresponding pixel points in the frame are (p + u, q + v), and the window W is aligned₁Calculating the energy term e at all points in the equation, as in (17)

e(p，q，u，v)＝||Int_tar(p，q)-Int_tar+1(p+u，q+v)||² (17)

Wherein (p, q) ∈ W₁，Int_tar(p, q) represents Fr_tarColor information of the middle pixel (p, q), Int_tar+1(p + u, q + v) represents Fr_tar+1The color information of the pixel points of the middle pixel point (p + u, q + v) is calculated for each pair of points in the window in sequence to obtain a²Vector e of dimensions.

Then, based on the local smooth likelihood model, the optical flow vector is optimized by combining the color feature and the local distance feature as shown in formula (18):

e (x) in the formula (18)₀，y₀U, v) is the local region energy, representing the image Fr_tarPixel point in frame (x)₀，y₀) The energy of the forward optical flow vector (u, v) is Fr_tarIn the frame (x)₀，y₀) Window W as center₁Weighted accumulation of energy items e of all internal pixel points;

in the method, O is set to be 20, and the change range of the optical flow vector (u, v) is represented; distance weight w_dAnd a color weight w_cBy pixel point (x)₀，y₀) Corresponding point (x) calculated from the optical flow (u, v)₀+u，y₀+ v) distance difference and color difference determination, setting the color parameter σ_c0.08 (empirical value), distance parameter σ_d5.5 (empirical value). The (u, v) that minimizes the E energy is the pixel (x)₀，y₀) For Fr, the optical flow vector estimation result of_tarCalculating optical flow vectors of all pixel points on the frame image to obtain an image Fr_tarTo the image Fr_tar+1Forward optical flow of (2).

Likewise, Fr is calculated_tar+1Frame to Fr_tarThe backward optical flow of the frame.

(2) Occlusion point detection

Recording image Fr_tarFrame to image Fr_tar+1The frame forward optical flow is { (u)_f(x)，v_f(y))|(x，y)∈Fr_tar}, and an image Fr_tar+1Frame to image Fr_tarThe inverse optical flow of (a) results in { (u)_b(x′)，v_b(y′))|(x′，y′)∈Fr_tar+1}. Calculating | l (u) for pixel (x, y)_f(x)，v_f(v))-(-u_b(x+u_f(x))，-v_b(y+v_f(y))) | |, if the value is not 0, the pixel point (x, y) is considered as a shielding point.

(3) Reestimation of occlusion point light flow

For pixels marked as occlusion points (x)₀，y₀) The optical flow energy is re-estimated using equation (19), denoted as E_b(x₀，y₀，u，v)：

In the formula (19), the compound represented by the formula (I),

denotes Fr_tarFrame pixel (x)₀，y₀) The average value of energy items e corresponding to different optical flow estimated values;

denotes Fr_tarFrame pixel (x)₀，y₀) The minimum value of the corresponding energy term e is measured by the different optical flow estimation values; w is a_r(x₀，y₀) The difference between the energy term e mean value and the minimum energy term e value is used for marking the pixel point (x) marked as shielding₀，y₀) Let E_bMinimum (u, v) even imageElement (x)₀，y₀) The optical flow vector of (a).

And (4) adopting the optical flow vector re-estimated in the step (3) for the final optical flow vector of the pixel marked as the occlusion point.

S2.1.2 superpixel temporal context and its feature representation

Method for calculating super pixel segmentation map by using S1.1 to Fr_tarFrame image Fr_tar-1Frame image and Fr_tar+1The frame image is subjected to superpixel segmentation.

(1) Superpixel temporal context

First according to Fr_tarTo Fr_tar+1Forward optical flow calculation Fr_tarFrame superpixel Seg_tarAll contained pixel points { (x, y) | (x, y) ∈ Seg_tarForward optical flow of { (u)_f(x)，v_f(y))|(x，y)∈Seg_tarMean value of }

As shown in equation (20):

in formula (20), Num (Seg)_tar) Representing a super-pixel Seg_tarCalculating the number of contained pixel points, and calculating the superpixel Seg according to the forward optical flow mean value_tarContaining pixel points in Fr_tar+1To obtain the region Seg_tar＝{(x′，y′)|x′＝x+u_f(x)，y＝y+u_f(y)，(x，y)∈Seg_tar，(x′，y′)∈Fr_tar+1Is called super pixel Seg_tarIn Fr_tar+1The corresponding area of (a). Calculating Seg'_tarAnd Fr_tar+1Ith super pixel in frame

The cross-over ratio IOU is as shown in equation (21):

in the formula (21), Num (·) indicates that the region includes the number of pixels. If it is

τ is according to Fr_tar+1To Fr_tarInverse optical flow computing superpixels

In Fr_tarCorresponding region Seg 'of frame'_tarRegion Seg 'is calculated from equation (21)'_tarAnd super pixel Seg_tarCrow to IOU (Seg'_tar，Seg_tar). If IOU (Seg_tar，Seg_tar) τ is then

Called super-pixel Seg_tarIn Fr_tar+1Corresponding super pixel, super pixel Seg_tarIn Fr_tar+1May be 0, 1 or more. In the method, the intersection ratio determination threshold τ is set to 0.3. In the same way, find the super pixel Seg_tarIn Fr_tar-1Corresponding superpixels, superpixels Seg, of a frame_tarIn Fr_tar-1Is 0, 1 or more.

Super pixel Seg_tarTime context memory of

Wherein

And

are each Fr_tarFrame superpixel Seg_tarIn Fr_tar-1Frame and Fr_tar+1A corresponding set of superpixels for the frame.

(2) Superpixel temporal context semantic feature representation

Superpixel temporal context Segs_tarIs characterized by a semantic feature of

As shown in formula (22):

is Fr_tarSuper pixel Seg in frame_tarIs characterized in that it is a mixture of two or more of the above-mentioned components,

is Fr_tar-1All corresponding superpixels in a frame

The mean value of the features is determined by the average,

is Fr_tar+1All corresponding superpixels in a frame

The mean of the features, the features of each superpixel, is calculated according to the method of section 1.3 of equation.

Fr_tarSuperpixel Seg in frame_tarIn Fr_tar+1Frame or Fr_tar-1Using its own characteristics when the number of corresponding superpixels of a frame is 0

Substitution

Or

S2.2 superpixel spatial context

Carrying out superpixel segmentation on the image by using the method of S1.1, and obtaining a superpixel segmentation graph of the highest level when the threshold value of a superpixel hierarchical segmentation tree is set to be 1, namely a root node of the hierarchical segmentation tree, wherein the node represents the whole image as a superpixel; setting the threshold value to be 0.06 to obtain a lower-level super pixel segmentation result; when the threshold is 0.08, the boundary determination criterion ratio is increased, so that pixel points with original boundary probability values of [0.06,0.08] are determined as non-boundary points, and the points are determined as boundary points when the threshold is 0.06. A super pixel of a high level will include a super pixel of a low level. In the method, a spatial context of a parent node superpixel is defined as a child node superpixel in a hierarchical partition tree.

S3 semantic Classification

S3.1 temporal context based superpixel semantic classification

The method inputs the temporal context characteristics of the superpixels, utilizes GBDT (gradient lifting decision tree) to carry out semantic classification on the superpixels, and outputs prediction labels of the superpixels.

In the GBDT training process, a training MR wheel is set, MR belongs to {1, 2, 3., MR }, and the MR wheel trains a regression tree, namely a weak classifier, for each class, namely L regression trees are trained when L classes exist, and j belongs to {1, 2, 3., L }. Finally, L × MR weak classifiers can be obtained. The training method is the same for each classifier in each round.

(1) GBDT multi-classifier training

Training set Fea_trComprising NSeg_trOne sample:

wherein the training sample Fea_iIs the temporal context feature of the ith superpixel, whose true label is lab_i，lab_i∈{1，2，3，...，L}。

First, the 0 th round of initialization is performed, and the prediction function value h of the class I classifier is set_l，0(x) Is 0; will really label lab_iConversion to L-dimensional tag vector

lab_i[k]E {0, 1}, if the true label of the ith training sample is j, the l-th component lab of the label vector_i[l]The other component value is 0, 1. Calculate the probability that the ith sample belongs to class l

I(lab_iJ) is an indicator function whose value is 1 when the label of sample i is j, and 0 otherwise.

The prediction result of the ith sample applied to the jth classifier of the mr-1 th round is recorded as h_l，(mr-1)(Fea_i) The classification error of the ith sample by the mr-1 th classifier is

As defined in formula (23):

then get the classification error set of the mr-1 th round

When the first classifier of the mr th round is constructed, traversing the training sample data set Fea_trTaking the feature value of the par dimension of the ith sample as a classification reference value for each feature dimension of each sample in the data set Fea_trAll samples are classified, and the samples with the characteristic values larger than the reference value belong to a set { Region₁Else belong to the set { Region }₂And f, calculating the error of the regression tree according to the formula (25) after all samples are classified

Wherein,

NRegion_mindicates that falls into Region_mTotal number of samples. The feature value that minimizes the regression tree error is finally selected as the new classification value of the tree. The regression tree is repeatedly constructed until the set height of the tree is reached, and the height of the regression tree is set to be 5 in the method. The regression trees of other categories in the current round are constructed in the same way.

The number of the regression tree leaf nodes of the jth class in the mr-th round is recorded as Reg_mr，lEach node is a subset of the training sample set, and any two leaf nodes intersect to form an empty set. Calculating the gain value of each leaf node of the constructed l-type regression tree in the mr round

As shown in formula (26):

calculating the predicted value h of the regression tree of the l class of the mr round to the ith sample by using the formula (27)_l，mr(Fea_i)：

Wherein Reg is in the {1, 2_mr，l}

Until the end of training the MR round. Predicted value h of regression tree of the ith category of the first MR round on the ith sample_l，MR(Fea_i) The expression is as (28):

wherein Reg is in the {1, 2_MR，l}。

And (3) substituting the formula (28) into the regression tree of the l class of the MR-2 round to obtain the predicted prediction result of the i sample, and obtaining the formula (29):

and by analogy, substituting the predicted prediction result of the ith sample into the regression tree from the I-th class of the MR-1 to the I-th class of the 0 th round to obtain the formula (30)

(2) GBDT prediction

Calculating the temporal context feature Fea of the superpixel Seg_SegThe predicted values h for the superpixel Seg belonging to the different classes are calculated using equation (30)_l，MR(Fea_Seg) Then, the probability value prob of the super-pixel Seg belonging to the different classes is calculated by the formula (24)_l，MR(Fea_Seg). The class l with the highest probability value is the prediction class of the super pixel Seg.

S3.2 optimizing semantic classification based on spatial context

When the method is used for carrying out superpixel segmentation on an image, two boundary judgment thresholds of 0.06 and 0.08 are set, so that a hierarchical segmentation tree with the height of 2 is obtained.

In the method, the semantic annotation of the superpixel determined by the 0.08 threshold is taken as an optimization target, and the superpixel determined by the 0.06 segmentation threshold is taken as a spatial context and is used for optimizing a semantic annotation result.

Firstly, according to the method of S3.1, semantic classification is carried out on each block of superpixels corresponding to the leaf nodes and the intermediate nodes, the semantic labeling probability of each superpixel in the superpixel segmentation graph under the threshold values of 0.06 and 0.08 is obtained, and the final semantic label of the superpixel block is calculated through the formula (31).

Wherein l represents the final semantic label of the super-pixel block which is the category of the maximum probability value calculated by the formula (31),

representing the probability that a threshold 0.08 super-pixel contains the a-th super-pixel semantic label in a threshold 0.06 super-pixel set of i,

is the threshold 0.08 probability of a superpixel semantic label being l. Naux represents the number of 0.06 threshold superpixels contained by the 0.08 threshold superpixel; w is a_auxThe confidence level of the super-pixel semantic annotation is 0.06 of the threshold value, and the value of the method is 0.4; w is a_{tar get}The confidence level of the super-pixel semantic annotation is 0.08, and the value of the method is 0.6.

Drawings

FIG. 1 is a flow chart of an RGBD indoor scene recognition method based on space-time context.

FIG. 2 is a diagram of a superpixel partition hierarchical tree.

FIG. 3 is a schematic diagram of spatial context based optimization.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and examples.

As shown in FIGS. 1-3, an RGB-D indoor scene labeling method based on super-pixel space-time context inputs an image Fr to be labeled_tarAnd its time-sequential adjacent frames Fr_tar-1、Fr_tar+1Output is Fr_tarPixel level labeling.

Computing image Fr to be annotated based on optical flow algorithm_tarWhere each super pixel is at Fr_tarChronologically adjacent frames Fr_tar-1And Fr_tar+1Corresponding super image inA pixel, the corresponding superpixel being its temporal context; the image is superpixel segmented using gPb/UCM algorithm, and the segmentation results are organized into a segmentation tree, Fr, according to a threshold_tarIs its spatial context at the sub-node of the partition tree.

S1 super pixel

S1.1 superpixel segmentation of images

In the formula (1), the reaction mixture is,

Probability value obtained according to formula (1)

S1.2Patch feature

Patch is defined as an m × m-sized grid, and slides from the upper left corner of the color image and the depth image downward to the right in steps of n pixels, eventually forming a dense grid on the color image and the depth image. In the method, the size of the Patch is set to be 16 multiplied by 16 in an experiment, the value of the sliding step length N when the Patch is selected to be 2, an image with the size of N multiplied by M is taken as an example, and the number of the Patch finally obtained is

S1.2.1 depth gradient feature

and

and

is that

representing the kronecker product.

And

and

S1.2.2 color gradient feature

and

and

is that

representing the kronecker product.

And

and

S1.2.3 color characteristics

Patch in color image is noted as Z^cFor each Z^cCalculating color characteristics F_colWherein the value of the t-th component is defined by equation (4):

and

is that

representing the kronecker product.

And

and

S1.2.4 Texture feature (Texture)

and

is that

representing the kronecker product.

And

respectively a local binary pattern gaussian kernel function and a position gaussian kernel function,

and

S1.3 superpixel features

Super pixel feature F_segIs defined as formula (6):

in the formula (7), F_{g_d}(p)，F_{g_c}(p)，F_col(p)，F_tex(p) represents the feature of Patch whose p-th center position falls within the super pixel seg, and n representsThe number of Patch whose core position falls within the super pixel seg.

Superpixel geometry

Is defined by the formula (8):

the components in equation (8) are defined as follows:

in formula (9), M, N represents the horizontal and vertical resolutions of the RGB scene image, respectively; seg, seg represent different superpixels; n is a radical of₄(s) is a set of four-neighbor domains of pixel s; b is_segIs the set of boundary pixels of the super-pixel seg.

Area to perimeter ratio R of super pixel^segAs defined in formula (10):

In formula (14)

width and Height respectively represent the Width and Height of the image, i.e.

The calculation is based on the normalized pixel coordinate values.

N^segis the principal normal vector modulo length of the point cloud corresponding to the superpixel to which the principal normal of the point cloud correspondsThe vectors are estimated by Principal Component Analysis (PCA).

S2 superpixel context

S2.1 superpixel temporal context

S2.1.1 interframe optical flow calculation

(2) Initial optical flow estimation

e(p，q，u，v)＝||Int_tar(p，q)-Int_tar+1(p+u，q+v)||² (17)

in the method, O is set to be 20, and the change range of the optical flow vector (u, v) is represented; distance weight W_dAnd a color weight w_cBy pixel point (x)₀，y₀) Corresponding point (x) calculated from the optical flow (u, v)₀+u，y₀+ v) distance differences and color differences,setting a color parameter σ_c0.08 (empirical value), distance parameter σ_d5.5 (empirical value). The (u, v) that minimizes the E energy is the pixel (x)₀，y₀) For Fr, the optical flow vector estimation result of_tarCalculating optical flow vectors of all pixel points on the frame image to obtain an image Fr_tarTo the image Fr_tar+1Forward optical flow of (2).

Also, Fr is calculated according to the method described above_tar+1Frame to Fr_tarThe backward optical flow of the frame.

(2) Occlusion point detection

(3) Reestimation of occlusion point light flow

In the formula (19), the compound represented by the formula (I),

denotes Fr_tarFrame pixel (x)₀，y₀) The minimum value of the corresponding energy term e is measured by the different optical flow estimation values; w is a_r(x₀，y₀) The difference between the energy term e mean value and the minimum energy term e value is used for marking the pixel point (x) marked as shielding₀，y₀) Let E_bMinimum (u, v) even pixel (x)₀，y₀) The optical flow vector of (a).

S2.1.2 superpixel temporal context and its feature representation

Method for calculating super pixel segmentation map by using S1.1 to Fr_tarFrame image Fr_ar-1Frame image and Fr_tar+1The frame image is subjected to superpixel segmentation.

(1) Superpixel temporal context

As shown in equation (20):

in formula (20), Num (Seg)_tar) Representing a super-pixel Seg_tarCalculating the number of contained pixel points, and calculating the superpixel Seg according to the forward optical flow mean value_tarContaining pixel points in Fr_tar+1To obtain the corresponding pixel ofTo region Seg_tar＝{(x′，y′)|x′＝x+u_f(x)，y′＝y+u_f(y)，(x，y)∈Seg_tar，(x′，y′)∈Fr_tar+1Is called super pixel Seg_tarIn Fr_tar+1The corresponding area of (a). Calculating Seg'_tarAnd Fr_tar+1Ith super pixel in frame

The cross-over ratio IOU is as shown in equation (21):

τ is according to Fr_tar+1To Fr_tarInverse optical flow computing superpixels

In Fr_tarCorresponding region Seg 'of frame'_tarThe region Seg ″, is calculated according to equation (21)_tarAnd super pixel Seg_tarCrow to IOU (Seg'_tar，Seg_tar). If IOU (Seg_tar，Seg_tar) τ is then

Called super-pixel Seg_tarIn Fr_tar+1Corresponding super pixel (super pixel Seg)_tarIn Fr_tar+1May be 0, 1 or more). In the present method, the intersection ratio determination threshold τ is set to 0.3 (empirical value). In the same way, find the super pixel Seg_tarIn Fr_tar-1Corresponding superpixel (superpixel Seg) of frame_tarIn Fr_tar-1May be 0, 1 or more).

Super pixel Seg_tarTime context memory of

Wherein

And

(2) Superpixel temporal context semantic feature representation

Superpixel temporal context Segs_tarIs characterized by a semantic feature of

As shown in formula (22):

is Fr_tar-1All corresponding superpixels in a frame

The mean value of the features is determined by the average,

is Fr_tar+1All corresponding superpixels in a frame

Substitution

Or

S2.2 superpixel spatial context

The image is super-pixel segmented by the method of section S1.1, and fig. 2 shows a super-pixel hierarchical segmentation tree obtained according to a plurality of boundary judgment thresholds. When the threshold value of the super pixel hierarchical segmentation tree is set to be 1, a super pixel segmentation graph of the highest level, namely a root node of the hierarchical segmentation tree, can be obtained, and the node represents the whole image as a super pixel; setting the threshold value to be 0.06 to obtain a lower-level super pixel segmentation result; when the threshold is 0.08, the boundary determination criterion ratio is increased, so that pixel points with original boundary probability values of [0.06,0.08] are determined as non-boundary points, and the points are determined as boundary points when the threshold is 0.06. It can be seen that the super-pixels of the high level include the super-pixels of the low level. In the method, a spatial context of a parent node superpixel is defined as a child node superpixel in a hierarchical partition tree.

S3 semantic Classification

S3.1 temporal context based superpixel semantic classification

In the GBDT training process, a training MR wheel is set, MR belongs to {1, 2, 3., MR }, and the MR wheel trains a regression tree (weak classifier) for each class, namely L regression trees are trained when L classes exist, and L belongs to {1, 2, 3., L }. Finally, L × MR weak classifiers can be obtained. The training method is the same for each classifier in each round.

(1) GBDT multi-classifier training

Training set Fea_trComprising NSeg_trOne sample:

lab_i[k]E {0, 1}, if the true label of the ith training sample is l, the l-th dimension component lab of the label vector_i[l]The other component value is 0, 1. Calculate the probability that the ith sample belongs to class l

I(lab_iL) is an indicator function, which has a value of 1 when the label of sample i is l, and 0 otherwise.

The prediction result of the ith sample and the first classifier of the mr-1 round is recorded as h_l，(mr-1)(Fea_i) The classification error of the ith sample by the mr-1 th classifier is

As defined in formula (23):

then get the classification error set of the mr-1 th round

Wherein,

NRegion_mindicates that falls into Region_mTotal number of samples. The feature value that minimizes the regression tree error is finally selected as the new classification value of the tree. The above process is repeated to construct a regression tree until the set height of the tree is reached, and the height of the regression tree is set to be 5 in the method. The regression trees of other categories in the current round are constructed in the same way.

The number of the regression tree leaf nodes of the l class in the mr th round is recorded as Reg_mr，lEach node is a subset of the training sample set, and any two leaf nodes intersect to form an empty set. Calculating the gain value of each leaf node of the constructed l-type regression tree in the mr round

As shown in formula (26):

Wherein Reg is in the {1, 2_mr，l}

The calculation is carried out by the above flow until the training MR wheel is finished. Predicted value h of regression tree of the ith category of the first MR round on the ith sample_l，MR(Fea_i) The expression is as (28):

wherein Reg is in the {1, 2_MR，l}。

(2) GBDT prediction

S3.2 optimizing semantic classification based on spatial context

When the method is used for carrying out superpixel segmentation on an image, two boundary judgment thresholds of 0.06 and 0.08 are set, so that a hierarchical segmentation tree with the height of 2 is obtained, as shown in fig. 3.

Firstly, according to the method of S3.1, semantically classifying each block of superpixels corresponding to the leaf nodes and the intermediate nodes in the graph 3 to obtain the semanteme labeling probability of each superpixel in the superpixel segmentation graph under the threshold values of 0.06 and 0.08, and calculating the final semantic label of the superpixel block through a formula (31).

is the threshold 0.08 probability of a superpixel semantic label being l. Naux represents the number of 0.06 threshold superpixels contained by the 0.08 threshold superpixel; w is a_auxThe confidence level of the super-pixel semantic annotation is 0.06 of the threshold value, and the value of the method is 0.4; w_targetThe confidence level of the super-pixel semantic annotation is 0.08, and the value of the method is 0.6.

Table 1 class 13 semantic experiments on NYUV2 dataset this method is compared with the class mean accuracy of other RGB-D indoor scene labeling methods based on defined features.

TABLE 1

[1]C.Coupire，C.Farabet，L.Najman and Y.LeCun..Indoor scene segmentation using depth information.In ICLR，2013.

[2]A.Hermans，G.Floros，and B.Leibe.Dense 3d semantic mapping of indoor scenes fron rgb-d images.In ICRA，2014.

[3]A.Wang，J.Lu，J.Cai，G.Wang，and T.-J.Cham.Unsupervised joint feature 1eaming and encoding for rgb-d scene labeling(TIP)，2015.

[4]J.Wang，Z.Wang，D.Tao，S.See and G.Wang.Learning common and specific features for rgb-d semantic segmentation with deconvolutional networks.In ECCV，2016.

Claims

1. An RGB-D indoor scene labeling method based on super-pixel space-time context is characterized in that: inputting the image Fr to be annotated_tarAnd the front and rear adjacent frames Fr in the time sequence thereof_tar-1、Fr_tar+1Output is Fr_tarPixel level labeling of (1);

computing image Fr to be annotated based on optical flow algorithm_tarWhere each super pixel is at Fr_tarChronologically adjacent frames Fr_tar-1And Fr_tar+1The corresponding superpixel in (1), namely the time context of the corresponding superpixel; the image is superpixel segmented using gPb/UCM algorithm, and the segmentation results are organized into a segmentation tree, Fr, according to a threshold_tarIs its spatial context, the sub-node of each superpixel in the partition tree;

structure Fr_tarPerforming feature representation of each super pixel based on time context, and classifying the super pixels based on the time context features by adopting a gradient lifting tree; fr is obtained by utilizing the super-pixel space context weighted combination and the semantic classification result of the space context_tarSemantic annotation of the super-middle pixel;

s1 super pixel

In the field of computer vision, the process of subdividing a digital image into a plurality of image sub-regions is called superpixel segmentation; the super-pixel is a region formed by a series of pixel points with adjacent positions and similar color, brightness and texture characteristics, the region retains local effective information and cannot damage the boundary information of an object in an image;

s1.1 superpixel segmentation of images

In the formula (1), the reaction mixture is,

the probability value of the pixel belonging to the boundary is calculated based on the depth image;

probability value obtained according to formula (1)

Setting different probability threshold values tr to obtain a multi-level segmentation result;

the set different probability threshold values tr are respectively 0.06 and 0.08, and the pixels with the probability values smaller than the set probability threshold values are connected into a region according to an eight-connection principle, wherein each region is a super pixel;

s1.2Patch feature

Patch is defined as a grid with the size of h multiplied by h, and slides downwards from the upper left corner of the color image and the depth image to the right by taking hs pixels as step length, and finally dense grids are formed on the color image and the depth image; wherein, the size of the Patch is 16 × 16, the sliding step hs is 2 when the Patch is selected, the size is N × M, and the final number of the patches is

Four types of features are calculated for each Patch: depth gradient features, color features, texture features;

s1.3 superpixel features

Super pixel feature F_segIs defined as formula (6):

in the formula (7), F_{g_d}(q1)，F_{g_c}(q1)，F_col(q1)，F_tex(q1) represents the feature of the Patch whose center position falls within the superpixel seg at the q1 th position, and n represents the number of patches whose center positions fall within the superpixel seg;

superpixel geometry

Is defined by the formula (8):

the components in equation (8) are defined as follows:

super pixel area A^seg＝∑_s∈seg1, s are pixels within the super-pixel seg; super pixel perimeter P^segAccording to B_segObtained, defined as formula (9):

in formula (9), M, N represents the horizontal and vertical resolutions of the RGB scene image, respectively; seg, seg' represent different superpixels; n is a radical of₄(s) is a set of four-neighbor domains of pixel s; b is_segIs the set of boundary pixels of the super-pixel seg;

area to perimeter ratio R of super pixel^segAs defined in formula (10):

is based on the x-coordinate s of the pixel s_xY coordinate s_yAnd the second-order Hu moment calculated by the product of the x coordinate and the y coordinate is defined as formulas (11), (12) and (13)

In formula (14)

width and Heiqht respectively represent the image Width and height, i.e.

Performing a calculation based on the normalized pixel coordinate values;

N^segis the principal normal vector mode length of the point cloud corresponding to the superpixel, wherein the principal normal vector of the point cloud corresponding to the superpixel is estimated by Principal Component Analysis (PCA);

s2 superpixel context

Respectively constructing a time context and a space context based on the RGB-D image sequence time sequence relation and a tree structure of super-pixel segmentation;

s2.1 superpixel temporal context

S2.1.1 interframe optical flow calculation

Defining the optical flow obtained by calculating from the target frame to the reference frame as a forward optical flow, and defining the optical flow obtained by calculating from the reference frame to the target frame as a reverse optical flow;

(1) initial optical flow estimation

The interframe initial optical flow estimation adopts a SimpleFlow method; for two frame images Fr_tarAnd Fr_tar+1(x, y) represents Fr_tarA middle pixel point, (u (x, y), v (x, y)) represents an optical flow vector at (x, y); defining an image Fr_tarAs target frame, image Fr_tar+1Is a reference frame, then image Fr_tarTo the image Fr_tar+1The forward optical flow of (A) is Fr_tarThe set of optical flow vectors of all the pixel points in (i.e., { (u (x, y), v (x, y)) | (x, y) ∈ Fr)_tar}; when u (x, y) and v (x, y) are respectively abbreviated as u and v, Fr_tarMiddle pixel (x, y) is calculated from the optical flow at Fr_tar+1The middle corresponding pixel point is (x + u, y + v);

first, the image Fr is calculated_tarTo the image Fr_tar+1Forward optical flow of (f), for Fr_tarPixel point (x)₀，y₀) Taking a window of size b × b centered thereon

Wherein, b is 10, W₁At an arbitrary point (p, q) in Fr_tar+1The corresponding pixel point in (1) is (p + u, q + v), and the window W is aligned₁Calculating the energy term e at all points in the equation, as in (17)

e(p，q，u，v)＝||Int_tar(p，q)-Int_tar+1(p+u，q+v)||² (17)

Wherein (p, q) ∈ W₁，Int_tar(p, q) represents Fr_tarMiddle pixel point (p)Q) pixel point color information, Int_tar+1(p + u, q + v) represents Fr_tar+1The color information of the pixel points of the middle pixel point (p + u, q + v) is calculated for each pair of points in the window in sequence to obtain b²A vector e of dimensions;

then, the optical flow vector is optimized by combining the color feature and the local distance feature based on the local smooth likelihood model, as shown in formula (18):

e (x) in the formula (18)₀，y₀U, v) is the local region energy, representing the image Fr_tarMiddle pixel (x)₀，y₀) The energy of the forward optical flow vector (u, v) is Fr_tarIn (x)₀，y₀) Window W as center₁Weighted accumulation of energy items e of all internal pixel points;

where O ═ 20 denotes the range of change of the optical flow vector (u, v); distance weight w_dAnd a color weight w_cBy pixel point (x)₀，y₀) Corresponding point (x) calculated from the optical flow (u, v)₀+u，y₀+ v) distance difference and color difference determination, setting the color parameter σ_c0.08, distance parameter σ_d5.5; make E energyThe smallest (u, v) is the pixel (x)₀，y₀) As a result of estimating the optical flow vector of (1), for the image Fr_tarCalculating optical flow vectors of all the upper pixel points to obtain an image Fr_tarTo the image Fr_tar+1Forward optical flow of (a);

likewise, Fr is calculated_tar+1To Fr_tarThe backward light flow of (2);

(2) occlusion point detection

Recording image Fr_tarTo the image Fr_tar+1Forward optical flow of { (u)_f(x)，v_f(y))|(x，y)∈Fr_tar}, and an image Fr_tar+1To the image Fr_tarThe inverse optical flow of (a) results in { (u)_b(x′)，v_b(y′))|(x′，y′)∈Fr_tar+1}; calculating | l (u) for pixel (x, y)_f(x)，v_f(y))-(-u_b(x+u_f(x))，-v_b(y+v_f(y))) |, if that value (| | (u)) |_f(x)，v_f(y))-(-u_b(x+u_f(x))，-v_b(y+v_f(y))) | |) is not 0, the pixel point (x, y) is considered as a shielding point;

(3) reestimation of occlusion point light flow

In the formula (19), the compound represented by the formula (I),

denotes Fr_tarPixel point (x)₀，y₀) The average value of energy items e corresponding to different optical flow estimated values;

denotes Fr_tarPixel point (x)₀，y₀) The minimum value of the corresponding energy term e is measured by the different optical flow estimation values; w is a_r(x₀，y₀) The difference between the energy term e mean value and the minimum energy term e value is used for marking the pixel point (x) marked as shielding₀，y₀) Let E_bThe smallest (u, v) is the pixel (x)₀，y₀) An optical flow vector of (d);

adopting the optical flow vector re-estimated in the step (3) for the final optical flow vector of the pixel marked as the occlusion point;

s2.1.2 superpixel temporal context and its feature representation

Image Fr by using the method of super-pixel segmentation map calculated by S1.1_tarImage Fr_tar-1And an image Fr_tar+1Performing super-pixel segmentation;

(1) superpixel temporal context

First according to Fr_tarTo Fr_tar+1Forward optical flow calculation Fr_tarSuper pixel Seg_tarAll contained pixel points { (x, y) | (x, y) ∈ Seg_tarForward optical flow of { (u)_f(x)，v_f(y))|(x，y)∈Seg_tarMean value of }

As shown in equation (20):

in formula (20), Num (Seg)_tar) Representing a super-pixel Seg_tarCalculating the number of contained pixel points, and calculating the superpixel Seg according to the forward optical flow mean value_tarContaining pixel points in Fr_tar+1Obtaining a region Seg 'from the corresponding pixel of (1)'_tar＝{(x′，y′)|x′＝x+u_f(x)，y′＝y+v_f(y)，(x，y)∈Seg_tar，(x′，y′)∈Fr_tar+1Is called super pixel Seg_tarIn Fr_tar+1A corresponding region of (1); calculating Seg'_tarAnd Fr_tar+1Middle ith super pixel

The cross-over ratio IOU is as shown in equation (21):

in the formula (21), Num (·) indicates that the region contains the number of pixels; if it is

Then according to Fr_tar+1To Fr_tarInverse optical flow computing superpixels

In Fr_tarCorresponding region of (1)' Seg_tarThe region Seg ″, is calculated according to equation (21)_tarAnd super pixel Seg_tarCross-over ratio of (IOU) (Seg_tar，Seg_tar) (ii) a If IOU (Seg_tar，Seg_tar) Is greater than or equal to tau, then

Called super-pixel Seg_tarIn Fr_tar+1Corresponding super pixel, super pixel Seg_tarIn Fr_tar+1Is 0, 1 or more; setting an intersection ratio judgment threshold value tau to be 0.3; finding a superpixel Seg_tarIn Fr_tar-1Corresponding super pixel, super pixel Seg_tarIn Fr_tar-1Is 0, 1 or more;

super pixel Seg_tarTime context memory of

Wherein

And

are each Fr_tarFrame superpixel Seg_tarIn FF_tar-1And Fr_tar+1A corresponding set of superpixels;

(2) superpixel temporal context semantic feature representation

Superpixel temporal context Segs_tarIs characterized by a semantic feature of

As shown in formula (22):

is Fr_tarSuper pixel Seg_tarIs characterized in that it is a mixture of two or more of the above-mentioned components,

is Fr_tar-1All corresponding super pixels in

The mean value of the features is determined by the average,

is Fr_tar+1All corresponding super pixels in

Characterised byMean value, the characteristic of each superpixel is calculated according to the method of S1.3;

Fr_tarsuper pixel Seg in (1)_tarIn Fr_tar+1Or Fr_tar-1When the number of corresponding super pixels of (1) is 0, its own characteristics are used

Substitution

Or

S2.2 superpixel spatial context

Carrying out superpixel segmentation on the image by using the method of S1.1, and obtaining a superpixel segmentation graph of the highest level when the threshold value of a superpixel hierarchical segmentation tree is set to be 1, namely a root node of the hierarchical segmentation tree, wherein the node represents the whole image as a superpixel; setting the threshold value to be 0.06 to obtain a lower-level super pixel segmentation result; when the threshold is 0.08, the boundary judgment standard ratio is improved, so that pixel points with the original boundary probability value of [0.06,0.08] are judged as non-boundary points, and the points are judged as boundary points when the threshold is 0.06; a super pixel of a high level will include a super pixel of a low level therein; defining a spatial context of a super pixel of a child node as a super pixel of a father node in a hierarchical partition tree;

s3 semantic Classification

S3.1 temporal context based superpixel semantic classification

Taking the temporal context characteristics of the superpixels as input, performing superpixel semantic classification by using GBDT, and outputting a prediction label of the superpixels;

in the GBDT training process, setting a training MR wheel, wherein MR belongs to {1, 2, 3., MR }, and training a regression tree, namely a weak classifier, for each class by the MR wheel, namely training L regression trees when L classes exist, and L belongs to {1, 2, 3., L }; finally, L multiplied by MR weak classifiers can be obtained; the training method for each classifier is the same in each round;

(1) GBDT multi-classifier training

Training set Fea_trComprising NSeg_trOne sample:

wherein the training sample Fea_iIs the temporal context feature of the ith superpixel, whose true label is lab_i，lab_i∈{1，2，3，...，L}；

First, the 0 th round of initialization is performed, and the prediction function value h of the class I classifier is set_l,0(x) Is 0; will really label lab_iConversion to L-dimensional tag vector

lab_i[k]E {0, 1}, if the true label of the ith training sample is l, the l-th dimension component lab of the label vector_i[l]1, the other component value is 0; calculate the probability that the ith sample belongs to class l

I(lab_iL) is an indicator function, the value of which is 1 when the label of sample i is l, otherwise the value is 0;

the prediction result of the ith sample and the first classifier of the mr-1 round is recorded as h_l,(mr-1)(Fea_i) The classification error of the ith sample by the mr-1 th classifier is

As defined in formula (23):

then get the classification error set of the mr-1 th round

When the first classifier of the mr th round is constructed, traversing the training set Fea_trTaking the feature value of the par dimension of the ith sample as a classification reference value for each feature dimension of each sample in the training set Fea_trAll samples are classified, and the samples with the characteristic values larger than the reference value belong to a set { Region₁Else belong to the set { Region }₂And f, calculating the error of the regression tree according to the formula (25) after all samples are classified

Wherein,

m＝1,2，NRegion_mindicates that falls into Region_mTotal number of samples of (a); finally, selecting the characteristic value which enables the error of the regression tree to be minimum as a new classification value of the tree; repeatedly constructing a regression tree until the set height of the tree is reached, wherein the height of the regression tree is 5; constructing regression trees of other categories in the current round by the same method;

the number of the regression tree leaf nodes of the l class in the mr th round is recorded as Reg_mr,lEach node is a subset of the training sample set, and the intersection of any two leaf nodes is an empty set; calculating the gain value of each leaf node of the constructed l-type regression tree in the mr round

As shown in formula (26):

calculating the predicted value h of the regression tree of the l class of the mr round to the ith sample by using the formula (27)_l,mr(Fea_i)：

Wherein Reg is in the {1, 2_mr,l}

Until the training of the MR wheel is finished; predicted value h of regression tree of the ith category of the first MR round on the ith sample_l,MR(Fea_i) The expression is as (28):

wherein Reg is in the {1, 2_MR,l}；

And (3) substituting the formula (28) into the regression tree of the l class of the MR-2 round to obtain the prediction result of the i sample, and obtaining the formula (29):

and by analogy, substituting the predicted result of the ith sample into the regression tree from the I type of the MR-1 round to the I type of the 0 round to obtain the formula (30)

(2) GBDT prediction

Calculating the temporal context feature Fea of the superpixel Seg_SegThe predicted values h for the superpixel Seg belonging to the different classes are calculated using equation (30)_l,MR(Fea_Seg) Then, the probability value prob of the super-pixel Seg belonging to the different classes is calculated by the formula (24)_l,MR(Fea_Seg) (ii) a The class l with the highest probability value is the prediction class of the super pixel Seg;

s3.2 optimizing semantic classification based on spatial context

When the image is subjected to superpixel segmentation, two boundary judgment thresholds of 0.06 and 0.08 are set, so that a hierarchical segmentation tree with the height of 2 is obtained;

the semantic annotation of the superpixel determined by the threshold of 0.08 is taken as an optimization target, and the superpixel determined by the threshold of 0.06 segmentation is taken as a spatial context and is used for optimizing a semantic annotation result;

firstly, according to the method of S3.1, performing semantic classification on each super pixel corresponding to the leaf nodes and the intermediate nodes to obtain the semantic labeling probability of each super pixel in the super pixel segmentation graph under the threshold values of 0.06 and 0.08, and calculating the final semantic label of the super pixel block through the formula (31);

wherein l^*Means that the final semantic label of the super-pixel block which is the category with the maximum probability value is calculated by equation (31),

a probability of a superpixel semantic label being l, which is a threshold of 0.08; naux represents the number of 0.06 threshold superpixels contained by the 0.08 threshold superpixel; w is a_auxThe confidence of the super-pixel semantic annotation with a threshold value of 0.06 is taken as 0.4; w is a_targetThe confidence of the super-pixel semantic annotation with the threshold value of 0.08 is 0.6.

2. The RGB-D indoor scene labeling method based on superpixel spatiotemporal context as claimed in claim 1, characterized in that: the implementation of the S1.2Patch feature is as follows,

s1.2.1 depth gradient feature

and

and

is that

The mapping coefficient of the t-th principal component obtained by kernel principal component analysis is applied,

represents the kronecker product;

and

and

parameters corresponding to the gaussian kernel function; finally, the EMK algorithm is used for transforming the depth gradient feature, and the transformed feature vector is still marked as F_{g_d}；

S1.2.2 color gradient feature

and

and

is that

represents the kronecker product;

and

and

parameters corresponding to the gaussian kernel function; finally, the color gradient features are transformed by using an EMK algorithm, and the transformed feature vector is still marked as F_{g_c}；

S1.2.3 color characteristics

and

is that

represents the kronecker product;

and

and

parameters corresponding to the gaussian kernel function; finally, the color features are transformed by using an EMK algorithm, and the transformed feature vector is still marked as F_col；

S1.2.4 textural features

in the formula (5), Z ∈ Z^gRepresents the relative two-dimensional coordinate position of pixel z in the color image Patch; s (z) represents the standard deviation of the pixel gray values in a 3 × 3 region centered on pixel z; lbp (z) is a local binary pattern feature of pixel z;

and

is that

represents the kronecker product;

and

local binary pattern Gaussian kernel function and position Gaussian kernel functionThe number of the first and second groups is,

and

parameters corresponding to the gaussian kernel function; finally, the texture features are transformed by using an EMK algorithm, and the transformed feature vector is still marked as F_tex。