CN111242954B

CN111242954B - Panorama segmentation method with bidirectional connection and shielding processing

Info

Publication number: CN111242954B
Application number: CN202010067124.7A
Authority: CN
Inventors: 李玺; 陈怡峰; 蔺广琛
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2022-05-13
Anticipated expiration: 2040-01-20
Also published as: CN111242954A

Abstract

The invention discloses a panorama segmentation method with bidirectional connection and shielding processing. The method comprises the following steps: 1) acquiring a data set for training panoramic segmentation, and defining an algorithm target; 2) carrying out feature learning on the images in the group by using a full convolution network; 3) extracting semantic features from the feature map through a semantic feature extraction branch; 4) extracting example features from the feature map through an example feature extraction branch; 5) establishing connection from instance segmentation to semantic segmentation, and aggregating semantic features and instance features to perform semantic segmentation; 6) establishing connection from semantic segmentation to instance segmentation, and aggregating instance features and semantic features to perform instance segmentation; 7) and fusing the results of semantic segmentation and instance segmentation by using an occlusion processing algorithm, and outputting a panoramic segmentation result. The method fully utilizes the complementarity between semantic segmentation and example segmentation, and simultaneously applies the occlusion processing algorithm proposed by the bottom layer characteristic appearance information to efficiently complete the panoramic segmentation of the image.

Description

Panorama segmentation method with bidirectional connection and shielding processing

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a panoramic segmentation method with bidirectional connection and shielding processing.

Background

The panorama segmentation task is a set of semantic segmentation tasks and instance segmentation tasks that require not only prediction of semantic classes at the pixel level, but also differentiation of instances to foreground classes. The task is an important basic task for scene understanding, and has wide application value in the fields of automatic driving and the like. The current mainstream technical route is divided into a top-down mode and a bottom-up mode. The top-down approach finds the bounding box of the instance first and then determines whether the pixel-by-pixel within the box belongs to the instance. The bottom-up approach first predicts the pixel-by-pixel instance attribution and generates a bounding box based thereon. From empirical results, top-down technical routes tend to perform better than bottom-up ones.

However, the top-down approach suffers from two major problems. First, in this scheme, two sub-networks exist for the two tasks of semantic segmentation and instance segmentation, but there is no way of information propagation between the two sub-networks. Thus, the complementarity between these two tasks is not well exploited. Second, for detected instances, there may be instances of mutual occlusion. Past approaches rely on the class score of the target to handle the occlusion relationship, but this is clearly not optimal due to the relevance of the class score to other factors such as data distribution. How to solve the two problems becomes the key of the top-down panorama segmentation method.

Disclosure of Invention

In order to solve the two problems, the invention provides a panorama segmentation method with bidirectional connection and occlusion processing. The method is based on a deep learning network, and features between the two tasks can be mutually strengthened by establishing bidirectional connection between semantic segmentation and instance segmentation. In addition, we propose an occlusion handling algorithm to deal exclusively with occlusion problems between instances. Through the two points, the method of the invention obtains excellent panorama segmentation performance.

The technical scheme of the invention comprises the following steps:

a panorama segmentation method with bidirectional connection and shielding processing comprises the following steps:

s1, acquiring a data set for training panoramic segmentation, and defining an algorithm target;

s2, extracting the features of the images in the data set by using a full convolution network to obtain a feature map of the images;

s3, extracting semantic features from the feature map by using a semantic feature extraction network;

s4, extracting example features from the feature graph by using an example feature extraction network;

s5, establishing connection from instance segmentation to semantic segmentation, and aggregating semantic features and instance features to perform semantic segmentation;

s6, establishing connection from semantic segmentation to instance segmentation, and aggregating instance features and semantic features to perform instance segmentation;

and S7, fusing the results of semantic segmentation and example segmentation by using an occlusion processing algorithm, and outputting a panoramic segmentation result.

On the basis of the scheme, the steps can be further realized in the following preferred mode.

Preferably, the algorithm goal of step S1 is: for each picture I in the data set for panoramic segmentation, identifying the semantic category to which the background pixel appears in the picture I; for foreground pixels appearing in I, the semantic class and the belonging instance to which they belong are identified.

Preferably, in step S2, a full convolution neural network is used to extract features for each pixel in the image, and a feature map F ═ Φ (I) of the image is obtained.

Preferably, in step S3, the feature map F extracted in S2 is input to extract semantic features from the feature map using a full convolutional neural network ψ, and the semantic features S ψ (F) ψ (Φ (I)) are extracted.

Preferably, the extracting example features described in step S4 specifically includes the following sub-steps:

s41, detecting an example set O in the image by using an area extraction network, and obtaining O ═ { O ═ O₁，...，O_kIn which O is_iDenotes the ith instance detected, i ∈ [1, k ]]K is the total number of detected instances;

s42, for each detected example O_iCalculate its bounding box B_i；

S43, extracting example features by using the example feature extraction network zeta, namely inputting the feature graph F extracted in S2 and the example O_iSurrounding frame B_iExtract its instance features

Preferably, the specific steps of establishing the connection from instance segmentation to semantic segmentation described in step S5 are as follows:

s51, restoring spatial information F of the example features extracted in S4 by using RoIInlay with differentiable operation_I：

F_I＝RoIInlay(I₁，...，I_k，B₁，...，B_k)；

The specific operation of the differentiable operation roinlay is as follows: for an example with coordinates (a, b) at the top left corner and a size h × w, assuming that it is clipped and deformed to obtain a feature map of m × m, each point (u, v) on the feature map is sampled from a position (x, y) of the original feature map:

that is, the value at the point (u, v) on the m × m feature map corresponds to the value v (x, y) at the point (x, y) on the original feature map, and thus, for any point (x) located within the target region_p，y_p) Finding four sampling points surrounding it, marked as set C, and obtaining (x) by bilinear interpolation_p，y_p) Value at a point

Wherein G is_wAnd G_hIs an interpolation function under the relative coordinate system of the sample points:

in the formula: parameter(s)

Parameter(s)

For values within the target region but beyond the boundaries of the sample points, the sample points are pulled to the target boundaries;

s52, F_IThe feature obtained by aggregation with the semantic feature S extracted in the step S3 is used for predicting a semantic segmentation result; the specific operation of characteristic polymerization is as follows: firstly F is put in_ISplicing the signal S with the channel dimension to form a new feature, and then carrying out 1-layer 3 × 3 convolution processing on the new feature to eliminate deformation caused by RoIInlay; then, after multi-scale pooling operation is carried out on the feature, descriptions of scenes with the sizes of 8 × 8, 4 × 4, 2 × 2 and 1 × 1 are obtained respectively; and finally, flattening and splicing the descriptions, splicing the spliced features and the original features on each pixel point, and performing 1 × 1 convolution processing to obtain the aggregated features.

Preferably, the specific operation of establishing the connection from semantic segmentation to instance segmentation in step S6 is as follows: first get instance O from semantic feature S using RoIAlign operation_iCorresponding semantic feature S_Oi＝RoIAlign(S，B_i) (ii) a Then the S is_OiAfter a 3 x 3 convolution processing, the data is compared with the example characteristic I_iAnd adding the aggregated features to predict the segmentation result of the example, wherein the segmentation result comprises the position, the category and the segmentation map of the example.

Preferably, the step S7 includes the following steps: firstly, judging the covering relation between mutually shielded examples through a shielding processing algorithm, and then fusing the example segmentation result and the semantic segmentation result to obtain a panoramic segmentation result; wherein for target instance O_iAnd O_jAssuming that the overlapping region is P, the appearance of the region is defined as the average of RGB values of pixels in the region, and the attribution of the P region, i.e. O, is judged by comparing the appearance of the P region with the similarity of the appearance of two targets_iAnd O_jThe occlusion relationship of (1).

The invention can fully utilize the complementarity between the semantic segmentation and the instance segmentation to establish the connection between the two tasks, so that the two tasks are mutually beneficial, and the panoramic segmentation performance is finally improved. Meanwhile, the invention utilizes the shielding processing algorithm provided by the bottom layer characteristic appearance information to effectively process the shielding problem, so that the model performance is further improved.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

On the contrary, the invention is intended to cover alternatives, modifications, equivalents and alternatives which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present invention, certain specific details are set forth in order to provide a better understanding of the present invention. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details.

As shown in fig. 1, a panorama segmentation method with bidirectional connection and occlusion processing includes the following steps:

s1, acquiring a data set for training panoramic segmentation, and defining an algorithm target. In this step, the algorithm targets are: for each picture I in the data set for panoramic segmentation, identifying the semantic category to which the background pixel appears in the picture I; for foreground pixels appearing in I, the semantic class and the belonging instance to which they belong are identified.

And S2, performing feature extraction on the image in the data set by using a full convolution network to obtain a feature map of the image. In this step, a full convolution neural network is used to extract features for each pixel in the image, and a feature map F ═ Φ (I) of the image is obtained.

And S3, extracting semantic features from the feature map by using the semantic feature extraction network branches. In this step, a feature map F extracted in S2, which is a semantic feature extracted from the feature map, is extracted using one full convolution neural network ψ, and its semantic feature S ψ (F) ψ (Φ (I)) is extracted.

And S4, extracting example features from the feature graph by using the example feature extraction network branches. In this step, extracting example features specifically includes the following substeps:

s41, detecting an example set O in the image by using an area extraction network (RPN), and obtaining O ═ O₁，...，O_kIn which O is_iDenotes the ith instance detected, i ∈ [1, k ]]K is the total number of detected instances;

s42, for each detected example O_iCalculate its bounding box B_i；

S43, extracting example features by using the example feature extraction network zeta, namely inputting the feature graph F extracted in S2 and the example O_iSurrounding frame B_iExtracting its example feature I_i＝ζ(F，B_i)＝ζ(φ(I)，B_i)。

And S5, establishing connection from instance segmentation to semantic segmentation through a first feature aggregation module, and aggregating semantic features and instance features to perform semantic segmentation. In this step, the specific steps of establishing the connection from instance segmentation to semantic segmentation are as follows:

F_I＝RoIInlay(I₁，...，I_k，B₁，...，B_k)；

The specific operation of the differentiable operation roinlay is as follows: for an example with coordinates of (a, b) at the upper left corner and a size of h × w (the upper left coordinate of the example refers to the upper left coordinate of the bounding box of the target object corresponding to the example), assuming that the example is subjected to clipping and deformation to obtain a feature map of m × m, each point (u, v) on the feature map is sampled from a position (x, y) of the original feature map:

in the formula: parameter(s)

Parameter(s)

s52, mixing F_IThe feature obtained by aggregating with the semantic feature S extracted in step S3 is used to predict the semantic segmentation result. The feature aggregation is performed in the first feature aggregation module, and the specific operations are as follows: firstly F is put in_ISplicing with S in channel dimension (concatenate) to form a new feature, and then performing 1-layer 3 × 3 convolution on the new featureProcessing to eliminate distortion caused by RoIInlay; then, after the feature is subjected to multiscale Pooling (Avg Pooling), descriptions of scenes with sizes of 8 × 8, 4 × 4, 2 × 2 and 1 × 1 are obtained respectively; and finally, flattening (flatten) and splicing the descriptions, splicing the spliced features and the original features on each pixel point, and performing convolution processing of 1 × 1 to obtain the aggregated features. And (4) performing convolution of 1 multiplied by 1 according to the characteristics obtained after aggregation to obtain a prediction partitioning result.

And S6, establishing connection from semantic segmentation to instance segmentation through a second feature aggregation module, and aggregating instance features and semantic features to perform instance segmentation. The specific operation of establishing the connection from semantic segmentation to instance segmentation is as follows: first get instance O from semantic feature S using RoIAlign operation_iCorresponding semantic feature S_Oi＝RoIAlign(S，B_i) (ii) a Then in a second feature aggregation module, S_OiAfter a 3 x 3 convolution processing, the data is compared with the example characteristic I_iAnd adding the aggregated features to predict the segmentation result of the example, wherein the segmentation result comprises the position, the category and the segmentation map of the example. According to the characteristics obtained after aggregation, the prediction segmentation result of the example can be obtained by performing 1 × 1 convolution on the characteristics.

And S7, judging the coverage relation between mutually shielded examples by using a shielding processing algorithm, fusing the results of semantic segmentation and example segmentation, and outputting a panoramic segmentation result. In this step, the specific steps of outputting the panorama segmentation result are as follows: firstly, judging the covering relation between mutually shielded examples through a shielding processing algorithm, and then fusing the example segmentation result and the semantic segmentation result to obtain a panoramic segmentation result; wherein for target instance O_iAnd O_jAssuming that the overlapping region is P, the appearance of the region is defined as the average of RGB values of pixels in the region, and the attribution of the P region, i.e. O, is judged by comparing the appearance of the P region with the similarity of the appearance of two targets_iAnd O_jThe occlusion relationship of (1).

The panorama segmentation algorithm firstly uses a common full convolution neural network, semantically segments branches and extracts features by instance segmentation. Then, by means of original RoIInlay operation, bidirectional connection of feature levels is established between the semantic segmentation task and the instance segmentation task, and the complementary relation between the two tasks is fully utilized. For the occlusion problem between the possible instances, the invention designs a simple and effective occlusion processing algorithm. The algorithm utilizes the appearance information of the bottom layer, and can deduce the occlusion relation between the instances without training. Through the two points, the method achieves excellent panorama segmentation performance.

Examples

The following simulation experiment is performed based on the above method, and the implementation method of this embodiment is as described above, and the specific steps are not described in detail, and only the experimental results are shown below.

This embodiment uses ResNet-50 and FPN (feature Pyramid network) as the base network (backbone) to extract features. The semantic feature extraction network is formed by stacking three layers of Deformable Convolution (Deformable Convolution). The example feature extraction network is stacked from three layers of conventional convolutions. The model of the invention is trained on the training set of the COCO data set and is subjected to performance test on the corresponding verification set. The performance is shown in table 1, compared to the model without bi-directional connection and without occlusion inference.

TABLE 1 comparison of the Performance of the different models

Bidirectional connection	Occlusion handling	PQ(％)
			×	×	41.3
√	×	41.8
			√	√	43.0

Note: in the table, x represents use, and v represents non-use.

Therefore, through the technical scheme, the panoramic segmentation method is developed based on the deep learning technology. The invention can fully utilize the complementarity between the semantic segmentation and the instance segmentation to establish the connection between the two tasks, so that the two tasks are mutually beneficial, and the panoramic segmentation performance is finally improved. Meanwhile, the invention utilizes the shielding processing algorithm provided by the bottom layer characteristic appearance information to effectively process the shielding problem, so that the model performance is further improved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A panorama segmentation method with bidirectional connection and shielding processing is characterized by comprising the following steps:

s7, fusing semantic segmentation and instance segmentation results by using an occlusion processing algorithm, and outputting a panoramic segmentation result;

the extracting of the example features described in step S4 specifically includes the following sub-steps:

s42, for each detected example O_iCalculate its bounding box B_i；

S43, extracting example features by using the example feature extraction network zeta, namely inputting the feature graph F extracted in S2 and the example O_iSurrounding frame B_iExtracting its example feature I_i＝ζ(F，B_i)＝ζ(φ(I)，B_i)；

The specific steps of establishing the connection from instance segmentation to semantic segmentation described in step S5 are as follows:

F_I＝RoIInlay(I₁，...，I_k，B₁，...，B_k)；

The specific operation of the differentiable operation roinlay is as follows: for an example with coordinates (a, b) at the upper left corner and a size h × w, assuming that the example is subjected to clipping and deformation to obtain an m × m feature map, each point (u, v) on the feature map is sampled from a position (x, y) of the original feature map:

i.e., the value at point (u, v) on the m × m feature map corresponds to the value v (x, y) at point (x, y) on the original feature map, and thus for any one located within the target regionMean one point (x)_p，y_p) Finding four sampling points surrounding it, marked as set C, and obtaining (x) by bilinear interpolation_p，y_p) Value at a point

Wherein G is_wAnd G_hIs an interpolation function in the relative coordinate system of the sample points:

in the formula: parameter(s)

Parameter(s)

s52, F_IThe feature obtained by aggregating the semantic features S extracted in the step S3 is used for predicting semantic segmentation results; the specific operation of characteristic polymerization is as follows: firstly F is put in_ISplicing the signal S with the channel dimension to form a new feature, and then carrying out 1-layer 3 × 3 convolution processing on the new feature to eliminate deformation caused by RoIInlay; then, after multi-scale pooling operation is carried out on the feature, descriptions of scenes with the sizes of 8 × 8, 4 × 4, 2 × 2 and 1 × 1 are obtained respectively; finally, the descriptions are leveled and spliced, and the splicing is carried outSplicing the post-feature and the original feature on each pixel point, and obtaining the aggregated feature after 1 × 1 convolution processing;

the specific operation of establishing the connection from semantic segmentation to instance segmentation described in step S6 is as follows: first get instance O from semantic feature S using RoIAlign operation_iCorresponding semantic features

Then will be

After a 3 x 3 convolution processing, the data is compared with the example characteristic I_iAnd adding the aggregated features to predict the segmentation result of the example, wherein the segmentation result comprises the position, the category and the segmentation map of the example.

2. A method of panorama segmentation with bidirectional join and occlusion processing as claimed in claim 1, characterized in that the algorithm goal of step S1 is: for each picture I in the data set for panoramic segmentation, identifying the semantic category to which the background pixel appears in the picture I; for foreground pixels appearing in I, the semantic class and the belonging instance to which they belong are identified.

3. A panorama segmentation method with bidirectional occlusion processing as claimed in claim 2, characterized in that in step S2, a fully convolutional neural network is used to extract features for each pixel in the image, and a feature map F ═ Φ (I) of the image is obtained.

4. A panorama segmentation method with bidirectional connection and occlusion processing as claimed in claim 3, characterized in that in step S3, a fully convolutional neural network ψ is used to extract semantic features from the feature map, i.e. the feature map F extracted in S2 is input, and its semantic features S ψ (F) ψ (Φ (I)) are extracted.

5. A method according to claim 1 with bidirectional connectionAnd a panorama segmentation method for occlusion processing, characterized in that step S7 specifically includes the steps of: firstly, judging the covering relation between mutually shielded examples through a shielding processing algorithm, and then fusing the example segmentation result and the semantic segmentation result to obtain a panoramic segmentation result; wherein for target instance O_iAnd O_jAssuming that the overlapping region is P, the appearance of the region is defined as the average of RGB values of pixels in the region, and the attribution of the P region, i.e. O, is judged by comparing the appearance of the P region with the similarity of the appearance of two targets_iAnd O_jThe occlusion relationship of (1).