CN111461127A

CN111461127A - Example segmentation method based on one-stage target detection framework

Info

Publication number: CN111461127A
Application number: CN202010239127.4A
Authority: CN
Inventors: 罗荣华; 李嘉明
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-07-28
Anticipated expiration: 2040-03-30
Also published as: CN111461127B

Abstract

An example segmentation method based on a one-stage target detection framework comprises the following steps: 1) encoding image dataset annotations, a target in an image being defined as a dense point object; 2) constructing an example segmentation network model, wherein the example segmentation network model comprises a backbone network, a main body feature extraction module, a detection module for generating a target detection result and a segmentation module for generating a semantic segmentation result; 3) deep learning training is carried out, and the method is mainly embodied as a multi-task loss function which is suitable for example segmentation tasks and designed by the invention; 4) and in the inference stage, the inference method combines the target detection result and the semantic segmentation result, and adopts a non-maximum segmentation screening method to obtain an example segmentation result. The method is simple and reasonable in design, can ensure the detection quality and the detection speed of the original target detection frame, can generate high-precision segmentation masks simultaneously, and has good robustness.

Description

Example segmentation method based on one-stage target detection framework

Technical Field

The invention belongs to the technical field of deep learning and computer vision, and particularly relates to an example segmentation method based on a one-stage target detection framework.

Background

In the past years, the object detection direction of computer vision has been greatly improved, and the example segmentation task with certain correlation with the object detection direction has been developed to a certain extent after the powerful functions of the object detection framework are used for reference.

The example segmentation task refers to positioning a potential target position in an image, wherein the position is represented by a detection target frame, pixel-by-pixel marking is carried out in different target areas in a semantic segmentation mode, and the task can be completed by carrying out corresponding expansion on the basis of target detection. Many works are expanding the existing mature target detection framework to complete the task of example segmentation, but most works are expanding the two-stage target detection framework, and the expansion is difficult to consider the detection speed while ensuring the segmentation quality.

In 2018, the simple and flexible expansion method of the two-stage target detection framework-based instance segmentation method Mask R-CNN proposed by the Renamo team from the Facebook AI institute enables the field to pay more attention to the commonality and mutual expansion functions of the target detection and instance segmentation tasks and the corresponding methods thereof. However, this method currently sacrifices efficiency in time while achieving a correspondingly high quality segmentation result, which is where this method is currently worth improving.

One of the reasons why Mask R-CNN is time-inefficient is that it uses a two-stage object detection framework that is inherently problematic. The invention provides a faster instance segmentation process by effectively expanding a one-stage target detection framework, and effectively solves the problems. However, the example segmentation method based on one-stage target detection framework extension cannot simply and directly extend Mask branches like Mask R-CNN to complete corresponding example segmentation tasks. The example segmentation task is expanded on a stage target detection frame, and the ambiguity problem of different object detection target frames and the ambiguity problem of segmentation semantics need to be solved.

Therefore, there is a need for an example segmentation method that addresses the above ambiguity problem while taking speed into account.

Disclosure of Invention

In view of the problems described in the background art, the present invention provides an example segmentation method based on a one-stage object detection framework, so as to provide an example segmentation method that can simultaneously consider both segmentation quality and model speed.

In order to solve the above technical problem, the present invention provides an example segmentation method based on a one-stage target detection framework, including:

s1, labeling the target objects in the training images into dense point objects in the network output layer;

s2, constructing a complete example segmentation network model based on a one-stage target detection framework;

s3, designing a multi-task loss function adaptive to an example segmentation task for deep learning training;

and S4, in an inference stage, combining the target detection result and the semantic segmentation result to finally obtain an example segmentation result.

Further, the multitask loss function in step S3 is composed of loss functions of four parts, namely, a target detection box, an object confidence, an object classification and a semantic segmentation, and specifically includes the following steps:

s31, adopting an intersection ratio loss function by the target detection frame part;

s32, the object confidence degree part adopts a focus loss function improved aiming at the object centrality, and the expression is as follows:

s33, the object classification part adopts a focus loss function improved for the multi-classification task, and the expression is as follows:

s34, adopting cross entropy loss function in semantic segmentation part,

wherein, y_tThe actual category of the label;

to indicate a function, y_tThe part greater than 0 is 1, the rest is 0, α_tIs one [0,0.5 ]]Fractional number between; γ is a number greater than 0; p is a radical of_tThe classification probabilities of different categories of the output; n is a radical of_tThe number of objects in the dataset for that category.

Further, in step S1, the target object included in each training image is expressed in the form of dense point objects, and a tensor having a size of [ B, H, W ] is output for each scale, where B is a batch size, H, W is a height and a width of the corresponding scale, and if and only if a target detection frame of a certain object includes a part of points in the tensor, the part of points is encoded in the form of a point object in such a manner that:

s11, encoding the object type into a one-hot vector with the length of the vector being C;

s12, encoding the object confidence coefficient into an object centrality;

s13, encoding the target detection box into a vector (L, R, T, B) with the length of 4;

where C is the number of classification categories, L, R, T, B respectively indicate the distance from the center of the point object to the left, right, top, and bottom ends of the corresponding object detection box.

Further, in step S2, the example segmented network model based on the one-stage object detection framework includes four parts, namely a backbone network, a main feature extraction module, an object detection module, and a semantic segmentation module, and specifically includes the following steps:

s21, a main network for helping to complete basic feature extraction, which comprises a network input layer for receiving image tensor input and a five-stage down-sampling feature extraction layer;

s22, the main body feature extraction module mainly carries out further mining extraction on the basic features, and the main body feature extraction module comprises three parts, namely transverse connection from a basic feature layer to a feature extraction layer, down-sampling from a lower feature layer to a higher feature layer from bottom to top, and up-sampling from the higher feature layer to the lower feature layer from top to bottom, and finally outputs a feature tensor of each scale;

s23, the target detection module is responsible for obtaining the output of the target detection result, five detection modules under different scales are responsible for the detection of objects with different sizes respectively, each detection module comprises two branches, the two branches both receive the feature tensor extracted by the main feature extraction module, the feature tensor is obtained through the convolution layer with the four kernel sizes of 3 and the down-sampling scale of 1, one branch is a regression branch, the feature tensor is obtained through the convolution branch with the output channel number of 4 to obtain the target detection regression result, the other branch is a classification branch, the feature tensor is obtained through the convolution branches with the output channel number of C and the output channel number of 1 respectively to obtain the target confidence coefficient and classification probability result;

and S24, the semantic segmentation module is responsible for obtaining the output of the semantic segmentation result, and the module only receives the feature tensor of the bottom layer of the main body feature extraction module as input, and obtains the semantic segmentation result divided by categories through four convolutional layers with the kernel size of 3 and the downsampling scale of 1 and a convolutional layer with the output channel number of C.

Further, the example segmentation network model outputs five target detection results of different scales and a semantic segmentation result of one scale, the target detection results are merged at the inference stage, then Top K results are screened out, and finally a non-maximum segmentation screening method combining the target detection results and the semantic segmentation results is used to obtain the final example segmentation result.

Further, the algorithm steps of the non-maximum segmentation screening method are as follows:

1) sorting the detection results in a descending order according to the object scores of the target detection results;

2) selecting a detection object m with the highest score, adding the detection object m into a final detection result list, deleting the detection object m from an original list, obtaining segmentation results corresponding to pixels from semantic segmentation results according to the detection category and detection frame results, using the segmentation results as an example segmentation result set of m, and storing the set by using the score of a target object and the segmentation results corresponding to each pixel as key value pairs respectively;

3) calculating the intersection ratio of m and other detection targets, finding out all results of which the intersection ratio is more than 0.5, deleting the results from the detection result list, and adding the scores and the segmentation results of the targets of the objects into the example segmentation result set of m;

4) performing linear combination on all candidate example segmentation results of m to obtain a final example segmentation result of m;

5) and repeating 2-4) until the detection result list is empty, finishing the algorithm, and returning to the final detection result list.

Compared with the prior art, the invention has the following beneficial effects: the model can complete the example segmentation task at the same time only by slightly modifying the original mature one-stage target detection framework, and an example segmentation result with better quality is obtained on the basis of keeping the detection speed of the original framework.

Drawings

FIG. 1 is a schematic diagram of an example segmentation method based on a one-stage object detection framework;

FIG. 2 is a schematic diagram of target object labeling in an image;

FIG. 3 is a schematic diagram of a main feature extraction module of a second part of the network architecture;

FIG. 4 is a block diagram of a network detection branch and segmentation branch module;

FIG. 5 is a schematic view of a data set processing flow.

Detailed Description

The present invention will be described in further detail below, but the embodiments of the present invention are not limited thereto, and the scope of protection is not limited to the examples.

An example segmentation method based on a single-target detection framework is implemented by the following steps:

s1, acquiring the image to be detected and the label thereof related to the example segmentation task, and labeling the target object in the training image as a dense point object in the network output layer, as shown in FIG. 2. The data acquisition and preprocessing steps are shown in fig. 5. In this embodiment, the target object contained in each image is expressed in the form of dense point objects, and a tensor with a size of [ B, H, W ] is output in each scale, where B is a batch size, H, W is a height and a width of the corresponding scale, and if and only if a detection frame of a certain object contains a part of points in the tensor, the part of points is encoded in the form of a point object, as shown in fig. 2, specifically, each point object is encoded as follows:

s12, encoding the object confidence coefficient into an object centrality;

s13, the target detection box is encoded into a length-4 vector (L, R, T, B), as shown in fig. 2 (B).

S2, constructing a complete example segmentation network model based on a one-stage target detection framework, as shown in the training stage of FIG. 1, completing basic feature extraction of an original image through a backbone network, further mining the basic features through a main feature extraction module, and finally obtaining the final output of the network model through a detection layer and a segmentation layer. The complete network model structure is shown in the training phase of fig. 1. The deep network model designed by the invention consists of a backbone network, a main body feature extraction module, a plurality of target detection modules and a semantic segmentation module, wherein the four parts are as follows:

s21, a backbone network for assisting in completing basic feature extraction, which includes a network input layer for receiving image tensor input and five stages of down-sampling feature extraction layers, in this embodiment, ResNet50-C4 is selected as the backbone network for assisting in completing basic feature extraction, the feature extraction capability of this part has been verified in an image recognition model, and its structure is roughly as shown in fig. 1, which includes a network input layer and 5 stages of feature layers, and fig. 1 omits the feature C1 layer without any modification. As shown in the bottom layer of the backbone network portion of fig. 1, the network model input is an image pixel matrix with an image size of 512 x 512.

S22, the main feature extraction module mainly carries out further mining extraction on the basic features, and the main feature extraction module mainly comprises three parts of transverse connection from a basic feature layer to a feature extraction layer, down-sampling from a low-level feature layer to a high-level feature layer from bottom to top, and up-sampling from the high-level feature layer to the low-level feature layer from top to bottom. In this embodiment, three parts of the module are described as follows:

1) the first part of the module is the cross-connect from the base feature layer to the feature extraction layer, which consists of a Batch Normalization layer (Batch Normalization) and 1 x 1 convolution kernel preceded by an Activation layer (Activation), without any downsampling or upsampling operations, as shown in the cross-connect section of fig. 3. In this embodiment, the connection is used in four parts in total, as shown in fig. 1, C2 to P2, C3 to P3, C4 to P4, and C5 to P5 parts.

2) The second part of the module is the bottom-up connection from the lower feature level to the upper feature level, as shown in the downsampling part of fig. 3, which consists of a Batch Normalization layer (Batch Normalization) and a 3 x 3 convolution kernel preceded by an Activation layer (Activation), where the convolution performs the downsampling operation. In this embodiment, this connection is used for two parts in total, as shown in fig. 1, parts P5 to P6, P6 to P7.

3) The third part of the module is a top-down connection from the upper feature level to the lower feature level, as shown in the upsampling part of fig. 3, which is a splice level (splice) that splices the upsampling level with the cross-connections. In this embodiment, the connection is used in three parts in total, as shown in fig. 1, parts P5 to P4, P4 to P3, and P3 to P2.

The three parts of the module eventually output one layer of the feature tensor per scale.

And S23, the target detection module is responsible for obtaining the output of the target detection result, the third part of the detection layer of the model performs the feature selection operation of the detection layer by using the features generated by the second part of the main feature layer of the model, and finally two result branches are output. As shown in fig. 1, the characteristics of the 5 parts P3, P4, P5, P6 and P7 are subjected to characteristic selection by the detection layer, and corresponding target detection outputs are generated. In the present embodiment, the five detection modules are respectively responsible for objects with different sizesDetection, wherein detection blocks generated by P3, P4, P5, P6 and P7 are respectively responsible for monitoring the size range of [0,16 ]]、[16,32]、[32,64]、[64,128]、[128,512]The method is favorable for detecting objects with different sizes by a model, each detection module comprises two branches, the two branches receive a feature tensor extracted by a main feature extraction module, the feature tensor is obtained through four convolution layers with the kernel size of 3 and the downsampling scale of 1, wherein the two branches are a classification branch and a regression branch respectively, the feature tensor of the classification branch passes through convolution branches with the output channel number of C and the output channel number of 1 respectively to obtain two outputs of object confidence coefficient and classification probability, as shown in a classification branch of FIG. 4, the regression branch performs regression prediction on a regression frame, the feature tensor obtains a target detection regression result through the convolution branches with the output channel number of 4, as shown in a regression branch of FIG. 4, the position parameters of the four target frames output by the regression branch are (L, R, T and B), the final convolution result is activated by using Relu function, and the two outputs of the classification branch have result domains of 0,1]And activating the final convolution result by using a sigmoid function, wherein the object confidence coefficient is expressed by adopting a target centrality, the object centrality represents the degree of a point object in the center of the object, and the expression is as follows:

the expression mode can effectively filter out low-quality detection objects, and the expression domain of the expression mode is [0,1 ]]. In particular, the initialization of the bias variables of the final convolution kernel of this block is all constant 0. The three parts of the module are shown in figure 4.

The fourth part of the model is responsible for obtaining the output of the semantic segmentation result, the module only receives the feature tensor of the lowest layer of the main body feature extraction module as input, and obtains a semantic segmentation mask output divided by categories through four convolutional layers with the kernel size of 3 and the downsampling scale of 1 and a convolutional layer with the output channel number of C, the size of a result matrix is [ B,128, C ], and the final convolution result is activated by using a sigmoid function, wherein B is the batch processing size, C is the category number, in the embodiment, B is equal to 12, and C is equal to 80 (adopting an MS COCO data set).

And S3, designing a multi-task loss function adaptive to the example segmentation task for deep learning training. The multitask loss function designed by the invention consists of loss functions of a target detection frame, an object confidence coefficient, an object classification and an object segmentation mask.

In the embodiment, the label and the output of the target detection frame both adopt a (L, R, T, B) representation mode, and the representation mode is more intuitive and more efficient in calculation of the intersection ratio than the traditional (X, Y, W, H), so that the target detection frame part of the multitask loss function adopts the intersection ratio loss function.

S32, the object confidence part adopts a focus loss function improved for the object centrality, which can improve the accuracy of the target detection task, and in this embodiment, the improved focus loss function is defined as:

wherein, y_tThe actual category of the label;

to indicate a function, y_tThe part greater than 0 is 1, the rest is 0, α_tIs one [0,0.5 ]]The fraction of the cells between (a) and (b),

make most of the samples (i.e. y)_tEqual to 0, no object labeled part) is lower than a few class samples (i.e., y)_tGreater than 0, there is a portion of the object label); gamma is a number greater than 0, | y_t-p_t|^γMake samples (| y) easily predicted_t-p_tWith a smaller value of | is used) ofThe loss value is lower than that of a sample difficult to predict (| y)_t-p_tThe value of | is greater) loss value; p is a radical of_tAre the classification probabilities of the different classes of output.

S33, the object classification part adopts a focus loss function improved for the multi-classification task, and the improved focus loss function can improve the object classification accuracy of the few sample numbers in the data set, and in this embodiment, the improved focus loss function is defined as:

wherein N is_tFor the number of objects in the data set of the category,

making the loss value of the majority class sample lower than the loss value of the minority class sample; y is_tThe actual category of the label; gamma is a number greater than 0 and has a meaning consistent with the improved binary focus loss function; p is a radical of_tAre the classification probabilities of the different classes of output.

S34, in this embodiment, the semantic segmentation section employs a cross entropy loss function.

In the neural network training, the network parameters are optimized by using a stochastic gradient descent method, in the embodiment, the initial learning rate is set to be 0.01, the momentum is 0.9, the weight attenuation is 0.0005, the batch size is 12, the total iteration round is 300 rounds, each round comprises 500 batches, the learning rate is descended in 180 rounds, 240 rounds and 280 rounds, and the descent ratio is 0.1.

S4, inference stage, processing the output of the model to obtain the final result, combining the target detection result and the semantic segmentation result, the deep network model outputs five target detection results of different scales and one semantic segmentation result, firstly combining the target detection results, then screening out Top K results, and finally using a non-maximum segmentation screening method combining the target detection result and the semantic segmentation result to obtain the final example segmentation result, as shown in the inference stage of FIG. 1. The method comprises the following specific steps:

the output of the semantic segmentation module can be directly used as a segmentation mask result, and the output of the target detection module is processed by the following steps of firstly obtaining the output of 5 detection modules, obtaining all target detection results, wherein the position parameters (L, R, T, B) of a target detection frame need to be decoded into a traditional representation mode of (X, Y, X, Y), multiplying the downsampling scale corresponding to each output to restore the size corresponding to the original image, finally combining all detection outputs to obtain an initial detection result, simultaneously directly using the output of the semantic segmentation module as the segmentation mask result, then sequencing the obtained initial detection results according to object scores, selecting Top K results, and screening out the result of which the object score is more than 0.5.

After obtaining the detection result and the segmentation result after post-processing, screening the detection result and the segmentation result by using a modified non-maximum segmentation screening method (NMS), wherein the algorithm comprises the following steps:

The finally obtained detection result list comprises the object score, the belonged category, the target detection frame and the object segmentation mask of the finally detected object, and can be used as the final output of the instance segmentation task.

In summary, the example segmentation method can give consideration to both the detection quality and the detection speed to a certain extent, and only a small amount of modification is performed on the original mature one-stage target detection framework, so that the model is simple, but efficient and practical.

The above embodiments are embodiments with better experimental effects, but the present invention is not limited to any implementation manner and form, and any modifications such as simple modification, simplification, combination, replacement, etc. according to the technical scheme of the present invention are included in the protection scope of the present invention.

Claims

1. An example segmentation method based on a one-stage target detection framework is characterized by comprising the following steps:

2. The example segmentation method based on the one-stage object detection framework of claim 1, wherein the multitask loss function in step S3 is composed of loss functions of four parts, namely an object detection box, an object confidence, an object classification and a semantic segmentation, and specifically includes the following steps:

s34, adopting cross entropy loss function in semantic segmentation part,

wherein, y_tThe actual category of the label;

3. The method of claim 2, wherein the method comprises: in step S1, the target object included in each training image is expressed in the form of dense point objects, and a tensor having a size of [ B, H, W ] is output for each scale, where B is a batch size, H, W is a height and a width of the corresponding scale, and if and only if a detection frame of an object includes a part of points in the tensor, the part of points is encoded in the form of a point object in the following manner:

s12, encoding the object confidence coefficient into an object centrality;

4. The method of one-stage object detection framework-based instance segmentation according to claim 1, wherein: in step S2, the example segmentation network model based on the one-stage object detection framework includes a backbone network, a main feature extraction module, an object detection module, and a semantic segmentation module, which are specifically as follows:

5. The method of claim 4, wherein the method comprises: the example segmentation network model outputs five target detection results of different scales and a semantic segmentation result of one scale, the target detection results are merged at the inference stage, Top K results are screened out, and finally a non-maximum segmentation screening method combining the target detection results and the semantic segmentation results is used to obtain the final example segmentation result.

6. The method of one-stage object detection framework-based instance segmentation of claim 5, wherein: the algorithm steps of the non-maximum segmentation screening method are as follows: