CN112396620A

CN112396620A - Image semantic segmentation method and system based on multiple thresholds

Info

Publication number: CN112396620A
Application number: CN202011284251.9A
Authority: CN
Inventors: 耿玉水; 刘建鑫; 赵晶; 张康; 李文骁
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2021-02-23

Abstract

The invention discloses an image semantic segmentation method and system based on multiple thresholds, which comprises the following steps: extracting the characteristics of the region of interest in the image multi-scale characteristic map according to the target object; segmenting the restored region of interest characteristics sequentially through a multi-level threshold value, and training a preset image semantic segmentation model; and processing the image to be segmented by using the trained image semantic segmentation model to obtain a semantic segmentation result. Extracting the characteristics of the region of interest by adopting a non-maximum inhibition method, and avoiding the problem of repeated proposal regions; and setting a multi-level threshold value for the segmentation branches, and optimizing the segmentation result by adopting the DenSeCRF, thereby improving the segmentation precision.

Description

Image semantic segmentation method and system based on multiple thresholds

Technical Field

The invention relates to the technical field of image processing, in particular to an image semantic segmentation method and system based on multiple thresholds.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

In the field of machine vision, image segmentation refers to dividing an image into a plurality of non-overlapping sub-regions, so that features in the same sub-region have certain similarity, and features of different sub-regions show obvious differences. In practical problems, a large number of application scenes need to process a large amount of image data at the same time, the image types are complex, and the traditional image segmentation algorithms, such as a threshold-based segmentation algorithm, a watershed algorithm and the like, cannot meet the current requirements; as deep learning progresses rapidly, more and more deep learning solutions are applied to the field of machine vision, where image segmentation progresses depending on the development of deep learning.

At present, many image segmentation algorithms based on deep learning, such as VGGNet and ResNet networks, still have advantages in the field of feature extraction. Long J et al published 2015 on CVPR proposed FCN networks, most image segmentation methods more or less utilize FCN or some of it. Pinheiro et al propose a depth mask segmentation model that segments each instance object by an instance output prediction candidate mask appearing in the input image, but with a lower accuracy of boundary segmentation. He et al propose a Mask R-CNN framework, which is an algorithm with better example segmentation results in the existing segmentation algorithms. Huang et al propose Mask screening R-CNN to optimize information transfer of Mask R-CNN and improve quality of generating prediction Mask, aiming at the problem that the Mask R-CNN classification frame and prediction Mask share an evaluation function to interfere with the segmentation result, and meanwhile, the segmentation task has great advantages under the condition of not training mass data.

However, the inventor finds that, so far, the existing algorithm still has some defects, such as the model is too complex, the precision is not high, a large amount of labeled data is required for training, and the like; in most cases, the above problems cannot be handled simultaneously and must be taken or rejected; in addition, the proposal of the Mask ordering R-CNN algorithm is used for image instance segmentation, the threshold value is set as an important influence factor in order to obtain a better Mask prediction result, generally, the higher the threshold value is, the more accurate the prediction result is, but the too high threshold value can cause the number of positive samples to be sharply reduced, thereby causing the phenomenon of model overfitting; if the threshold value is too low, more redundant results are contained in the samples, so that the detector is difficult to distinguish positive and negative samples, and the training effect is influenced; therefore, in some scenes, the network has the problems that the segmentation edge of some images is rough and not fine enough, the segmentation edge exceeds or does not reach the expected position, the threshold value is set to be fixed, and the segmentation of the adjustment parameter on the image edge is not improved greatly.

Disclosure of Invention

In order to solve the problems, the invention provides an image semantic segmentation method and system based on multiple thresholds, which adopts a non-maximum inhibition method to extract the characteristics of an interested region and avoids the problem of repeated suggested regions; and setting a multi-level threshold value for the segmentation branches, and optimizing the segmentation result by adopting the DenSeCRF, thereby improving the segmentation precision.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a method for segmenting image semantics based on multiple thresholds, including:

extracting the characteristics of the region of interest in the image multi-scale characteristic map according to the target object;

segmenting the restored region of interest characteristics sequentially through a multi-level threshold value, and training a preset image semantic segmentation model;

and processing the image to be segmented by using the trained image semantic segmentation model to obtain a semantic segmentation result.

In a second aspect, the present invention provides a multi-threshold-based image semantic segmentation system, including:

the characteristic extraction module is used for extracting the characteristics of the region of interest in the image multi-scale characteristic spectrum according to the target object;

the multi-stage segmentation module is used for sequentially segmenting the restored region-of-interest features through multi-stage thresholds so as to train a preset image semantic segmentation model;

and the processing module is used for processing the image to be segmented by using the trained image semantic segmentation model to obtain a semantic segmentation result.

In a third aspect, the present invention provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein when the computer instructions are executed by the processor, the method of the first aspect is performed.

In a fourth aspect, the present invention provides a computer readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

aiming at the problem of Mask screening R-CNN, namely the problem that the segmentation edge of an image exceeds or develops to an expected position, the invention provides a probability image segmentation algorithm based on multiple thresholds, and a network model can better screen the prediction result of the network by setting a multi-level threshold method, so that the prediction precision is higher; for the problem of image segmentation edge processing, the invention adds the segmentation effect of a DenseCRF optimization network in a segmentation branch, realizes efficient feature extraction, and improves the segmentation efficiency and precision.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

Fig. 1 is a flowchart of a multi-threshold-based image semantic segmentation method provided in embodiment 1 of the present invention;

FIG. 2 is a block diagram of a Multi-threshold robust mask branch structure provided in embodiment 1 of the present invention;

fig. 3 is a diagram of a MTPMS R-CNN network model provided in embodiment 1 of the present invention.

The specific implementation mode is as follows:

the invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example 1

As shown in fig. 1, the present embodiment provides a method for segmenting image semantics based on multiple thresholds, including:

s1: extracting the characteristics of the region of interest in the image multi-scale characteristic map according to the target object;

s2: segmenting the restored region of interest characteristics sequentially through a multi-level threshold value, and training a preset image semantic segmentation model;

s3: and processing the image to be segmented by using the trained image semantic segmentation model to obtain a semantic segmentation result.

In the embodiment, an image semantic segmentation model is pre-constructed on the basis of an MTPMS R-CNN network, ResNet-101 is used as a network main body structure, namely the number of network layers is 101, and due to the fact that images are various and complex, image features cannot be effectively extracted only through a single convolutional neural network, and therefore the feature pyramid network FPN is integrated in the network and is beneficial to feature extraction. The FPN adopts a transversely connected top-down hierarchical structure, single-scale input is carried out to construct a network feature pyramid, the multi-scale problem of extracting a target object from an image is solved, the feature pyramid network FPN has strong robustness and adaptability, and required parameters are few.

Therefore, in step S1, in this embodiment, the feature pyramid network structure is used to extract the multi-scale feature map of the image, the image is divided into different sizes, and features corresponding to different sizes are generated, the shallow features can distinguish simple large targets, and the deep features can distinguish small targets.

In addition, the DCN model has strong learning ability, and can automatically acquire high-order nonlinear feature combinations, however, the features are usually implicit, and the meaning is difficult to interpret. Cross Network proposed by DCN can explicitly and automatically acquire Cross features, is lighter than DNN and therefore inferior to DNN in expression capacity; therefore, the DCN is added into the network, and the performance of the network is effectively improved.

In step S1, feature maps of different scales generated by the front-layer network are input to the target detector, and the region-of-interest features are extracted from the feature maps of different levels of the feature pyramid according to the size of the target object, so that a simple network structure is changed, the detection performance of small targets is greatly improved, and the accuracy and speed are improved without increasing the calculation amount to a large extent.

In this embodiment, the target detector for extracting the features of the region of interest adopts an RPN, which is equivalent to a sliding window-based non-class target detector and is based on a convolutional neural network structure, specifically: according to the size of the target object, an anchor frame is generated by adopting sliding frame scanning, a plurality of anchor points with different sizes and aspect ratios can be generated in one suggested area, the anchor points overlap and cover the images as much as possible, the size of the suggested candidate frame area and the overlap (IOU) of the expected area directly influence the classification effect, and the characteristics of the region of interest are obtained according to the overlap ratio of the suggested area and the expected area.

Because anchor points are frequently overlapped, the suggested areas are finally overlapped on the same target, in order to solve the problem of repeated suggestion, the embodiment adopts a non-maximum suppression (NMS) algorithm to score the overlapping rate of the suggested candidate frame areas and the expected area, NMS obtains a suggestion list ordered according to scores, iterates the ordered list, discards suggestions with IOU values larger than a certain predefined threshold value, and proposes a suggestion with a higher score; specifically, the method comprises the following steps:

sorting the scores of all the suggested candidate frames, and selecting the highest score and the anchor frame corresponding to the highest score; and traversing the rest of frames, deleting the rest of frames if the overlapping area of the rest of frames and the current highest scoring frame is larger than a certain threshold, continuously selecting one frame with the highest score from the frames which are not processed, repeating the process, and obtaining the characteristics of the region of interest according to the overlapping rate of the region of the rest of suggested candidate frames and the expected region.

In this embodiment, 9 anchor frames with different sizes and different aspect ratios are adopted, and it should be noted that the edge of each anchor frame cannot be larger than the edge of the image.

Since image segmentation is a pixel-level operation, when segmentation is performed, it is necessary to determine whether a given pixel is a part of a target, accuracy is necessary at the pixel level, after a series of convolution and pooling operations are performed on an original image, and a frame fine adjustment step is provided in an RPN, a RoI frame may have different sizes, and if pixel-level segmentation is directly performed, an image target object cannot be accurately located, and therefore RoI must be restored.

Therefore, in step S2, in this embodiment, in the Mask R-CNN, the interest-region alignment layer (RoIAlign) uses a bilinear interpolation method, so as to retain the spatial information on the feature map, solve the error caused by the twice quantization of the feature map on the RoI Pooling layer, solve the problem of region mismatching of the image object, and implement pixel-level detection and segmentation.

The RoI alignment layer RoIAlign differs from RoI poling in that the RoI alignment layer RoIAlign eliminates quantization operations, does not quantize the RoI boundary cells, but calculates the exact position of the sampling point of each cell using bilinear interpolation, and then outputs the last fixed-size RoI using max-pool or average-pool operations.

In the Mask Scoring R-CNN, a segmentation branch and a detection branch are inserted in parallel, while the embodiment sets a multi-level threshold, and needs to set a multi-level branch, considering that image segmentation is a pixel-level operation, the detection of an instance object does not greatly help segmentation, and too many overlay networks cause slow operation; therefore, in step S2, the present embodiment sets a multi-level threshold only for the split branches;

the classification branch and the bounding box regression branch use the configuration in Mask Scoring R-CNN, so the present embodiment should consider the number of levels of Mask branch threshold setting, and add one segmentation branch in each Cascade stage, so as to maximize the sample diversity for learning the Mask predict task. Therefore, the result generated by the previous mask branch in this embodiment is passed to the next level of mask branch after polling as the input of the next level.

For the mask branch, the main body of the embodiment uses a Full Convolution (FCN) network, and then uses the DenseCRF to optimize the segmentation effect. Because the DenseCRF processes the image segmentation edge in the image segmentation process more finely, the embodiment enables the DenseCRF to optimize the segmentation result of the previous network on the basis of the original mask branch so as to improve the final segmentation precision; and due to the property of the DenseCRF, the embodiment only accesses the DenseCRF on the mask branch of the last stage.

To overcome the limitations of short-range conditional random fields, the present embodiment modifies the fully-connected conditional random field model using the following energy function:

where x is a label for pixel assignment, which is used as a single pointPotential energy theta_i(x_i)＝-logP(x_i)，P(x_i) Is the label distribution probability at pixel i calculated by the deep convolutional neural network; a pair of potentials has the same form and can be inferred by using a fully connected graph, for example, connecting all pairs i and j of image pixels, using the following expression:

wherein if x_i≠x_j，μ(x_i,x_j) 1, otherwise 0; the rest part of the expression adopts two Gaussian kernels with different feature spaces, the first is a bidirectional kernel between the pixel position (marked as P) and the RGB color (marked as C), and the second kernel is the pixel position; hyper-parameter sigma_α,σ_β,σ_γThe scale of the gaussian kernel is controlled, the first kernel forces the same and located pixels to have the same label, and the second kernel only considers spatial proximity when forcing smoothing.

In this embodiment, the model can effectively approximate probabilistic reasoning. Full decomposition mean field estimation e (x) pi_ie_i(x_i) The information transfer update can be expressed as a Gaussian convolution under a double-sided space, and the calculation process is obviously accelerated by a high-dimensional filtering algorithm, so that the algorithm is very fast in practice.

In this embodiment, an end-to-end training mode is adopted, network parameters are updated and optimized in a unified manner, Mask branches output k binary masks with a size of k × m × m and a coding resolution of m × m, that is, each class corresponds to one Mask, and a sigmod function is used for each Mask, where l is a function of a maximum number of Mask branches_maskIs the average binary cross entropy loss;

the ground route category of the RoI is k, l_maskDefined only in the kth class, i.e. although each point has k binary masks, there is only one mask pair l of the kth class_maskMake a contribution; thus, the mask branch has no inter-class contention.

Mask branches predict each category, three-level branches are still used when the classification layer is used for selecting and outputting a Mask to predict a segmentation Mask of an example object generated in a final target detection stage during reasoning, because the network has a certain optimization effect on the Mask, the final Mask predicts the Mask branch from the last level, a Mask branch schematic diagram is shown in fig. 2, and a network model structure diagram is shown in fig. 3.

In this embodiment, the loss function of the image semantic segmentation model is used to evaluate the degree of difference between the model prediction output and the ground truth, and can intuitively reflect the training effect of the model, and generally, the smaller the loss, the closer the prediction output is to the ground truth, and the better the performance of the model is. The loss function of the model of the embodiment is divided into two parts, wherein the first part is the loss function in the RPN, and the second part is the loss function of three branches; the RPN is used to generate candidate regions and fine-tune bounding boxes, so the penalty function of the RPN consists of the target recognition penalty and the bounding box regression penalty:

wherein i is the index of the mini-batch anchor box; n is a radical of_clsAnd N_regIndicating the number of classification layers and regression layers respectively; p is a radical of_iThe predicted probability value representing the anchor is an object, if the anchor bin is negative

t_i4 parameterized coordinates representing prediction candidate boxes;

4 parameterized coordinates representing a truth region; l is_clsAnd L_regRepresenting the classification loss and the regression loss, respectively, a parameter τ with a value of 10 is added in order to balance the influence of the two loss functions.

The branch loss is composed of four parts, respectively, the classification loss L_clsRegression of the bounding box L_boxThe division loss L_maskAnd a maskiou loss of L_maskiou(ii) a The branch penalty function is as follows:

L_head＝L_cls+L_box+L_mask+L_maskiou

wherein, the classification loss uses cross entropy loss, the bounding box regression loss is smooth _ L1_ loss function, the partition mask loss uses binary _ cross _ entry _ with _ logits loss function, and the mask loss uses mean square error to calculate the regression loss between the prediction mask and the ground route; the final loss function is as follows:

L_final＝L({p_i},{t_i})+L_head

according to the method, the prediction result of the network can be better screened by setting the multi-level threshold, and the prediction precision of the result is higher; for the problem of image segmentation edge processing, the segmentation effect of a DenseCRF optimization network is added into a multi-stage segmentation branch to obtain an optimal image segmentation result.

Example 2

The embodiment provides an image semantic segmentation system based on multiple thresholds, which includes:

It should be noted that the above modules correspond to steps S1 to S3 in embodiment 1, and the above modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure in embodiment 1. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

In further embodiments, there is also provided:

an electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the method of embodiment 1. For brevity, no further description is provided herein.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method described in embodiment 1.

The method in embodiment 1 may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A multi-threshold-based image semantic segmentation method is characterized by comprising the following steps:

2. The multi-threshold-based image semantic segmentation method as claimed in claim 1, wherein a feature pyramid network is used to extract a multi-scale feature map of the image, and the multi-scale feature map is input into a pre-trained target detector to extract region-of-interest features of a target object.

3. The method for image semantic segmentation based on multiple threshold values as claimed in claim 1, wherein the extracting the region-of-interest features comprises: and generating an anchor frame by adopting sliding frame scanning according to the size of the target object, and obtaining the characteristics of the region of interest according to the overlapping rate of the suggested candidate frame region and the expected region.

4. The image semantic segmentation method based on the multiple thresholds as claimed in claim 3, characterized in that, the overlapping rate of the suggested candidate frame area and the expected area is scored, the suggested candidate frame area with the highest score and the corresponding anchor frame are selected, the anchor frames of the other suggested candidate frame areas are traversed, and if the overlapping rate of the other anchor frames and the anchor frame with the current highest score is greater than a preset threshold, the anchor frames are deleted; and obtaining the region-of-interest characteristics according to the overlapping rate of the residual suggested candidate box region and the expected region.

5. The image semantic segmentation method based on multiple thresholds as set forth in claim 1, characterized in that a bilinear interpolation method of RoIAlign is adopted to restore the features of the region of interest.

6. The multi-threshold-based image semantic segmentation method according to claim 1, wherein a DenSeCRF is added to the segmentation branch of the last level threshold to optimize the segmentation result.

7. The multi-threshold-based image semantic segmentation method according to claim 1, wherein the loss functions of the image semantic segmentation model comprise a target recognition loss function, a bounding box regression loss function and a segmentation branch loss function, and the segmentation branch loss functions comprise a classification loss, a bounding box regression, a segmentation loss and a mask iou loss function.

8. A multi-threshold based image semantic segmentation system, comprising:

9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the method of any of claims 1-7.

10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 7.