CN111915613A

CN111915613A - Image instance segmentation method, device, equipment and storage medium

Info

Publication number: CN111915613A
Application number: CN202010798878.XA
Authority: CN
Inventors: 杜吉祥; 程海峰; 张洪博; 翟传敏
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2020-08-11
Filing date: 2020-08-11
Publication date: 2020-11-10
Anticipated expiration: 2040-08-11
Also published as: CN111915613B

Abstract

The invention provides an image instance segmentation method, an image instance segmentation device, image instance segmentation equipment and a storage medium, wherein the method comprises the following steps: extracting the characteristics of the image to obtain a characteristic diagram; extracting multi-scale information of the feature map through a self-attention module to obtain feature maps of different scales, and performing dot multiplication on the feature maps obtained by adjusting dimensionality through 1 multiplied by 1 convolution to introduce pixel-level attention to the feature maps; compressing the space dimension of the feature diagram, performing real-number transformation on the feature diagram, and weighting the low-level feature diagram; fusing the weighted feature map with the high-level feature map to obtain a new feature map; classifying the fused feature maps through target detection and semantic segmentation to obtain target detection and semantic segmentation feature maps; and carrying out image segmentation based on the obtained target detection and the semantic segmentation characteristic graph to obtain a segmentation result. According to the invention, through the self-attention module and the path enhancement module, more excellent semantic guidance is provided for subsequent feature fusion, and certain noise influence is reduced for training and learning of a subsequent network.

Description

Image instance segmentation method, device, equipment and storage medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to an image instance segmentation method, device, equipment and storage medium.

Background

The image instance segmentation problem is an extremely important sub-problem in the field of computer vision, and the main purpose of the problem is to accurately intercept an object of interest in an image from the image, so as to avoid background interference and facilitate subsequent processing. The example segmentation itself contains many branch problems, respectively target detection and semantic segmentation. The lifting of each point is extremely important for the improvement of the final result, so the improvement direction in the text can also start from these two places.

Early image segmentation does not distinguish between instance segmentation and semantic segmentation, and aims to distinguish foreground and background images. The early image segmentation development techniques mainly had the following two directions:

image segmentation algorithm based on threshold value

The algorithm for image segmentation using threshold is one of the most commonly used image segmentation algorithms, and its main idea is to compare the gray value of each pixel of the image with the threshold, and then classify to obtain the final binary image, separating the target from the background. Therefore, the emphasis of the thresholding method is how to find the optimal gray level threshold. The classical algorithm is OTSU ohio, which finds the optimal threshold by maximizing the variance between different classes. Besides, there are various algorithms such as an averaging method, a histogram doublet method, an iterative threshold method and the like.

The threshold method has the characteristics of high calculation speed, simple algorithm implementation and the like, and has excellent effect when the difference between the target and the background is large or the contrast is obvious. However, when the information of the image is complex and the problems of blocking noise are serious, a satisfactory segmentation result cannot be achieved by the threshold method.

Image segmentation algorithm based on edge information

The edge itself contains more information than the pixel value, and the edge information changes most severely in the features such as pixel value, gradient and the like in the image, so that the segmentation based on the edge information is one of the most conceivable methods and the longest study time.

In general, pixels within a target are uniform or change gently, and such a change becomes extremely severe between different objects, so that edge information is often high-frequency information. One can calculate all possible boundary information in the image by the characteristic and perform segmentation according to the boundary information. Commonly used edge detection algorithms are Sobel, Canny, etc. However, when the pixel value of the target itself is variable or contains many fine textures, it is difficult for a single edge detection algorithm to obtain the boundary of the target completely. Similarly, when the image itself has a serious noise effect, the edge detection algorithm cannot work, because the noise information itself is not related to the image, and is also high-frequency information, which has a great influence on the detection.

Early image segmentation algorithms focused mainly on the use of surface information of images, such as pixels, edges, gradients, etc., but with the development of deep learning, neural networks can extract higher-level semantic information from images, which undoubtedly expands the amount of information that can be used by the algorithms, and greatly facilitates image segmentation, and segmentation starts to progress in two directions, namely semantic segmentation and example segmentation.

The semantic segmentation is aimed at classifying all pixels in an image, so as to distinguish different classes of objects in the image. The current semantic segmentation algorithm is developed very rapidly, end-to-end training and prediction are achieved, and the accuracy rate is continuously updated. The development of semantic segmentation on a neural network should be started from the FCN paper, which uses a conventional neural network as an encoding part to obtain a low-resolution but information-rich feature map, and then uses a transposed convolution to decode the feature map, and finally outputs a segmentation result map identical to that of the input image. This is the first time that end-to-end training is realized, and the development direction of semantic segmentation is also determined. In the subsequent semantic segmentation papers, the idea of the main body is not changed greatly, the first half is an encoder, a convolutional neural network is mainly used for encoding image information, and the second half is a decoder, and the encoded feature map is decoded to obtain a final result map. Subsequent improvements have also focused primarily on both the codec parts. The encoder part mainly uses a more excellent neural network, such as GoogleNet, ResNet, SENEt, and the like, and the decoder mainly uses a more excellent method, such as U-Net, to add short-circuit connection and reuse intermediate detection characteristics. Some other methods are to improve the processing details, such as the hole convolution proposed by deep lab, which increases the receptive field and makes the information of the feature map richer.

In the field of instance segmentation, the basic idea is object detection plus semantic segmentation. Early SDS generated proposed regions using MCG and then segmented the results using Bottom-Up, but this is computationally expensive and generally efficient. And then introducing a supercolumn concept into the classifier on the basis of SDS (sodium dodecyl sulfate) in HyperColumns to realize the correction of the ROI. The subsequent CFM starts assisted segmentation using mask. Through continuous iterative development, the final Mask-RCNN of the Renamo improves the development of example segmentation by a large step, and the final Mask-RCNN uses the fast-RCNN with the highest precision at that time as a target detection part, and then semantic segmentation branches are added on the basis to segment each detected target area. Therefore, the precision is ensured no matter in the target detection stage or the semantic segmentation stage. The later development mainly aims to improve the problems of Mask-RCNN, such as the full-connection information assisted segmentation added by PANET in semantic segmentation branches or the mechanism added by MS-RCNN for scoring the segmentation result.

The scene range used in the actual production life is wide, and the interaction between the instance segmentation technology and the user is more intuitive, and the instance segmentation requirement with higher precision is provided. At present, the Mask-RCNN method is similar to the present invention. In the semantic segmentation stage, the model mainly uses four layers of full convolution networks and final deconvolution to process to obtain a mask graph. The method is different from the method in that the self-defined feature fusion module is used for processing in the semantic segmentation stage, the feature graph information is more efficiently utilized, and finally processed moral mask graph information is richer and has higher accuracy.

Disclosure of Invention

The invention aims to provide an image instance segmentation method, an image instance segmentation device, an image instance segmentation equipment and a storage medium, which are used for realizing higher-precision instance segmentation.

In order to achieve the above object, an embodiment of the present invention provides an image instance segmentation method, including the following steps:

extracting features input into the convolutional neural network image to obtain five feature maps with different resolutions;

extracting multi-scale information of the feature map through a self-attention module to obtain feature maps of different scales;

based on the obtained feature maps with different scales, multiplying the feature maps with feature map points obtained by adjusting dimensionality through 1 × 1 convolution, and introducing pixel-level attention to the feature maps;

compressing the spatial dimension of the feature map, performing real number on the feature map, and weighting the low-level feature map based on the dimension and the real number;

fusing the weighted feature map with the high-level feature map to obtain a new feature map corresponding to the low level;

classifying the fused feature maps through target detection and semantic segmentation to obtain target detection and semantic segmentation feature maps;

and carrying out image segmentation based on the obtained target detection and the semantic segmentation characteristic graph to obtain a segmentation result.

Furthermore, the multi-scale feature map is obtained through a U-shaped structure, the U-shaped structure is fused with features of different pyramid scales, and the pyramid uses different convolution kernels to extract features to obtain the feature map.

Further, the feature map based on the obtained different scales is point-multiplied with the feature map obtained by adjusting dimensionality through 1 × 1 convolution, and the feature map after point multiplication is subjected to global pooling processing to obtain the feature map with the strongest global information.

Further, the weighting process is as follows:

wherein ,z_cRepresenting the output array of real numbers, H and W being the height and width of the feature map, i, j representing the coordinates of the feature map, u_cThe dimension of the feature map is to convert the H × W × C input feature map into 1 × 1 × C output, and the result represents the global information corresponding to each layer of the feature map in the spatial dimension.

The present invention also provides an image instance segmentation apparatus, comprising:

the first extraction module is used for extracting the features input into the convolutional neural network image and acquiring five feature maps with different resolutions;

the second extraction module is used for extracting multi-scale information of the feature map through the self-attention module to obtain feature maps with different scales;

the dot multiplication module is used for performing dot multiplication on the obtained feature maps with different scales and the obtained feature map with dimension adjusted through 1 x 1 convolution so as to introduce pixel level attention to the feature map;

the weighting module is used for compressing the spatial dimension of the feature map, carrying out real number on the feature map and weighting the low-level feature map based on the dimension and the real number;

the fusion module is used for fusing the weighted feature map with the high-level feature map to obtain a new feature map corresponding to the low level;

the classification module is used for classifying the fused feature map through target detection and semantic segmentation to obtain a target detection and semantic segmentation feature map;

and the segmentation module is used for carrying out image segmentation on the acquired target detection and semantic segmentation characteristic graph to obtain a segmentation result.

Further, the multi-scale feature map is obtained through a U-shaped structure, the U-shaped structure is fused with features of different pyramid scales, and the pyramid uses different convolution kernels to extract features to obtain the feature map.

Further, the weighting process is as follows:

The invention also provides an image example segmentation device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor is used for operating the computer program to realize the image example segmentation method.

The invention also provides a storage medium for storing at least one program and at least one instruction, wherein the at least one program and the instruction are executed to realize the image instance segmentation method.

The invention has the beneficial technical effects that:

the invention provides a self-attention module and a path enhancement structure in the characteristic fusion stage, plays a great role in improving the final result of the network, and can realize higher-precision example segmentation. The information level of the feature map of the highest layer in the backbone is improved through the self-attention module, multi-scale information and pixel level attention are introduced, and more excellent semantic guidance is provided for subsequent feature fusion. The path enhancement structure introduces attention of space dimensionality, global information obtained by the high-level feature map is used as weight to weight the low-level feature map, so that the network is emphasized in use of information, and certain noise influence is reduced for training and learning of a subsequent network.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a flowchart illustrating an image example segmentation method according to a first embodiment of the present invention.

Fig. 2 is a flowchart illustrating an image example segmentation method according to a first embodiment of the present invention.

Fig. 3 is a schematic diagram of a self-attention module according to a first embodiment of the invention.

Fig. 4 is a schematic diagram of a path enhancing structure according to a first embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For the understanding of the present invention, the original Mask-RCNN in which a classic Feature Pyramid (FPN) is used for merging five feature maps of different resolutions extracted from a backbone will be described first. The fusion mode is widely applied since being proposed, and the performance of the network is improved greatly. The method fuses high-level semantic information and low-level spatial information, and enables a feature map with unbalanced original information to be more balanced. In ResNet, the feature map semantic information of the fifth layer is most rich and the performance is strongest, but its spatial information is almost lost because the resolution is too small. However, by fusing the feature pyramid with a lower-level feature map, the defect that the feature pyramid is lack of spatial information can be overcome to a certain extent. The characteristic of rich semantic information can be used for providing semantic guidance for the low-level feature map and reducing noise information in the feature map.

Therefore, the function realized by the Feature Pyramid (FPN) is to guide low-level spatial information by using high-level semantic information, reduce noise, improve the information quality of the feature map and balance the information difference between the feature maps. The key role is the feature map at the highest level of the network, semantic information of the feature map at the layer is strongest, but multi-scale information is not available due to the design of the backbone. In the current structure, the pyramid structure can extract feature information of different scales, but the structure is flat and lacks global context information.

And an early characteristic pyramid (FPN) fuses the characteristic diagram through a Bottom-Up structure, so that the information of the characteristic diagram is more balanced, and the noise is suppressed. However, the feature pyramid just performs fusion processing on feature map information, does not select information, and has no emphasis. This results in the retention of noise or unimportant information within the profile, which is equivalent to introducing noise interference for later processing by the network. If the information of the feature map can be screened and weighted while the feature map fusion processing is performed, the unimportant and the next-important information can be suppressed, and the weight of the useful information can be enhanced, the effect of the feature fusion can be more prominent, and the processing after the network can also avoid the interference of noise

The following detailed description describes specific embodiments of the present invention.

As shown in fig. 1-2, a first embodiment of the present invention provides an image instance segmentation method, including the steps of:

s11, extracting the features of the convolution neural network image to obtain five feature maps with different resolutions;

s12, extracting multi-scale information of the feature map through a self-attention module to obtain feature maps with different scales;

in the present embodiment, as shown in fig. 3, the self-attention module can fuse three features with different pyramid scales through a U-shaped structure (e.g., a feature pyramid). To better extract the multi-scale context information of an image through a pyramid structure, the pyramid structure can use three different convolution kernels for feature extraction, 3 × 3, 5 × 5, and 7 × 7 respectively. The scale information of three different targets in large, medium and small sizes can be extracted from the original feature map through the three feature maps with different scales, the information level of the feature map is enriched, and meanwhile, the structure can bring pixel-level attention to the feature map. The self-attention module can introduce finer pixel-level attention information for the network while fusing multi-scale information, which can provide more accurate semantic guidance for the low-level feature map in subsequent feature fusion. It should be understood that the U-shaped structure may be of other types, and these embodiments are all within the scope of the present invention.

S13, based on the obtained feature maps of different scales, multiplying the feature maps by the feature map point obtained by adjusting the dimensionality through 1 × 1 convolution, thereby introducing pixel-level attention to the feature maps;

in this embodiment, the feature map input algorithm module at the last layer of the ResNet-101 is divided into two paths, one path extracts multi-scale information through a pyramid structure, and 4 × 4, 8 × 8, and 16 × 16 after each level of pyramid represents the resolution of the feature map after the processing at this level. The other path adjusts the dimensionality by a 1 × 1 convolution. Because the resolution of the feature map itself at the time of input is small, the use of a large convolution kernel herein also does not introduce excessive computational load to the network. The pyramid structure extracts different scales of information step by step, and combines the context information of different scales accurately in a cascading manner. After two paths of processing, the two feature maps are subjected to dot multiplication, and pixel level attention is introduced to the original feature map through the step.

Preferably, the two feature maps are subjected to point multiplication and then subjected to global pooling, so that the feature maps contain the strongest global information, and the branch can provide the global information for the module, thereby further improving the performance of the module.

S14, compressing the space dimension of the feature map, real-numbering the feature map, and weighting the low-level feature map based on the dimension and the real number;

in this embodiment, as shown in fig. 4, the path enhancement structure is that the real number obtained by real-quantizing the feature map has a global receptive field, and the output dimension of the path enhancement structure is the same as the dimension of the input feature map, which can be used to characterize the global distribution on the feature dimension, so that the global receptive field can be transmitted to the next-layer feature map, and such global information is important for many tasks.

And converting the high-level feature map into corresponding global information through global pooling, wherein the real numbers are equivalent to weight information extracted from the network. And performing dot multiplication on the low-level feature map and the feature map of the bottom layer, and weighting the low-level feature map, wherein the calculation process is as follows:

s15, fusing the weighted feature graph with the high-level feature graph to obtain a new feature graph corresponding to the low level;

in the embodiment, through the algorithm, adjustment on the spatial dimension is added, the low-level feature map is weighted by using the high-level feature map with rich semantic information, and guidance information is provided for the low-level feature map by using simpler processing. The blank of the original characteristic pyramid in the spatial dimension is supplemented, and the overall quality of characteristic fusion processing is improved.

S16, classifying the fused feature maps through target detection and semantic segmentation to obtain target detection and semantic segmentation feature maps;

and S17, carrying out image segmentation based on the acquired target detection and the semantic segmentation characteristic graph to obtain a segmentation result.

A second embodiment of the present invention provides a remote sensing image extraction device based on superpixels, as shown in the figure, including:

a first extraction module 110, which is used for extracting the characteristics input into the convolutional neural network image and acquiring five characteristic graphs with different resolutions;

a second extraction module 120, which is used for extracting the multi-scale information of the feature map through the self-attention module to obtain feature maps with different scales;

in this embodiment, as shown in fig. 3, the self-attention module can fuse three features with different pyramid scales through a U-shaped structure (e.g., a feature pyramid). To better extract the multi-scale context information of an image through a pyramid structure, the pyramid structure can use three different convolution kernels for feature extraction, 3 × 3, 5 × 5, and 7 × 7 respectively. The scale information of three different targets in large, medium and small sizes can be extracted from the original feature map through the three feature maps with different scales, the information level of the feature map is enriched, and meanwhile, the structure can bring pixel-level attention to the feature map. The self-attention module can introduce finer pixel-level attention information for the network while fusing multi-scale information, which can provide more accurate semantic guidance for the low-level feature map in subsequent feature fusion. It should be understood that the U-shaped structure may be of other types, and these embodiments are all within the scope of the present invention.

A dot multiplication module, which is used for performing dot multiplication on the obtained feature maps with different scales and the obtained feature map with dimension adjusted by 1 multiplied by 1 convolution, thereby introducing pixel level attention to the feature map;

140: the weighting module is used for compressing the spatial dimension of the feature map, carrying out real number on the feature map and weighting the low-level feature map based on the dimension and the real number;

in this embodiment, as shown in fig. 4, the real number obtained by real-quantizing the feature map has a global receptive field, and the output dimension of the real number is the same as the input dimension of the feature map, which can be used to characterize the global distribution on the feature dimension, so that the global receptive field can be transmitted to the next layer of feature map, and such global information is important for many tasks.

wherein ,z_cRepresenting the output array of real numbers, H and W being the height and width of the feature map, i, j representing the coordinates of the feature map, u_cIs the dimension of the feature map, and converts the H multiplied by W multiplied by C input feature map into 1 multiplied by C outputThe result represents the global information corresponding to each layer of the feature map in the spatial dimension.

150, a fusion module for fusing the weighted feature map with the high-level feature map to obtain a new feature map corresponding to the low level;

170: and the segmentation module is used for carrying out image segmentation on the acquired target detection and semantic segmentation characteristic graph to obtain a segmentation result.

A third embodiment of the present invention provides an image instance segmentation apparatus, which includes a memory and a processor, wherein the memory stores a computer program, and the processor is configured to run the computer program to implement the image instance segmentation method.

The fourth embodiment of the present invention further provides a storage medium, where the storage medium stores a computer program, and the computer program can be executed by a processor of a device in which the storage medium is located, so as to implement the image instance segmentation method.

In the embodiments provided in the embodiments of the present invention, it should be understood that the apparatus and method provided may be implemented in other ways. The apparatus and method embodiments described above are illustrative only, as the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An image instance segmentation method is characterized by comprising the following steps:

2. The image instance segmentation method according to claim 1, wherein the multi-scale feature map is obtained by a U-shaped structure, the U-shaped structure fuses features of a plurality of different pyramid scales, and the pyramid extracts features using a plurality of different convolution kernels to obtain a feature map.

3. The image instance segmentation method according to claim 1, wherein the feature maps with different scales obtained based on the above are point-multiplied by the feature map obtained by adjusting dimensionality through 1 × 1 convolution, and the point-multiplied feature map is processed through global pooling to obtain the feature map with the strongest global information.

4. An image instance segmentation method as claimed in claim 1, wherein the weighting procedure is:

wherein ,z_cRepresenting the output array of real numbers, H and W being the height and width of the feature map, i, j representing the coordinates of the feature map, u_cThe dimension of the feature map is that the H multiplied by W multiplied by C input feature map is converted into 1 multiplied by C output, and the result represents the global information corresponding to each layer of the feature map in the space dimension.

5. An image instance segmentation apparatus, comprising:

and the segmentation module is used for acquiring the target detection and semantic segmentation characteristic diagram to perform image segmentation so as to obtain a segmentation result.

6. The image instance segmentation device according to claim 5, wherein the multi-scale feature map is obtained by a U-shaped structure, the U-shaped structure fuses features of different pyramid scales, and the pyramid uses different convolution kernels to extract features to obtain a feature map.

7. The image example segmentation device according to claim 5, wherein the obtained feature maps of different scales are point-multiplied by the obtained feature map with dimensions adjusted by 1 × 1 convolution, and the point-multiplied feature map is processed by global pooling to obtain the feature map with the strongest global information.

8. An image instance segmentation apparatus as claimed in claim 5, wherein the weighting procedure is:

where z represents the real number array of the output, H and W are the height and width of the feature map, i, j represents the coordinates of the feature map, u_cThe dimension of the feature map is that the H multiplied by W multiplied by C input feature map is converted into 1 multiplied by C output, and the result represents the global information corresponding to each layer of the feature map in the space dimension.

9. An image instance segmentation apparatus comprising a memory in which a computer program is stored and a processor for running the computer program to implement an image instance segmentation method according to any one of claims 1 to 4.

10. A storage medium, characterized in that the storage medium stores a computer program executable by a processor of a device on which the storage medium is located to implement an image instance segmentation method according to any one of claims 1 to 4.