CN111915613B

CN111915613B - Image instance segmentation method, device, equipment and storage medium

Info

Publication number: CN111915613B
Application number: CN202010798878.XA
Authority: CN
Inventors: 杜吉祥; 程海峰; 张洪博; 翟传敏
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2020-08-11
Filing date: 2020-08-11
Publication date: 2023-06-13
Anticipated expiration: 2040-08-11
Also published as: CN111915613A

Abstract

The invention provides an image instance segmentation method, device, equipment and storage medium, wherein the method comprises the following steps: extracting features of the image and obtaining a feature map; extracting multi-scale information of the feature images through a self-attention module to obtain feature images with different scales, and then multiplying the feature images with feature image points obtained through 1X 1 convolution adjustment dimensionality to introduce pixel-level attention to the feature images; compressing the space dimension of the feature map, realizing the feature map, and weighting the low-level feature map; fusing the weighted feature map with the high-level feature map to obtain a new feature map; classifying the fused feature images through target detection and semantic segmentation to obtain target detection and semantic segmentation feature images; and performing image segmentation based on the obtained target detection and the semantic segmentation feature map to obtain a segmentation result. The invention provides more excellent semantic guidance for subsequent feature fusion through the self-attention module and the path enhancement module, and reduces certain noise influence for training and learning of subsequent networks.

Description

Image instance segmentation method, device, equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a method, apparatus, device, and storage medium for image instance segmentation.

Background

The problem of image instance segmentation is an extremely important sub-problem in the field of computer vision, and the main purpose of the image instance segmentation is to accurately intercept objects of interest in an image from the image so as to avoid background interference and facilitate subsequent processing. Instance segmentation itself involves many spur problems, target detection and semantic segmentation, respectively. The improvement at each point is of paramount importance for the improvement of the final result, and thus the direction of improvement herein may also be from both of these places.

Early image segmentation does not distinguish between instance segmentation and semantic segmentation, with the aim of distinguishing the foreground image from the background image. Early image segmentation development techniques mainly have the following two directions:

threshold-based image segmentation algorithm

The algorithm of image segmentation using a threshold is one of the most common image segmentation algorithms, the main idea being to compare the gray value of each pixel of the image with the threshold and then classify the image to obtain the final binary image, separating the object from the background. Therefore, the emphasis of the thresholding method is how to find the optimal gray threshold. The classical algorithm is the OTSU oxford method, which finds the best threshold by maximizing the variance between the different classes. In addition, the method also comprises a mean value method, a histogram double-peak method, an iterative threshold method and other algorithms.

The thresholding method has the characteristics of high calculation speed, simple algorithm implementation and the like, and has excellent effect when the difference between a target and a background is large or the contrast is obvious. However, when the information of the image itself is complex and the problems such as shielding noise are serious, the threshold method cannot achieve a satisfactory segmentation result.

Image segmentation algorithm based on edge information

The edge itself contains more information than the pixel value, and the edge information in the image also changes the most severely in the pixel value, gradient, and other characteristics, so that the segmentation according to the edge information is one of the most easily conceived and the longest research time.

In general, the pixels inside the object are uniform or change is moderate, and the change becomes extremely severe among different objects, so that the edge information is usually high-frequency information in most cases. One can calculate all possible boundary information in the image from this feature and segment it accordingly. Common edge detection algorithms are Sobel, canny, etc. However, when the pixel value of the object itself is variable or contains many fine textures, it is difficult for a single edge detection algorithm to obtain the boundary of the object completely. Similarly, when the noise influence of the image itself is serious, the edge detection algorithm cannot be effective, because the noise information itself is irrelevant to the image, and the noise information is also high-frequency information, so that the influence is extremely large for detection.

Early image segmentation algorithms focused mainly on the use of image surface information such as pixels, edges, gradients, etc., but with the development of deep learning, neural networks could extract more advanced semantic information from images, which undoubtedly expanded the amount of information that the algorithm could use, greatly facilitated image segmentation, with segmentation beginning in two directions, semantic segmentation and instance segmentation.

The purpose of semantic segmentation is simpler, namely to classify all pixels in an image, thereby distinguishing objects of different classes in the image. The current semantic segmentation algorithm is very rapid in development, has been trained and predicted end-to-end, and continuously refreshed. The development of semantic segmentation on neural networks starts from the paper FCN, and uses a traditional neural network as an encoding part to obtain a low-resolution but rich-information feature map, then decodes the feature map by using transposed convolution, and finally outputs a segmentation result map which is the same as an input image. This is the first time end-to-end training has been achieved, and the direction of development of semantic segmentation has also been determined. In the subsequent semantic segmentation theory, the main idea is not greatly changed, the former half part is an encoder, the convolutional neural network is mainly used for encoding the image information, the latter half part is a decoder, and the encoded feature map is decoded to obtain a final result map. Subsequent improvements have also focused mainly on both parts of the codec. The encoder part mainly uses more excellent neural networks, such as GoogleNet, resNet, SENet, and the decoder has more excellent methods such as U-Net added with short-circuit connection, and the intermediate detection feature is reused. And the other parts are improved in processing details, such as cavity convolution proposed by deep Lab, so that the receptive field is increased, and the information of the feature map is more abundant.

In the field of instance segmentation, the basic idea is object detection plus semantic segmentation. Early SDS generated a suggested region using MCG, after which the result was segmented using Bottom-Up, but this was computationally intensive and the segmentation effect was general. And then introducing a super column concept into the classifier on the basis of SDS (sodium dodecyl sulfate), thereby realizing the correction of the ROI. The CFM then starts using the mask for auxiliary segmentation. Through continuous iterative development, the Mask-RCNN of the final He Kaiming improves the development of instance segmentation by a large step, he uses the fast-RCNN with highest accuracy at the time as a target detection part, and semantic segmentation branches are added on the basis to segment each detected target region. Thus, the precision is ensured in both the target detection stage and the semantic segmentation stage. The development of the method mainly aims at improving the problems of Mask-RCNN, such as full connection information assisted segmentation added by PANet on semantic segmentation branches or a mechanism for scoring segmentation results added by MS-RCNN.

The scene range used in actual production and life is also very wide, and the example segmentation technology is more visual in interaction with a user, and the example segmentation requirement with higher precision is provided. At present, the Mask-RCNN method is relatively similar to the present invention. The model is mainly processed by using a four-layer full convolution network and a final deconvolution in the semantic segmentation stage to obtain a mask graph. The invention is different from the method in that the method uses the custom feature fusion module to process in the semantic segmentation stage, so that the feature map information is utilized more efficiently, and the finally processed moral mask map information is richer and higher in accuracy.

Disclosure of Invention

The invention aims to provide an image instance segmentation method, an image instance segmentation device and a storage medium, which are used for realizing higher-precision instance segmentation.

To achieve the above object, an embodiment of the present invention provides an image instance segmentation method, including the steps of:

extracting features input to a convolutional neural network image, and obtaining five feature images with different resolutions;

extracting multi-scale information of the feature images through a self-attention module to obtain feature images with different scales;

based on the obtained feature images with different scales, multiplying the feature images obtained by the dimension adjustment through 1 multiplied by 1 convolution, thereby introducing pixel-level attention to the feature images;

compressing the space dimension of the feature map, realizing the feature map, and weighting the low-level feature map based on the dimension and the real number;

fusing the weighted feature images with the high-level feature images to obtain new feature images corresponding to the low level;

classifying the fused feature images through target detection and semantic segmentation to obtain target detection and semantic segmentation feature images;

and performing image segmentation based on the obtained target detection and the semantic segmentation feature map to obtain a segmentation result.

Furthermore, the multi-scale feature map is obtained through a U-shaped structure, the U-shaped structure fuses features of various different pyramid scales, and the pyramid uses various different convolution kernels to perform feature extraction, so that a feature map is obtained.

Further, the feature map based on the obtained feature maps with different scales is subjected to dot multiplication with the obtained feature map with dimension adjusted through 1×1 convolution, and the feature map after dot multiplication is subjected to global pooling processing to obtain the feature map with the strongest global information.

Further, the weighting process is as follows:

wherein ,z_c Representing the real number array of the output, H and W are the height and width of the feature map, i, j represents the coordinates of the feature map, u _c Is the dimension of the feature map, converts the h×w×c input feature map into a 1×1×c output, and this result represents global information corresponding to each layer of this feature map in the spatial dimension.

The invention also provides an image instance segmentation device, comprising:

the first extraction module is used for extracting the characteristics input to the convolutional neural network image and obtaining five characteristic diagrams with different resolutions;

the second extraction module is used for extracting multi-scale information of the feature images through the self-attention module to obtain feature images with different scales;

the dot multiplication module is used for dot multiplying the obtained feature images with different dimensions by the feature images with dimension adjusted through 1X 1 convolution so as to introduce pixel-level attention to the feature images;

the weighting module is used for compressing the space dimension of the feature map, realizing the feature map, and weighting the low-level feature map based on the dimension and the real number;

the fusion module is used for fusing the weighted feature images with the high-level feature images to obtain new feature images corresponding to the low level;

the classification module is used for classifying the fused feature images through target detection and semantic segmentation to obtain target detection and semantic segmentation feature images;

the segmentation module is used for obtaining target detection and semantic segmentation feature images to carry out image segmentation, and obtaining segmentation results.

Further, the multi-scale feature map is obtained through a U-shaped structure, the U-shaped structure fuses features of various pyramid scales, and the pyramid uses various convolution kernels to perform feature extraction to obtain the feature map.

Further, the weighting process is as follows:

The invention also provides an image instance segmentation device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor is used for running the computer program to realize the image instance segmentation method.

The present invention also provides a storage medium storing at least one program, at least one instruction, the at least one program, the instructions being executable to implement the image instance segmentation method.

The beneficial technical effects of the invention are as follows:

the invention provides a self-attention module and a path enhancement structure in the feature fusion stage, plays a great role in the final result of the network, and can realize higher-precision instance segmentation. The self-attention module improves the information level of the highest-layer feature map in the backstone, introduces multi-scale information and pixel-level attention, and provides more excellent semantic guidance for subsequent feature fusion. The path enhancement structure introduces the attention of space dimension, and uses the high-level feature map to acquire global information as weight to weight the low-level feature map, so that the network has a certain emphasis on the use of information, and a certain noise influence is reduced for training and learning of the subsequent network.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some examples of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart illustrating an image instance segmentation method according to a first embodiment of the present invention.

Fig. 2 is a flowchart illustrating an image instance segmentation method according to a first embodiment of the present invention.

Fig. 3 is a schematic diagram of a self-attention module according to a first embodiment of the present invention.

Fig. 4 is a schematic diagram of a path enhancement structure according to a first embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

To facilitate an understanding of the present invention, an original Mask-RCNN is described below in which a classical Feature Pyramid (FPN) is used for fusion of five different resolution feature maps extracted by a backface. Such a fusion approach has been widely used since its proposal, and has indeed improved the performance of the network. The method fuses high-level semantic information and low-level spatial information, so that the feature map with unbalanced original information becomes more balanced. In ResNet, the fifth layer feature map semantic information is the most abundant and the performance is the strongest, but because the resolution is too small, his spatial information is almost lost. However, the defect of lacking of spatial information can be made up to a certain extent by fusing the feature pyramid with a lower-level feature map. The characteristics of rich semantic information can be used for providing semantic guidance for the low-level feature map, and noise information in the low-level feature map can be reduced.

Therefore, the Feature Pyramid (FPN) can be known to realize the effect of guiding low-level space information by using high-level semantic information, reducing noise, improving the information quality of feature images and balancing the information difference between the feature images. The key function is the highest-level feature map of the network, and the semantic information of the feature map is the strongest, but no multi-scale information exists due to the design reason of the backhaul. In the current structure, the pyramid structure can extract feature information of different scales, but the structure is flattened and lacks global context information.

And the early Feature Pyramid (FPN) fuses the feature images through a Bottom-Up structure, so that the information of the feature images is more balanced, and noise is suppressed. However, the feature pyramid only carries out fusion processing on the feature map information, and does not select the information and has no emphasis. This results in that noise or unimportant information within the profile is also preserved, which is equivalent to introducing noise interference for processing after the network. If the information of the feature map can be screened and weighted while the feature map fusion processing is performed, unimportant and sub-important information is restrained, and the weight of useful information is enhanced, the effect of feature fusion is more prominent, and the noise interference can be avoided in the processing after the network

Details of embodiments of the present invention are described below.

As shown in fig. 1-2, a first embodiment of the present invention provides an image instance segmentation method, including the steps of:

s11, extracting features input into a convolutional neural network image, and obtaining five feature images with different resolutions;

s12, extracting multi-scale information of the feature map through a self-attention module to obtain feature maps with different scales;

in this embodiment, as shown in fig. 3, the self-attention module may merge three features with different pyramid dimensions through a U-shaped structure (e.g., feature pyramid). To better extract the multi-scale context information of the image through the pyramid structure, the pyramid structure can use three different convolution kernels for feature extraction, 3×3, 5×5, 7×7, respectively. The three different scale feature images can be used for extracting scale information of large, medium and small targets from the original feature images, enriching the information hierarchy of the feature images, and simultaneously, the structure can bring pixel-level attention to the feature images. The self-attention module can introduce finer pixel-level attention information to the network while fusing multi-scale information, which can provide more accurate semantic guidance for low-level feature graphs in subsequent feature fusion. Of course, it should be noted that other types of U-shaped structures may be used, and these solutions are all within the scope of the present invention.

S13, based on the feature images with different scales, multiplying the feature images with feature image points with dimension adjusted through 1X 1 convolution, and thus introducing pixel-level attention to the feature images;

in this embodiment, the feature map input algorithm module of the last layer of ResNet-101 divides two paths, one path extracts multi-scale information through pyramid structures, and 4×4, 8×8 and 16×16 after each level of pyramid represent the resolution of the feature map after the level of processing. The other path is dimensionality adjusted by a 1 x 1 convolution. Because the feature map itself has little resolution at the time of input, the use of large convolution kernels herein does not introduce excessive computational effort into the network. The pyramid structure extracts different scale information step by step, and the context information of different scales is accurately combined in a cascading mode. After two paths of processing, the two feature images are subjected to dot multiplication, and pixel-level attention is introduced to the original feature images through the dot multiplication.

Preferably, the global pooling processing can be performed after the dot multiplication of the two feature images, so that the feature images contain the strongest global information, and the branch can provide global information for the module, thereby further improving the performance of the module.

S14, compressing the space dimension of the feature map, realizing the feature map, and weighting the lower-layer feature map based on the dimension and the real number;

in this embodiment, as shown in fig. 4, the path enhancement structure is that the real number obtained by the real number of the feature map has a global receptive field, and the output dimension and the input feature map dimension of the real number are the same, so that the real number can be used to characterize the global distribution in the feature dimension, and thus the global receptive field can be transferred to the next layer of feature map, and such global information is important for many tasks.

The high-level feature map is converted into corresponding global information through global pooling, and the real numbers are equivalent to weight information extracted from the network. And carrying out point multiplication processing on the feature map and the feature map of the bottom layer, weighting the feature map of the lower layer, and calculating the following formula:

s15, fusing the weighted feature map with the high-level feature map to obtain a new feature map corresponding to the low level;

in this embodiment, the above algorithm adds adjustment in spatial dimension, weights the low-level feature map by using the high-level feature map with rich semantic information, and provides guidance information for the low-level feature map by using simpler processing. The blank of the original feature pyramid in the space dimension is supplemented, and the overall quality of feature fusion processing is improved.

S16, classifying the fused feature images through target detection and semantic segmentation to obtain target detection and semantic segmentation feature images;

and S17, image segmentation is carried out based on the acquisition target detection and the semantic segmentation feature map, and a segmentation result is obtained.

A second embodiment of the present invention provides a remote sensing image extraction device based on super pixels, as shown in the figure, including:

110, a first extraction module for extracting the characteristics of the convolutional neural network image and obtaining five characteristic diagrams with different resolutions;

the second extraction module is used for extracting multi-scale information of the feature map through the self-attention module to obtain feature maps with different scales;

A point multiplication module, configured to multiply the obtained feature map with the dimension adjusted by 1×1 convolution based on the obtained feature map with different dimensions, so as to introduce pixel-level attention to the feature map;

140: the weighting module is used for compressing the space dimension of the feature map, realizing the feature map, and weighting the lower-layer feature map based on the dimension and the real number;

in this embodiment, as shown in fig. 4, the real number obtained by the real number of the feature map has a global receptive field, and the output dimension and the input dimension of the real number are the same, so that the real number can be used for representing the global distribution in the feature dimension, and thus the global receptive field can be transferred to the next layer of feature map, and such global information is important for many tasks.

a fusion module, which is used for fusing the weighted feature map with the high-level feature map to obtain a new feature map corresponding to the low level;

170: the segmentation module is used for obtaining target detection and semantic segmentation feature images to carry out image segmentation, and obtaining segmentation results.

A third embodiment of the present invention provides an image instance segmentation apparatus, including a memory, and a processor, where the memory stores a computer program, and the processor is configured to run the computer program to implement the image instance segmentation method.

The fourth embodiment of the present invention further provides a storage medium, where a computer program is stored, where the computer program can be executed by a processor of a device where the storage medium is located, so as to implement the image instance segmentation method.

In the several embodiments provided by the embodiments of the present invention, it should be understood that the provided apparatus and method may be implemented in other manners. The apparatus and method embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present invention may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, an electronic device, a network device, or the like) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An image instance segmentation method, comprising the steps of:

extracting multi-scale information of the feature images through a self-attention module to obtain feature images with different scales; the feature images with different scales are obtained through a U-shaped structure, the U-shaped structure fuses features with different pyramid scales, and the pyramid extracts the features by using different convolution kernels to obtain the feature images;

based on the obtained feature images with different scales, multiplying the feature images obtained by the dimension adjustment through 1 multiplied by 1 convolution, thereby introducing pixel-level attention to the feature images; the feature map with the strongest global information is obtained through global pooling processing on the basis of the feature maps with different scales and the feature map obtained through dimension adjustment by 1X 1 convolution; specifically, the feature map input algorithm module of the last layer of ResNet-101 is divided into two paths, wherein one path extracts multi-scale information through a pyramid structure, 4×4, 8×8 and 16×16 after each level of pyramid represent the resolution of the feature map after the level of processing, and the other path adjusts the dimension through 1×1 convolution; the pyramid structure gradually extracts different scale information, and combines the context information of different scales in a cascading mode; after two paths of processing, the two feature images are subjected to dot multiplication, and pixel-level attention is introduced to the feature images through the dot multiplication; compressing the space dimension of the feature map, realizing the feature map, and weighting the low-level feature map based on the dimension and the real number;

fusing the weighted feature images with the high-level feature images to obtain new feature images corresponding to the low level; the weighting process comprises the following steps:

wherein ,

representing the real number array of the output, H and W are the height and width of the feature map, i, j represents the coordinates of the feature map,/->

The method is a feature map, an input feature map of H multiplied by W multiplied by C is converted into an output of 1 multiplied by C, and the result represents global information corresponding to each layer of the feature map in the space dimension;

2. An image instance segmentation apparatus, comprising:

the second extraction module is used for extracting multi-scale information of the feature images through the self-attention module to obtain feature images with different scales; the feature images with different scales are obtained through a U-shaped structure, the U-shaped structure fuses features with different pyramid scales, and the pyramid extracts the features by using different convolution kernels to obtain the feature images;

the dot multiplication module is used for dot multiplying the obtained feature images with different dimensions by the feature images with dimension adjusted through 1X 1 convolution so as to introduce pixel-level attention to the feature images; the point multiplication module is specifically used for carrying out point multiplication on the obtained feature images with different scales and the obtained feature images with dimension adjusted through 1 multiplied by 1 convolution, and obtaining the feature image with the strongest global information through global pooling processing on the feature images after the point multiplication; specifically, the feature map input algorithm module of the last layer of ResNet-101 is divided into two paths, wherein one path extracts multi-scale information through a pyramid structure, 4×4, 8×8 and 16×16 after each level of pyramid represent the resolution of the feature map after the level of processing, and the other path adjusts the dimension through 1×1 convolution; the pyramid structure gradually extracts different scale information, and combines the context information of different scales in a cascading mode; after two paths of processing, the two feature images are subjected to dot multiplication, and pixel-level attention is introduced to the feature images through the dot multiplication;

the weighting module is used for compressing the space dimension of the feature map, realizing the feature map, and weighting the low-level feature map based on the dimension and the real number; the weighting process comprises the following steps:

wherein ,

the segmentation module is used for obtaining the target detection and the semantic segmentation feature map to carry out image segmentation and obtain a segmentation result.

3. An image instance segmentation apparatus comprising a memory and a processor, the memory having a computer program stored therein, the processor being operable to execute the computer program to implement an image instance segmentation method as claimed in claim 1.

4. A storage medium storing a computer program executable by a processor of a device in which the storage medium is located to implement an image instance segmentation method according to claim 1.