CN111915613B - Image instance segmentation method, device, equipment and storage medium - Google Patents
Image instance segmentation method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN111915613B CN111915613B CN202010798878.XA CN202010798878A CN111915613B CN 111915613 B CN111915613 B CN 111915613B CN 202010798878 A CN202010798878 A CN 202010798878A CN 111915613 B CN111915613 B CN 111915613B
- Authority
- CN
- China
- Prior art keywords
- feature
- feature map
- feature images
- images
- segmentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 86
- 238000000034 method Methods 0.000 title claims abstract description 49
- 238000003860 storage Methods 0.000 title claims abstract description 13
- 238000001514 detection method Methods 0.000 claims abstract description 27
- 238000003709 image segmentation Methods 0.000 claims abstract description 15
- 230000004927 fusion Effects 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims description 22
- 238000004422 calculation algorithm Methods 0.000 claims description 19
- 239000000284 extract Substances 0.000 claims description 14
- 238000000605 extraction Methods 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 10
- 238000010586 diagram Methods 0.000 claims description 10
- 238000011176 pooling Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 8
- 238000013527 convolutional neural network Methods 0.000 claims description 7
- 238000012549 training Methods 0.000 abstract description 3
- 238000011161 development Methods 0.000 description 8
- 230000018109 developmental process Effects 0.000 description 8
- 230000000694 effects Effects 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000007499 fusion processing Methods 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- DBMJMQXJHONAFJ-UHFFFAOYSA-M Sodium laurylsulphate Chemical compound [Na+].CCCCCCCCCCCCOS([O-])(=O)=O DBMJMQXJHONAFJ-UHFFFAOYSA-M 0.000 description 3
- 238000003708 edge detection Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/70—Denoising; Smoothing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20016—Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention provides an image instance segmentation method, device, equipment and storage medium, wherein the method comprises the following steps: extracting features of the image and obtaining a feature map; extracting multi-scale information of the feature images through a self-attention module to obtain feature images with different scales, and then multiplying the feature images with feature image points obtained through 1X 1 convolution adjustment dimensionality to introduce pixel-level attention to the feature images; compressing the space dimension of the feature map, realizing the feature map, and weighting the low-level feature map; fusing the weighted feature map with the high-level feature map to obtain a new feature map; classifying the fused feature images through target detection and semantic segmentation to obtain target detection and semantic segmentation feature images; and performing image segmentation based on the obtained target detection and the semantic segmentation feature map to obtain a segmentation result. The invention provides more excellent semantic guidance for subsequent feature fusion through the self-attention module and the path enhancement module, and reduces certain noise influence for training and learning of subsequent networks.
Description
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to a method, apparatus, device, and storage medium for image instance segmentation.
Background
The problem of image instance segmentation is an extremely important sub-problem in the field of computer vision, and the main purpose of the image instance segmentation is to accurately intercept objects of interest in an image from the image so as to avoid background interference and facilitate subsequent processing. Instance segmentation itself involves many spur problems, target detection and semantic segmentation, respectively. The improvement at each point is of paramount importance for the improvement of the final result, and thus the direction of improvement herein may also be from both of these places.
Early image segmentation does not distinguish between instance segmentation and semantic segmentation, with the aim of distinguishing the foreground image from the background image. Early image segmentation development techniques mainly have the following two directions:
threshold-based image segmentation algorithm
The algorithm of image segmentation using a threshold is one of the most common image segmentation algorithms, the main idea being to compare the gray value of each pixel of the image with the threshold and then classify the image to obtain the final binary image, separating the object from the background. Therefore, the emphasis of the thresholding method is how to find the optimal gray threshold. The classical algorithm is the OTSU oxford method, which finds the best threshold by maximizing the variance between the different classes. In addition, the method also comprises a mean value method, a histogram double-peak method, an iterative threshold method and other algorithms.
The thresholding method has the characteristics of high calculation speed, simple algorithm implementation and the like, and has excellent effect when the difference between a target and a background is large or the contrast is obvious. However, when the information of the image itself is complex and the problems such as shielding noise are serious, the threshold method cannot achieve a satisfactory segmentation result.
Image segmentation algorithm based on edge information
The edge itself contains more information than the pixel value, and the edge information in the image also changes the most severely in the pixel value, gradient, and other characteristics, so that the segmentation according to the edge information is one of the most easily conceived and the longest research time.
In general, the pixels inside the object are uniform or change is moderate, and the change becomes extremely severe among different objects, so that the edge information is usually high-frequency information in most cases. One can calculate all possible boundary information in the image from this feature and segment it accordingly. Common edge detection algorithms are Sobel, canny, etc. However, when the pixel value of the object itself is variable or contains many fine textures, it is difficult for a single edge detection algorithm to obtain the boundary of the object completely. Similarly, when the noise influence of the image itself is serious, the edge detection algorithm cannot be effective, because the noise information itself is irrelevant to the image, and the noise information is also high-frequency information, so that the influence is extremely large for detection.
Early image segmentation algorithms focused mainly on the use of image surface information such as pixels, edges, gradients, etc., but with the development of deep learning, neural networks could extract more advanced semantic information from images, which undoubtedly expanded the amount of information that the algorithm could use, greatly facilitated image segmentation, with segmentation beginning in two directions, semantic segmentation and instance segmentation.
The purpose of semantic segmentation is simpler, namely to classify all pixels in an image, thereby distinguishing objects of different classes in the image. The current semantic segmentation algorithm is very rapid in development, has been trained and predicted end-to-end, and continuously refreshed. The development of semantic segmentation on neural networks starts from the paper FCN, and uses a traditional neural network as an encoding part to obtain a low-resolution but rich-information feature map, then decodes the feature map by using transposed convolution, and finally outputs a segmentation result map which is the same as an input image. This is the first time end-to-end training has been achieved, and the direction of development of semantic segmentation has also been determined. In the subsequent semantic segmentation theory, the main idea is not greatly changed, the former half part is an encoder, the convolutional neural network is mainly used for encoding the image information, the latter half part is a decoder, and the encoded feature map is decoded to obtain a final result map. Subsequent improvements have also focused mainly on both parts of the codec. The encoder part mainly uses more excellent neural networks, such as GoogleNet, resNet, SENet, and the decoder has more excellent methods such as U-Net added with short-circuit connection, and the intermediate detection feature is reused. And the other parts are improved in processing details, such as cavity convolution proposed by deep Lab, so that the receptive field is increased, and the information of the feature map is more abundant.
In the field of instance segmentation, the basic idea is object detection plus semantic segmentation. Early SDS generated a suggested region using MCG, after which the result was segmented using Bottom-Up, but this was computationally intensive and the segmentation effect was general. And then introducing a super column concept into the classifier on the basis of SDS (sodium dodecyl sulfate), thereby realizing the correction of the ROI. The CFM then starts using the mask for auxiliary segmentation. Through continuous iterative development, the Mask-RCNN of the final He Kaiming improves the development of instance segmentation by a large step, he uses the fast-RCNN with highest accuracy at the time as a target detection part, and semantic segmentation branches are added on the basis to segment each detected target region. Thus, the precision is ensured in both the target detection stage and the semantic segmentation stage. The development of the method mainly aims at improving the problems of Mask-RCNN, such as full connection information assisted segmentation added by PANet on semantic segmentation branches or a mechanism for scoring segmentation results added by MS-RCNN.
The scene range used in actual production and life is also very wide, and the example segmentation technology is more visual in interaction with a user, and the example segmentation requirement with higher precision is provided. At present, the Mask-RCNN method is relatively similar to the present invention. The model is mainly processed by using a four-layer full convolution network and a final deconvolution in the semantic segmentation stage to obtain a mask graph. The invention is different from the method in that the method uses the custom feature fusion module to process in the semantic segmentation stage, so that the feature map information is utilized more efficiently, and the finally processed moral mask map information is richer and higher in accuracy.
Disclosure of Invention
The invention aims to provide an image instance segmentation method, an image instance segmentation device and a storage medium, which are used for realizing higher-precision instance segmentation.
To achieve the above object, an embodiment of the present invention provides an image instance segmentation method, including the steps of:
extracting features input to a convolutional neural network image, and obtaining five feature images with different resolutions;
extracting multi-scale information of the feature images through a self-attention module to obtain feature images with different scales;
based on the obtained feature images with different scales, multiplying the feature images obtained by the dimension adjustment through 1 multiplied by 1 convolution, thereby introducing pixel-level attention to the feature images;
compressing the space dimension of the feature map, realizing the feature map, and weighting the low-level feature map based on the dimension and the real number;
fusing the weighted feature images with the high-level feature images to obtain new feature images corresponding to the low level;
classifying the fused feature images through target detection and semantic segmentation to obtain target detection and semantic segmentation feature images;
and performing image segmentation based on the obtained target detection and the semantic segmentation feature map to obtain a segmentation result.
Furthermore, the multi-scale feature map is obtained through a U-shaped structure, the U-shaped structure fuses features of various different pyramid scales, and the pyramid uses various different convolution kernels to perform feature extraction, so that a feature map is obtained.
Further, the feature map based on the obtained feature maps with different scales is subjected to dot multiplication with the obtained feature map with dimension adjusted through 1×1 convolution, and the feature map after dot multiplication is subjected to global pooling processing to obtain the feature map with the strongest global information.
Further, the weighting process is as follows:
wherein ,zc Representing the real number array of the output, H and W are the height and width of the feature map, i, j represents the coordinates of the feature map, u c Is the dimension of the feature map, converts the h×w×c input feature map into a 1×1×c output, and this result represents global information corresponding to each layer of this feature map in the spatial dimension.
The invention also provides an image instance segmentation device, comprising:
the first extraction module is used for extracting the characteristics input to the convolutional neural network image and obtaining five characteristic diagrams with different resolutions;
the second extraction module is used for extracting multi-scale information of the feature images through the self-attention module to obtain feature images with different scales;
the dot multiplication module is used for dot multiplying the obtained feature images with different dimensions by the feature images with dimension adjusted through 1X 1 convolution so as to introduce pixel-level attention to the feature images;
the weighting module is used for compressing the space dimension of the feature map, realizing the feature map, and weighting the low-level feature map based on the dimension and the real number;
the fusion module is used for fusing the weighted feature images with the high-level feature images to obtain new feature images corresponding to the low level;
the classification module is used for classifying the fused feature images through target detection and semantic segmentation to obtain target detection and semantic segmentation feature images;
the segmentation module is used for obtaining target detection and semantic segmentation feature images to carry out image segmentation, and obtaining segmentation results.
Further, the multi-scale feature map is obtained through a U-shaped structure, the U-shaped structure fuses features of various pyramid scales, and the pyramid uses various convolution kernels to perform feature extraction to obtain the feature map.
Further, the feature map based on the obtained feature maps with different scales is subjected to dot multiplication with the obtained feature map with dimension adjusted through 1×1 convolution, and the feature map after dot multiplication is subjected to global pooling processing to obtain the feature map with the strongest global information.
Further, the weighting process is as follows:
wherein ,zc Representing the real number array of the output, H and W are the height and width of the feature map, i, j represents the coordinates of the feature map, u c Is the dimension of the feature map, converts the h×w×c input feature map into a 1×1×c output, and this result represents global information corresponding to each layer of this feature map in the spatial dimension.
The invention also provides an image instance segmentation device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor is used for running the computer program to realize the image instance segmentation method.
The present invention also provides a storage medium storing at least one program, at least one instruction, the at least one program, the instructions being executable to implement the image instance segmentation method.
The beneficial technical effects of the invention are as follows:
the invention provides a self-attention module and a path enhancement structure in the feature fusion stage, plays a great role in the final result of the network, and can realize higher-precision instance segmentation. The self-attention module improves the information level of the highest-layer feature map in the backstone, introduces multi-scale information and pixel-level attention, and provides more excellent semantic guidance for subsequent feature fusion. The path enhancement structure introduces the attention of space dimension, and uses the high-level feature map to acquire global information as weight to weight the low-level feature map, so that the network has a certain emphasis on the use of information, and a certain noise influence is reduced for training and learning of the subsequent network.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some examples of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart illustrating an image instance segmentation method according to a first embodiment of the present invention.
Fig. 2 is a flowchart illustrating an image instance segmentation method according to a first embodiment of the present invention.
Fig. 3 is a schematic diagram of a self-attention module according to a first embodiment of the present invention.
Fig. 4 is a schematic diagram of a path enhancement structure according to a first embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
To facilitate an understanding of the present invention, an original Mask-RCNN is described below in which a classical Feature Pyramid (FPN) is used for fusion of five different resolution feature maps extracted by a backface. Such a fusion approach has been widely used since its proposal, and has indeed improved the performance of the network. The method fuses high-level semantic information and low-level spatial information, so that the feature map with unbalanced original information becomes more balanced. In ResNet, the fifth layer feature map semantic information is the most abundant and the performance is the strongest, but because the resolution is too small, his spatial information is almost lost. However, the defect of lacking of spatial information can be made up to a certain extent by fusing the feature pyramid with a lower-level feature map. The characteristics of rich semantic information can be used for providing semantic guidance for the low-level feature map, and noise information in the low-level feature map can be reduced.
Therefore, the Feature Pyramid (FPN) can be known to realize the effect of guiding low-level space information by using high-level semantic information, reducing noise, improving the information quality of feature images and balancing the information difference between the feature images. The key function is the highest-level feature map of the network, and the semantic information of the feature map is the strongest, but no multi-scale information exists due to the design reason of the backhaul. In the current structure, the pyramid structure can extract feature information of different scales, but the structure is flattened and lacks global context information.
And the early Feature Pyramid (FPN) fuses the feature images through a Bottom-Up structure, so that the information of the feature images is more balanced, and noise is suppressed. However, the feature pyramid only carries out fusion processing on the feature map information, and does not select the information and has no emphasis. This results in that noise or unimportant information within the profile is also preserved, which is equivalent to introducing noise interference for processing after the network. If the information of the feature map can be screened and weighted while the feature map fusion processing is performed, unimportant and sub-important information is restrained, and the weight of useful information is enhanced, the effect of feature fusion is more prominent, and the noise interference can be avoided in the processing after the network
Details of embodiments of the present invention are described below.
As shown in fig. 1-2, a first embodiment of the present invention provides an image instance segmentation method, including the steps of:
s11, extracting features input into a convolutional neural network image, and obtaining five feature images with different resolutions;
s12, extracting multi-scale information of the feature map through a self-attention module to obtain feature maps with different scales;
in this embodiment, as shown in fig. 3, the self-attention module may merge three features with different pyramid dimensions through a U-shaped structure (e.g., feature pyramid). To better extract the multi-scale context information of the image through the pyramid structure, the pyramid structure can use three different convolution kernels for feature extraction, 3×3, 5×5, 7×7, respectively. The three different scale feature images can be used for extracting scale information of large, medium and small targets from the original feature images, enriching the information hierarchy of the feature images, and simultaneously, the structure can bring pixel-level attention to the feature images. The self-attention module can introduce finer pixel-level attention information to the network while fusing multi-scale information, which can provide more accurate semantic guidance for low-level feature graphs in subsequent feature fusion. Of course, it should be noted that other types of U-shaped structures may be used, and these solutions are all within the scope of the present invention.
S13, based on the feature images with different scales, multiplying the feature images with feature image points with dimension adjusted through 1X 1 convolution, and thus introducing pixel-level attention to the feature images;
in this embodiment, the feature map input algorithm module of the last layer of ResNet-101 divides two paths, one path extracts multi-scale information through pyramid structures, and 4×4, 8×8 and 16×16 after each level of pyramid represent the resolution of the feature map after the level of processing. The other path is dimensionality adjusted by a 1 x 1 convolution. Because the feature map itself has little resolution at the time of input, the use of large convolution kernels herein does not introduce excessive computational effort into the network. The pyramid structure extracts different scale information step by step, and the context information of different scales is accurately combined in a cascading mode. After two paths of processing, the two feature images are subjected to dot multiplication, and pixel-level attention is introduced to the original feature images through the dot multiplication.
Preferably, the global pooling processing can be performed after the dot multiplication of the two feature images, so that the feature images contain the strongest global information, and the branch can provide global information for the module, thereby further improving the performance of the module.
S14, compressing the space dimension of the feature map, realizing the feature map, and weighting the lower-layer feature map based on the dimension and the real number;
in this embodiment, as shown in fig. 4, the path enhancement structure is that the real number obtained by the real number of the feature map has a global receptive field, and the output dimension and the input feature map dimension of the real number are the same, so that the real number can be used to characterize the global distribution in the feature dimension, and thus the global receptive field can be transferred to the next layer of feature map, and such global information is important for many tasks.
wherein ,zc Representing the real number array of the output, H and W are the height and width of the feature map, i, j represents the coordinates of the feature map, u c Is the dimension of the feature map, converts the h×w×c input feature map into a 1×1×c output, and this result represents global information corresponding to each layer of this feature map in the spatial dimension.
The high-level feature map is converted into corresponding global information through global pooling, and the real numbers are equivalent to weight information extracted from the network. And carrying out point multiplication processing on the feature map and the feature map of the bottom layer, weighting the feature map of the lower layer, and calculating the following formula:
s15, fusing the weighted feature map with the high-level feature map to obtain a new feature map corresponding to the low level;
in this embodiment, the above algorithm adds adjustment in spatial dimension, weights the low-level feature map by using the high-level feature map with rich semantic information, and provides guidance information for the low-level feature map by using simpler processing. The blank of the original feature pyramid in the space dimension is supplemented, and the overall quality of feature fusion processing is improved.
S16, classifying the fused feature images through target detection and semantic segmentation to obtain target detection and semantic segmentation feature images;
and S17, image segmentation is carried out based on the acquisition target detection and the semantic segmentation feature map, and a segmentation result is obtained.
A second embodiment of the present invention provides a remote sensing image extraction device based on super pixels, as shown in the figure, including:
110, a first extraction module for extracting the characteristics of the convolutional neural network image and obtaining five characteristic diagrams with different resolutions;
the second extraction module is used for extracting multi-scale information of the feature map through the self-attention module to obtain feature maps with different scales;
in this embodiment, as shown in fig. 3, the self-attention module may merge three features with different pyramid dimensions through a U-shaped structure (e.g., feature pyramid). To better extract the multi-scale context information of the image through the pyramid structure, the pyramid structure can use three different convolution kernels for feature extraction, 3×3, 5×5, 7×7, respectively. The three different scale feature images can be used for extracting scale information of large, medium and small targets from the original feature images, enriching the information hierarchy of the feature images, and simultaneously, the structure can bring pixel-level attention to the feature images. The self-attention module can introduce finer pixel-level attention information to the network while fusing multi-scale information, which can provide more accurate semantic guidance for low-level feature graphs in subsequent feature fusion. Of course, it should be noted that other types of U-shaped structures may be used, and these solutions are all within the scope of the present invention.
A point multiplication module, configured to multiply the obtained feature map with the dimension adjusted by 1×1 convolution based on the obtained feature map with different dimensions, so as to introduce pixel-level attention to the feature map;
in this embodiment, the feature map input algorithm module of the last layer of ResNet-101 divides two paths, one path extracts multi-scale information through pyramid structures, and 4×4, 8×8 and 16×16 after each level of pyramid represent the resolution of the feature map after the level of processing. The other path is dimensionality adjusted by a 1 x 1 convolution. Because the feature map itself has little resolution at the time of input, the use of large convolution kernels herein does not introduce excessive computational effort into the network. The pyramid structure extracts different scale information step by step, and the context information of different scales is accurately combined in a cascading mode. After two paths of processing, the two feature images are subjected to dot multiplication, and pixel-level attention is introduced to the original feature images through the dot multiplication.
Preferably, the global pooling processing can be performed after the dot multiplication of the two feature images, so that the feature images contain the strongest global information, and the branch can provide global information for the module, thereby further improving the performance of the module.
140: the weighting module is used for compressing the space dimension of the feature map, realizing the feature map, and weighting the lower-layer feature map based on the dimension and the real number;
in this embodiment, as shown in fig. 4, the real number obtained by the real number of the feature map has a global receptive field, and the output dimension and the input dimension of the real number are the same, so that the real number can be used for representing the global distribution in the feature dimension, and thus the global receptive field can be transferred to the next layer of feature map, and such global information is important for many tasks.
wherein ,zc Representing the real number array of the output, H and W are the height and width of the feature map, i, j represents the coordinates of the feature map, u c Is the dimension of the feature map, converts the h×w×c input feature map into a 1×1×c output, and this result represents global information corresponding to each layer of this feature map in the spatial dimension.
The high-level feature map is converted into corresponding global information through global pooling, and the real numbers are equivalent to weight information extracted from the network. And carrying out point multiplication processing on the feature map and the feature map of the bottom layer, weighting the feature map of the lower layer, and calculating the following formula:
a fusion module, which is used for fusing the weighted feature map with the high-level feature map to obtain a new feature map corresponding to the low level;
in this embodiment, the above algorithm adds adjustment in spatial dimension, weights the low-level feature map by using the high-level feature map with rich semantic information, and provides guidance information for the low-level feature map by using simpler processing. The blank of the original feature pyramid in the space dimension is supplemented, and the overall quality of feature fusion processing is improved.
The classification module is used for classifying the fused feature images through target detection and semantic segmentation to obtain target detection and semantic segmentation feature images;
170: the segmentation module is used for obtaining target detection and semantic segmentation feature images to carry out image segmentation, and obtaining segmentation results.
A third embodiment of the present invention provides an image instance segmentation apparatus, including a memory, and a processor, where the memory stores a computer program, and the processor is configured to run the computer program to implement the image instance segmentation method.
The fourth embodiment of the present invention further provides a storage medium, where a computer program is stored, where the computer program can be executed by a processor of a device where the storage medium is located, so as to implement the image instance segmentation method.
In the several embodiments provided by the embodiments of the present invention, it should be understood that the provided apparatus and method may be implemented in other manners. The apparatus and method embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present invention may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, an electronic device, a network device, or the like) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (4)
1. An image instance segmentation method, comprising the steps of:
extracting features input to a convolutional neural network image, and obtaining five feature images with different resolutions;
extracting multi-scale information of the feature images through a self-attention module to obtain feature images with different scales; the feature images with different scales are obtained through a U-shaped structure, the U-shaped structure fuses features with different pyramid scales, and the pyramid extracts the features by using different convolution kernels to obtain the feature images;
based on the obtained feature images with different scales, multiplying the feature images obtained by the dimension adjustment through 1 multiplied by 1 convolution, thereby introducing pixel-level attention to the feature images; the feature map with the strongest global information is obtained through global pooling processing on the basis of the feature maps with different scales and the feature map obtained through dimension adjustment by 1X 1 convolution; specifically, the feature map input algorithm module of the last layer of ResNet-101 is divided into two paths, wherein one path extracts multi-scale information through a pyramid structure, 4×4, 8×8 and 16×16 after each level of pyramid represent the resolution of the feature map after the level of processing, and the other path adjusts the dimension through 1×1 convolution; the pyramid structure gradually extracts different scale information, and combines the context information of different scales in a cascading mode; after two paths of processing, the two feature images are subjected to dot multiplication, and pixel-level attention is introduced to the feature images through the dot multiplication; compressing the space dimension of the feature map, realizing the feature map, and weighting the low-level feature map based on the dimension and the real number;
fusing the weighted feature images with the high-level feature images to obtain new feature images corresponding to the low level; the weighting process comprises the following steps:
wherein ,representing the real number array of the output, H and W are the height and width of the feature map, i, j represents the coordinates of the feature map,/->The method is a feature map, an input feature map of H multiplied by W multiplied by C is converted into an output of 1 multiplied by C, and the result represents global information corresponding to each layer of the feature map in the space dimension;
classifying the fused feature images through target detection and semantic segmentation to obtain target detection and semantic segmentation feature images;
and performing image segmentation based on the obtained target detection and the semantic segmentation feature map to obtain a segmentation result.
2. An image instance segmentation apparatus, comprising:
the first extraction module is used for extracting the characteristics input to the convolutional neural network image and obtaining five characteristic diagrams with different resolutions;
the second extraction module is used for extracting multi-scale information of the feature images through the self-attention module to obtain feature images with different scales; the feature images with different scales are obtained through a U-shaped structure, the U-shaped structure fuses features with different pyramid scales, and the pyramid extracts the features by using different convolution kernels to obtain the feature images;
the dot multiplication module is used for dot multiplying the obtained feature images with different dimensions by the feature images with dimension adjusted through 1X 1 convolution so as to introduce pixel-level attention to the feature images; the point multiplication module is specifically used for carrying out point multiplication on the obtained feature images with different scales and the obtained feature images with dimension adjusted through 1 multiplied by 1 convolution, and obtaining the feature image with the strongest global information through global pooling processing on the feature images after the point multiplication; specifically, the feature map input algorithm module of the last layer of ResNet-101 is divided into two paths, wherein one path extracts multi-scale information through a pyramid structure, 4×4, 8×8 and 16×16 after each level of pyramid represent the resolution of the feature map after the level of processing, and the other path adjusts the dimension through 1×1 convolution; the pyramid structure gradually extracts different scale information, and combines the context information of different scales in a cascading mode; after two paths of processing, the two feature images are subjected to dot multiplication, and pixel-level attention is introduced to the feature images through the dot multiplication;
the weighting module is used for compressing the space dimension of the feature map, realizing the feature map, and weighting the low-level feature map based on the dimension and the real number; the weighting process comprises the following steps:
wherein ,representing the real number array of the output, H and W are the height and width of the feature map, i, j represents the coordinates of the feature map,/->The method is a feature map, an input feature map of H multiplied by W multiplied by C is converted into an output of 1 multiplied by C, and the result represents global information corresponding to each layer of the feature map in the space dimension;
the fusion module is used for fusing the weighted feature images with the high-level feature images to obtain new feature images corresponding to the low level;
the classification module is used for classifying the fused feature images through target detection and semantic segmentation to obtain target detection and semantic segmentation feature images;
the segmentation module is used for obtaining the target detection and the semantic segmentation feature map to carry out image segmentation and obtain a segmentation result.
3. An image instance segmentation apparatus comprising a memory and a processor, the memory having a computer program stored therein, the processor being operable to execute the computer program to implement an image instance segmentation method as claimed in claim 1.
4. A storage medium storing a computer program executable by a processor of a device in which the storage medium is located to implement an image instance segmentation method according to claim 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010798878.XA CN111915613B (en) | 2020-08-11 | 2020-08-11 | Image instance segmentation method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010798878.XA CN111915613B (en) | 2020-08-11 | 2020-08-11 | Image instance segmentation method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111915613A CN111915613A (en) | 2020-11-10 |
CN111915613B true CN111915613B (en) | 2023-06-13 |
Family
ID=73284821
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010798878.XA Active CN111915613B (en) | 2020-08-11 | 2020-08-11 | Image instance segmentation method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111915613B (en) |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112446308B (en) * | 2020-11-16 | 2024-09-13 | 北京科技大学 | Pedestrian detection method based on semantic enhancement multi-scale feature pyramid fusion |
CN112396115B (en) * | 2020-11-23 | 2023-12-22 | 平安科技(深圳)有限公司 | Attention mechanism-based target detection method and device and computer equipment |
CN112700462B (en) * | 2020-12-31 | 2024-09-17 | 北京迈格威科技有限公司 | Image segmentation method and device, electronic equipment and storage medium |
CN112819032B (en) * | 2021-01-11 | 2023-10-27 | 平安科技(深圳)有限公司 | Multi-model-based slice feature classification method, device, equipment and medium |
CN112906525B (en) * | 2021-02-05 | 2024-10-18 | 广州市百果园信息技术有限公司 | Age identification method and device and electronic equipment |
CN113159061B (en) * | 2021-03-10 | 2022-02-08 | 哈尔滨工业大学 | Actual tunnel surrounding rock fragment identification method based on example segmentation |
CN113421267B (en) * | 2021-05-07 | 2024-04-12 | 江苏大学 | Point cloud semantic and instance joint segmentation method and system based on improved PointConv |
CN113469942B (en) * | 2021-06-01 | 2022-02-22 | 天津大学 | CT image lesion detection method |
CN113591859A (en) * | 2021-06-23 | 2021-11-02 | 北京旷视科技有限公司 | Image segmentation method, apparatus, device and medium |
CN113344935B (en) * | 2021-06-30 | 2023-02-03 | 山东建筑大学 | Image segmentation method and system based on multi-scale difficulty perception |
CN113610085B (en) * | 2021-10-10 | 2021-12-07 | 成都千嘉科技有限公司 | Character wheel image identification method based on attention mechanism |
CN114913325B (en) * | 2022-03-24 | 2024-05-10 | 北京百度网讯科技有限公司 | Semantic segmentation method, semantic segmentation device and computer program product |
CN114511046A (en) * | 2022-04-19 | 2022-05-17 | 阿里巴巴(中国)有限公司 | Object recognition method and device |
CN114972376B (en) * | 2022-05-16 | 2023-08-25 | 北京医准智能科技有限公司 | Coronary calcified plaque segmentation method, segmentation model training method and related device |
CN118334322B (en) * | 2024-06-12 | 2024-08-27 | 浙江大华技术股份有限公司 | Camouflage target detection method, camouflage target detection device, computer equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104866855A (en) * | 2015-05-07 | 2015-08-26 | 华为技术有限公司 | Image feature extraction method and apparatus |
CN106157319A (en) * | 2016-07-28 | 2016-11-23 | 哈尔滨工业大学 | The significance detection method that region based on convolutional neural networks and Pixel-level merge |
CN107330873A (en) * | 2017-05-05 | 2017-11-07 | 浙江大学 | Objective evaluation method for quality of stereo images based on multiple dimensioned binocular fusion and local shape factor |
CN109525290A (en) * | 2019-01-11 | 2019-03-26 | 电子科技大学 | Real number feedback iteration channel estimation methods based on MIMO-FBMC system |
CN110084210A (en) * | 2019-04-30 | 2019-08-02 | 电子科技大学 | The multiple dimensioned Ship Detection of SAR image based on attention pyramid network |
CN110096994A (en) * | 2019-04-28 | 2019-08-06 | 西安电子科技大学 | A kind of small sample PolSAR image classification method based on fuzzy label semanteme priori |
CN111242071A (en) * | 2020-01-17 | 2020-06-05 | 陕西师范大学 | Attention remote sensing image target detection method based on anchor frame |
CN111292330A (en) * | 2020-02-07 | 2020-06-16 | 北京工业大学 | Image semantic segmentation method and device based on coder and decoder |
-
2020
- 2020-08-11 CN CN202010798878.XA patent/CN111915613B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104866855A (en) * | 2015-05-07 | 2015-08-26 | 华为技术有限公司 | Image feature extraction method and apparatus |
CN106157319A (en) * | 2016-07-28 | 2016-11-23 | 哈尔滨工业大学 | The significance detection method that region based on convolutional neural networks and Pixel-level merge |
CN107330873A (en) * | 2017-05-05 | 2017-11-07 | 浙江大学 | Objective evaluation method for quality of stereo images based on multiple dimensioned binocular fusion and local shape factor |
CN109525290A (en) * | 2019-01-11 | 2019-03-26 | 电子科技大学 | Real number feedback iteration channel estimation methods based on MIMO-FBMC system |
CN110096994A (en) * | 2019-04-28 | 2019-08-06 | 西安电子科技大学 | A kind of small sample PolSAR image classification method based on fuzzy label semanteme priori |
CN110084210A (en) * | 2019-04-30 | 2019-08-02 | 电子科技大学 | The multiple dimensioned Ship Detection of SAR image based on attention pyramid network |
CN111242071A (en) * | 2020-01-17 | 2020-06-05 | 陕西师范大学 | Attention remote sensing image target detection method based on anchor frame |
CN111292330A (en) * | 2020-02-07 | 2020-06-16 | 北京工业大学 | Image semantic segmentation method and device based on coder and decoder |
Non-Patent Citations (1)
Title |
---|
Pyramid Feature Attention Network for Saliency detection;Ting Zhao, Xiangqian Wu;CVPR;1-10 * |
Also Published As
Publication number | Publication date |
---|---|
CN111915613A (en) | 2020-11-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111915613B (en) | Image instance segmentation method, device, equipment and storage medium | |
Tong et al. | Saliency detection with multi-scale superpixels | |
CN112966691B (en) | Multi-scale text detection method and device based on semantic segmentation and electronic equipment | |
CN111553837B (en) | Artistic text image generation method based on neural style migration | |
CN105184763B (en) | Image processing method and device | |
CN112150493B (en) | Semantic guidance-based screen area detection method in natural scene | |
CN110334762B (en) | Feature matching method based on quad tree combined with ORB and SIFT | |
CN112800964B (en) | Remote sensing image target detection method and system based on multi-module fusion | |
CN111462126A (en) | Semantic image segmentation method and system based on edge enhancement | |
CN108898145A (en) | A kind of image well-marked target detection method of combination deep learning | |
CN109978848B (en) | Method for detecting hard exudation in fundus image based on multi-light-source color constancy model | |
CN111738055B (en) | Multi-category text detection system and bill form detection method based on same | |
CN112365514A (en) | Semantic segmentation method based on improved PSPNet | |
CN105160355B (en) | A kind of method for detecting change of remote sensing image based on region correlation and vision word | |
CN111914947B (en) | Image instance segmentation method, device, equipment and storage medium based on feature fusion | |
CN106778768A (en) | Image scene classification method based on multi-feature fusion | |
CN105046278B (en) | The optimization method of Adaboost detection algorithm based on Haar feature | |
CN110706235A (en) | Far infrared pedestrian detection method based on two-stage cascade segmentation | |
CN112802039B (en) | Panorama segmentation method based on global edge attention | |
Song et al. | Carvingnet: content-guided seam carving using deep convolution neural network | |
Al-Amaren et al. | RHN: A residual holistic neural network for edge detection | |
CN113392854A (en) | Image texture feature extraction and classification method | |
CN116071339A (en) | Product defect identification method based on improved whale algorithm optimization SVM | |
CN114708615A (en) | Human body detection method based on image enhancement in low-illumination environment, electronic equipment and storage medium | |
CN111242120B (en) | Character detection method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |