CN112950653A

CN112950653A - Attention image segmentation method, device and medium

Info

Publication number: CN112950653A
Application number: CN202110217268.0A
Authority: CN
Inventors: 王立; 郭振华; 赵雅倩; 李仁刚
Original assignee: Shandong Yingxin Computer Technology Co Ltd
Current assignee: Shandong Yingxin Computer Technology Co Ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-06-11
Anticipated expiration: 2041-02-26
Also published as: CN112950653B

Abstract

The invention discloses an attention image segmentation method, which comprises the following steps: performing convolution on the image and extracting a plurality of feature maps of the image; selecting and fusing a plurality of feature maps to obtain a fused feature map; obtaining a first segmentation result of the image through an attention network and the fusion feature map; selecting a segmentation network; carrying out size transformation on the first segmentation result of the image to obtain regional information; weighting and fusing the image through the segmentation network and the region information to obtain a fourth matrix; inputting the fourth matrix into the segmentation network to obtain a second segmentation result of the image; through the mode, the feature graphs can be fused, and the weighted fusion is carried out according to the segmentation network, so that the segmentation precision is improved.

Description

Attention image segmentation method, device and medium

Technical Field

The present invention relates to the field of image processing, and in particular, to a method, an apparatus, and a medium for attention map image segmentation.

Background

Image segmentation (image segmentation) is an important research direction in the field of computer vision, and is an important part of image semantic understanding. Image segmentation refers to a process of dividing an image into a plurality of regions with similar properties, in recent years, an image segmentation technology has been developed rapidly, and technologies such as scene object segmentation, human body front background segmentation, human face and human body analysis, three-dimensional reconstruction and the like related to the technology are widely applied to industries such as unmanned driving, augmented reality, security monitoring and the like.

The image segmentation means that an image is divided into a plurality of mutually disjoint areas according to characteristics such as gray scale, color, spatial texture, geometric shape and the like, so that the characteristics show consistency or similarity in the same area and obviously differ among different areas. In brief, in one image, the object is separated from the background. For grayscale images, pixels inside a region generally have grayscale similarities, and pixels at the boundaries of the region generally have grayscale discontinuities.

Generally, image segmentation needs to predict whether a pixel point in an image belongs to a certain target class or a scene class. Due to the complex variety of image scenes: illumination, visual angle, size, shielding and the like, which cause great difficulty in understanding the scene and distinguishing pixel points.

Disclosure of Invention

The invention mainly solves the problem of more accurate image segmentation by accurately classifying and judging the image pixels.

In order to solve the technical problems, the invention adopts a technical scheme that: an attention map image segmentation method is provided, comprising the following steps:

convolving an image and extracting a plurality of feature maps of the image;

selecting and fusing a plurality of feature maps to obtain a fused feature map;

obtaining a first segmentation result of the image through an attention network and the fusion feature map;

selecting a segmentation network;

carrying out size transformation on the first segmentation result of the image to obtain regional information;

weighting and fusing the image through the segmentation network and the region information to obtain a fourth matrix;

and inputting the fourth matrix into the segmentation network to obtain a second segmentation result of the image.

Preferably, the step of performing weighted fusion on the image through the segmentation network and the region information to obtain a fourth matrix further includes:

inputting the image into the segmentation network for calculation to obtain a feature matrix;

the feature matrix comprises a first matrix, a second matrix and a third matrix;

carrying out weight calculation on the first matrix, the second matrix and the area information to obtain a weighting strategy;

and obtaining the fourth matrix based on the weighting strategy and the third matrix.

Preferably, the step of calculating the weight of the first matrix, the second matrix and the area information further includes:

obtaining the vector dimension of a first element in the second matrix;

querying the elements with the same type as the first element in the region information, and recording as second elements;

inquiring the elements in the first matrix, which have the same kind as the second elements, and recording as third elements;

acquiring a vector dimension of the third element;

calculating a vector inner product of the vector dimension of the first element and the vector dimension of the third element to obtain first data;

normalizing the first data to obtain a first vector;

and returning to obtain the vector dimension of the first element in the second matrix until the second matrix is traversed.

Preferably, the step of obtaining the fourth matrix based on the weighting policy and the third matrix further includes:

respectively carrying out weighted fusion on all the first vectors obtained after traversing the second matrix and the third matrix to obtain a plurality of second vectors;

arranging a plurality of second vectors according to the position of the first element in the second matrix to obtain a fourth matrix;

arranging all the first vectors obtained after traversing the second matrix according to the position of the first element in the second matrix to obtain a weighting matrix;

and performing weighted fusion on the weighting matrix and the third matrix to obtain the fourth matrix.

Preferably, the step of selecting and fusing a plurality of feature maps further comprises: enabling the sizes of the feature maps to be the same through a bilinear interpolation method or a deconvolution network method;

and adding the feature maps with the same size to obtain the fusion feature map.

Preferably, the size conversion is performed by down-sampling so that the size of the first division result is the same as the size of the image input to the division network.

Preferably, the step of obtaining the first segmentation result of the image through the attention network and the fused feature map further includes: inputting the fused feature map to the attention network;

changing the size of the fused feature map into the size of the image through a bilinear interpolation method;

normalizing the value range of the fusion characteristic diagram through a normalization function;

and obtaining a first segmentation result of the image by solving a parameter function.

The present invention also provides an attention image segmentation system, comprising: the system comprises an extraction module, a fusion module, a first segmentation module, a transformation module and a second segmentation module;

the extraction module is used for performing convolution on the image through a convolution kernel and extracting a plurality of feature maps of the image;

the fusion module is used for selecting and fusing a plurality of feature maps to obtain a fusion feature map;

the first segmentation module is used for obtaining a first segmentation result of the image through an attention network and the fusion feature map;

the transformation module is used for selecting a segmentation network and carrying out size transformation on the first segmentation result of the image to obtain region information;

the second segmentation module is used for weighting and fusing the image through the segmentation network and the region information to obtain a fourth matrix, and inputting the fourth matrix into the segmentation network to obtain a second segmentation result of the image.

The invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for attention image segmentation.

The invention has the beneficial effects that:

1. according to the attention map image segmentation method, the feature maps can be fused, and weighting fusion is carried out according to the segmentation network, so that the segmentation precision is improved.

2. The attention image segmentation network training system can realize that the attention result obtained by the auxiliary attention network can weight the features in the main segmentation network by using an attention weighting method, and can improve the accuracy of image segmentation.

3. The computer-readable storage medium can realize the calculation of elements and vector inner products in a matrix, calculates a rough segmentation result through an attention network, calculates a fine segmentation result through rough segmentation, improves the calculation efficiency, and does not generate errors when the calculation process is realized by software.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic diagram of an attention image segmentation method according to embodiment 1 of the present invention;

fig. 2 is a schematic view of an attention network structure in the attention image segmentation method according to embodiment 1 of the present invention;

fig. 3 is a schematic diagram of a segmentation network structure in the attention image segmentation method according to embodiment 1 of the present invention;

FIG. 4 is a flowchart of weighted fusion in the attention image segmentation method according to embodiment 1 of the present invention;

FIG. 5 is a flowchart of a weight calculation method in the attention image segmentation method according to embodiment 1 of the present invention;

fig. 6 is a schematic structural diagram of an attention image segmentation system according to embodiment 2 of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

It is noted that in the description of the present invention

The attention mechanism (attention) has made a significant breakthrough in the image domain in recent years, and has proven beneficial for improving the performance of the model. The attention mechanism itself is also the perception mechanism according to the human brain and human eye.

The essence of the attention mechanism is to locate the information of interest, suppress unwanted information, and enable the network to focus on the areas of greater interest. More specifically, in image processing, the attention mechanism learns the relationship between a certain pixel and pixels at all other positions (including a far position) in an image, and the learned characteristics of the relationship are used for assisting in segmenting the details of the image, so that the segmentation result is more accurate and finer.

ResNet is a residual network, which can be understood as a sub-network that can be stacked to form a very deep network.

When the convolution kernel is used for image processing, given an input image, each corresponding pixel in an output image is formed after weighted averaging of pixels in a small area in the input image, wherein a weight value is defined by a function, and the function is called the convolution kernel.

At each convolution layer of the CNN, the data is in three-dimensional form, and can be seen as a stack of two-dimensional pictures, each of which is called a feature map.

feature map represents a feature map in the present embodiment;

bilinear interpolation, also known as bilinear interpolation. Mathematically, bilinear interpolation is linear interpolation extension of an interpolation function with two variables, and the core idea is to perform linear interpolation in two directions respectively.

Bilinear interpolation is used as an interpolation algorithm in numerical analysis and is widely applied to the aspects of signal processing, digital image and video processing and the like.

The deconvolution is a special forward convolution, which is to enlarge the size of the input image by 0 supplementation according to a certain proportion, then rotate the convolution kernel, and then perform forward convolution.

Normalization is a simplified calculation mode, namely, a dimensional expression is transformed into a dimensionless expression to become a scalar.

argmax is a function that is a function of (a set of) parameters to the function.

The softmax function is also called a normalized exponential function.

It is noted that in the description of the present invention

The first segmentation result is a coarse segmentation result and the second segmentation result is a fine segmentation result.

u_newIs a second vector; the vector inner product of the first element is e, and the vector inner product of the third element is f;

the first data is h,.

Example 1

An embodiment of the present invention provides a method for segmenting an attention image, please refer to fig. 1, which includes the following steps:

s100, training an attention network, wherein the attention network is a trainable rough segmentation network; the attention network in this embodiment is based on the ResNet network, but is not limited to this type of network; the method comprises the steps of obtaining an original image to be segmented and a backbone network structure;

s110, the main network convolutes the image through a convolution kernel and extracts a characteristic graph of the image; setting convolution step length, and controlling the size of the feature image after convolution through the convolution step length; in the backbone network, the size of the feature map of the image decreases by one time after each convolution, for example, the image with the previous image of 200 × 200 becomes 100 × 100 after one convolution;

s120, performing multiple convolution on the original image to be segmented to obtain a heatMap;

the method comprises the following specific steps:

s121, performing convolution conv1 on the original image to be segmented to obtain a first image; 1/2, the first image becomes the original image to be segmented;

s122, performing secondary convolution conv2 on the first image to obtain a second image; the second image becomes 1/2 of the first image, becomes 1/4 of the original image to be segmented;

s123, carrying out convolution conv3 on the second image for three times to obtain a third image; the third image becomes 1/2 of the second image, becomes 1/8 of the original image to be segmented; outputting a first featureMap of the current image;

s124, carrying out convolution conv4 on the third image for four times to obtain a fourth image; the fourth image becomes 1/2 of the third image, becomes 1/16 of the original image to be segmented; outputting a second featureMap of the current image;

s125, performing five times of convolution conv5 on the fourth image to obtain a fifth image; the fifth image becomes 1/2 of the fourth image, becomes 1/32 of the original image to be segmented; outputting a third featureMap of the current image;

s126, conv6 is performed for six times on the fifth image, the number of channels of the third featureMap is changed,

after the sixth convolution, the number of featureMap channels of the image changes, which is a common practice in ResNet networks. In the convolutional neural network, the size and the number of channels of the feature map can be changed by manual setting, respectively. In the present invention, the most common setting means (resnet50) is used to control the number of channels of the feature map and the output of the feature map size for each layer.

Obtaining a sixth image, wherein the size of the sixth image is still 1/32 of the original image to be segmented, and outputting a fourth featureMap of the current image; and the fourth featureMap of the sixth image at this time is the heatMap heat map;

typically, the last layer of feature maps in an image segmentation convolutional neural network is referred to as a heat map. The thermal map is a characteristic map in the present embodiment; because the last layer of the graph is taken out for calculation and the researcher draws the image representation of the last layer of the feature graph.

S130, selecting a plurality of feature maps, and fusing the feature maps;

the fusion method comprises the following steps: fusing feature maps of different sizes;

assuming that the size of a first feature in the plurality of features is (C x W H) 1 x 28, where C is the number of channels in the first feature, H is the height of the first feature, and W is the width of the first feature; the second feature size (C × W/2 × H/2) is 1 × 14, the third feature size (C × W/4 × H/4) is 1 × 7, when feature fusion is performed, simple addition cannot be directly performed, because the feature sizes are different, and in order to add the feature sizes, small feature graphs are sampled, and usually a deconvolution network or a bilinear interpolation method is used to obtain the same size as that of a large-size feature graph, and then the addition fusion is performed;

for example: obtaining the same size of the third feature map through bilinear interpolation (F.interpole) or Deconvolution (deconvo) and adding the third feature map and the second feature map to realize feature fusion, and calculating a rough segmentation result of the image through the fused features through an attention network;

in this embodiment, the feature fusion method is not limited to the above one, and may be implemented in various ways, for example:

fusing the second feature map and the first feature map, fusing the third feature map and the second feature map, fusing the third feature map and the first feature map, and the like;

or directly and independently using the characteristics of the first characteristic diagram, the second characteristic diagram or the third characteristic diagram to calculate the rough segmentation result of the original image to be segmented through an attention network without characteristic fusion;

the fused feature map can obtain a rough segmentation result of the original image to be segmented through an attention network;

the method comprises the following specific steps:

s131, referring to FIG. 2, the attention network includes a convolution layer, a reduction layer, a softmax normalization function layer and an argmax parameter-solving layer

Inputting the fused feature map into an attention network, and restoring a second feature map to an original size through bilinear interpolation (F.interplate); normalizing the value range of the feature map to be in a range of [0, 1] through a softmax layer in the attention network, and obtaining a rough segmentation result of a second feature map through an argmax function; calculating a loss value of the attention network through a loss function; wherein the loss function adopts a cross entropy loss function which is common to image segmentation.

S200, segmenting the network, where there are many segmented networks, such as FCN, SegNet, ENet, etc., and this embodiment is not limited to what type of segmented network is adopted;

referring to fig. 3, in the present embodiment, a network similar to the attention structure network is used as the segmentation network; the specific structure of the split network comprises: conv2d convolutional layer, bilinear interpolation (f.interplate) layer, output layer, softmax layer and argmax layer;

where Conv2d represents a 2d convolutional layer network, the input feature map may be convolved to extract features of the input feature map.

S300, referring to FIG. 4, performing weighted fusion through a rough segmentation attention network to obtain a finely segmented image segmentation result;

inputting a feature diagram with the size of C H W through a segmentation network, respectively passing through 3 convolution layers conv2d with the size of 1W and outputting three feature matrixes which are marked as a first matrix, a second matrix and a third matrix;

calculating a weighting strategy by the first matrix and the second matrix through weight calculation;

the weighting strategy is: how each feature element in the feature map should be weighted,

and how each feature element in the feature map is weighted to be applied to the third matrix,

referring to fig. 5, the method of weight calculation includes the following steps:

s310, after the attention network passes through the argmax layer, a rough segmentation result of the input image is obtained, the size of the rough segmentation result is equal to that of the original image, the rough segmentation result is subjected to size conversion to obtain region information, the size of the rough segmentation result is converted into C H W, the C H W is consistent with the size of a feature map input by the segmentation network, and a down-sampling method is adopted in a size conversion method;

s320, traversing the element u at each position in the second matrix, for example, the size of the second matrix is C × H × W, where C represents the number of channels, and only traversing the positions of the elements represented by H × W; assuming that element u at the first position in the second matrix is traversed, the vector dimension of the position of element u is 1 × C, that is, vector dimension C;

this is partly because the matrix C × H × W is a three-dimensional matrix, but the positions of the elements represented by H × W are traversed, so when traversing the H × W elements, the vector dimension in which the position of the element u is located is 1 × C;

s330, searching the element position associated with the element u position in the region removing information according to the element u position, finding the region associated with the element u position according to the position, and outputting;

inquiring elements with the same category as the element u in the region information, outputting position information of the elements with the same category, and if N elements with the same category as the element u exist in the region information, outputting the position information of the N elements; the position information is coordinates;

s340, obtaining a vector dimension C at the position of the element u in the second matrix, obtaining a vector dimension C x 1, representing by f, obtaining N vector dimensions C associated with the element u in the first matrix, obtaining the position information of the N elements through the step S330 in the same method as the step S330, and then obtaining the vector dimension C corresponding to the N position information; marking the vector dimension C corresponding to the N pieces of information, N & ltC & gt as e;

s350, obtaining an inner vector product of e and f, [ C1 ] · [ C × N ] ═ 1 × N ], and denoted by h, where h includes N elements, each of the N elements includes respective position information, and the position information is the same as the position information of the N elements acquired in the first matrix;

s360, solving softmax of h, and carrying out normalization to obtain a vector of [1 x N ];

using a weighted fusion formula:

and using [1 x N ] of h]Carrying out weighted fusion on the vector and the third matrix;

wherein G is_iRepresenting h taken from the third matrix_iAnd is combined with h_iVector of the corresponding position, G_iHas a dimension of 1 × C, and is subjected to weighted fusion_newIs also 1 × C;

s370, traversing the elements at each position in the third matrix, wherein the number of the elements is H, W, and substituting the elements at each position and the vector of each element in the corresponding H into the weighted fusion formula to obtain a plurality of u after weighted fusion_newFor a number u according to the position of the element u in the second matrix_newArranging to obtain a weighted and fused matrix, and marking the weighted and fused matrix as a fourth matrix, wherein the dimensionality of the fourth matrix is C x H x W;

or the vectors of each element in h are arranged according to the position of the element u in the second matrix to obtain a weighting matrix, and each vector in the weighting matrix and the corresponding element in each position in the third matrix are subjected to weighted fusion to obtain a weighted-fused matrix, wherein the matrix is a fourth matrix.

S380, taking the fourth matrix as the input of the segmentation network, and obtaining a finer segmentation result through the computation processing of the segmentation network, wherein the u is obtained by performing scale transformation on the rough segmentation result of the attention network and performing weighted fusion on the rough segmentation result and the matrix input by the segmentation network_newBefore the rough segmentation is carried out through the attention network, the images are convoluted for multiple times to extract feature maps in the images, the feature maps obtained through different convolutions are fused, the features of the images are fused, the rough segmentation result is obtained through the attention network, the rough segmentation precision is further improved compared with the prior art, and u obtained after weighting fusion is further improved_newTherefore, the result of the rough segmentation is weighted again, so that the precision of the image is further improved, and the final segmentation using the segmentation network is improved compared with the previous segmentation precision, so that a finer segmentation result is obtained.

Example 2

An embodiment of the present invention further provides an attention image segmentation system, please refer to fig. 6, including: the system comprises an extraction module, a fusion module, a first segmentation module, a transformation module and a second segmentation module;

the extraction module is used for carrying out convolution on the image through the convolution kernel and extracting a plurality of characteristic graphs of the image;

and the second segmentation module is used for weighting and fusing the image through the segmentation network and the region information to obtain a fourth matrix, and inputting the fourth matrix into the segmentation network to obtain a second segmentation result of the image.

Based on the same inventive concept as the method in the foregoing embodiments, the present specification embodiment further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of an attention image segmentation method as disclosed in the foregoing.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps of implementing the above embodiments may be implemented by hardware, and a program that can be implemented by the hardware and can be instructed by the program to be executed by the relevant hardware may be stored in a computer readable storage medium, where the storage medium may be a read-only memory, a magnetic or optical disk, and the like.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. An attention map image segmentation method, characterized by comprising the steps of:

convolving an image and extracting a plurality of feature maps of the image;

selecting and fusing a plurality of feature maps to obtain a fused feature map;

selecting a segmentation network;

2. An attention map image segmentation method as claimed in claim 1, characterized in that: the step of performing weighted fusion on the image through the segmentation network and the region information to obtain a fourth matrix further comprises:

3. An attention map image segmentation method as claimed in claim 2, characterized in that: the step of performing weight calculation on the first matrix, the second matrix and the region information further includes:

obtaining the vector dimension of a first element in the second matrix;

acquiring a vector dimension of the third element;

normalizing the first data to obtain a first vector;

4. An attention map image segmentation method as claimed in claim 3, characterized in that: the step of deriving the fourth matrix based on the weighting policy and the third matrix further comprises:

and arranging the second vectors according to the position of the first element in the second matrix to obtain a fourth matrix.

5. An attention map image segmentation method as claimed in claim 3, characterized in that: the step of obtaining the fourth matrix based on the weighting policy and the third matrix further comprises:

6. An attention map image segmentation method as claimed in claim 1, characterized in that: the step of selecting and fusing a plurality of feature maps further comprises: enabling the sizes of the feature maps to be the same through a bilinear interpolation method or a deconvolution network method;

7. An attention map image segmentation method as claimed in claim 1, characterized in that: the size transformation makes the size of the first segmentation result the same as the size of the image input to the segmentation network by means of downsampling.

8. An attention map image segmentation method as claimed in claim 1, characterized in that: the step of obtaining the first segmentation result of the image through the attention network and the fused feature map further comprises: inputting the fused feature map to the attention network;

9. An attention image segmentation system, comprising: the system comprises an extraction module, a fusion module, a first segmentation module, a transformation module and a second segmentation module;

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of a method for attention image segmentation according to any one of claims 1 to 7.