CN114758135A

CN114758135A - Unsupervised image semantic segmentation method based on attention mechanism

Info

Publication number: CN114758135A
Application number: CN202210504797.3A
Authority: CN
Inventors: 钱丽萍; 王寅生; 钱江; 王晨熙; 王倩
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2022-07-15

Abstract

An attention mechanism-based unsupervised image semantic segmentation method comprises the steps of removing part of redundant background information of an RGB image through an attention module; and extracting image semantic information by using an unsupervised image semantic segmentation network, and marking the same label on pixels belonging to the same category in the image so as to realize the extraction of the image semantic information. The method can be used for solving the problems of labor waste, segmentation precision reduction and the like caused by the existence of a large amount of honor background information in the unsupervised image semantic segmentation.

Description

Unsupervised image semantic segmentation method based on attention mechanism

Technical Field

The invention belongs to the field of computer vision, and particularly relates to an unsupervised image semantic segmentation method based on an attention mechanism.

Background

As the amount of image data increases from blowout, the amount of work required for image processing has correspondingly proliferated. In order to reduce the workload as much as possible, researchers continuously develop more automatic and more accurate image processing algorithms, and the purpose is to enable a sensor of a machine to be close to the recognition capability of human eyes, and to autonomously make analysis and judgment according to the 'seeing' condition, so that the burden of people is reduced.

As one of the important characteristics of the human visual system, the visual attention mechanism enables a human to quickly select a few salient objects from a complex scene for attention, and is an important means for the human to process a large amount of external information with limited resources. In view of the high-efficiency data screening capability, the visual attention mechanism is introduced into the field of computer information processing, particularly the field of image processing needing to calculate mass data, and has very important theoretical value and practical significance. The attention mechanism is derived from the research of experts on human attention, and after the deep learning is rapidly developed, the attention mechanism becomes a core technology in the fields of natural language processing, statistical learning, image detection and the like in a wide range of fields which are widely applied to the deep learning. The structural model based on the attention mechanism not only can record the position relation among information, but also can measure the importance of different information characteristics according to the weight of the information. The dynamic weight parameters are established by making decisions on correlation and irrelevance on the information characteristics to strengthen the weakening of useless information by key information, so that the efficiency of the deep learning algorithm is improved, and meanwhile, some defects of the traditional deep learning are improved.

The image semantic segmentation refers to an image segmentation technology for classifying each pixel point according to semantic content expressed by the pixel point in an image. The image semantic segmentation can greatly compress the image memory on the premise of only retaining the image semantic content, is an important application of semantic communication, and is also one of the most important basic technologies of visual intelligence direction. The semantic segmentation effect is related to the understanding capability of the intelligent system to the application scene of the intelligent system, so that the semantic segmentation effect has great application value in important fields such as unmanned driving, robot cognition and navigation, security monitoring, unmanned aerial vehicle landing systems and the like. However, the above task typically requires a large amount of marking data that matches the scene under consideration to achieve reliable performance. Collecting and labeling large data sets for each new task and domain is very expensive, time consuming, and error prone. Furthermore, in many cases, for various reasons, sufficient training data may not be available, and the large amount of data for other areas and tasks is somewhat relevant to the task in question. These considerations are particularly true for semantic segmentation, where the learning framework requires a large amount of manually labeled data, which is very expensive to acquire. In order to solve the problem of difficult training data labeling, more and more flexible unsupervised semantic segmentation methods with stronger expansibility are designed to be paid more and more attention, and unsupervised semantic segmentation is a future development trend.

Although the current semantic segmentation technology has achieved remarkable results, due to the fact that development time is short and complexity is high, the semantic segmentation technology is completely applied to actual life, and many problems such as insufficient segmentation precision and low algorithm efficiency are urgently needed to be solved. Therefore, the method has important significance for the development of computer vision technology aiming at the research and the improvement of the semantic segmentation algorithm.

Disclosure of Invention

In order to overcome the defects that the traditional unsupervised image semantic segmentation technology needs a large amount of computing power and causes computing power waste to a certain extent aiming at the images with a large amount of redundant background information, the invention provides an unsupervised image semantic segmentation method based on an attention mechanism, wherein the image redundant background information is reduced by utilizing a space attention mechanism, a key area of the image is reserved, an unsupervised image semantic segmentation model is used as a segmentation model to segment the images, and the existing unsupervised image semantic segmentation efficiency and precision are effectively improved.

In order to solve the technical problem, the invention adopts the following technical scheme:

an unsupervised image semantic segmentation method based on an attention mechanism comprises the following steps:

s1: obtaining an affine transformation matrix theta, firstly, initializing the theta into an identity transformation matrix, and continuously correcting parameters of the theta through a loss function to finally obtain an expected affine transformation matrix;

S2: after the RGB image U is input, the position of the coordinate point of the input image U corresponding to the coordinate point of the feature map V is calculated according to the affine transformation matrix obtained at the previous stage, and the calculation method is as follows:

wherein, the first and the second end of the pipe are connected with each other,

representing the position of the pixel, s representing an input feature image coordinate point, t representing an output feature map coordinate point, A_θIs affine transformation matrix obtained in the S1 stage;

s3: the gray value of a certain specific pixel point in the output characteristic graph is calculated by utilizing an interpolation mode, and the calculation method comprises the following steps:

where W and H represent the width and height of the input image, V_i ^cIs the position in the channel c

The gray value of the pixel i of (a),

the gray value of the c channel point (n, m) on the input feature map is obtained;

s4: extraction of deep features { x from input images using a feature extraction module_n}；

S5: one-dimensional (1D) convolutional layer computation q-dimensional classFeature response vector in the class space r_n}；

S6: feature response vector r_nObtaining r 'on each axis of pixel class space by using Batch Normalization function (Batch Normalization)'_nH, make { r'_nHas zero mean and unit variance;

s7: using argmax function, choose to be at { r'_nThe dimension with the maximum value in the pixel is determined as the class label of each pixel c_n}；

S8: calculating a loss function and performing back propagation to update parameters, wherein the loss function is composed of characteristic similarity loss and space continuity loss, and mu represents a weight loss function for balancing the two loss functions and is defined as follows:

L＝L_sim({r′_n，c_n})+μL_con({r′_n}) (3)

Wherein the feature similarity loss function is as follows:

where N is the total number of pixels in the input image, the response map { r }_n＝W_cx_nIs obtained by applying a linear classifier, where { W }_c∈R^q×pIs normalized to { r'_n}；

The spatial continuity loss function is defined as follows:

r 'in the formula'_ξ，ηRepresentative response map { r'_nThe pixel value at (ξ, η);

by applying a loss of spatial continuity, too many pixel labels due to complex patterns or textures are removed.

Further, in the step S1, the affine transformation matrix θ is a 2 × 3 matrix in the two-dimensional image.

Still further, in step S2, the coordinate mapping relationship is that the target picture is mapped to the input picture, because the coordinate mapping needs to collect pixels from different coordinates of the original picture to the target picture, the coordinates of the target picture need to be traversed each time sampling, and the coordinates of the collected original picture are not fixed, so that the corresponding coordinate point of the coordinates of each position of the transformed output feature map on the input feature map can be obtained.

Further, in the step S3, when

Or

If greater than 1, the corresponding max () entry will take 0, so only (x)_i，y_i) The gray scale value of 4 surrounding points determines the gray scale of the target pixel point, and when the gray scale value is less than the gray scale value of 4 surrounding points

And

the smaller the influence (i.e. the closer to the point (n, m)), the greater the weight.

Further, in step S8, the objective behind the feature similarity loss function is to enhance the similarity of similar features, and once image pixels are clustered according to their features, feature vectors within the same class should be similar to each other, while feature vectors of different classes should be different from each other, and through the minimization of this loss function, the network weights are updated to facilitate extracting more effective features for classification.

The invention has the beneficial effects that: the redundant background information of the image is reduced by utilizing a space attention mechanism, the key area of the image is reserved, the unsupervised image semantic segmentation model is used as a segmentation model to segment the image, and the semantic segmentation efficiency and precision of the conventional unsupervised image are effectively improved.

Drawings

FIG. 1 is a flow chart of an unsupervised image semantic segmentation method based on an attention mechanism according to the present invention.

FIG. 2 is a schematic flow chart of an unsupervised image semantic segmentation method based on the attention mechanism according to another embodiment of the present invention.

Detailed Description

The invention will be further illustrated with reference to specific embodiments. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

Referring to fig. 1 and 2, an unsupervised image semantic segmentation method based on an attention mechanism includes the following steps:

s1: obtaining an affine transformation matrix theta, firstly, initializing the theta into an identity transformation matrix, continuously correcting parameters of the theta through a loss function, and finally obtaining an expected affine transformation matrix, wherein the affine transformation matrix theta is a 2 x 3 matrix in a two-dimensional image;

s2: after the RGB image U is input, calculating the position of the coordinate point of the input image U corresponding to the coordinate point of the characteristic diagram V according to the affine transformation matrix obtained in the previous stage, wherein the calculating method comprises the following steps:

representing the position of the pixel, s representing the input feature image coordinate point, t representing the output feature map coordinate point, A_θIs the affine transformation matrix obtained in the first stage, by which stepThe coordinates of each position of the transformed output characteristic diagram correspond to coordinate points on the input characteristic diagram;

s3: the gray value of a certain specific pixel in the output feature map is calculated by an interpolation method, and the calculation method is as follows:

The gray value of the pixel i of (a),

is the gray value of the c-th channel point (n, m) on the input feature map. When in use

Or alternatively

Above 1, the corresponding max () item will take 0, that is, only (x)_i，y_i) The gray values of the surrounding 4 points determine the gray value of the target pixel point, and when

And

the smaller, the larger the influence (i.e., the closer to point (n, m)), the larger the weight;

s4: extracting deep features { x) from an input image using a feature extraction module_n}；

S5: one-dimensional (1D) convolutional layer computation of response vectors for features in a q-dimensional class space { r }_n}；

S6: the response vector is obtained by using a batch normalization function on each axis of the category space to obtain r'_nH, make { r'_nHas zero mean and unit variance;

s7: by using the argmax function, select at { r'_nThe dimension with the maximum value in the pixel is obtained as the class label of each pixel c_n}；

S8: calculating a loss function and performing back propagation to update parameters, wherein the loss function consists of characteristic similarity loss and spatial continuity loss, and mu represents a weight loss function for balancing the two loss functions, and is specifically defined as follows:

L＝L_sim({r′_n，c_n})+μL_con({r′_n}) (3)

wherein the feature similarity loss function is as follows:

wherein the content of the first and second substances,

in the formula c_nIs a class label, is determined by assigning a class ID to a response vector using the argmax function, N is the total number of pixels in the input image, the response map { r }_n＝W_cx_nIs obtained by applying a linear classifier, where { W } _c∈R^q ^×p} then the response map is normalized to { r'_n- { r'_nWith zero mean and unit variance, the goal behind this loss function is to enhance the similarity of similar features, and once image pixels are clustered according to their features, the feature vectors within the same class should be similar to each other, while the feature vectors of different classes should be different from each other, and through minimization of this loss function, the network weights are updated to facilitate extraction of more efficient features for classification;

the spatial continuity loss function is defined as follows:

by applying a loss of spatial continuity, excess pixel labels due to other reasons such as complex patterns or textures can be deleted to some extent.

To have the loss function propagate backwards, the associated gradient is defined as follows:

accordingly, the number of the first and second electrodes,

similar to the above formula.

The method provided by the embodiment can optimize the problem of computing power waste caused by overlarge redundant background area of the traditional unsupervised image semantic segmentation method, realizes the process of generating the semantic image by using the spatial attention mechanism algorithm as an image redundancy removing tool and using the unsupervised image semantic segmentation algorithm, and can be used for solving the problems of computing power waste and precision reduction of the unsupervised segmentation algorithm caused by the overlarge redundant information of the image background in unsupervised image semantic segmentation.

The embodiments described in this specification are merely exemplary of implementations of the inventive concepts and are provided for illustrative purposes only. The scope of the present invention should not be construed as being limited to the particular forms set forth in the embodiments, but is to be accorded the widest scope consistent with the principles and equivalents thereof as contemplated by those skilled in the art.

Claims

1. An attention-based unsupervised image semantic segmentation method is characterized by comprising the following steps of:

s1, obtaining an affine transformation matrix theta, firstly, initializing the theta into an identity transformation matrix, and continuously correcting parameters of the theta through a loss function to finally obtain an expected affine transformation matrix;

s2, after the RGB image U is input, calculating the position of the coordinate point of the input image U corresponding to the coordinate point of the characteristic diagram V according to the affine transformation matrix obtained in the previous stage, wherein the calculation method comprises the following steps:

wherein the content of the first and second substances,

representing the position of the pixel, s representing the input feature image coordinate point, t representing the output feature map coordinate point, A_θAffine transformation obtained in the step S1;

s3, calculating the gray value of a specific pixel point in the output characteristic graph by using an interpolation mode, wherein the calculation method comprises the following steps:

Where W and H represent the width and height of the input image,

is the position in the channel c

The gray value of the pixel i of (a),

s4, extracting deep features { x ] from the input image by using the feature extraction module_n}；

S5 one-dimensional (1D) convolutional layer computing a feature response vector { r ] in a q-dimensional class space_n}；

S6 feature response vector r_nObtaining r 'on each axis of pixel class space by using Batch Normalization function (Batch Normalization)'_nH, make { r'_nHas zero mean and unit variance;

s7 selecting at { r'_nThe dimension with the maximum value in the pixel is determined as the class label of each pixel c_n}；

S8, calculating a loss function and performing back propagation to update parameters, wherein the loss function is composed of characteristic similarity loss and space continuity loss, mu represents a weight loss function for balancing the two loss functions and is defined as follows:

L＝L_sim({r′_n,c_n})+μL_con({r′_n}) (3)

wherein the feature similarity loss function is as follows:

wherein the content of the first and second substances,

The spatial continuity loss function is defined as follows:

r 'in the formula'_ξ,ηRepresentative response map { r' _nThe pixel value at (ξ, η);

2. The unsupervised image semantic segmentation method based on the attention mechanism as claimed in claim 1, wherein in the step S1, the affine transformation matrix θ is a 2 x 3 matrix in the two-dimensional image.

3. The method for unsupervised image semantic segmentation based on attention mechanism as claimed in claim 1 or 2, wherein in step S2, the coordinate mapping relationship is that the target picture is mapped to the input picture, because the coordinate mapping requires the pixel acquisition from different coordinates of the original image to the target picture, the coordinates of the target picture need to be traversed for each sampling, and the coordinates of the acquired original image are not fixed, so that the corresponding coordinate point of the coordinates of each position of the transformed output feature map on the input feature map can be obtained.

4. The method for unsupervised image semantic segmentation based on attention mechanism as claimed in claim 1 or 2, wherein in step S3, when the image semantic segmentation is performed

Or

If greater than 1, the corresponding max () entry will take 0, so only (x)_i,y_i) The gray scale value of 4 surrounding points determines the gray scale of the target pixel point, and when the gray scale value is less than the gray scale value of 4 surrounding points

And

the smaller the influence (i.e., the closer to point (n, m)), the larger the weight.

5. An unsupervised image semantic segmentation method based on attention mechanism as claimed in claim 1 or 2 wherein in step S8, the goal behind the feature similarity loss function is to enhance the similarity of similar features, once image pixels are clustered according to their features, the feature vectors in the same class should be similar to each other, and the feature vectors in different classes should be different from each other, and through the minimization of this loss function, the network weight is updated to facilitate extracting more effective features for classification.