CN114758135A - Unsupervised image semantic segmentation method based on attention mechanism - Google Patents

Unsupervised image semantic segmentation method based on attention mechanism Download PDF

Info

Publication number
CN114758135A
CN114758135A CN202210504797.3A CN202210504797A CN114758135A CN 114758135 A CN114758135 A CN 114758135A CN 202210504797 A CN202210504797 A CN 202210504797A CN 114758135 A CN114758135 A CN 114758135A
Authority
CN
China
Prior art keywords
image
pixel
semantic segmentation
feature
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210504797.3A
Other languages
Chinese (zh)
Inventor
钱丽萍
王寅生
钱江
王晨熙
王倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202210504797.3A priority Critical patent/CN114758135A/en
Publication of CN114758135A publication Critical patent/CN114758135A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

An attention mechanism-based unsupervised image semantic segmentation method comprises the steps of removing part of redundant background information of an RGB image through an attention module; and extracting image semantic information by using an unsupervised image semantic segmentation network, and marking the same label on pixels belonging to the same category in the image so as to realize the extraction of the image semantic information. The method can be used for solving the problems of labor waste, segmentation precision reduction and the like caused by the existence of a large amount of honor background information in the unsupervised image semantic segmentation.

Description

Unsupervised image semantic segmentation method based on attention mechanism
Technical Field
The invention belongs to the field of computer vision, and particularly relates to an unsupervised image semantic segmentation method based on an attention mechanism.
Background
As the amount of image data increases from blowout, the amount of work required for image processing has correspondingly proliferated. In order to reduce the workload as much as possible, researchers continuously develop more automatic and more accurate image processing algorithms, and the purpose is to enable a sensor of a machine to be close to the recognition capability of human eyes, and to autonomously make analysis and judgment according to the 'seeing' condition, so that the burden of people is reduced.
As one of the important characteristics of the human visual system, the visual attention mechanism enables a human to quickly select a few salient objects from a complex scene for attention, and is an important means for the human to process a large amount of external information with limited resources. In view of the high-efficiency data screening capability, the visual attention mechanism is introduced into the field of computer information processing, particularly the field of image processing needing to calculate mass data, and has very important theoretical value and practical significance. The attention mechanism is derived from the research of experts on human attention, and after the deep learning is rapidly developed, the attention mechanism becomes a core technology in the fields of natural language processing, statistical learning, image detection and the like in a wide range of fields which are widely applied to the deep learning. The structural model based on the attention mechanism not only can record the position relation among information, but also can measure the importance of different information characteristics according to the weight of the information. The dynamic weight parameters are established by making decisions on correlation and irrelevance on the information characteristics to strengthen the weakening of useless information by key information, so that the efficiency of the deep learning algorithm is improved, and meanwhile, some defects of the traditional deep learning are improved.
The image semantic segmentation refers to an image segmentation technology for classifying each pixel point according to semantic content expressed by the pixel point in an image. The image semantic segmentation can greatly compress the image memory on the premise of only retaining the image semantic content, is an important application of semantic communication, and is also one of the most important basic technologies of visual intelligence direction. The semantic segmentation effect is related to the understanding capability of the intelligent system to the application scene of the intelligent system, so that the semantic segmentation effect has great application value in important fields such as unmanned driving, robot cognition and navigation, security monitoring, unmanned aerial vehicle landing systems and the like. However, the above task typically requires a large amount of marking data that matches the scene under consideration to achieve reliable performance. Collecting and labeling large data sets for each new task and domain is very expensive, time consuming, and error prone. Furthermore, in many cases, for various reasons, sufficient training data may not be available, and the large amount of data for other areas and tasks is somewhat relevant to the task in question. These considerations are particularly true for semantic segmentation, where the learning framework requires a large amount of manually labeled data, which is very expensive to acquire. In order to solve the problem of difficult training data labeling, more and more flexible unsupervised semantic segmentation methods with stronger expansibility are designed to be paid more and more attention, and unsupervised semantic segmentation is a future development trend.
Although the current semantic segmentation technology has achieved remarkable results, due to the fact that development time is short and complexity is high, the semantic segmentation technology is completely applied to actual life, and many problems such as insufficient segmentation precision and low algorithm efficiency are urgently needed to be solved. Therefore, the method has important significance for the development of computer vision technology aiming at the research and the improvement of the semantic segmentation algorithm.
Disclosure of Invention
In order to overcome the defects that the traditional unsupervised image semantic segmentation technology needs a large amount of computing power and causes computing power waste to a certain extent aiming at the images with a large amount of redundant background information, the invention provides an unsupervised image semantic segmentation method based on an attention mechanism, wherein the image redundant background information is reduced by utilizing a space attention mechanism, a key area of the image is reserved, an unsupervised image semantic segmentation model is used as a segmentation model to segment the images, and the existing unsupervised image semantic segmentation efficiency and precision are effectively improved.
In order to solve the technical problem, the invention adopts the following technical scheme:
an unsupervised image semantic segmentation method based on an attention mechanism comprises the following steps:
s1: obtaining an affine transformation matrix theta, firstly, initializing the theta into an identity transformation matrix, and continuously correcting parameters of the theta through a loss function to finally obtain an expected affine transformation matrix;
S2: after the RGB image U is input, the position of the coordinate point of the input image U corresponding to the coordinate point of the feature map V is calculated according to the affine transformation matrix obtained at the previous stage, and the calculation method is as follows:
Figure BDA0003635490510000031
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003635490510000032
representing the position of the pixel, s representing an input feature image coordinate point, t representing an output feature map coordinate point, AθIs affine transformation matrix obtained in the S1 stage;
s3: the gray value of a certain specific pixel point in the output characteristic graph is calculated by utilizing an interpolation mode, and the calculation method comprises the following steps:
Figure BDA0003635490510000033
where W and H represent the width and height of the input image, Vi cIs the position in the channel c
Figure BDA0003635490510000034
The gray value of the pixel i of (a),
Figure BDA0003635490510000035
the gray value of the c channel point (n, m) on the input feature map is obtained;
s4: extraction of deep features { x from input images using a feature extraction modulen};
S5: one-dimensional (1D) convolutional layer computation q-dimensional classFeature response vector in the class space rn};
S6: feature response vector rnObtaining r 'on each axis of pixel class space by using Batch Normalization function (Batch Normalization)'nH, make { r'nHas zero mean and unit variance;
s7: using argmax function, choose to be at { r'nThe dimension with the maximum value in the pixel is determined as the class label of each pixel cn};
S8: calculating a loss function and performing back propagation to update parameters, wherein the loss function is composed of characteristic similarity loss and space continuity loss, and mu represents a weight loss function for balancing the two loss functions and is defined as follows:
L=Lsim({r′n,cn})+μLcon({r′n}) (3)
Wherein the feature similarity loss function is as follows:
Figure BDA0003635490510000041
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003635490510000042
where N is the total number of pixels in the input image, the response map { r }n=WcxnIs obtained by applying a linear classifier, where { W }c∈Rq×pIs normalized to { r'n};
The spatial continuity loss function is defined as follows:
Figure BDA0003635490510000043
r 'in the formula'ξ,ηRepresentative response map { r'nThe pixel value at (ξ, η);
by applying a loss of spatial continuity, too many pixel labels due to complex patterns or textures are removed.
Further, in the step S1, the affine transformation matrix θ is a 2 × 3 matrix in the two-dimensional image.
Still further, in step S2, the coordinate mapping relationship is that the target picture is mapped to the input picture, because the coordinate mapping needs to collect pixels from different coordinates of the original picture to the target picture, the coordinates of the target picture need to be traversed each time sampling, and the coordinates of the collected original picture are not fixed, so that the corresponding coordinate point of the coordinates of each position of the transformed output feature map on the input feature map can be obtained.
Further, in the step S3, when
Figure BDA0003635490510000051
Or
Figure BDA0003635490510000052
If greater than 1, the corresponding max () entry will take 0, so only (x)i,yi) The gray scale value of 4 surrounding points determines the gray scale of the target pixel point, and when the gray scale value is less than the gray scale value of 4 surrounding points
Figure BDA0003635490510000053
And
Figure BDA0003635490510000054
the smaller the influence (i.e. the closer to the point (n, m)), the greater the weight.
Further, in step S8, the objective behind the feature similarity loss function is to enhance the similarity of similar features, and once image pixels are clustered according to their features, feature vectors within the same class should be similar to each other, while feature vectors of different classes should be different from each other, and through the minimization of this loss function, the network weights are updated to facilitate extracting more effective features for classification.
The invention has the beneficial effects that: the redundant background information of the image is reduced by utilizing a space attention mechanism, the key area of the image is reserved, the unsupervised image semantic segmentation model is used as a segmentation model to segment the image, and the semantic segmentation efficiency and precision of the conventional unsupervised image are effectively improved.
Drawings
FIG. 1 is a flow chart of an unsupervised image semantic segmentation method based on an attention mechanism according to the present invention.
FIG. 2 is a schematic flow chart of an unsupervised image semantic segmentation method based on the attention mechanism according to another embodiment of the present invention.
Detailed Description
The invention will be further illustrated with reference to specific embodiments. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
Referring to fig. 1 and 2, an unsupervised image semantic segmentation method based on an attention mechanism includes the following steps:
s1: obtaining an affine transformation matrix theta, firstly, initializing the theta into an identity transformation matrix, continuously correcting parameters of the theta through a loss function, and finally obtaining an expected affine transformation matrix, wherein the affine transformation matrix theta is a 2 x 3 matrix in a two-dimensional image;
s2: after the RGB image U is input, calculating the position of the coordinate point of the input image U corresponding to the coordinate point of the characteristic diagram V according to the affine transformation matrix obtained in the previous stage, wherein the calculating method comprises the following steps:
Figure BDA0003635490510000061
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003635490510000062
representing the position of the pixel, s representing the input feature image coordinate point, t representing the output feature map coordinate point, AθIs the affine transformation matrix obtained in the first stage, by which stepThe coordinates of each position of the transformed output characteristic diagram correspond to coordinate points on the input characteristic diagram;
s3: the gray value of a certain specific pixel in the output feature map is calculated by an interpolation method, and the calculation method is as follows:
Figure BDA0003635490510000071
where W and H represent the width and height of the input image, Vi cIs the position in the channel c
Figure BDA0003635490510000072
The gray value of the pixel i of (a),
Figure BDA0003635490510000073
is the gray value of the c-th channel point (n, m) on the input feature map. When in use
Figure BDA0003635490510000074
Or alternatively
Figure BDA0003635490510000075
Above 1, the corresponding max () item will take 0, that is, only (x)i,yi) The gray values of the surrounding 4 points determine the gray value of the target pixel point, and when
Figure BDA0003635490510000076
And
Figure BDA0003635490510000077
the smaller, the larger the influence (i.e., the closer to point (n, m)), the larger the weight;
s4: extracting deep features { x) from an input image using a feature extraction modulen};
S5: one-dimensional (1D) convolutional layer computation of response vectors for features in a q-dimensional class space { r }n};
S6: the response vector is obtained by using a batch normalization function on each axis of the category space to obtain r'nH, make { r'nHas zero mean and unit variance;
s7: by using the argmax function, select at { r'nThe dimension with the maximum value in the pixel is obtained as the class label of each pixel cn};
S8: calculating a loss function and performing back propagation to update parameters, wherein the loss function consists of characteristic similarity loss and spatial continuity loss, and mu represents a weight loss function for balancing the two loss functions, and is specifically defined as follows:
L=Lsim({r′n,cn})+μLcon({r′n}) (3)
wherein the feature similarity loss function is as follows:
Figure BDA0003635490510000078
wherein the content of the first and second substances,
Figure BDA0003635490510000079
in the formula cnIs a class label, is determined by assigning a class ID to a response vector using the argmax function, N is the total number of pixels in the input image, the response map { r }n=WcxnIs obtained by applying a linear classifier, where { W } c∈Rq ×p} then the response map is normalized to { r'n- { r'nWith zero mean and unit variance, the goal behind this loss function is to enhance the similarity of similar features, and once image pixels are clustered according to their features, the feature vectors within the same class should be similar to each other, while the feature vectors of different classes should be different from each other, and through minimization of this loss function, the network weights are updated to facilitate extraction of more efficient features for classification;
the spatial continuity loss function is defined as follows:
Figure BDA0003635490510000081
r 'in the formula'ξ,ηRepresentative response map { r'nThe pixel value at (ξ, η);
by applying a loss of spatial continuity, excess pixel labels due to other reasons such as complex patterns or textures can be deleted to some extent.
To have the loss function propagate backwards, the associated gradient is defined as follows:
Figure BDA0003635490510000082
Figure BDA0003635490510000083
accordingly, the number of the first and second electrodes,
Figure BDA0003635490510000084
similar to the above formula.
The method provided by the embodiment can optimize the problem of computing power waste caused by overlarge redundant background area of the traditional unsupervised image semantic segmentation method, realizes the process of generating the semantic image by using the spatial attention mechanism algorithm as an image redundancy removing tool and using the unsupervised image semantic segmentation algorithm, and can be used for solving the problems of computing power waste and precision reduction of the unsupervised segmentation algorithm caused by the overlarge redundant information of the image background in unsupervised image semantic segmentation.
The embodiments described in this specification are merely exemplary of implementations of the inventive concepts and are provided for illustrative purposes only. The scope of the present invention should not be construed as being limited to the particular forms set forth in the embodiments, but is to be accorded the widest scope consistent with the principles and equivalents thereof as contemplated by those skilled in the art.

Claims (5)

1. An attention-based unsupervised image semantic segmentation method is characterized by comprising the following steps of:
s1, obtaining an affine transformation matrix theta, firstly, initializing the theta into an identity transformation matrix, and continuously correcting parameters of the theta through a loss function to finally obtain an expected affine transformation matrix;
s2, after the RGB image U is input, calculating the position of the coordinate point of the input image U corresponding to the coordinate point of the characteristic diagram V according to the affine transformation matrix obtained in the previous stage, wherein the calculation method comprises the following steps:
Figure FDA0003635490500000011
wherein the content of the first and second substances,
Figure FDA0003635490500000012
representing the position of the pixel, s representing the input feature image coordinate point, t representing the output feature map coordinate point, AθAffine transformation obtained in the step S1;
s3, calculating the gray value of a specific pixel point in the output characteristic graph by using an interpolation mode, wherein the calculation method comprises the following steps:
Figure FDA0003635490500000013
Where W and H represent the width and height of the input image,
Figure FDA0003635490500000014
is the position in the channel c
Figure FDA0003635490500000015
The gray value of the pixel i of (a),
Figure FDA0003635490500000016
the gray value of the c channel point (n, m) on the input feature map is obtained;
s4, extracting deep features { x ] from the input image by using the feature extraction modulen};
S5 one-dimensional (1D) convolutional layer computing a feature response vector { r ] in a q-dimensional class spacen};
S6 feature response vector rnObtaining r 'on each axis of pixel class space by using Batch Normalization function (Batch Normalization)'nH, make { r'nHas zero mean and unit variance;
s7 selecting at { r'nThe dimension with the maximum value in the pixel is determined as the class label of each pixel cn};
S8, calculating a loss function and performing back propagation to update parameters, wherein the loss function is composed of characteristic similarity loss and space continuity loss, mu represents a weight loss function for balancing the two loss functions and is defined as follows:
L=Lsim({r′n,cn})+μLcon({r′n}) (3)
wherein the feature similarity loss function is as follows:
Figure FDA0003635490500000021
wherein the content of the first and second substances,
Figure FDA0003635490500000022
where N is the total number of pixels in the input image, the response map { r }n=WcxnIs obtained by applying a linear classifier, where { W }c∈Rq×pIs normalized to { r'n};
The spatial continuity loss function is defined as follows:
Figure FDA0003635490500000023
r 'in the formula'ξ,ηRepresentative response map { r' nThe pixel value at (ξ, η);
by applying a loss of spatial continuity, too many pixel labels due to complex patterns or textures are removed.
2. The unsupervised image semantic segmentation method based on the attention mechanism as claimed in claim 1, wherein in the step S1, the affine transformation matrix θ is a 2 x 3 matrix in the two-dimensional image.
3. The method for unsupervised image semantic segmentation based on attention mechanism as claimed in claim 1 or 2, wherein in step S2, the coordinate mapping relationship is that the target picture is mapped to the input picture, because the coordinate mapping requires the pixel acquisition from different coordinates of the original image to the target picture, the coordinates of the target picture need to be traversed for each sampling, and the coordinates of the acquired original image are not fixed, so that the corresponding coordinate point of the coordinates of each position of the transformed output feature map on the input feature map can be obtained.
4. The method for unsupervised image semantic segmentation based on attention mechanism as claimed in claim 1 or 2, wherein in step S3, when the image semantic segmentation is performed
Figure FDA0003635490500000031
Or
Figure FDA0003635490500000032
If greater than 1, the corresponding max () entry will take 0, so only (x)i,yi) The gray scale value of 4 surrounding points determines the gray scale of the target pixel point, and when the gray scale value is less than the gray scale value of 4 surrounding points
Figure FDA0003635490500000033
And
Figure FDA0003635490500000034
the smaller the influence (i.e., the closer to point (n, m)), the larger the weight.
5. An unsupervised image semantic segmentation method based on attention mechanism as claimed in claim 1 or 2 wherein in step S8, the goal behind the feature similarity loss function is to enhance the similarity of similar features, once image pixels are clustered according to their features, the feature vectors in the same class should be similar to each other, and the feature vectors in different classes should be different from each other, and through the minimization of this loss function, the network weight is updated to facilitate extracting more effective features for classification.
CN202210504797.3A 2022-05-10 2022-05-10 Unsupervised image semantic segmentation method based on attention mechanism Pending CN114758135A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210504797.3A CN114758135A (en) 2022-05-10 2022-05-10 Unsupervised image semantic segmentation method based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210504797.3A CN114758135A (en) 2022-05-10 2022-05-10 Unsupervised image semantic segmentation method based on attention mechanism

Publications (1)

Publication Number Publication Date
CN114758135A true CN114758135A (en) 2022-07-15

Family

ID=82334627

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210504797.3A Pending CN114758135A (en) 2022-05-10 2022-05-10 Unsupervised image semantic segmentation method based on attention mechanism

Country Status (1)

Country Link
CN (1) CN114758135A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117079061A (en) * 2023-10-17 2023-11-17 四川迪晟新达类脑智能技术有限公司 Target detection method and device based on attention mechanism and Yolov5

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117079061A (en) * 2023-10-17 2023-11-17 四川迪晟新达类脑智能技术有限公司 Target detection method and device based on attention mechanism and Yolov5

Similar Documents

Publication Publication Date Title
CN114241282A (en) Knowledge distillation-based edge equipment scene identification method and device
CN111899172A (en) Vehicle target detection method oriented to remote sensing application scene
CN109086777B (en) Saliency map refining method based on global pixel characteristics
CN111008978B (en) Video scene segmentation method based on deep learning
CN109034035A (en) Pedestrian's recognition methods again based on conspicuousness detection and Fusion Features
CN111027377B (en) Double-flow neural network time sequence action positioning method
CN110728694B (en) Long-time visual target tracking method based on continuous learning
CN108427919B (en) Unsupervised oil tank target detection method based on shape-guided saliency model
CN110443279B (en) Unmanned aerial vehicle image vehicle detection method based on lightweight neural network
WO2022218396A1 (en) Image processing method and apparatus, and computer readable storage medium
CN113592894B (en) Image segmentation method based on boundary box and co-occurrence feature prediction
CN115205633A (en) Automatic driving multi-mode self-supervision pre-training method based on aerial view comparison learning
CN112785636A (en) Multi-scale enhanced monocular depth estimation method
Alsanad et al. Real-time fuel truck detection algorithm based on deep convolutional neural network
CN114758135A (en) Unsupervised image semantic segmentation method based on attention mechanism
Yin Object Detection Based on Deep Learning: A Brief Review
CN112668662B (en) Outdoor mountain forest environment target detection method based on improved YOLOv3 network
CN116579616B (en) Risk identification method based on deep learning
CN105844299B (en) A kind of image classification method based on bag of words
CN110363240B (en) Medical image classification method and system
CN111950476A (en) Deep learning-based automatic river channel ship identification method in complex environment
CN108765384B (en) Significance detection method for joint manifold sequencing and improved convex hull
CN116386042A (en) Point cloud semantic segmentation model based on three-dimensional pooling spatial attention mechanism
CN113223037B (en) Unsupervised semantic segmentation method and unsupervised semantic segmentation system for large-scale data
CN112784800B (en) Face key point detection method based on neural network and shape constraint

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination