CN113111970B - Method for classifying images by constructing global embedded attention residual network - Google Patents

Method for classifying images by constructing global embedded attention residual network Download PDF

Info

Publication number
CN113111970B
CN113111970B CN202110487497.4A CN202110487497A CN113111970B CN 113111970 B CN113111970 B CN 113111970B CN 202110487497 A CN202110487497 A CN 202110487497A CN 113111970 B CN113111970 B CN 113111970B
Authority
CN
China
Prior art keywords
feature matrix
global
attention
transformation
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110487497.4A
Other languages
Chinese (zh)
Other versions
CN113111970A (en
Inventor
裴炤
万志杨
张艳宁
马苗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Normal University
Original Assignee
Shaanxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaanxi Normal University filed Critical Shaanxi Normal University
Priority to CN202110487497.4A priority Critical patent/CN113111970B/en
Publication of CN113111970A publication Critical patent/CN113111970A/en
Application granted granted Critical
Publication of CN113111970B publication Critical patent/CN113111970B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure discloses a method of classifying images by building a global embedded attention residual network, comprising: preprocessing the image data to be classified; constructing a global embedded attention residual network containing a global embedded attention module, wherein the global embedded attention residual network comprises 1 input layer, 1 convolution layer with a convolution kernel size of 7×7, 1 max pooling layer, a global embedded attention module, 2 fully connected layers and 1 output layer, and the global embedded attention module comprises a spatial attention sub-module based on global context and a channel attention sub-module based on coordinates; and inputting the preprocessed image data to be classified into a global embedded attention residual error network for classification.

Description

Method for classifying images by constructing global embedded attention residual network
Technical Field
The present disclosure relates to an image classification method, and more particularly, to a method of classifying images by constructing a global embedded attention residual network.
Background
Image classification is an important task in the field of computer vision. At present, many scholars utilize a method of adding an attention mechanism to improve a network structure, so that image classification can be better performed. The most classical extrusion and excitation networks have been considered as milestones of the attention mechanism by two-step operation of extrusion and excitation, which first uses global averaging pooling to extrude global features into channel features, then uses a simple gating mechanism and excitation using sigmoid functions, and finally the corresponding channel products. The method can adaptively recalibrate the characteristic response of the channel by modeling the inter-dependencies between channels with the aid of a 2D global pool, providing significant performance improvements at a relatively low computational cost. However, it only considers the encoding of inter-channel information, ignoring the importance of location information, which is critical for capturing object structures in computer vision tasks. In addition, the later scholars try to combine the spatial attention information with the channel attention information, but the benefit brought by using only the position information of the local space is not high, so that the local position information of the channel can be effectively utilized while the global position information is added into the neural network.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a method for classifying images by constructing a global embedded attention residual network, and the classification capability of the images is improved by utilizing global position information and embedding the global position information into channel information to effectively extract image features.
In order to achieve the above object, the present disclosure provides the following technical solutions:
a method of classifying images by constructing a global embedded attention residual network, comprising the steps of:
s100: preprocessing the image data to be classified;
s200: constructing a global embedded attention residual network containing a global embedded attention module, wherein the global embedded attention residual network comprises 1 input layer, 1 convolution layer with a convolution kernel size of 7×7, 1 max pooling layer, a global embedded attention module, 2 fully connected layers and 1 output layer, and the global embedded attention module comprises a spatial attention sub-module based on global context and a channel attention sub-module based on coordinates;
s300: and inputting the preprocessed image data to be classified into a global embedded attention residual error network for classification.
Preferably, after the global embedded attention residual network is built, a training sample is required to be selected and preprocessed to train the network, a verification sample is required to be selected and preprocessed to adjust parameters of the trained network, and a test sample is required to be selected to test performance of the trained network.
Preferably, in step S200, the spatial attention submodule based on the global context includes:
the first subunit is used for inputting the preprocessed training samples, verification samples and test samples into the convolution layer and the pooling layer for processing and then performing global average pooling operation so as to obtain a feature matrix containing global information;
the second subunit is used for performing linear transformation on the feature matrix containing global information by adopting convolution and reshape functions with convolution kernel size of 1×1 so as to obtain a feature matrix subjected to dimension transformation processing;
the third subunit is configured to perform adaptive selection on the feature matrix subjected to the dimensional transformation processing by using a softmax function, obtain a corresponding weight of each different element on the feature matrix, and multiply the corresponding weight of each different element with the feature matrix containing global information to obtain a feature matrix containing global context feature information;
and a fourth subunit, configured to perform nonlinear transformation on the feature matrix containing the global context feature information by using batch normalization and a ReLU activation function and perform dimensional transformation by using 1×1 convolution.
Preferably, the global context-based spatial attention submodule is expressed as:
where x represents the output of global average pooling and y represents globalThe output of the contextual features, H and W representing the height and width of the input image, respectively, X representing the input image, K representing a 1X 1 convolution, reLU representing the ReLU activation function, BN representing the batch normalization function, N representing the number of elements in the feature matrix, e representing the base of the natural logarithmic function, i, j, m representing the possible positions of all elements in the feature matrix, respectively, X j 、x m Respectively representing the values of element information in the feature matrix, t represents the weight of the x matrix, tx j And tx m And representing the output value obtained by calculating the feature matrix after global average pooling operation.
Preferably, in step S200, the coordinate-based channel attention submodule includes:
a fifth subunit, configured to decompose the feature matrix containing the global context feature information after the nonlinear transformation and the dimensional transformation by adopting average pooling along the W 'direction and the H' direction, to obtain a one-dimensional feature matrix along the W 'direction and a one-dimensional feature matrix along the H' direction, where the one-dimensional feature matrix along the W 'direction includes local position information of the channel, and the one-dimensional feature matrix along the H' direction includes long-term dependency information;
a sixth subunit, configured to concatenate the one-dimensional feature matrix along the W 'direction and the one-dimensional feature matrix along the H' direction, and perform feature transformation with a convolution of 1×1 to obtain a feature matrix subjected to dimension transformation;
and a seventh subunit, configured to perform weight distribution on the feature matrix subjected to the dimension transformation by using a softmax function to obtain feature matrices with different weights, and perform feature transformation on the feature matrices with different weights by using 1×1 convolution, so as to obtain an output of the global embedded attention module.
Preferably, the feature matrix containing the global context feature information after the nonlinear transformation and the dimensional transformation is decomposed by adopting average pooling along the W 'direction and the H' direction, and the obtained one-dimensional feature matrix along the W 'direction and the obtained one-dimensional feature matrix along the H' direction are respectively expressed as:
where H 'and W' represent the height and width of the spatial attention sub-module output image of the global context, Z H And Z W Representing the one-dimensional feature matrix along the H 'direction and along the W' direction, respectively, i and j representing the H 'and W' of the ith row, respectively.
Preferably, the cascading of the one-dimensional feature matrix along the W 'direction and the one-dimensional feature matrix along the H' direction is performed by:
g=K(z w +z h )
where g represents the output of the cascade operation and K represents a 1 x 1 convolution.
Preferably, the output of the global embedded attention module is expressed as:
z=X(i,j)×a c +X(i,j)×b c
and is also provided with
Wherein A and B represent random numbers, ac and Bc represent initial values, a c And b c Representing the weights corresponding to feature matrices with different weights, e represents the natural logarithm, g H And g W The feature matrix after dimension transformation in step S306 is obtained by processing a ReLU activation function and then dividing the feature matrix into two matrices along the space dimension, wherein the dimension of the feature matrix after dimension transformation in step S306 is R C×(W+H) G after segmentation H And g W The dimension of the feature matrix is R respectively C×H And R is C×W
Preferably, the image data to be classified and the training sample, the verification sample and the test sample are preprocessed according to the following steps:
s201: performing horizontal and vertical overturning on the image data to be classified and the image data in the training sample, the verification sample and the test sample;
s202: rotating the flipped image data clockwise or anticlockwise;
s203: scaling the rotated image data;
s204: and carrying out average reduction processing on the zoomed image data.
Preferably, in step S204, the mean value reduction processing is performed on the scaled image data by the following formula:
wherein Z is the image after the mean value is subtracted, v i The pixel matrix of the ith image in the n images, n is 100000 integer images.
Compared with the prior art, the beneficial effects that this disclosure brought are:
the method for embedding the global position information into the channel information is provided for constructing the global embedded attention residual error network, and the memory burden is greatly reduced through effective data preprocessing.
Drawings
FIG. 1 is a flow chart of a method of classifying images by building a global embedded attention residual network, provided by one embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a global embedded attention residual network provided by another embodiment of the present disclosure;
fig. 3 (a), fig. 3 (b), fig. 3 (c), fig. 3 (d) are schematic diagrams illustrating comparison of a global embedded attention residual network and an existing classification method according to another embodiment of the present disclosure.
Detailed Description
Specific embodiments of the present disclosure will be described in detail below with reference to fig. 1 to 3 (d). While specific embodiments of the disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
It should be noted that certain terms are used throughout the description and claims to refer to particular components. Those of skill in the art will understand that a person may refer to the same component by different names. The specification and claims do not identify differences in terms of components, but rather differences in terms of the functionality of the components. As used throughout the specification and claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. The description hereinafter sets forth a preferred embodiment for practicing the invention, but is not intended to limit the scope of the invention, as the description proceeds with reference to the general principles of the description. The scope of the present disclosure is defined by the appended claims.
For the purposes of promoting an understanding of the embodiments of the disclosure, reference will now be made to the embodiments illustrated in the drawings and specific examples, without the intention of being limiting the embodiments of the disclosure.
In one embodiment, as shown in fig. 1, an image classification method based on a global embedded attention residual network includes the following steps:
s100: preprocessing the image data to be classified;
s200: constructing a global embedded attention residual network containing a global embedded attention module, wherein the global embedded attention residual network comprises 1 input layer, 1 convolution layer with a convolution kernel size of 7×7, 1 max pooling layer, a global embedded attention module, 2 fully connected layers and 1 output layer, and the global embedded attention module comprises a spatial attention sub-module based on global context and a channel attention sub-module based on coordinates;
s300: and inputting the preprocessed image data to be classified into a global embedded attention residual error network for classification.
Compared with the existing method, the method can reduce model parameters of the deep neural network, fully utilize the context information and embed the context information into the channel information, realize global feature modeling, improve classification effect and solve the problems that the existing network is low in classification accuracy and difficult to combine the position information and the channel information.
In another embodiment, after the global embedded attention residual network is built, a training sample is required to be selected and preprocessed to train the network, a verification sample is required to be selected and preprocessed to adjust parameters of the trained network, and a test sample is required to test performance of the trained network.
In this embodiment, firstly, a plurality of image data are selected from any image data set including COCO, imageNet and ADNI, the selected image data are sorted into different image data subsets and are respectively used as a training sample, a verification sample and a test sample, then the selected training sample, the verification sample and the test sample are preprocessed, finally, the preprocessed training sample is input into the global embedded attention residual network for network training, after the network training is completed, the preprocessed verification sample is input into the trained network for adjusting network parameters, and the preprocessed test sample is input into the trained network for testing network performance, so that image classification is realized.
Compared with the existing method, the method can reduce model parameters of the deep neural network, fully utilize the context information and embed the context information into the channel information, realize global feature modeling, improve classification effect and solve the problems that the existing network is low in classification accuracy and difficult to combine the position information and the channel information.
In another embodiment, in step S200, the global context-based spatial attention submodule includes:
the first subunit is used for inputting the preprocessed training samples, verification samples and test samples into the convolution layer and the pooling layer for processing and then performing global average pooling operation so as to obtain a feature matrix containing global information;
the second subunit performs linear transformation on the feature matrix containing global information by adopting convolution and reshape functions with convolution kernel size of 1 multiplied by 1 so as to obtain a feature matrix subjected to dimension transformation processing;
the third subunit performs self-adaptive selection on the feature matrix subjected to dimension transformation processing by using a softmax function to obtain the corresponding weight of each different element on the feature matrix, and multiplies the corresponding weight of each different element by the feature matrix containing global information to obtain the feature matrix containing global context feature information;
and a fourth subunit, configured to perform nonlinear transformation on the feature matrix containing the global context feature information by using batch normalization and a ReLU activation function and perform dimensional transformation by using 1×1 convolution.
In another embodiment, the global context based spatial attention submodule is expressed as:
where x represents the output of global average pooling, y represents the output of global context features, H and W are eachRepresenting the height and width of the input image, X representing the input image, K representing a 1X 1 convolution, reLU representing a ReLU activation function, BN representing a batch normalization function, N representing the number of elements in the feature matrix, e representing the base of a natural logarithmic function, i, j, m representing the possible positions of all elements in the feature matrix, X, respectively j 、x m Respectively representing the values of element information in the feature matrix, t represents the weight of the x matrix, tx j And tx m And representing the output value obtained by calculating the feature matrix after global average pooling operation.
In another embodiment, in step S200, the coordinate-based channel attention submodule includes:
a fifth subunit, configured to decompose the feature matrix containing the global context feature information after the nonlinear transformation and the dimensional transformation by adopting average pooling along the W 'direction and the H' direction, to obtain a one-dimensional feature matrix along the W 'direction and a one-dimensional feature matrix along the H' direction, where the one-dimensional feature matrix along the W 'direction includes local position information of the channel, and the one-dimensional feature matrix along the H' direction includes long-term dependency information;
a sixth subunit, configured to concatenate the one-dimensional feature matrix along the W 'direction and the one-dimensional feature matrix along the H' direction, and perform feature transformation with a convolution of 1×1 to obtain a feature matrix subjected to dimension transformation;
and a seventh subunit, wherein the feature matrixes with different weights are obtained by carrying out weight distribution on the feature matrixes subjected to dimension transformation by utilizing a softmax function, and the feature matrixes with different weights are respectively subjected to feature transformation by utilizing 1×1 convolution, so that the output of the global embedded attention module is obtained.
In another embodiment, the feature matrix containing the global context feature information after the nonlinear transformation and the dimensional transformation is decomposed by adopting average pooling along the W 'direction and the H' direction, and the obtained one-dimensional feature matrix along the W 'direction and the obtained one-dimensional feature matrix along the H' direction are respectively expressed as:
where H 'and W' represent the height and width of the spatial attention sub-module output image of the global context, Z H And Z W Representing the one-dimensional feature matrix along the H 'direction and along the W' direction, respectively, i and j representing the H 'and W' of the ith row, respectively.
In another embodiment, the cascading of the one-dimensional feature matrix along the W 'direction and the one-dimensional feature matrix along the H' direction is performed by:
g=K(z w +z h )
where g represents the output of the cascade operation and K represents a 1 x 1 convolution.
In another embodiment, the output of the global embedded attention module is expressed as:
z=X(i,j)×a c +X(i,j)×b c
and is also provided with
Wherein A and B represent random numbers, ac and Bc represent initial values, a c And b c Representing the weights corresponding to feature matrices with different weights, e represents the natural logarithm, g H And g W The feature matrix after dimension transformation in step S306 is obtained by processing a ReLU activation function and then dividing the feature matrix into two matrices along the space dimension, wherein the dimension of the feature matrix after dimension transformation in step S306 is R C×(W+H) G after segmentation H And g W The dimension of the feature matrix is R respectively C×H And R is C×W
In another embodiment, the image data to be classified and the training sample, the verification sample and the test sample are preprocessed according to the following steps:
s201: performing horizontal and vertical overturning on the image data to be classified and the image data in the training sample, the verification sample and the test sample;
s202: rotating the flipped image data clockwise or anticlockwise;
in this step of the process, the process is carried out,
s203: scaling the rotated image data;
s204: and carrying out average reduction processing on the zoomed image data.
In this step, the mean value reduction processing is performed on the scaled image data by the following formula:
wherein Z is the image after the mean value is subtracted, v i The pixel matrix of the ith image in the n images, n is 100000 integer images.
In another embodiment, the training of the preprocessed training samples by inputting them into the global embedded attention residual network is performed by:
s501: performing linear and nonlinear operation on the preprocessed training samples in a forward propagation mode;
s502: and carrying out chain derivation on the training samples subjected to linear and nonlinear operation in a counter-propagation mode, and updating the weight information of the network according to a preset learning rate until the maximum iteration number is reached.
In this embodiment, training samples are sequentially processed through an input layer, a convolution layer, a maximum pooling layer, a global embedded attention module, a full connection layer and an output layer in a forward propagation manner, and then chain derivation is performed in a reverse propagation manner, and weight information of a network is updated from the output layer sequentially through the full connection layer, the global embedded attention module, the maximum pooling layer, the convolution layer and the input layer according to a preset learning rate a (a=0.01 and a gradually decreases with an increase of the network layer number, and a=a/5 every 20 iterations are added), so that the training samples continuously reciprocate until the maximum iteration number is reached.
The method of the present disclosure is further described below in connection with specific examples.
Specific example 1:
1. 100000 Zhang Yangben images are selected from the ImageNet image classification data set as training samples, 10000 sample images are selected as verification samples and 30000 sample images are selected as test samples, and the training samples and the images in the test samples are not overlapped.
2. 100000 images in the training sample are preprocessed, which comprises the following steps:
a. performing horizontal overturn and vertical overturn on the image;
b. rotating the flipped image by 20 degrees in a clockwise or counterclockwise direction;
c. scaling the rotated image to obtain a training sample image with 224×224;
d. the training sample image is subjected to mean reduction processing through a formula (1), wherein the formula (1) is expressed as:
wherein Z is the image after the mean value is subtracted, v i The pixel matrix of the ith image in the n images, n is 100000 integer images.
The preprocessing steps of the images in the verification sample and the test sample are the same as the above steps, and will not be repeated here.
3. Constructing a global embedded attention residual network containing a global embedded attention module, as shown in fig. 2, wherein the global embedded attention residual network comprises 1 input layer, 1 convolution layer with a convolution kernel size of 7×7, 1 max pooling layer, a global embedded attention module, 2 fully connected layers and 1 output layer, and the global embedded attention module comprises a spatial attention sub-module based on global context and a channel attention sub-module based on coordinates;
wherein the global context based spatial attention submodule comprises:
the first subunit is used for inputting the preprocessed training samples, verification samples and test samples into the convolution layer and the pooling layer for processing and then performing global average pooling operation so as to obtain a feature matrix containing global information;
the second subunit performs linear transformation on the feature matrix containing global information by adopting convolution and reshape functions with convolution kernel size of 1 multiplied by 1 so as to obtain a feature matrix subjected to dimension transformation processing;
the third subunit performs self-adaptive selection on the feature matrix subjected to dimension transformation processing by using a softmax function to obtain the corresponding weight of each different element on the feature matrix, and multiplies the corresponding weight of each different element by the feature matrix containing global information to obtain the feature matrix containing global context feature information;
and a fourth subunit, configured to perform nonlinear transformation on the feature matrix containing the global context feature information by using batch normalization and a ReLU activation function and perform dimensional transformation by using 1×1 convolution.
Wherein the first subunit is implemented by formula (2), where formula (2) is expressed as:
the second to fourth sub-units are realized by the formula (3), and the formula (3) is expressed as:
where X represents the output of global average pooling, y represents the output of global context features, H and W represent the height and width of the input image, respectively, and X represents the inputThe image, K represents a 1 x 1 convolution, reLU represents a ReLU activation function, BN represents a batch normalization function, N represents the number of elements in the feature matrix, e represents the base of a natural logarithmic function, i, j, m represent the possible positions of all elements in the feature matrix, x, respectively j 、x m Respectively representing the values of element information in the feature matrix, t represents the weight of the x matrix, tx j And tx m And representing the output value obtained by calculating the feature matrix after global average pooling operation.
The coordinate-based channel attention submodule includes:
a fifth subunit, configured to decompose the feature matrix containing the global context feature information after the nonlinear transformation and the dimensional transformation by adopting average pooling along the W 'direction and the H' direction, to obtain a one-dimensional feature matrix along the W 'direction and a one-dimensional feature matrix along the H' direction, where the one-dimensional feature matrix along the W 'direction includes local position information of the channel, and the one-dimensional feature matrix along the H' direction includes long-term dependency information;
a sixth subunit, configured to concatenate the one-dimensional feature matrix along the W 'direction and the one-dimensional feature matrix along the H' direction, and perform feature transformation with a convolution of 1×1 to obtain a feature matrix subjected to dimension transformation;
and a seventh subunit, wherein the feature matrixes with different weights are obtained by carrying out weight distribution on the feature matrixes subjected to dimension transformation by utilizing a softmax function, and the feature matrixes with different weights are respectively subjected to feature transformation by utilizing 1×1 convolution, so that the output of the global embedded attention module is obtained.
Wherein the fifth subunit is implemented by formulas (4) and (5), where formulas (4) and (5) are expressed as:
the sixth subunit is implemented by equation (6), equation (6) being expressed as:
g=K(z w +z h ) (6)
the seventh subunit is realized by formulas (7) - (9), formulas (7) - (9) being expressed as:
z=X(i,j)×a c +X(i,j)×b c (9)
where H 'and W' represent the height and width of the spatial attention sub-module output image of the global context, Z H And Z W Representing one-dimensional feature matrices in the H 'direction and in the W' direction, respectively, i and j representing H 'of the ith row and W' of the jth row, respectively, g representing the output of the cascade operation, K representing a 1 x 1 convolution, a and B representing random numbers, ac and Bc representing initial values, a c And b c Representing the weights corresponding to feature matrices with different weights, e represents the natural logarithm, g H And g W The feature matrix after dimension transformation in step S306 is obtained by processing a ReLU activation function and then dividing the feature matrix into two matrices along the space dimension, wherein the dimension of the feature matrix after dimension transformation in step S306 is R C×(W+H) G after segmentation H And g W The dimension of the feature matrix is R respectively C ×H And R is C×W
It should be noted that, the number of the global embedded attention modules is 16, each global embedded attention module is correspondingly embedded in the residual error structure, the convolution layer outside the global embedded attention mechanism can selectively perform batch normalization function and excitation function to perform nonlinear transformation, the output layer uses the full connection layer and softmax function to output the probability of the category to which each input image belongs, and the category with the maximum probability is used as the predicted category.
It should be further noted that 16 global embedded attention modules are selected because the attention modules may be embedded in multiple networks, such as, for example, resnet50, and the disclosure describes based on a structure of resnet50, and the structure of resnet50 is (3,4,6,3) a total of 16 residual blocks, so the disclosure selects 16 global embedded attention modules.
4. And inputting the preprocessed training sample containing 100000 images into the global embedded attention residual error network to train the training sample, and repeatedly and circularly updating the weight of the network through two steps of forward propagation and reverse propagation until the maximum iteration times reach 70-120 times, and ending the training process to obtain a trained residual error network model. And inputting the preprocessed verification sample containing 100000 images and the preprocessed test sample containing 30000 images into a trained network for verification and testing, so as to realize image classification.
Test samples were used to test and compared to the Top-1 accuracy and Top-5 accuracy on the COCO and ImageNet datasets, respectively, with existing classification methods including CA attention mechanisms, SE attention, BAM attention, and CBAM attention mechanisms, with the results shown in tables 1 and 2:
TABLE 1
TABLE 2
As can be seen from Table 1, the global embedded attention residual network added with the global embedded attention module has good performance in COCO or ImageNet data sets, with the reference of resnet-50, the highest Top-1acc can reach 75.9, the highest Top-5acc can reach 86.6, and the highest Top-1acc and Top-5acc respectively reach 75.8 and 83.1 in ImageNet data sets, so that the global embedded attention residual network has a certain improvement compared with other models.
As can be seen from Table 2, if Resnet-101 is used, there is a certain improvement in Top-1acc and Top-5acc, which indicates that the method disclosed by the disclosure has better generalization performance and can better classify images.
Specific example 2:
327 structural magnetic resonance image data were selected from the ADNI dataset, including 119 brain MRI of mild cognitive impairment patients, 101 brain MRI of alzheimer patients, and 107 normal human brain MRI. According to 7:3 into training sample data and test sample data, training and verifying the training sample data by using a 10-fold cross verification method, specifically: the training sample data are divided into 10 parts, the number is 0-9, wherein 0-8 is assumed, 9 parts are taken as training sets, 9 parts are taken as verification sets, after training and verification are finished, 8 parts are taken as verification sets, the rest are taken as test sets, training and verification are performed, and the like, and the training and verification are performed for 10 times. Finally, testing by using a test sample, and comparing the accuracy, recall and precision of the classification method comprising CA attention mechanism, SE attention, BAM attention and CBAM attention mechanism with the existing classification method on an ADNI data set, wherein the comparison results are shown in tables 3 and 4:
TABLE 3 Table 3
TABLE 4 Table 4
As can be seen from tables 3 and 4, the accuracy of the global embedded attention residual network added with the global embedded attention module is up to 88.5 when the Resnet-50 is taken as a basic model, and is up to 90.5 when the Resnet-101 is taken as a basic model.
To further verify the technical effect of the method of the present disclosure, the present disclosure applies the method to three data sets, and selects the results of a portion of the tests for visual display. As shown in fig. 3 (a) to 3 (d), where fig. 3 (a) is a non-attention mechanism, fig. 3 (b) is a SE attention mechanism, fig. 3 (c) is a CA attention mechanism, and fig. 3 (d) is a GEA attention mechanism (i.e., a global embedded attention residual network). From the results, the GEA mechanism proposed by the user can restrict the network more, so that the network can concentrate the more prominent features of the image rather than focusing on the whole image, the network can more prominently focus on the region of interest, the most prominent features of the image can be found out, and the classification accuracy of the network on the image can be greatly improved.
The basic principles of the present disclosure have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present disclosure are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present disclosure. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not intended to be limited to the details disclosed herein as such.

Claims (7)

1. A method of classifying images by constructing a global embedded attention residual network, comprising the steps of:
s100: preprocessing the image data to be classified;
s200: constructing a global embedded attention residual network containing a global embedded attention module, wherein the global embedded attention residual network comprises 1 input layer, 1 convolution layer with a convolution kernel size of 7×7, 1 max pooling layer, a global embedded attention module, 2 fully connected layers and 1 output layer, and the global embedded attention module comprises a spatial attention sub-module based on global context and a channel attention sub-module based on coordinates;
the global context-based spatial attention submodule includes:
the first subunit is used for inputting the preprocessed training samples, verification samples and test samples into the convolution layer and the pooling layer for processing and then performing global average pooling operation so as to obtain a feature matrix containing global information;
the second subunit performs linear transformation on the feature matrix containing global information by adopting convolution and reshape functions with convolution kernel size of 1 multiplied by 1 so as to obtain a feature matrix subjected to dimension transformation processing;
the third subunit performs self-adaptive selection on the feature matrix subjected to dimension transformation processing by using a so and max function to obtain the corresponding weight of each different element on the feature matrix, and multiplies the corresponding weight of each different element by the feature matrix containing global information to obtain the feature matrix containing global context feature information;
a fourth subunit, configured to perform nonlinear transformation on the feature matrix including the global context feature information by using batch normalization and a ReLU activation function, and perform dimensional transformation by using 1×1 convolution; the global context based spatial attention submodule is expressed as:
wherein X is represented as the output of global average pooling, y is represented as the output of global context features, H and W are represented as the height and width of the input image, respectively, X is represented as the input image, K is represented as a 1X 1 convolution, reLU is represented as a ReLU activation function, BN is represented as a batch normalization function, N is represented as the number of elements in the feature matrix, e is represented as the base of a natural logarithmic function, i, j, m are represented as the possible positions of all elements in the feature matrix, respectively, X j 、x m Respectively representing the values of element information in the feature matrix, t represents the weight of the x matrix, tx j And tx m Representing the output value obtained by calculating the feature matrix after global average pooling operation;
The coordinate-based channel attention submodule includes:
a fifth subunit, configured to decompose the feature matrix containing the global context feature information after the nonlinear transformation and the dimensional transformation by adopting average pooling along the W 'direction and the H' direction, to obtain a one-dimensional feature matrix along the W 'direction and a one-dimensional feature matrix along the H' direction, where the one-dimensional feature matrix along the W 'direction includes local position information of the channel, and the one-dimensional feature matrix along the H' direction includes long-term dependency information;
a sixth subunit, configured to concatenate the one-dimensional feature matrix along the W 'direction and the one-dimensional feature matrix along the H' direction, and perform feature transformation with a convolution of 1×1 to obtain a feature matrix subjected to dimension transformation;
a seventh subunit, performing weight distribution on the feature matrix subjected to the dimension transformation by using a softmax function to obtain feature matrices with different weights, and performing feature transformation on the feature matrices with different weights by using 1×1 convolution respectively to obtain the output of the global embedded attention module;
s300: and inputting the preprocessed image data to be classified into a global embedded attention residual error network for classification.
2. The method of claim 1, wherein after the global embedded attention residual network construction is completed, a training sample is selected and preprocessed to train the network, a verification sample is selected and preprocessed to adjust parameters of the trained network, and a test sample is selected to perform performance test on the trained network.
3. The method of claim 1, wherein the feature matrix containing the global context feature information after the nonlinear transformation and the dimensional transformation is decomposed by adopting average pooling along the W 'and the H' directions, and the obtained one-dimensional feature matrix along the W 'direction and the obtained one-dimensional feature matrix along the H' direction are respectively expressed as:
where H 'and W' represent the height and width of the spatial attention sub-module output image of the global context, Z H And Z W Representing the one-dimensional feature matrix along the H 'direction and along the W' direction, respectively, i and j representing the H 'and W' of the ith row, respectively.
4. The method of claim 1, wherein the cascading of the one-dimensional feature matrix along the W 'direction and the one-dimensional feature matrix along the H' direction is performed by:
g=K(z w +z h )
where g represents the output of the cascade operation and K represents a 1 x 1 convolution.
5. The method of claim 1, wherein the output of the global embedded attention module is represented as:
z=X(i,j)×a c +X(i,j)×b c
and is also provided with
Wherein A and B represent random numbers, ac and Bc represent initial values, a c And b c Representing the weights corresponding to feature matrices with different weights, e represents the natural logarithm, g H And g W Is performed by the process in step S306After the feature matrix subjected to the dimension transformation is processed by the ReLU activation function, the feature matrix is segmented into two matrices along the space dimension, and the dimension of the feature matrix subjected to the dimension transformation in step S306 is R C×(W+H) G after segmentation H And g W The dimension of the feature matrix is R respectively C×H And R is C×W
6. The method according to claim 2, wherein the image data to be classified and the training, validation and test samples are preprocessed according to the following steps:
s201: performing horizontal and vertical overturning on the image data to be classified and the image data in the training sample, the verification sample and the test sample;
s202: rotating the flipped image data clockwise or anticlockwise;
s203: scaling the rotated image data;
s204: and carrying out average reduction processing on the zoomed image data.
7. The method according to claim 6, wherein in step S204, the process of reducing the mean value of the scaled image data is performed by:
wherein Z is the image after the mean value is subtracted, v i The pixel matrix of the ith image in the n images, n is 100000 integer images.
CN202110487497.4A 2021-04-30 2021-04-30 Method for classifying images by constructing global embedded attention residual network Active CN113111970B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110487497.4A CN113111970B (en) 2021-04-30 2021-04-30 Method for classifying images by constructing global embedded attention residual network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110487497.4A CN113111970B (en) 2021-04-30 2021-04-30 Method for classifying images by constructing global embedded attention residual network

Publications (2)

Publication Number Publication Date
CN113111970A CN113111970A (en) 2021-07-13
CN113111970B true CN113111970B (en) 2023-12-26

Family

ID=76720844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110487497.4A Active CN113111970B (en) 2021-04-30 2021-04-30 Method for classifying images by constructing global embedded attention residual network

Country Status (1)

Country Link
CN (1) CN113111970B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115034375B (en) * 2022-08-09 2023-06-27 北京灵汐科技有限公司 Data processing method and device, neural network model, equipment and medium
CN115203380B (en) * 2022-09-19 2022-12-20 山东鼹鼠人才知果数据科技有限公司 Text processing system and method based on multi-mode data fusion
CN116958711B (en) * 2023-09-19 2023-12-15 华东交通大学 Lead-zinc ore image classification model construction method, system, storage medium and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111199214A (en) * 2020-01-04 2020-05-26 西安电子科技大学 Residual error network multispectral image ground feature classification method
CN111259982A (en) * 2020-02-13 2020-06-09 苏州大学 Premature infant retina image classification method and device based on attention mechanism
WO2020140633A1 (en) * 2019-01-04 2020-07-09 平安科技(深圳)有限公司 Text topic extraction method, apparatus, electronic device, and storage medium
CN112163601A (en) * 2020-09-14 2021-01-01 华南理工大学 Image classification method, system, computer device and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020140633A1 (en) * 2019-01-04 2020-07-09 平安科技(深圳)有限公司 Text topic extraction method, apparatus, electronic device, and storage medium
CN111199214A (en) * 2020-01-04 2020-05-26 西安电子科技大学 Residual error network multispectral image ground feature classification method
CN111259982A (en) * 2020-02-13 2020-06-09 苏州大学 Premature infant retina image classification method and device based on attention mechanism
CN112163601A (en) * 2020-09-14 2021-01-01 华南理工大学 Image classification method, system, computer device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
分层特征融合注意力网络图像超分辨率重建;雷鹏程;刘丛;唐坚刚;彭敦陆;;中国图象图形学报(第09期);全文 *

Also Published As

Publication number Publication date
CN113111970A (en) 2021-07-13

Similar Documents

Publication Publication Date Title
CN113111970B (en) Method for classifying images by constructing global embedded attention residual network
US20230021497A1 (en) Generating images using neural networks
CN111626300B (en) Image segmentation method and modeling method of image semantic segmentation model based on context perception
CN109190695B (en) Fish image classification method based on deep convolutional neural network
CN108734661B (en) High-resolution image prediction method for constructing loss function based on image texture information
CN110706214B (en) Three-dimensional U-Net brain tumor segmentation method fusing condition randomness and residual error
CN111127472B (en) Multi-scale image segmentation method based on weight learning
CN112418261B (en) Human body image multi-attribute classification method based on prior prototype attention mechanism
CN113706544B (en) Medical image segmentation method based on complete attention convolutional neural network
CN115311502A (en) Remote sensing image small sample scene classification method based on multi-scale double-flow architecture
CN115965789A (en) Scene perception attention-based remote sensing image semantic segmentation method
CN116168197A (en) Image segmentation method based on Transformer segmentation network and regularization training
CN116503399A (en) Insulator pollution flashover detection method based on YOLO-AFPS
CN116091823A (en) Single-feature anchor-frame-free target detection method based on fast grouping residual error module
CN115171052A (en) Crowded crowd attitude estimation method based on high-resolution context network
CN113298931B (en) Reconstruction method and device of object model, terminal equipment and storage medium
CN114882278A (en) Tire pattern classification method and device based on attention mechanism and transfer learning
CN111724306B (en) Image reduction method and system based on convolutional neural network
CN117373064A (en) Human body posture estimation method based on self-adaptive cross-dimension weighting, computer equipment and storage medium
CN113436224A (en) Intelligent image clipping method and device based on explicit composition rule modeling
CN109583584B (en) Method and system for enabling CNN with full connection layer to accept indefinite shape input
CN116385454A (en) Medical image segmentation method based on multi-stage aggregation
CN110930314A (en) Image banding noise suppression method and device, electronic device and storage medium
CN114897884A (en) No-reference screen content image quality evaluation method based on multi-scale edge feature fusion
CN114782779B (en) Small sample image feature learning method and device based on feature distribution migration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant