CN115965864A - Lightweight attention mechanism network for crop disease identification - Google Patents

Lightweight attention mechanism network for crop disease identification Download PDF

Info

Publication number
CN115965864A
CN115965864A CN202211622568.8A CN202211622568A CN115965864A CN 115965864 A CN115965864 A CN 115965864A CN 202211622568 A CN202211622568 A CN 202211622568A CN 115965864 A CN115965864 A CN 115965864A
Authority
CN
China
Prior art keywords
model
attention mechanism
channel
mobilevit
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211622568.8A
Other languages
Chinese (zh)
Inventor
张德富
仲仁豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University filed Critical Xiamen University
Priority to CN202211622568.8A priority Critical patent/CN115965864A/en
Publication of CN115965864A publication Critical patent/CN115965864A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A40/00Adaptation technologies in agriculture, forestry, livestock or agroalimentary production
    • Y02A40/10Adaptation technologies in agriculture, forestry, livestock or agroalimentary production in agriculture

Landscapes

  • Image Analysis (AREA)

Abstract

A lightweight attention mechanism network for crop disease identification relates to the field of deep learning. Based on the MobileViT model, a channel attention mechanism is added to part of MobileViT blocks of the MobileViT model, and the channel attention mechanism and the space attention mechanism are added at the last of the MobileViT model. The method is based on lightweight transform model namely MobileViT network model building, the model can effectively learn local representation and global representation, in order to better capture disease information of crops, an improved attention mechanism is added into the model, all data of a plant Village public data set are used for training and testing, the data totally comprise 38 categories, and 99.60% identification accuracy is obtained through evaluation and verification on the plant Village public data set, so that diseases of crops are effectively identified.

Description

Lightweight attention mechanism network for crop disease identification
Technical Field
The invention relates to the field of deep learning, in particular to a lightweight attention mechanism network for crop disease identification, and belongs to the application of a deep learning model in the field of crop disease identification.
Background
Diseases of crops affect the growth of crops, reduce the yield of crops and affect the quality. Because agriculture plays an important role, crop diseases seriously harm crops, and therefore, the rapid and accurate identification of the crop diseases becomes important.
To solve the problem of crop diseases, a key aspect is to quickly and accurately identify the crop diseases and then prescribe medicines according to symptoms. The crop diseases are not well judged, and the accurate and efficient judgment of the crop diseases is a main challenge. In recent years, the deep learning technology is developed rapidly, and the application field of the deep learning technology is wide. In the field of image recognition, a convolutional neural network has a good effect, and can effectively extract the characteristics of images and classify the images. Researchers have proposed a variety of convolutional neural networks, such as VGGNet, google lenet, resNet, and others. VGGNet is a relatively deep model, and the network structure is relatively simple and achieves good effect. An inclusion module is adopted in the GoogLeNet to build, the module adopts a multi-branch structure, a plurality of convolution layers are used for extracting different information, and the characterization capability of the network is improved. In ResNet, a residual error connection technology is provided, a deeper convolutional neural network can be constructed, and a good effect is achieved. In the field of image recognition, a convolutional neural network has been mainly used for the leading part, and in recent years, a Transformer model is also applied to the field of computer vision. The Transformer model achieves superior effects in the field of Natural Language Processing (NLP), and the main core point of the Transformer model is a self-attention mechanism, which is different from a convolutional neural network and a cyclic neural network. The Vision Transformer (ViT) model is applied to the field of computer Vision, and uses a Transformer structure to process a visual task and obtain a good result. The ViT model divides a picture into patches which are not overlapped, then the patches are subjected to linear mapping, and then the patches are input into a transform model for calculation. The patch of the image resembles token in NLP task.
With the rapid development of deep learning, the application field of the deep learning technology is more and more extensive, and the deep learning technology is gradually used for identifying diseases of crops. Many deep learning models have superior effects and can accurately identify crop diseases, but the number of parameters is large, and requirements on computing and storage resources are high, so that deployment and use on some mobile terminals and embedded devices are difficult. In addition, the Transformer model has good effect in the visual field, can learn global characteristics, but has larger parameter quantity and is difficult to use on a mobile terminal and an embedded device. Therefore, it is significant to design a lightweight model capable of effectively identifying crop diseases.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a lightweight attention mechanism network for crop disease identification, which is built based on a lightweight network model and added with an attention mechanism, in order to better identify diseases of crops, consider the parameters and complexity of the model and deploy and use the model at some mobile terminals and embedded devices.
The lightweight attention mechanism network for crop disease identification is characterized in that a channel attention mechanism is added to part of MobileViT blocks of a MobileViT model on the basis of the MobileViT model, and the channel attention mechanism and a space attention mechanism are added at the last of the MobileViT model;
the channel attention mechanism is based on a CBAM attention mechanism and also comprises one-dimensional convolution; the channel attention mechanism is used for analyzing the relationship among image channels and giving a weight to each channel to acquire key information, so that the performance of the network is improved;
the space attention mechanism is based on a CBAM attention mechanism and further comprises a multi-branch network structure and a hole convolution; the multi-branch network structure is constructed by convolution kernels with different sizes, and the cavity convolution layer is used for increasing the receptive field; the spatial attention mechanism is used for analyzing the relationship between image spaces and giving each pixel point a weight, so that important information is obtained in spatial dimensions.
The MobileViT model can effectively learn local characterization and global characterization, in order to better capture crop disease information, an improved attention mechanism is added into the model, all data of a plant Village public data set are used for training and testing, and a training method specifically comprises the following steps:
a public data set, namely a PlantVillage training model is adopted, wherein the PlantVillage is a public farming disease data set and comprises 38 categories; randomly dividing a plant Village data set into a training set, a verification set and a test set according to the ratio of 6: 2; the training set is used for training the model, the verification set is used for checking the state of the model in the training process, and the test set is used for finally testing the effect of the model;
in the model training process, each sample in a training set consists of an input image and a real class label corresponding to the image; inputting sample data in a training set into a model to obtain the prediction output of the model, wherein the output of the model is a vector, if C categories exist, a total vector with C elements is output, and each position represents the probability of the category; the real label of the input image is also a vector containing C elements, only one element of the vector is 1, the other elements are all 0, and the position of 1 represents the real category label of the image; comparing the output result of the model (namely the predicted label) with the real label of the input image, and calculating the loss through a cross entropy loss function, wherein the calculation formula of the cross entropy loss function is
Figure SMS_1
Where C is the number of classes, p c Is a variable that takes a value of 0 or 1 (if c is a true class, then p c =1, otherwise p c =0),q c Predicting a probability for class c for the model; after loss is calculated, calculating the gradient of the model parameters through back propagation, and updating the network model parameters by adopting an AdamW optimizer; in the actual training process, a mini-batch mode is adopted, namely, a batch of sample data is input into the model each time, and the specific training process is as follows:
(1) Inputting a batch of image sample data into the model.
(2) And calculating the prediction category of the data by the model.
(3) And comparing the prediction category obtained by model output with the real category, and calculating loss through a cross entropy loss function.
(4) And (4) performing back propagation operation, calculating the gradient of the model parameters, and updating the network model parameters by adopting an AdamW optimizer.
(5) And repeating the steps, and finishing when the set number of times of training is reached.
The details and the emphasis of the network model of the present invention are described in detail below, and the core part of the network model of the present invention includes an attention mechanism and a MobileViT block, wherein the attention mechanism includes a channel attention mechanism and a space attention mechanism.
1) A channel attention mechanism is constructed and used for analyzing the relationship among the image channels, a weight is given to each channel, and key information is obtained, so that the performance of the network is improved;
2) Constructing a spatial attention mechanism, giving a weight to each pixel point, compressing channel information through global average pooling and maximum pooling for input images, and constructing a multi-branch network by using convolution kernels with different sizes so as to better fuse information, improve the characterization capability of the network and capture crop disease information; using hole convolution to increase the receptive field; processing the compressed image through a plurality of convolution kernels, adding and fusing the results, and calculating by sigmoid to obtain an attention score;
3) A lightweight attention mechanism network for crop disease identification is constructed, a MobileViT model is used as a basic model, a channel attention mechanism is added to part of MobileViT blocks of the MobileViT model, and channel attention and space attention mechanisms are added at the last of the MobileViT model, so that channel information and space information can be captured better.
In step 1), the channel attention mechanism is constructed based on a CBAM attention mechanism, the channel attention mechanism uses a multilayer perceptron (MLP) and comprises two fully connected layers, and a one-dimensional convolution is introduced to alleviate the problem that the number of parameters is too large:
suppose that the input image is
Figure SMS_2
Wherein H, W and C are respectively the height, width and channel number of the image;
spatial information of the compressed image is obtained through global average pooling and maximum pooling operations
Figure SMS_3
Figure SMS_4
Wherein->
Figure SMS_5
And &>
Figure SMS_6
Respectively representing the results obtained by adopting global average pooling and maximum pooling on the image in the channel attention mechanism; in channel attention, the computation process of global average pooling is to sum and average all the elements in each channel for each channel of the image, each channel having H × W values; the maximum pooling operation is computed similarly by maximizing all elements in each channel. Will be/are>
Figure SMS_7
And &>
Figure SMS_8
Splicing, namely calculating the attention score through a sigmoid function by one-dimensional convolution fusion, wherein the specific calculation process is as follows:
Figure SMS_9
wherein,
Figure SMS_10
is a one-dimensional convolution with a convolution kernel of size k, σ is the sigmoid function, </or>
Figure SMS_11
Multiplying the attention score and the input image results in an output ≧ or>
Figure SMS_12
In the step 2), the space attention mechanism is constructed, which is different from the channel attention mechanism, the space attention is the relation among the spaces, and each pixel point is given a weight, so that important information is obtained in the space dimension; in the spatial attention mechanism, for an input image, channel information is compressed through global average pooling and maximum pooling to respectively obtain
Figure SMS_13
Wherein +>
Figure SMS_14
And &>
Figure SMS_15
Respectively representing the results obtained by using global average pooling and maximum pooling for images in the spatial dimension in the spatial attention mechanism. In the spatial attention mechanism, the calculation process of global average pooling is to calculate the position of each pixel point in an image, the position of each pixel point is totally C channels, namely totally C values, and then the average value of the C values is calculated to obtain average pooled output; the maximum pooling calculation process is similar, i.e., the averaging process is changed to maximum. Then the resulting->
Figure SMS_16
Calculating; in order to better fuse information, improve the characterization capability of the network and capture crop disease information, a multi-branch structure is adopted. Some classical models, such as google lenet and ResNet, adopt a multi-branch structure, and improve the performance of the model. Therefore, in the spatial attention mechanism, for better information extraction, the construction of a multi-branch network structure by using convolution kernels with different sizes is considered, namely, input images are respectively input into the branches to be calculated, then output results of the branches are added, and the receptive field is increased by using hole convolution; get after pooling operation>
Figure SMS_17
Splicing the two, processing through a plurality of convolution kernels, adding and fusing the results, and calculating through sigmoid to obtain an attention score; the specific calculation process of the spatial attention score is as follows:
Figure SMS_18
wherein s is sp In order to be a point of attention score,
Figure SMS_19
representing a convolution kernel size of k i ×k i Void value of d i σ is a sigmoid function, in conjunction with the method of (1)>
Figure SMS_20
Then multiplying the attention fraction and the input to obtain an output
Figure SMS_21
In step 3), the MobileViT model contains MV2 blocks and MobileViT blocks; MV2 is a reverse residual structure in MobileNetv 2; the MobileViT block is a core module of a MobileViT model and is used for learning local representation and global representation; the crop disease identification model construction comprises local modeling and global modeling:
(1) Local modeling: for an input image
Figure SMS_22
Wherein H, W and C are respectively height, width and channel number; first, local modeling operation Flo is performed cal The step is firstly passed through an n × n convolution module f n×n Learning local tokens and then passing through a 1 x 1 convolution module f 1×1 Performing dimensionality increase to obtain output X after local modeling loc l The specific calculation process is as follows:
Figure SMS_23
wherein d is X local Dimension of (d);
(2) Global modeling: the method comprises the steps of calculation of an Unfold module and a Transformer module and Fold operation; output X obtained after local modeling local Performing Unfold operation, changing the image into sequence data capable of being processed by a transform, and performing self-attention operation; mixing X local Dividing into non-overlapping patches, and obtaining the result after the Unfold operation
Figure SMS_24
Wherein P = P h ×p w ,/>
Figure SMS_25
p h ,p w The height and width of each patch are respectively, P is the number of groups, and N is the number of each group of patches; data X in each group p Inputting the data into a Transformer module for calculation and learning global representation; then performing Fold operation reduction to obtain
Figure SMS_26
Reducing dimension through convolution operation of 1 × 1, splicing through splicing operation and original input X, and fusing through convolution operation of n × n, wherein the specific operations are as follows:
X o =f n×n ([X,f 1×1 (X fold )])
wherein f is n×n For convolution operations of n × n, f 1×1 Is a convolution operation of 1 × 1.
Compared with the prior art, the invention has the following outstanding technical effects and advantages:
the method is built based on a lightweight Transformer model, namely a MobileViT network model, the model can effectively learn local representation and global representation, in order to better capture disease information of crops, an improved attention mechanism is added into the model, then all data of a PlantVillage public data set are used for training and testing, the total number of the classes is 38, and the model achieves 99.60% identification accuracy and shows the effectiveness of the network model in the invention through evaluation and verification on the PlantVillage public data set.
The present invention compares with some existing work, such as in the "Using Deep Learning for Image-Based Plant distance Detection" literature, authors use the google lenet model to achieve 99.35% accuracy on the Plant village public data set. In the literature of Tomato crop classification using pre-drawn leaf learning algorithm, an author Rangarajan et al selects Tomato images in a PlantVillage data set, and a VGG16 model is adopted to obtain an accuracy of 96.19%. In the document of the Grapediscease image classification based on light weight restriction neural networks and channel analysis, a channel attention mechanism is added into the ShuffleNet, and a grape image in a plantaVillage data set is selected for recognition, so that the accuracy rate is 99.14%. Compared with the prior art, the method has the advantages that on one hand, the MobileViT model is used, and the network model can effectively learn local representation and global representation. On the other hand, considering that some crop diseases are small and difficult to identify, an improved attention mechanism is added, wherein the improved attention mechanism comprises a channel attention mechanism and a space attention mechanism, and the channel dimension and the space dimension are considered simultaneously, so that the model can focus on the disease area in the crop disease picture, and the crop diseases are effectively identified.
The method is used for researching the convolutional neural network and the Transformer model, and the convolutional neural network and the Transformer model have good effect in image recognition and can be used for recognizing crop diseases. Meanwhile, considering that some deep learning models have larger parameters and are difficult to use on some embedded and mobile end equipment, a lightweight model is selected for identifying crop diseases. In addition, an attention mechanism is researched and added into the model for learning important information and improving the representation capability of the model, so that diseases of crops can be effectively identified.
Drawings
Fig. 1 is an input picture of crop diseases.
FIG. 2 is a block diagram of a MobileViT block with added channel attention mechanism.
Fig. 3 shows the way in which packets are divided when calculated using the transform module in the (modified) MoibleViT block, where the same color is a group.
FIG. 4 is a diagram of an attention mechanism including channel attention and spatial attention.
FIG. 5 is a graph of the results of the network model prediction output.
Detailed Description
In order to explain the present invention in more detail, the following detailed description is made with reference to the accompanying drawings and examples.
First, a method of training a network model in the present invention will be described.
In the invention, a public data set, plantVillage, which is a public crop disease data set and comprises 38 categories, is used for training a model. The plantavivlage dataset was randomly divided into a training set, a validation set, and a test set in a 6: 2 ratio. The training set is used for training the model, the verification set is used for checking the state of the model in the training process, and the test set is used for finally testing the effect of the model.
In the model training process, each sample in the training set is composed of an input image and a real class label corresponding to the image. And inputting sample data in the training set into the model to obtain the prediction output of the model, wherein the output of the model is a vector, and if C categories exist, a total C-element vector is output, and each position represents the probability of the category. The true tag of the input image is also a vector containing C elements, only one of the elements of the vector is 1, the others are all 0, and the position of 1 represents the true category tag of the image. Comparing the output result of the model (namely the predicted label) with the real label of the input image, and calculating the loss through a cross entropy loss function, wherein the calculation formula of the cross entropy loss function is
Figure SMS_27
Wherein C is the number of classes, p c Is a variable that takes a value of 0 or 1 (if c is a true class, then p c =1, otherwise p c =0),q c The probability of class c is predicted for the model. After the loss is calculated, the gradient of the model parameters is calculated through back propagation, and the AdamW optimizer is adopted to update the network model parameters. In the actual training process, a mini-batch mode is adopted, that is, a batch of sample data is input into the model each time, and the specific training process is as follows.
(1) Inputting a batch of image sample data into the model.
(2) And calculating the prediction category of the data by the model.
(3) And comparing the prediction category obtained by model output with the real category, and calculating loss through a cross entropy loss function.
(4) And (4) performing back propagation operation, calculating the gradient of the model parameters, and updating the network model parameters by adopting an AdamW optimizer.
(5) And repeating the steps, and finishing when the set number of times of training is reached.
The details and emphasis of the network model of the present invention are explained in detail below.
The invention constructs an effective attention mechanism, adds the effective attention mechanism into a network model, improves the performance of the network and can effectively capture crop disease information. The attention mechanism is widely applied to the fields of natural language processing, computer vision and the like. Attention can be paid to the fact that the power mechanism can indicate important information, so that unimportant information is omitted, and the model can make more accurate judgment. In the crop disease identification model, an attention mechanism is introduced, so that the model can effectively capture disease information, and the identification accuracy is improved.
The invention mainly improves the CBAM attention mechanism, and ensures the performance without being too complicated. CBAM includes a channel attention and spatial attention mechanism. The channel attention mechanism mainly analyzes the relationship among image channels, and gives a weight to each channel to acquire key information, so that the performance of the network is improved. The attention mechanism constructed by the invention is shown in fig. 4, the calculation process of the channel attention mechanism is shown in the upper part of fig. 4, the calculation process of the space attention mechanism is shown in the lower part of fig. 4, the output obtained after the input image is subjected to the calculation of the channel attention mechanism and the space attention mechanism is added with the input image to obtain the final output, and the specific calculation processes of the channel attention mechanism and the space attention mechanism are described as follows.
The channel attention constructed by the present invention is primarily an improvement over the CBAM attention mechanism. In CBAM, the channel attention mechanism uses a multi-layer perceptron (MLP), which contains two fully-connected layers, resulting in a large number of parameters. Aiming at the problem, the invention introduces one-dimensional convolution to relieve the problem that the parameter number is too large according to the idea in ECA-Net. Suppose that the input image is
Figure SMS_28
Where H, W, C are the height, width and number of channels of the image, respectively. The spatial information of the image is compressed through global average pooling and maximum pooling operations, the average pooling can be adopted to represent some information of the whole body, the maximum pooling can be added to represent remarkable information, and the two are combined to better improve the performance of the model. After a global average pooling and a maximum pooling calculation, respectively->
Figure SMS_29
Will then->
Figure SMS_30
And &>
Figure SMS_31
Splicing, fusing by one-dimensional convolution, and finally calculating the attention score by a sigmoid function. The specific calculation flow is as follows:
Figure SMS_32
wherein,
Figure SMS_33
is a one-dimensional convolution with a convolution kernel size k, σ being the sigmoid function, < >>
Figure SMS_34
Then pourMultiplying the intent score and the input image results in an output->
Figure SMS_35
After the attention of the channel dimension is considered, a space attention mechanism is constructed, different from the channel attention mechanism, the space attention mechanism focuses on the relation between spaces, and each pixel point is given a weight, so that important information is acquired in the space dimension. In the spatial attention mechanism, for the input image, the channel information is firstly compressed through global average pooling and maximum pooling to respectively obtain
Figure SMS_36
Then obtained by a pair of convolutional layers
Figure SMS_37
And (4) performing calculation. In order to better fuse information, improve the characterization capability of the network and capture crop disease information, a multi-branch structure is adopted. Some classical models, such as google lenet and ResNet, adopt a multi-branch structure, which improves the performance of the model. Therefore, in the spatial attention mechanism, for better information extraction, it is considered to construct a multi-branch network structure by using convolution kernels with different sizes, and choose to use hole convolution to increase the receptive field. Obtained after a pooling operation
Figure SMS_38
Splicing the two, processing through a plurality of convolution kernels, adding and fusing the results, and calculating through sigmoid to obtain the attention score. The specific calculation process is as follows:
Figure SMS_39
/>
wherein,
Figure SMS_40
representing a convolution kernel size of k i ×k i Void value of d i σ is a sigmoid function,
Figure SMS_41
then the attention score and the input are multiplied together to obtain an output->
Figure SMS_42
The MobileViT model is used as a basic model. It is an advantage for the Transformer model to be able to learn global tokens based on a self-attention mechanism. And for the convolutional neural network, the convolutional neural network has spatial induction bias and can learn local characteristics through fewer parameters. Considering that the convolutional neural network and the Transformer model have respective advantages, the MobileViT model is selected to construct the crop disease identification model, and the MobileViT model combines the advantages of the convolutional neural network and the Transformer model. The main blocks contained in MobileViT are MV2 block and MobileViT block. The MV2 block is a MobileNetv2 block, which is a reciprocal residual structure in MobileNetv 2. In MobileViT, the core module is the MobileViT block, which combines the advantages of CNNs and transformers for learning local and global characterizations. For an input image
Figure SMS_43
Wherein H, W and C are respectively height, width and channel number. First, a local modeling operation F is performed local The step is first passed through an n × n convolution module f n×n Learning local tokens and then passing through a 1 x 1 convolution module f 1, performing dimension raising to obtain output X after local modeling local The specific calculation process is as follows:
Figure SMS_44
wherein d is X local Of (c) is measured. Then, global modeling is carried out through a Transformer model. The method mainly comprises the steps of Unfold, transform module calculation and Fold operation. For the output X obtained after local modeling local First, an operation of Unfo1d is performed to convert an image into sequence data that can be processed by a transform, and a self-attention calculation is performed. Mixing X local Divided into non-overlapping patches obtained after the Unfold operation
Figure SMS_45
Wherein P = P h ×p w
Figure SMS_46
p h ,p w The height and width of each patch are respectively, P is the number of groups, and N is the number of each group of patches. Then the data X in each group p Inputting the data into a Transformer module for calculation, and learning the global characterization. Then performing Fold operation for reduction to obtain->
Figure SMS_47
Finally, dimension reduction is carried out through convolution operation of 1 × 1, splicing is carried out through splicing operation and original input X, and then fusion is carried out through convolution operation of n × n, wherein the specific operation is as follows:
X o =f n×n ([X,f 1×1 (X fold )])
wherein f is n×n For convolution operations of n × n, f 1×1 Is a convolution operation of 1 × 1.
On the basis of the MobileViT, in order to better capture information, the method adds a channel attention mechanism in a part of the MobileViT blocks of the MobileViT.
The following is a process of inputting an image and obtaining a recognition output through network model prediction.
(1) An image is input, as shown in fig. 1, and has a size of 224 × 224 × 3, which indicates that the image has a height and a width of 224, 3 channels, and a category of "Apple _ scab". The calculation is first performed through a 3 × 3 convolutional layer, and then through the BatchNorm layer and the SiLU activation function.
(2) In the first stage, the computation is performed by a MobileNetv2 module, in which the upscaling is first performed by a 1 × 1 convolution module, by a 1 × 1 convolution layer. And then, calculating by a 3 × 3 deep convolution, wherein the number of input channels and the number of output channels of the deep convolution are equal, and each channel only uses one convolution kernel to reduce the parameter number. Finally, dimension reduction is carried out through convolution of 1 multiplied by 1.
(3) In the second stage, three MobileNetv2 modules are used for calculation, the calculation process of each MobileNetv2 module is similar to that in the step (2), and the difference is that the number of channels of an output image in the current stage is 48, and the size of the output image is 56 multiplied by 56; the number of channels of the output image in step (2) is 32, and the size is 112 × 112.
(4) In the third stage, firstly, a MobileNetv2 module is used for calculation, and the specific process is similar to the step (2), except that the number of channels of the output image in the current stage is 64, and the size of the output image is 28 × 28. The calculation is performed by a modified MobileViT block, the structure of which is shown in fig. 2. In the modified MobileViT block, the calculation is first performed by the channel attention mechanism, assuming the input is
Figure SMS_50
Calculating input X by respectively using global tie pooling and maximum pooling, pressing spatial information into channels, and respectively obtaining ^ and ^ according to the spatial information>
Figure SMS_53
Will then->
Figure SMS_55
And &>
Figure SMS_49
Splicing, fusing by one-dimensional convolution, and calculating the attention score through a sigmoid function>
Figure SMS_51
Figure SMS_52
Finally, the attention score is multiplied by the input to obtain the output->
Figure SMS_54
After the attention calculation of the channel, the calculation of the local characterization is carried out, and the calculation is carried out through a convolution module of 3 multiplied by 3 firstly and then through a convolution module of 1 multiplied by 1. Followed by global characterizationThe calculation is carried out by dividing the previous output into several non-overlapping latches by the Unfold operation, and then obtaining ^ or ^ after the Unfold operation>
Figure SMS_48
Then the data X in each group p Inputting the data into a transform module for calculation, learning global characteristics, grouping and dividing in a manner shown in fig. 3, wherein the same color is a group. And restoring the output through Fold operation, calculating through convolution of 1 × 1, splicing the calculated result and the input, and fusing information through convolution of 3 × 3 to obtain the output.
(5) In the fourth stage, firstly, a MobileNetv2 module is used for calculation, and then, an improved MobileViT block is used for calculation, and the specific process is similar to the step (4), wherein the difference is that the number of output image channels in the current stage is 80, and the size is 14 multiplied by 14; in the step (4), the number of output image channels is 64, and the size is 28 × 28.
(6) In the fifth stage, firstly, a MobileNetv2 module is used for calculation, and then a MobileViT block is used for calculation, the specific process is similar to the step (4), except that a channel attention mechanism is not added, the number of output image channels in the current stage is 96, and the size is 7 × 7.
(7) The output from the previous step is first calculated using a 1 x 1 convolution and then input into the attention mechanism. Assume that the previous output is
Figure SMS_56
Wherein H, W and C are respectively height, width and channel number. Firstly, calculating the attention of the channel, and respectively obtaining the attention of the channel after calculating the global average pooling and the maximum pooling
Figure SMS_57
Will then->
Figure SMS_58
And &>
Figure SMS_59
Splicing, and passing throughAnd (5) performing dimensional convolution for fusion, and finally calculating the attention score through a sigmoid function. The specific calculation flow is as follows:
Figure SMS_60
then multiplying the attention fraction and the input image to obtain an output
Figure SMS_61
After the channel attention is finished, the spatial attention is calculated, and the output of the channel attention is obtained through pooling operation
Figure SMS_62
And then splicing the two, processing through a plurality of convolution kernels, adding and fusing the results, and calculating through sigmoid to obtain the attention score. The specific calculation process is as follows:
Figure SMS_63
then multiplying the attention score and the channel attention output to obtain an output
Figure SMS_64
Will be finally->
Figure SMS_65
And input X l And adding to obtain the attention mechanism output.
(8) Inputting the output obtained in the previous step into a final classifier, performing pooling operation on the output obtained in the previous step in a channel dimension through global average pooling in the classifier, then performing calculation through a full connection layer, wherein the number of the elements output finally is the number of categories, each element is the probability that an input image belongs to the category, the maximum value of the elements is the category predicted by the network model, and finally outputting the category, which is specifically shown in fig. 5.
The invention combines a MobileViT model and an attention mechanism to construct a lightweight attention mechanism network for crop disease identification. For a basic MobileViT model, the method adds a channel attention mechanism in a part of MobileViT blocks of the MobileViT, and adds a channel attention mechanism and a space attention mechanism at the end of the MobileViT model, so that channel information and space information can be better captured.

Claims (6)

1. A lightweight attention mechanism network for crop disease identification is characterized in that an improved attention mechanism is added on the basis of a MobileViT model, a channel attention mechanism is added into a part of MobileViT blocks of the MobileViT model, and the channel attention mechanism and a space attention mechanism are added at the last of the MobileViT model;
the channel attention mechanism is based on a CBAM attention mechanism and also comprises one-dimensional convolution; the channel attention mechanism is used for analyzing the relationship among image channels, and giving a weight to each channel to acquire key information, so that the performance of the network is improved;
the space attention mechanism is based on a CBAM attention mechanism, and further comprises a multi-branch network structure and a hole convolution; the multi-branch network structure is constructed by convolution kernels with different sizes, and the cavity convolution layer is used for increasing the receptive field; the spatial attention mechanism is used for analyzing the relation between image spaces and giving each pixel point a weight, so that important information can be obtained in the spatial dimension.
2. The network of claim 1, wherein the improved attention mechanism based on the MobileViT model is used for training tests with the data of the PlantVilage public data set, and the training method is as follows:
a public data set, namely a PlantVillage training model is adopted, wherein the PlantVillage is a public farming disease data set and comprises 38 categories; randomly dividing a PlantVillage data set into a training set, a verification set and a test set according to the ratio of 6: 2; the training set is used for training the model, the verification set is used for checking the state of the model in the training process, and the test set is used for finally testing the effect of the model;
in the model training process, each sample in a training set consists of an input image and a real class label corresponding to the image; inputting sample data in a training set into a model to obtain the prediction output of the model, wherein the output of the model is a vector, if C categories exist, a total vector with C elements is output, and each position represents the probability of the category; the real label of the input image is also a vector containing C elements, only one of the elements of the vector is 1, the other elements are 0, and the position with 1 represents the real category label of the image; comparing the output result of the model (namely the predicted label) with the real label of the input image, and calculating the loss through a cross entropy loss function, wherein the calculation formula of the cross entropy loss function is
Figure QLYQS_1
Where C is the number of classes, p c Is a variable that takes a value of 0 or 1 (if c is a true class, then p c =1, otherwise p c =0),q c Predicting a probability for class c for the model; after loss is calculated, calculating the gradient of the model parameters through back propagation, and updating the network model parameters by adopting an AdamW optimizer; in the actual training process, a mini-batch mode is adopted, namely, a batch of sample data is input into the model each time, and the specific training process is as follows:
(1) Inputting a batch of image sample data into the model;
(2) Calculating the prediction category of the data through the model;
(3) Comparing the prediction category obtained by the model output with the real category, and calculating the loss through a cross entropy loss function;
(4) Performing back propagation operation, calculating the gradient of the model parameters, and updating the network model parameters by adopting an AdamW optimizer;
(5) And repeating the steps, and finishing when the set times of training is reached.
3. A lightweight attentional force system network for crop disease identification as claimed in claim 1 wherein said improved force system comprises in particular the steps of:
1) A channel attention mechanism is constructed and used for analyzing the relationship among the image channels, a weight is given to each channel, and key information is obtained, so that the performance of the network is improved;
2) A spatial attention mechanism is constructed, each pixel point is given a weight, channel information is compressed through global average pooling and maximum pooling for input images, and a multi-branch network is constructed by using convolution kernels with different sizes, so that information is better fused, the characterization capability of the network is improved, and crop disease information is captured; using hole convolution to increase the receptive field; processing the compressed image through a plurality of convolution kernels, adding and fusing the results, and calculating by sigmoid to obtain an attention score;
3) A lightweight attention mechanism network for crop disease identification is constructed, a MobileViT model is used as a basic model, a channel attention mechanism is added to part of MobileViT blocks of the MobileViT model, and channel attention and space attention mechanisms are added at the last of the MobileViT model, so that channel information and space information can be captured better.
4. The network of claim 3, wherein in step 1), the specific steps of constructing the channel attention mechanism are as follows:
suppose that the input image is
Figure QLYQS_2
Wherein H, W and C are respectively the height, width and channel number of the image;
spatial information of the compressed image is obtained through global average pooling and maximum pooling operations
Figure QLYQS_3
Figure QLYQS_4
Wherein->
Figure QLYQS_5
And &>
Figure QLYQS_6
Respectively representing the results obtained by adopting global average pooling and maximum pooling on the image in the channel attention mechanism; in channel attention, the computation process of global average pooling is to sum and average all the elements in each channel for each channel of the image, each channel having H × W values; the calculation processes of the maximum pooling operation are similar, namely the maximum value of all elements in each channel is obtained; will calculate to get->
Figure QLYQS_7
And &>
Figure QLYQS_8
Splicing, namely calculating the attention score through a sigmoid function by one-dimensional convolution fusion, wherein the specific calculation process is as follows:
Figure QLYQS_9
wherein,
Figure QLYQS_10
is a one-dimensional convolution with a convolution kernel of size k, σ is the sigmoid function, </or>
Figure QLYQS_11
Multiplying the attention score and the input image results in an output ≧ or>
Figure QLYQS_12
5. A lightweight attentive force mechanism network for crop disease identification as claimed in claim 3, wherein in step 2), said spatial attentive force mechanism is constructed forEach pixel point has a weight, so that important information is obtained in spatial dimension; in the spatial attention mechanism, for an input image, channel information is compressed through global average pooling and maximum pooling to respectively obtain
Figure QLYQS_13
Wherein +>
Figure QLYQS_14
And &>
Figure QLYQS_15
Respectively representing the results obtained by adopting global average pooling and maximum pooling on the image in the spatial dimension in the spatial attention mechanism; in the spatial attention mechanism, the calculation process of global average pooling is to calculate the position of each pixel point in an image, the position of each pixel point is totally C channels, namely totally C values, and then the average value of the C values is calculated to obtain average pooled output; the calculation process of the maximum pooling is similar, namely the process of averaging is changed into the process of solving the maximum value; then on the basis of the resulting pairs of convoluting layers>
Figure QLYQS_16
Calculating; in order to better fuse information, improve the characterization capability of a network and capture crop disease information, a multi-branch structure is adopted; some classical models, such as google lenet and ResNet, adopt a multi-branch structure, so that the performance of the models is improved; therefore, in the spatial attention mechanism, for better information extraction, the construction of a multi-branch structure network by using convolution kernels with different sizes is considered, namely, input images are respectively input into the branches to be calculated, then output results of the branches are added, and the receptive field is increased by using hole convolution; get after pooling operation>
Figure QLYQS_17
Splicing the two, processing the two through a plurality of convolution kernels, adding and fusing the results, and calculating the result through sigmoidA score of attention; spatial attention score s sp The calculation process of (2) is as follows:
Figure QLYQS_18
wherein s is sp In order to be a fraction of attention,
Figure QLYQS_19
representing a convolution kernel size of k i ×k i Void value of d i σ is a sigmoid function, <' > is a sigmoid function>
Figure QLYQS_20
Then multiplying the attention score and the input results in an output ≧ or>
Figure QLYQS_21
6. A lightweight attentive force mechanism network for crop disease identification as recited in claim 3 wherein in step 3) said MobileViT model contains MV2 blocks and MobileViT blocks; MV2 is a reverse residual structure in MobileNetv 2; the MobileViT block is a core module of a MobileViT model and is used for learning local representation and global representation; the MobileViT block for constructing the crop disease identification model comprises local modeling and global modeling:
(1) Local modeling: for an input image
Figure QLYQS_22
Wherein H, W and C are respectively height, width and channel number; first, a local modeling operation F is performed local The step is firstly passed through an n × n convolution module f n×n Learning local tokens, n typically being 3, and then passing through a 1 x 1 convolution module f 1×1 Performing dimension raising to obtain output X after local modeling local The specific calculation process is as follows:
Figure QLYQS_23
wherein d is locak The dimension of (a), i.e., the number of output channels obtained after calculation by local modeling;
(2) Global modeling: the method comprises the steps of calculation and Fold operation of an Unfold and Transformer module; output obtained after local modeling local Performing Unfold operation, changing the image into sequence data capable of being processed by a transform, and performing self-attention operation; will be provided with local Divided into non-overlapping patches obtained after the Unfold operation
Figure QLYQS_24
Wherein P = ph × P w ,/>
Figure QLYQS_25
ph, w The height and width of each patch are respectively, P is the number of groups, and N is the number of each group of patches; data X in each group p Inputting the data into a Transformer module for calculation and learning global representation; then performing Fold operation reduction to obtain
Figure QLYQS_26
Reducing dimension through convolution operation of 1 × 1, splicing through splicing operation and original input X, and fusing through convolution operation of n × n, wherein the specific operations are as follows:
Figure QLYQS_27
wherein f is n×n For convolution operations of n × n, f 1×1 Is a convolution operation of 1 × 1.
CN202211622568.8A 2022-12-16 2022-12-16 Lightweight attention mechanism network for crop disease identification Pending CN115965864A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211622568.8A CN115965864A (en) 2022-12-16 2022-12-16 Lightweight attention mechanism network for crop disease identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211622568.8A CN115965864A (en) 2022-12-16 2022-12-16 Lightweight attention mechanism network for crop disease identification

Publications (1)

Publication Number Publication Date
CN115965864A true CN115965864A (en) 2023-04-14

Family

ID=87359164

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211622568.8A Pending CN115965864A (en) 2022-12-16 2022-12-16 Lightweight attention mechanism network for crop disease identification

Country Status (1)

Country Link
CN (1) CN115965864A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117081806A (en) * 2023-08-18 2023-11-17 四川农业大学 Channel authentication method based on feature extraction
CN117274184A (en) * 2023-09-19 2023-12-22 河北大学 Kidney cancer PET-CT image-specific prediction ki-67 expression method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117081806A (en) * 2023-08-18 2023-11-17 四川农业大学 Channel authentication method based on feature extraction
CN117081806B (en) * 2023-08-18 2024-03-19 四川农业大学 Channel authentication method based on feature extraction
CN117274184A (en) * 2023-09-19 2023-12-22 河北大学 Kidney cancer PET-CT image-specific prediction ki-67 expression method
CN117274184B (en) * 2023-09-19 2024-05-28 河北大学 Kidney cancer PET-CT image-specific prediction ki-67 expression method

Similar Documents

Publication Publication Date Title
CN106529447B (en) Method for identifying face of thumbnail
AU2020104006A4 (en) Radar target recognition method based on feature pyramid lightweight convolutional neural network
CN107316013B (en) Hyperspectral image classification method based on NSCT (non-subsampled Contourlet transform) and DCNN (data-to-neural network)
CN105138973B (en) The method and apparatus of face authentication
CN110929736B (en) Multi-feature cascading RGB-D significance target detection method
CN114758383A (en) Expression recognition method based on attention modulation context spatial information
CN113674334B (en) Texture recognition method based on depth self-attention network and local feature coding
CN115965864A (en) Lightweight attention mechanism network for crop disease identification
CN112862690B (en) Transformers-based low-resolution image super-resolution method and system
CN112288011A (en) Image matching method based on self-attention deep neural network
CN110188827A (en) A kind of scene recognition method based on convolutional neural networks and recurrence autocoder model
CN111582041A (en) Electroencephalogram identification method based on CWT and MLMSFFCNN
CN116645716B (en) Expression recognition method based on local features and global features
CN117237559B (en) Digital twin city-oriented three-dimensional model data intelligent analysis method and system
CN112308137B (en) Image matching method for aggregating neighborhood points and global features by using attention mechanism
CN115311502A (en) Remote sensing image small sample scene classification method based on multi-scale double-flow architecture
CN114821340A (en) Land utilization classification method and system
Liu et al. Mmran: A novel model for finger vein recognition based on a residual attention mechanism: Mmran: A novel finger vein recognition model
CN114170657A (en) Facial emotion recognition method integrating attention mechanism and high-order feature representation
CN113449671A (en) Multi-scale and multi-feature fusion pedestrian re-identification method and device
CN110414338B (en) Pedestrian re-identification method based on sparse attention network
Manzari et al. A robust network for embedded traffic sign recognition
CN116977723A (en) Hyperspectral image classification method based on space-spectrum hybrid self-attention mechanism
CN113962262B (en) Continuous learning-based intelligent radar signal sorting method
CN114445665A (en) Hyperspectral image classification method based on Transformer enhanced non-local U-shaped network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination