CN115965864A

CN115965864A - Lightweight attention mechanism network for crop disease identification

Info

Publication number: CN115965864A
Application number: CN202211622568.8A
Authority: CN
Inventors: 张德富; 仲仁豪
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-04-14

Abstract

A lightweight attention mechanism network for crop disease identification relates to the field of deep learning. Based on the MobileViT model, a channel attention mechanism is added to part of MobileViT blocks of the MobileViT model, and the channel attention mechanism and the space attention mechanism are added at the last of the MobileViT model. The method is based on lightweight transform model namely MobileViT network model building, the model can effectively learn local representation and global representation, in order to better capture disease information of crops, an improved attention mechanism is added into the model, all data of a plant Village public data set are used for training and testing, the data totally comprise 38 categories, and 99.60% identification accuracy is obtained through evaluation and verification on the plant Village public data set, so that diseases of crops are effectively identified.

Description

Lightweight attention mechanism network for crop disease identification

Technical Field

The invention relates to the field of deep learning, in particular to a lightweight attention mechanism network for crop disease identification, and belongs to the application of a deep learning model in the field of crop disease identification.

Background

Diseases of crops affect the growth of crops, reduce the yield of crops and affect the quality. Because agriculture plays an important role, crop diseases seriously harm crops, and therefore, the rapid and accurate identification of the crop diseases becomes important.

To solve the problem of crop diseases, a key aspect is to quickly and accurately identify the crop diseases and then prescribe medicines according to symptoms. The crop diseases are not well judged, and the accurate and efficient judgment of the crop diseases is a main challenge. In recent years, the deep learning technology is developed rapidly, and the application field of the deep learning technology is wide. In the field of image recognition, a convolutional neural network has a good effect, and can effectively extract the characteristics of images and classify the images. Researchers have proposed a variety of convolutional neural networks, such as VGGNet, google lenet, resNet, and others. VGGNet is a relatively deep model, and the network structure is relatively simple and achieves good effect. An inclusion module is adopted in the GoogLeNet to build, the module adopts a multi-branch structure, a plurality of convolution layers are used for extracting different information, and the characterization capability of the network is improved. In ResNet, a residual error connection technology is provided, a deeper convolutional neural network can be constructed, and a good effect is achieved. In the field of image recognition, a convolutional neural network has been mainly used for the leading part, and in recent years, a Transformer model is also applied to the field of computer vision. The Transformer model achieves superior effects in the field of Natural Language Processing (NLP), and the main core point of the Transformer model is a self-attention mechanism, which is different from a convolutional neural network and a cyclic neural network. The Vision Transformer (ViT) model is applied to the field of computer Vision, and uses a Transformer structure to process a visual task and obtain a good result. The ViT model divides a picture into patches which are not overlapped, then the patches are subjected to linear mapping, and then the patches are input into a transform model for calculation. The patch of the image resembles token in NLP task.

With the rapid development of deep learning, the application field of the deep learning technology is more and more extensive, and the deep learning technology is gradually used for identifying diseases of crops. Many deep learning models have superior effects and can accurately identify crop diseases, but the number of parameters is large, and requirements on computing and storage resources are high, so that deployment and use on some mobile terminals and embedded devices are difficult. In addition, the Transformer model has good effect in the visual field, can learn global characteristics, but has larger parameter quantity and is difficult to use on a mobile terminal and an embedded device. Therefore, it is significant to design a lightweight model capable of effectively identifying crop diseases.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a lightweight attention mechanism network for crop disease identification, which is built based on a lightweight network model and added with an attention mechanism, in order to better identify diseases of crops, consider the parameters and complexity of the model and deploy and use the model at some mobile terminals and embedded devices.

The lightweight attention mechanism network for crop disease identification is characterized in that a channel attention mechanism is added to part of MobileViT blocks of a MobileViT model on the basis of the MobileViT model, and the channel attention mechanism and a space attention mechanism are added at the last of the MobileViT model;

the channel attention mechanism is based on a CBAM attention mechanism and also comprises one-dimensional convolution; the channel attention mechanism is used for analyzing the relationship among image channels and giving a weight to each channel to acquire key information, so that the performance of the network is improved;

the space attention mechanism is based on a CBAM attention mechanism and further comprises a multi-branch network structure and a hole convolution; the multi-branch network structure is constructed by convolution kernels with different sizes, and the cavity convolution layer is used for increasing the receptive field; the spatial attention mechanism is used for analyzing the relationship between image spaces and giving each pixel point a weight, so that important information is obtained in spatial dimensions.

The MobileViT model can effectively learn local characterization and global characterization, in order to better capture crop disease information, an improved attention mechanism is added into the model, all data of a plant Village public data set are used for training and testing, and a training method specifically comprises the following steps:

a public data set, namely a PlantVillage training model is adopted, wherein the PlantVillage is a public farming disease data set and comprises 38 categories; randomly dividing a plant Village data set into a training set, a verification set and a test set according to the ratio of 6: 2; the training set is used for training the model, the verification set is used for checking the state of the model in the training process, and the test set is used for finally testing the effect of the model;

in the model training process, each sample in a training set consists of an input image and a real class label corresponding to the image; inputting sample data in a training set into a model to obtain the prediction output of the model, wherein the output of the model is a vector, if C categories exist, a total vector with C elements is output, and each position represents the probability of the category; the real label of the input image is also a vector containing C elements, only one element of the vector is 1, the other elements are all 0, and the position of 1 represents the real category label of the image; comparing the output result of the model (namely the predicted label) with the real label of the input image, and calculating the loss through a cross entropy loss function, wherein the calculation formula of the cross entropy loss function is

Where C is the number of classes, p _c Is a variable that takes a value of 0 or 1 (if c is a true class, then p _c =1, otherwise p _c ＝0)，q _c Predicting a probability for class c for the model; after loss is calculated, calculating the gradient of the model parameters through back propagation, and updating the network model parameters by adopting an AdamW optimizer; in the actual training process, a mini-batch mode is adopted, namely, a batch of sample data is input into the model each time, and the specific training process is as follows:

(1) Inputting a batch of image sample data into the model.

(2) And calculating the prediction category of the data by the model.

(3) And comparing the prediction category obtained by model output with the real category, and calculating loss through a cross entropy loss function.

(4) And (4) performing back propagation operation, calculating the gradient of the model parameters, and updating the network model parameters by adopting an AdamW optimizer.

(5) And repeating the steps, and finishing when the set number of times of training is reached.

The details and the emphasis of the network model of the present invention are described in detail below, and the core part of the network model of the present invention includes an attention mechanism and a MobileViT block, wherein the attention mechanism includes a channel attention mechanism and a space attention mechanism.

1) A channel attention mechanism is constructed and used for analyzing the relationship among the image channels, a weight is given to each channel, and key information is obtained, so that the performance of the network is improved;

2) Constructing a spatial attention mechanism, giving a weight to each pixel point, compressing channel information through global average pooling and maximum pooling for input images, and constructing a multi-branch network by using convolution kernels with different sizes so as to better fuse information, improve the characterization capability of the network and capture crop disease information; using hole convolution to increase the receptive field; processing the compressed image through a plurality of convolution kernels, adding and fusing the results, and calculating by sigmoid to obtain an attention score;

3) A lightweight attention mechanism network for crop disease identification is constructed, a MobileViT model is used as a basic model, a channel attention mechanism is added to part of MobileViT blocks of the MobileViT model, and channel attention and space attention mechanisms are added at the last of the MobileViT model, so that channel information and space information can be captured better.

In step 1), the channel attention mechanism is constructed based on a CBAM attention mechanism, the channel attention mechanism uses a multilayer perceptron (MLP) and comprises two fully connected layers, and a one-dimensional convolution is introduced to alleviate the problem that the number of parameters is too large:

suppose that the input image is

Wherein H, W and C are respectively the height, width and channel number of the image;

spatial information of the compressed image is obtained through global average pooling and maximum pooling operations

Wherein->

And &>

Respectively representing the results obtained by adopting global average pooling and maximum pooling on the image in the channel attention mechanism; in channel attention, the computation process of global average pooling is to sum and average all the elements in each channel for each channel of the image, each channel having H × W values; the maximum pooling operation is computed similarly by maximizing all elements in each channel. Will be/are>

And &>

Splicing, namely calculating the attention score through a sigmoid function by one-dimensional convolution fusion, wherein the specific calculation process is as follows:

wherein,

is a one-dimensional convolution with a convolution kernel of size k, σ is the sigmoid function, </or>

Multiplying the attention score and the input image results in an output ≧ or>

In the step 2), the space attention mechanism is constructed, which is different from the channel attention mechanism, the space attention is the relation among the spaces, and each pixel point is given a weight, so that important information is obtained in the space dimension; in the spatial attention mechanism, for an input image, channel information is compressed through global average pooling and maximum pooling to respectively obtain

Wherein +>

And &>

Respectively representing the results obtained by using global average pooling and maximum pooling for images in the spatial dimension in the spatial attention mechanism. In the spatial attention mechanism, the calculation process of global average pooling is to calculate the position of each pixel point in an image, the position of each pixel point is totally C channels, namely totally C values, and then the average value of the C values is calculated to obtain average pooled output; the maximum pooling calculation process is similar, i.e., the averaging process is changed to maximum. Then the resulting->

Calculating; in order to better fuse information, improve the characterization capability of the network and capture crop disease information, a multi-branch structure is adopted. Some classical models, such as google lenet and ResNet, adopt a multi-branch structure, and improve the performance of the model. Therefore, in the spatial attention mechanism, for better information extraction, the construction of a multi-branch network structure by using convolution kernels with different sizes is considered, namely, input images are respectively input into the branches to be calculated, then output results of the branches are added, and the receptive field is increased by using hole convolution; get after pooling operation>

Splicing the two, processing through a plurality of convolution kernels, adding and fusing the results, and calculating through sigmoid to obtain an attention score; the specific calculation process of the spatial attention score is as follows:

wherein s is ^sp In order to be a point of attention score,

representing a convolution kernel size of k _i ×k _i Void value of d _i σ is a sigmoid function, in conjunction with the method of (1)>

Then multiplying the attention fraction and the input to obtain an output

In step 3), the MobileViT model contains MV2 blocks and MobileViT blocks; MV2 is a reverse residual structure in MobileNetv 2; the MobileViT block is a core module of a MobileViT model and is used for learning local representation and global representation; the crop disease identification model construction comprises local modeling and global modeling:

(1) Local modeling: for an input image

Wherein H, W and C are respectively height, width and channel number; first, local modeling operation Flo is performed _cal The step is firstly passed through an n × n convolution module f ^n×n Learning local tokens and then passing through a 1 x 1 convolution module f ^1×1 Performing dimensionality increase to obtain output X after local modeling _loc ^l The specific calculation process is as follows:

wherein d is X _local Dimension of (d);

(2) Global modeling: the method comprises the steps of calculation of an Unfold module and a Transformer module and Fold operation; output X obtained after local modeling _local Performing Unfold operation, changing the image into sequence data capable of being processed by a transform, and performing self-attention operation; mixing X _local Dividing into non-overlapping patches, and obtaining the result after the Unfold operation

Wherein P = P _h ×p _w ，/>

p _h ，p _w The height and width of each patch are respectively, P is the number of groups, and N is the number of each group of patches; data X in each group _p Inputting the data into a Transformer module for calculation and learning global representation; then performing Fold operation reduction to obtain

Reducing dimension through convolution operation of 1 × 1, splicing through splicing operation and original input X, and fusing through convolution operation of n × n, wherein the specific operations are as follows:

X _o ＝f ^n×n ([X，f ^1×1 (X _fold )])

wherein f is ^n×n For convolution operations of n × n, f ^1×1 Is a convolution operation of 1 × 1.

Compared with the prior art, the invention has the following outstanding technical effects and advantages:

the method is built based on a lightweight Transformer model, namely a MobileViT network model, the model can effectively learn local representation and global representation, in order to better capture disease information of crops, an improved attention mechanism is added into the model, then all data of a PlantVillage public data set are used for training and testing, the total number of the classes is 38, and the model achieves 99.60% identification accuracy and shows the effectiveness of the network model in the invention through evaluation and verification on the PlantVillage public data set.

The present invention compares with some existing work, such as in the "Using Deep Learning for Image-Based Plant distance Detection" literature, authors use the google lenet model to achieve 99.35% accuracy on the Plant village public data set. In the literature of Tomato crop classification using pre-drawn leaf learning algorithm, an author Rangarajan et al selects Tomato images in a PlantVillage data set, and a VGG16 model is adopted to obtain an accuracy of 96.19%. In the document of the Grapediscease image classification based on light weight restriction neural networks and channel analysis, a channel attention mechanism is added into the ShuffleNet, and a grape image in a plantaVillage data set is selected for recognition, so that the accuracy rate is 99.14%. Compared with the prior art, the method has the advantages that on one hand, the MobileViT model is used, and the network model can effectively learn local representation and global representation. On the other hand, considering that some crop diseases are small and difficult to identify, an improved attention mechanism is added, wherein the improved attention mechanism comprises a channel attention mechanism and a space attention mechanism, and the channel dimension and the space dimension are considered simultaneously, so that the model can focus on the disease area in the crop disease picture, and the crop diseases are effectively identified.

The method is used for researching the convolutional neural network and the Transformer model, and the convolutional neural network and the Transformer model have good effect in image recognition and can be used for recognizing crop diseases. Meanwhile, considering that some deep learning models have larger parameters and are difficult to use on some embedded and mobile end equipment, a lightweight model is selected for identifying crop diseases. In addition, an attention mechanism is researched and added into the model for learning important information and improving the representation capability of the model, so that diseases of crops can be effectively identified.

Drawings

Fig. 1 is an input picture of crop diseases.

FIG. 2 is a block diagram of a MobileViT block with added channel attention mechanism.

Fig. 3 shows the way in which packets are divided when calculated using the transform module in the (modified) MoibleViT block, where the same color is a group.

FIG. 4 is a diagram of an attention mechanism including channel attention and spatial attention.

FIG. 5 is a graph of the results of the network model prediction output.

Detailed Description

In order to explain the present invention in more detail, the following detailed description is made with reference to the accompanying drawings and examples.

First, a method of training a network model in the present invention will be described.

In the invention, a public data set, plantVillage, which is a public crop disease data set and comprises 38 categories, is used for training a model. The plantavivlage dataset was randomly divided into a training set, a validation set, and a test set in a 6: 2 ratio. The training set is used for training the model, the verification set is used for checking the state of the model in the training process, and the test set is used for finally testing the effect of the model.

In the model training process, each sample in the training set is composed of an input image and a real class label corresponding to the image. And inputting sample data in the training set into the model to obtain the prediction output of the model, wherein the output of the model is a vector, and if C categories exist, a total C-element vector is output, and each position represents the probability of the category. The true tag of the input image is also a vector containing C elements, only one of the elements of the vector is 1, the others are all 0, and the position of 1 represents the true category tag of the image. Comparing the output result of the model (namely the predicted label) with the real label of the input image, and calculating the loss through a cross entropy loss function, wherein the calculation formula of the cross entropy loss function is

Wherein C is the number of classes, p _c Is a variable that takes a value of 0 or 1 (if c is a true class, then p _c =1, otherwise p _c ＝0)，q _c The probability of class c is predicted for the model. After the loss is calculated, the gradient of the model parameters is calculated through back propagation, and the AdamW optimizer is adopted to update the network model parameters. In the actual training process, a mini-batch mode is adopted, that is, a batch of sample data is input into the model each time, and the specific training process is as follows.

(1) Inputting a batch of image sample data into the model.

(2) And calculating the prediction category of the data by the model.

The details and emphasis of the network model of the present invention are explained in detail below.

The invention constructs an effective attention mechanism, adds the effective attention mechanism into a network model, improves the performance of the network and can effectively capture crop disease information. The attention mechanism is widely applied to the fields of natural language processing, computer vision and the like. Attention can be paid to the fact that the power mechanism can indicate important information, so that unimportant information is omitted, and the model can make more accurate judgment. In the crop disease identification model, an attention mechanism is introduced, so that the model can effectively capture disease information, and the identification accuracy is improved.

The invention mainly improves the CBAM attention mechanism, and ensures the performance without being too complicated. CBAM includes a channel attention and spatial attention mechanism. The channel attention mechanism mainly analyzes the relationship among image channels, and gives a weight to each channel to acquire key information, so that the performance of the network is improved. The attention mechanism constructed by the invention is shown in fig. 4, the calculation process of the channel attention mechanism is shown in the upper part of fig. 4, the calculation process of the space attention mechanism is shown in the lower part of fig. 4, the output obtained after the input image is subjected to the calculation of the channel attention mechanism and the space attention mechanism is added with the input image to obtain the final output, and the specific calculation processes of the channel attention mechanism and the space attention mechanism are described as follows.

The channel attention constructed by the present invention is primarily an improvement over the CBAM attention mechanism. In CBAM, the channel attention mechanism uses a multi-layer perceptron (MLP), which contains two fully-connected layers, resulting in a large number of parameters. Aiming at the problem, the invention introduces one-dimensional convolution to relieve the problem that the parameter number is too large according to the idea in ECA-Net. Suppose that the input image is

Where H, W, C are the height, width and number of channels of the image, respectively. The spatial information of the image is compressed through global average pooling and maximum pooling operations, the average pooling can be adopted to represent some information of the whole body, the maximum pooling can be added to represent remarkable information, and the two are combined to better improve the performance of the model. After a global average pooling and a maximum pooling calculation, respectively->

Will then->

And &>

Splicing, fusing by one-dimensional convolution, and finally calculating the attention score by a sigmoid function. The specific calculation flow is as follows:

wherein,

is a one-dimensional convolution with a convolution kernel size k, σ being the sigmoid function, < >>

Then pourMultiplying the intent score and the input image results in an output->

After the attention of the channel dimension is considered, a space attention mechanism is constructed, different from the channel attention mechanism, the space attention mechanism focuses on the relation between spaces, and each pixel point is given a weight, so that important information is acquired in the space dimension. In the spatial attention mechanism, for the input image, the channel information is firstly compressed through global average pooling and maximum pooling to respectively obtain

Then obtained by a pair of convolutional layers

And (4) performing calculation. In order to better fuse information, improve the characterization capability of the network and capture crop disease information, a multi-branch structure is adopted. Some classical models, such as google lenet and ResNet, adopt a multi-branch structure, which improves the performance of the model. Therefore, in the spatial attention mechanism, for better information extraction, it is considered to construct a multi-branch network structure by using convolution kernels with different sizes, and choose to use hole convolution to increase the receptive field. Obtained after a pooling operation

Splicing the two, processing through a plurality of convolution kernels, adding and fusing the results, and calculating through sigmoid to obtain the attention score. The specific calculation process is as follows:

/>

wherein,

representing a convolution kernel size of k _i ×k _i Void value of d _i σ is a sigmoid function,

then the attention score and the input are multiplied together to obtain an output->

The MobileViT model is used as a basic model. It is an advantage for the Transformer model to be able to learn global tokens based on a self-attention mechanism. And for the convolutional neural network, the convolutional neural network has spatial induction bias and can learn local characteristics through fewer parameters. Considering that the convolutional neural network and the Transformer model have respective advantages, the MobileViT model is selected to construct the crop disease identification model, and the MobileViT model combines the advantages of the convolutional neural network and the Transformer model. The main blocks contained in MobileViT are MV2 block and MobileViT block. The MV2 block is a MobileNetv2 block, which is a reciprocal residual structure in MobileNetv 2. In MobileViT, the core module is the MobileViT block, which combines the advantages of CNNs and transformers for learning local and global characterizations. For an input image

Wherein H, W and C are respectively height, width and channel number. First, a local modeling operation F is performed _local The step is first passed through an n × n convolution module f ^n×n Learning local tokens and then passing through a 1 x 1 convolution module f ^1× 1, performing dimension raising to obtain output X after local modeling _local The specific calculation process is as follows:

wherein d is X _local Of (c) is measured. Then, global modeling is carried out through a Transformer model. The method mainly comprises the steps of Unfold, transform module calculation and Fold operation. For the output X obtained after local modeling _local First, an operation of Unfo1d is performed to convert an image into sequence data that can be processed by a transform, and a self-attention calculation is performed. Mixing X _local Divided into non-overlapping patches obtained after the Unfold operation

Wherein P = P _h ×p _w ，

p _h ，p _w The height and width of each patch are respectively, P is the number of groups, and N is the number of each group of patches. Then the data X in each group _p Inputting the data into a Transformer module for calculation, and learning the global characterization. Then performing Fold operation for reduction to obtain->

Finally, dimension reduction is carried out through convolution operation of 1 × 1, splicing is carried out through splicing operation and original input X, and then fusion is carried out through convolution operation of n × n, wherein the specific operation is as follows:

X _o ＝f ^n×n ([X，f ^1×1 (X _fold )])

On the basis of the MobileViT, in order to better capture information, the method adds a channel attention mechanism in a part of the MobileViT blocks of the MobileViT.

The following is a process of inputting an image and obtaining a recognition output through network model prediction.

(1) An image is input, as shown in fig. 1, and has a size of 224 × 224 × 3, which indicates that the image has a height and a width of 224, 3 channels, and a category of "Apple _ scab". The calculation is first performed through a 3 × 3 convolutional layer, and then through the BatchNorm layer and the SiLU activation function.

(2) In the first stage, the computation is performed by a MobileNetv2 module, in which the upscaling is first performed by a 1 × 1 convolution module, by a 1 × 1 convolution layer. And then, calculating by a 3 × 3 deep convolution, wherein the number of input channels and the number of output channels of the deep convolution are equal, and each channel only uses one convolution kernel to reduce the parameter number. Finally, dimension reduction is carried out through convolution of 1 multiplied by 1.

(3) In the second stage, three MobileNetv2 modules are used for calculation, the calculation process of each MobileNetv2 module is similar to that in the step (2), and the difference is that the number of channels of an output image in the current stage is 48, and the size of the output image is 56 multiplied by 56; the number of channels of the output image in step (2) is 32, and the size is 112 × 112.

(4) In the third stage, firstly, a MobileNetv2 module is used for calculation, and the specific process is similar to the step (2), except that the number of channels of the output image in the current stage is 64, and the size of the output image is 28 × 28. The calculation is performed by a modified MobileViT block, the structure of which is shown in fig. 2. In the modified MobileViT block, the calculation is first performed by the channel attention mechanism, assuming the input is

Calculating input X by respectively using global tie pooling and maximum pooling, pressing spatial information into channels, and respectively obtaining ^ and ^ according to the spatial information>

Will then->

And &>

Splicing, fusing by one-dimensional convolution, and calculating the attention score through a sigmoid function>

Finally, the attention score is multiplied by the input to obtain the output->

After the attention calculation of the channel, the calculation of the local characterization is carried out, and the calculation is carried out through a convolution module of 3 multiplied by 3 firstly and then through a convolution module of 1 multiplied by 1. Followed by global characterizationThe calculation is carried out by dividing the previous output into several non-overlapping latches by the Unfold operation, and then obtaining ^ or ^ after the Unfold operation>

Then the data X in each group _p Inputting the data into a transform module for calculation, learning global characteristics, grouping and dividing in a manner shown in fig. 3, wherein the same color is a group. And restoring the output through Fold operation, calculating through convolution of 1 × 1, splicing the calculated result and the input, and fusing information through convolution of 3 × 3 to obtain the output.

(5) In the fourth stage, firstly, a MobileNetv2 module is used for calculation, and then, an improved MobileViT block is used for calculation, and the specific process is similar to the step (4), wherein the difference is that the number of output image channels in the current stage is 80, and the size is 14 multiplied by 14; in the step (4), the number of output image channels is 64, and the size is 28 × 28.

(6) In the fifth stage, firstly, a MobileNetv2 module is used for calculation, and then a MobileViT block is used for calculation, the specific process is similar to the step (4), except that a channel attention mechanism is not added, the number of output image channels in the current stage is 96, and the size is 7 × 7.

(7) The output from the previous step is first calculated using a 1 x 1 convolution and then input into the attention mechanism. Assume that the previous output is

Wherein H, W and C are respectively height, width and channel number. Firstly, calculating the attention of the channel, and respectively obtaining the attention of the channel after calculating the global average pooling and the maximum pooling

Will then->

And &>

Splicing, and passing throughAnd (5) performing dimensional convolution for fusion, and finally calculating the attention score through a sigmoid function. The specific calculation flow is as follows:

then multiplying the attention fraction and the input image to obtain an output

After the channel attention is finished, the spatial attention is calculated, and the output of the channel attention is obtained through pooling operation

And then splicing the two, processing through a plurality of convolution kernels, adding and fusing the results, and calculating through sigmoid to obtain the attention score. The specific calculation process is as follows:

then multiplying the attention score and the channel attention output to obtain an output

Will be finally->

And input X _l And adding to obtain the attention mechanism output.

(8) Inputting the output obtained in the previous step into a final classifier, performing pooling operation on the output obtained in the previous step in a channel dimension through global average pooling in the classifier, then performing calculation through a full connection layer, wherein the number of the elements output finally is the number of categories, each element is the probability that an input image belongs to the category, the maximum value of the elements is the category predicted by the network model, and finally outputting the category, which is specifically shown in fig. 5.

The invention combines a MobileViT model and an attention mechanism to construct a lightweight attention mechanism network for crop disease identification. For a basic MobileViT model, the method adds a channel attention mechanism in a part of MobileViT blocks of the MobileViT, and adds a channel attention mechanism and a space attention mechanism at the end of the MobileViT model, so that channel information and space information can be better captured.

Claims

1. A lightweight attention mechanism network for crop disease identification is characterized in that an improved attention mechanism is added on the basis of a MobileViT model, a channel attention mechanism is added into a part of MobileViT blocks of the MobileViT model, and the channel attention mechanism and a space attention mechanism are added at the last of the MobileViT model;

the channel attention mechanism is based on a CBAM attention mechanism and also comprises one-dimensional convolution; the channel attention mechanism is used for analyzing the relationship among image channels, and giving a weight to each channel to acquire key information, so that the performance of the network is improved;

the space attention mechanism is based on a CBAM attention mechanism, and further comprises a multi-branch network structure and a hole convolution; the multi-branch network structure is constructed by convolution kernels with different sizes, and the cavity convolution layer is used for increasing the receptive field; the spatial attention mechanism is used for analyzing the relation between image spaces and giving each pixel point a weight, so that important information can be obtained in the spatial dimension.

2. The network of claim 1, wherein the improved attention mechanism based on the MobileViT model is used for training tests with the data of the PlantVilage public data set, and the training method is as follows:

a public data set, namely a PlantVillage training model is adopted, wherein the PlantVillage is a public farming disease data set and comprises 38 categories; randomly dividing a PlantVillage data set into a training set, a verification set and a test set according to the ratio of 6: 2; the training set is used for training the model, the verification set is used for checking the state of the model in the training process, and the test set is used for finally testing the effect of the model;

in the model training process, each sample in a training set consists of an input image and a real class label corresponding to the image; inputting sample data in a training set into a model to obtain the prediction output of the model, wherein the output of the model is a vector, if C categories exist, a total vector with C elements is output, and each position represents the probability of the category; the real label of the input image is also a vector containing C elements, only one of the elements of the vector is 1, the other elements are 0, and the position with 1 represents the real category label of the image; comparing the output result of the model (namely the predicted label) with the real label of the input image, and calculating the loss through a cross entropy loss function, wherein the calculation formula of the cross entropy loss function is

(1) Inputting a batch of image sample data into the model;

(2) Calculating the prediction category of the data through the model;

(3) Comparing the prediction category obtained by the model output with the real category, and calculating the loss through a cross entropy loss function;

(4) Performing back propagation operation, calculating the gradient of the model parameters, and updating the network model parameters by adopting an AdamW optimizer;

(5) And repeating the steps, and finishing when the set times of training is reached.

3. A lightweight attentional force system network for crop disease identification as claimed in claim 1 wherein said improved force system comprises in particular the steps of:

2) A spatial attention mechanism is constructed, each pixel point is given a weight, channel information is compressed through global average pooling and maximum pooling for input images, and a multi-branch network is constructed by using convolution kernels with different sizes, so that information is better fused, the characterization capability of the network is improved, and crop disease information is captured; using hole convolution to increase the receptive field; processing the compressed image through a plurality of convolution kernels, adding and fusing the results, and calculating by sigmoid to obtain an attention score;

4. The network of claim 3, wherein in step 1), the specific steps of constructing the channel attention mechanism are as follows:

suppose that the input image is

Wherein->

And &>

Respectively representing the results obtained by adopting global average pooling and maximum pooling on the image in the channel attention mechanism; in channel attention, the computation process of global average pooling is to sum and average all the elements in each channel for each channel of the image, each channel having H × W values; the calculation processes of the maximum pooling operation are similar, namely the maximum value of all elements in each channel is obtained; will calculate to get->

And &>

wherein,

5. A lightweight attentive force mechanism network for crop disease identification as claimed in claim 3, wherein in step 2), said spatial attentive force mechanism is constructed forEach pixel point has a weight, so that important information is obtained in spatial dimension; in the spatial attention mechanism, for an input image, channel information is compressed through global average pooling and maximum pooling to respectively obtain

Wherein +>

And &>

Respectively representing the results obtained by adopting global average pooling and maximum pooling on the image in the spatial dimension in the spatial attention mechanism; in the spatial attention mechanism, the calculation process of global average pooling is to calculate the position of each pixel point in an image, the position of each pixel point is totally C channels, namely totally C values, and then the average value of the C values is calculated to obtain average pooled output; the calculation process of the maximum pooling is similar, namely the process of averaging is changed into the process of solving the maximum value; then on the basis of the resulting pairs of convoluting layers>

Calculating; in order to better fuse information, improve the characterization capability of a network and capture crop disease information, a multi-branch structure is adopted; some classical models, such as google lenet and ResNet, adopt a multi-branch structure, so that the performance of the models is improved; therefore, in the spatial attention mechanism, for better information extraction, the construction of a multi-branch structure network by using convolution kernels with different sizes is considered, namely, input images are respectively input into the branches to be calculated, then output results of the branches are added, and the receptive field is increased by using hole convolution; get after pooling operation>

Splicing the two, processing the two through a plurality of convolution kernels, adding and fusing the results, and calculating the result through sigmoidA score of attention; spatial attention score s ^sp The calculation process of (2) is as follows:

wherein s is ^sp In order to be a fraction of attention,

representing a convolution kernel size of k _i ×k _i Void value of d _i σ is a sigmoid function, <' > is a sigmoid function>

Then multiplying the attention score and the input results in an output ≧ or>

6. A lightweight attentive force mechanism network for crop disease identification as recited in claim 3 wherein in step 3) said MobileViT model contains MV2 blocks and MobileViT blocks; MV2 is a reverse residual structure in MobileNetv 2; the MobileViT block is a core module of a MobileViT model and is used for learning local representation and global representation; the MobileViT block for constructing the crop disease identification model comprises local modeling and global modeling:

(1) Local modeling: for an input image

Wherein H, W and C are respectively height, width and channel number; first, a local modeling operation F is performed _local The step is firstly passed through an n × n convolution module f ^n×n Learning local tokens, n typically being 3, and then passing through a 1 x 1 convolution module f ^1×1 Performing dimension raising to obtain output X after local modeling _local The specific calculation process is as follows:

wherein d is _locak The dimension of (a), i.e., the number of output channels obtained after calculation by local modeling;

(2) Global modeling: the method comprises the steps of calculation and Fold operation of an Unfold and Transformer module; output obtained after local modeling _local Performing Unfold operation, changing the image into sequence data capable of being processed by a transform, and performing self-attention operation; will be provided with _local Divided into non-overlapping patches obtained after the Unfold operation

Wherein P = ph × P _w ，/>

ph, _w The height and width of each patch are respectively, P is the number of groups, and N is the number of each group of patches; data X in each group _p Inputting the data into a Transformer module for calculation and learning global representation; then performing Fold operation reduction to obtain