CN117994608A

CN117994608A - Wheat disease image recognition method, device, medium and equipment

Info

Publication number: CN117994608A
Application number: CN202410143203.XA
Authority: CN
Inventors: 侯志松; 白玉鹏; 周浩宇; 赵珍威; 李成林; 高国红
Original assignee: Henan Institute of Science and Technology
Current assignee: Henan Institute of Science and Technology
Priority date: 2024-02-01
Filing date: 2024-02-01
Publication date: 2024-05-07

Abstract

The invention discloses a wheat disease image recognition method, a device, a medium and equipment, and relates to the technical field of image recognition. According to the wheat disease identification method, a Vision Transformer model is adopted, a downsampling layer and an upsampling layer are added in an image block embedding layer of the model, so that a wheat disease identification model is constructed and obtained, finally, the wheat disease identification model is trained through marked sample data, and wheat disease identification is carried out through the trained wheat disease identification model. According to the wheat disease identification method, the wheat disease identification model is built based on the Vision Transformer model, the Vision Transformer model utilizes a self-attention mechanism, so that the wheat image can be globally understood, the dependency relationship between disease type characteristics is built, the image context information is fully utilized, meanwhile, the detail information of the wheat image can be increased by adding the up-down sampling layer into the Vision Transformer model, the global information in the wheat image can be better captured, the complex and changeable disease condition of the wheat can be dealt with, and the accuracy of wheat disease identification is improved.

Description

Wheat disease image recognition method, device, medium and equipment

Technical Field

The application relates to the technical field of image recognition, in particular to a method, a device, a medium and equipment for recognizing wheat disease images.

Background

In general, crop diseases are serious biological disasters in agricultural production, the average annual crop disease outbreak cultivation area in China is as high as 3.5 hundred million km ², and the economic loss which is difficult to estimate is brought to the agricultural production in China. The wheat diseases are diagnosed in time and accurately controlled, and the economic loss caused by the diseases can be reduced to the greatest extent. The traditional mode of manually detecting the wheat diseases is time-consuming and labor-consuming, and needs rich expertise, so that the detection process has stronger subjectivity, and the requirement of rapid and accurate diagnosis of the diseases in the current environment cannot be met.

With the continuous development of artificial intelligence technology, the rapid identification and diagnosis of crop diseases by using computer vision is becoming an important method for replacing the traditional artificial diagnosis of crop diseases. According to the difference of the feature extraction methods, the image recognition technology can be divided into two types of traditional machine learning algorithms which rely on manually extracting features and image recognition algorithms based on deep learning. The traditional machine learning algorithm includes a support vector machine (Support Vector Machine, SVM), a K-Nearest Neighbor (KNN), a Decision Tree (DT), a K-means clustering algorithm, etc.

In the prior art, a method for analyzing the mildew degree of wheat by using a KNN algorithm exists, and a good recognition effect is obtained. The method also utilizes a Support Vector Machine (SVM) and a Logistic Regression (LR) algorithm to be combined with a spectrum imaging technology for detecting rice diseases, and the identification accuracy reaches 93%. In recent years, deep learning technology has gradually taken the dominant role in the field of computer vision by virtue of its strong feature expression capability and the lack of manual feature extraction, and among them, the convolutional neural network (Convolutional Neural Network, CNN) which is widely used in image recognition and has the best practical effect should be used. Many researchers use AlexNet, VGGNet, googleNet and other classical CNN models for identifying crop diseases, and a good identification effect is obtained. There are methods for identifying 6 tomato diseases by adopting two models of VGG16 and AlexNet, and the identification accuracy of the two models reaches 97.29% and 97.49% respectively. In recent years, more and more crop disease recognition models based on deep learning are developed, and the model recognition precision is also continuously improved.

However, the convolutional neural network for image recognition lacks global understanding of the image itself, cannot establish a dependency relationship between features, cannot fully utilize context information, and has small lesion spots when facing complex and changeable wheat disease images, for example, in the early development stage of the wheat disease, and is not easy to capture by the convolutional neural network, so that the recognition accuracy is reduced.

Disclosure of Invention

Based on the above, it is necessary to provide a wheat disease image recognition method, device, medium and apparatus for the above technical problems.

The technical scheme adopted in the specification is as follows:

the specification provides a wheat disease image recognition method, which comprises the following steps:

Obtaining sample images of healthy wheat and various types of disease wheat, and labeling the real disease types of the sample images to construct a sample data set;

Adding a downsampling layer and an upsampling layer in an image block embedding layer of the Vision Transformer model to construct a wheat disease identification model;

Inputting each sample image into a constructed wheat disease identification model to identify the type of the wheat disease, so as to obtain the predicted disease type of each sample image; the method comprises the steps of extracting local features of a sample image through an up-sampling layer and a down-sampling layer, partitioning the local features, and extracting global features of disease types based on the partitioned local features through a multi-head self-attention layer so as to identify diseases;

Training a wheat disease recognition model by taking the deviation between the predicted disease type and the real disease type of the minimized sample images as an optimization target; and carrying out wheat disease image recognition by the trained wheat disease recognition model.

Optionally, the obtaining sample images of healthy wheat and various types of disease wheat specifically includes:

obtaining sample images of healthy wheat, scab wheat, powdery mildew wheat and rust wheat, and adjusting the sizes of the sample images to be the same.

Optionally, the step of inputting each sample image into the constructed wheat disease identification model to identify the type of the wheat disease specifically includes:

Inputting each sample image into a constructed wheat disease identification model, and extracting local features of the sample images through an up-sampling layer and a down-sampling layer to obtain local feature images of each sample image;

For each sample image, cutting the corresponding local feature map into a plurality of image blocks, and labeling category labels and position information of each image block;

Inputting each marked image block into Transformer Encoder layers, extracting global disease type features through the multi-head self-attention layers included in the image blocks, and identifying the disease type of wheat in the sample image according to the global disease type features.

Optionally, the constructing the wheat disease identification model specifically includes:

and (3) reducing the dimension of the output vector of the first full-connection layer in the MLP Block layer of the Vision Transformer model, and constructing a wheat disease identification model.

Optionally, the Vision Transformer model is a Vision Transformer model pre-trained based on an ImageNet-21k dataset.

Optionally, the method further comprises:

Expanding the acquired sample image by a data enhancement method; the data enhancement method comprises horizontal overturning, vertical overturning, random angle rotation and size transformation.

The present specification provides a wheat disease image recognition device, comprising:

the acquisition module is used for acquiring sample images of healthy wheat and wheat with various diseases, labeling the real disease types of each sample image and constructing a sample data set;

The construction module is used for adding a downsampling layer and an upsampling layer in an image block embedding layer of the Vision Transformer model to construct a wheat disease identification model;

The identification module is used for inputting each sample image into the constructed wheat disease identification model to identify the type of the wheat disease, so as to obtain the predicted disease type of each sample image; the method comprises the steps of extracting local features of a sample image through an up-sampling layer and a down-sampling layer, partitioning the local features, and extracting global features of disease types based on the partitioned local features through a multi-head self-attention layer so as to identify diseases;

The training module is used for training the wheat disease recognition model by taking the deviation between the predicted disease type and the real disease type of the minimized sample images as an optimization target; and carrying out wheat disease image recognition by the trained wheat disease recognition model.

The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the above wheat disease image recognition method.

The present specification provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above wheat disease image recognition method when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

and adding a downsampling layer and an upsampling layer in the image block embedding layer by adopting the Vision Transformer model to construct and obtain a wheat disease identification model, finally training the wheat disease identification model by marked sample data, and carrying out wheat disease identification by the trained wheat disease identification model.

According to the wheat disease identification method, a wheat disease identification model is built based on Vision Transformer models, the Vision Transformer models utilize a self-attention mechanism, so that the wheat images after the block processing can be globally understood, the dependency relationship between disease type characteristics is built, the image context information is fully utilized, meanwhile, by adding an up-down sampling layer in the Vision Transformer models, the detail information of the wheat images can be increased by extracting local characteristics, the expression capacity of the wheat disease identification model is improved, global information in the wheat images is better captured, complex and changeable disease conditions of the wheat can be handled, and the accuracy of wheat disease identification is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

fig. 1 is a schematic flow chart of a wheat disease image recognition method provided in the present specification;

FIGS. 2 a-d are schematic views of a sample image provided herein;

FIG. 3 is a schematic view of the overall framework of a Vision Transformer model provided in the present specification;

FIG. 4 is a schematic diagram of a ViT-Base model basic structure provided in the present specification;

FIG. 5 is a schematic diagram showing a comparison of Patch Embedding layers before and after modification according to the present disclosure;

FIG. 6 is a schematic diagram of a transfer learning process provided in the present specification;

FIG. 7a is a schematic diagram showing a comparison of model recognition accuracy curves according to different schemes provided in the present specification;

FIG. 7b is a graph showing the comparison of the Loss curves of different models according to the present description

FIG. 8 is a schematic diagram of data enhancement provided herein;

FIG. 9a is a graph showing the comparison of model recognition accuracy curves of training sets before and after data enhancement provided in the present specification;

FIG. 9b is a schematic diagram showing a comparison of model identification accuracy curves of a verification set before and after data enhancement;

FIG. 10 is a graph showing comparison of recognition accuracy curves of different models on a training set provided in the present specification;

FIG. 11a is a matrix of confusion over a ViT model sub-test set provided in the present specification;

FIG. 11b is a confusion matrix for Alexnet models on a test set provided in the present specification;

FIG. 11c is a confusion matrix for VGG16 on test set provided herein;

fig. 12 is a schematic diagram of a wheat disease image recognition device provided in the present specification;

fig. 13 is a schematic diagram of a computer device for implementing the wheat disease image recognition method provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art without the exercise of inventive faculty, are intended to be within the scope of the application, based on the embodiments in the specification.

Convolutional neural networks have been expected in the field of computer vision for the past 10 years, leading to an era. However, the convolution operation of the convolution neural network for identifying the wheat disease image lacks global understanding of the image, cannot establish a dependency relationship between features, and cannot fully utilize the context information. Furthermore, the weight of the convolution is fixed and cannot dynamically adapt to changes in the input.

In the wheat-disease recognition task, since the convolution operation cannot sufficiently utilize the context information, recognition accuracy may be limited in the face of complex and variable wheat-disease images. For example, in the early stages of disease development, the lesions may be smaller and less likely to be captured by convolutional neural networks, resulting in reduced recognition accuracy. In addition, convolutional neural networks are very sensitive to small changes in the input image, which can lead to fluctuations in the recognition effect of the model when handling different disease stages or different varieties of wheat. The problem of fixed convolution weight also makes the model be difficult to adapt to the diversity and variability of wheat diseases, has further influenced recognition effect. Therefore, aiming at the wheat disease identification task, the convolutional neural network model has a certain technical problem in the aspects of identification accuracy and stability.

Therefore, in the invention, the transducer model in the natural language processing field is migrated to the computer vision task to perform the wheat disease image recognition. Compared with a convolutional neural network, the self-attention mechanism of the transducer is not limited by local interaction, long-distance dependency can be mined, parallel calculation can be performed, and the most suitable induction bias can be learned according to different task targets.

The invention uses a transfer learning method to identify the collected 4 wheat images (types include powdery mildew, scab, rust and health) based on Vision Transformer models trained on ImageNet-21 k. The main content comprises: constructing ViT models to realize wheat disease image identification; (2) The influence of migration learning and data enhancement on ViT model performance is explored; (3) And comparing ViT the difference between the classical CNN model and the model in the task of identifying the wheat diseases.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a wheat disease image recognition method in the specification, which specifically includes the following steps:

S101: and obtaining sample images of healthy wheat and various types of disease wheat, and labeling the real disease types of the sample images to construct a sample data set.

In general, when the server of the service platform performs the identification of the wheat disease image, a sample data set can be constructed by firstly acquiring a wheat sample image collected in a history way. Based on this, in one or more embodiments of the present description, the server may obtain sample images of healthy wheat, gibberellic disease wheat, powdery mildew wheat, and rust wheat, and adjust the size of each sample image to be the same to construct a sample dataset.

For example, the server may use Scrapy crawler framework to collect wheat images and healthy wheat images infected with scab, powdery mildew, rust from a browser (e.g., hundred degrees, google, yahoo, etc. websites). In view of the fact that most of the crawled wheat disease images are low in quality, the information is messy and repeated images are more, the server can display the crawled sample images, and image enhancement is carried out in response to user operation. The user can manually select 4000 non-repeated high-definition images with pixels above 1920 multiplied by 1080 and obvious disease characteristics as the original images. The collected original images are randomly cut into 900 multiplied by 900, 600 multiplied by 600 and 300 multiplied by 300 images with different pixel sizes, 2000 wheat images are screened from the images, and the total number of the images is 8000. Finally, the image size is uniformly adjusted to 224×224 pixels, and a wheat disease data set WDD (WHEAT DISEASE DATASET, WDD) is constructed, as shown in fig. 2 a-d, and fig. 2 a-d are schematic diagrams of a sample image in the present specification.

The server mentioned in the present specification may be a server provided on a service platform, or a device such as a desktop, a notebook, or the like capable of executing the aspects of the present specification. For convenience of explanation, only the server is used as the execution subject.

S102: and adding a downsampling layer and an upsampling layer in an image block embedding layer of the Vision Transformer model to construct a wheat disease identification model.

After the construction of the sample data set is completed, the server can construct a wheat disease identification model. Since convolution operations in convolutional neural networks lack global understanding of the image itself, dependency between features cannot be established, and context information cannot be fully utilized. Therefore, the wheat disease identification model is built based on Vision Transformer model.

The transducer is a natural language processing (Natural Language Processing, NLP) classical model proposed by Google in 2017. The Transformer model uses Self-attribute mechanism, does not adopt RNN sequence structure, so that the model can be parallelized and trained, and can possess global information. Dosovitskiy et al propose ViT (Vision Transformer) models, which first apply the original Transformer model to image classification tasks. To convert an image into sequence data that can be processed by a transducer structure, viT introduced the concept of image blocks (patches). Firstly, performing block processing on a two-dimensional image, flattening each image block into a one-dimensional vector, then performing linear projection transformation on each vector, introducing position codes, and adding position information of a sequence. In addition, a class flag (class) is added before the input sequence data to better represent global information. The ViT model is typically pre-trained on a large dataset and then trimmed to smaller downstream tasks, as shown in fig. 3, which is an overall framework schematic of the Vision Transformer model in this specification.

The Vision Transformer model algorithm flow is as follows:

1. given an image X ε R ^3n×3n, the image is divided into 9 patches, one for each The 9 patches were leveled, with/>

2. Using matricesFlattened vector/>The i epsilon { 1..the 9} is subjected to linear transformation to obtain an image coding vector z ⁱ∈R^l, and the i epsilon { 1..the 9} has a specific calculation formula:

zⁱ＝W·xⁱ，i∈{1,...,9}

3. The image coding vector z ⁱ, i e {1,., 9} and the class coding vector z ⁰ are added to the corresponding position codes, respectively, to obtain an input coding vector:

zⁱ+pⁱ∈R^l，i∈{1,...,9}

p ⁱ is a position code for each image block to represent the position information of the image block in the original image in order to preserve the context in the Vision Transformer model. Through the position coding, the model can better utilize the global information of the image, and the disease identification accuracy is improved.

4. Inputting the input encoding vector into Vision Transformer Encoder to obtain a corresponding output o ⁱ∈R^l, i e { 1..9 };

5. inputting the category coding vector o ⁰ into a multi-layer perceptron (MLP) in the fully connected neural network to obtain a category prediction vector And calculating cross entropy loss with the true category vector y epsilon R ^c to obtain a loss value loss, and updating weight parameters of the model by using an optimization algorithm. The number of neurons of the MLP layer may be set to 512 and the activation function to ReLU.

Wherein, the image block is to divide the original image into a plurality of image blocks with the same size, and the image blocks are divided into 9 blocks in the example.

Linear projective transformation refers to flattening an image block into a one-dimensional vector, for example, the vector size after flattening can be set to 128. Each image block can be mapped to a 128-dimensional feature vector by linear projective transformation.

The position coding means that in order to introduce position information of a sequence, the position of each image block in an original image is represented using position coding. The position code consists of uniformly distributed floating point numbers ranging from 0 to 1. By adding the position coding to the image coding vector, the position information of the image block in the original image can be preserved.

Inputting the input code vector to Vision Transformer Encoder means inputting the input code vector to Vision Transformer Encoder, and performing multi-layer self-attention calculation to obtain the output code vector. For example, the number of Encoder layers may be set to 12.

The cross entropy loss is calculated by comparing the category prediction vector with the true category vector. The cross entropy loss is used to measure the gap between model predictions and actual categories, thereby guiding model training.

Vision Transformer have three versions, "ViT-Base", "ViT-Large" and "ViT-Huge". These three versions differ in the number of layers of the encoder, the size of the hidden layer dimension, the number of self-attention heads used by the multi-head attention layer, and the size of the MLP classifier. The differences for the different Vision Transformer versions are shown in table 1.

Table 1 comparison table of models of different versions ViT

Model	Number of Layers	Hidden Size D	MLP Size	Heads	Number of Parameters
						ViT-Base	12	768	3072	12	86M
ViT-Large	24	1024	4096	16	307M
						ViT-Huge	32	1280	5120	16	632M

The specific model can be determined according to the need, and the model is exemplified by a ViT-Base model with relatively small parameter, the number of layers of the model encoder is 12, the dimension of a hidden layer is 768, the number of self-attention heads used by a multi-head self-attention layer is 12, as shown in fig. 4, and fig. 4 is a schematic diagram of the basic structure of a ViT-Base model in the specification. The ViT-Base model input image is a 224 pixel by 224 pixel RGB image, which is first cut into 196 pixel 16 x 16 blocks (patches) by a special convolution layer, the convolution kernel size of the convolution layer being 16 x 16, with a step size of 16. And adding category labels and position information on the image block (patch), inputting the category labels and the position information into a Transformer Encoder layer for learning global features, and adopting a residual structure and a Dropout layer to eliminate the problems of gradient disappearance, explosion, network degradation and the like caused by network stacking. And finally, slicing the trained class labels to be used as the output of a model, and inputting the model into a Softmax layer to identify the wheat disease image according to the extracted characteristics.

Further, in one or more embodiments of the present disclosure, in order to improve the performance of the model, on the basis of not changing the size of the input image of the ViT model, structural optimization may be performed on Patch Embedding layers, that is, a downsampling layer and an upsampling layer are added in the image block embedding layer of the Vision Transformer model, the sample image is downsampled and upsampled once through the two sampling layers to extract local features, so as to obtain a local feature map of the sample image, then the extracted local feature map is input into the ViT model, cut into a plurality of image blocks, the disease types of each image block and the position information of each image block in the sample image are marked, and the marked image blocks are input into Transformer Encoder layers to extract global features of the disease types, so as to learn global features. The disease type herein refers to the type of wheat disease, and may refer to 4 kinds of wheat images (powdery mildew, scab, rust, health) as an example in step S101. The position information refers to the position of each pixel point in the image in the original image. In practical application, the disease type and the position information have important significance for model training and prediction, and can help the wheat disease recognition model to better learn the disease type characteristics and the relations of the sample images, so that the recognition accuracy of the wheat disease recognition model is improved.

As shown in fig. 5. Fig. 5 is a schematic diagram showing a comparison of Patch Embedding layers before and after modification in the present specification.

Further, in one or more embodiments of the present disclosure, in order to reduce the number of parameters of the model, the dimension of the output vector of the first full link layer in the MLP Block layer of the Vision Transformer model may be reduced to 1536 from 3072, for example, without losing the performance of the model as much as possible, and the parameter of the modified MLP Block layer is reduced by 50%. The training speed of the model can be further improved by reducing the model parameters.

S103: and inputting each sample image into the constructed wheat disease identification model to identify the wheat disease type, so as to obtain the predicted disease type of each sample image. The local features of the sample image are extracted through the up-sampling layer and the down-sampling layer, the local features are segmented, and the disease type global features are extracted through the multi-head self-attention layer based on the segmented local features so as to identify the disease.

S104: training a wheat disease recognition model by taking the deviation between the predicted disease type and the real disease type of the minimized sample images as an optimization target; and carrying out wheat disease image recognition by the trained wheat disease recognition model.

After the wheat disease recognition model is built, training the built wheat disease recognition model based on the sample image so as to perform a wheat disease recognition task through the trained wheat disease recognition model.

Thus, in one or more embodiments of the present disclosure, the server may input the sample images into the wheat disease identification model to obtain predicted disease types of each sample image, where the predicted disease types may be three types of diseases and four types of health in step S101. And then determining loss according to the deviation between the predicted disease type of the sample image and the pre-marked real disease type, and training a wheat disease identification model by taking the minimum loss as an optimization target. And finally, performing a wheat disease recognition task through the trained wheat disease recognition model.

Furthermore, the wheat disease recognition model can be trained for multiple times, the server can take sample images with preset proportion in the sample data set as a test set, and the wheat disease recognition model is tested for prediction accuracy after each round of training, so that convergence condition of the wheat disease recognition model in the training process is determined.

For example, the server may randomly select 80% of the pictures from the sample dataset of wheat diseases as the training set, 10% of the pictures as the validation set, and the remaining 10% of the pictures as the test set for training and testing of the model. During model training, the initial learning rate (LEARNING RATE) may be set to 0.0001 and the Batch training sample number (Batch size) may be set to 16. The number of iterations (epochs) may be set to 200, randomly scrambling the training set before each iteration. The optimizer may choose a random gradient descent (Stochastic GRADIENT DESCENT, SGD) algorithm.

When the prediction accuracy rate of the wheat disease identification model is evaluated, the model can be evaluated by adopting the average identification accuracy rate:

Wherein n _s represents the number of sample types, taking step S101 as an example, i.e. 4, n _i represents the number of samples of the i-th type, and n _ii represents the number of samples of the i-th type predicted correctly. The server can take the test set average recognition accuracy evaluation model of the iteration first corresponding model with the highest average recognition accuracy of the verification set in 200 iterations of the wheat disease recognition model.

Based on the wheat disease image recognition method shown in fig. 1, a Vision Transformer model is adopted, a downsampling layer and an upsampling layer are added in an image block embedding layer of the model to construct and obtain a wheat disease recognition model, finally, the wheat disease recognition model is trained through marked sample data, and wheat disease recognition is carried out through the trained wheat disease recognition model.

According to the invention, the wheat disease identification model is constructed based on Vision Transformer model, and as Vision Transformer model utilizes a self-attention mechanism, the wheat image can be globally understood, the dependency relationship between disease type characteristics is established, and the context information is fully utilized. Furthermore, the weights of the ViT models are learnable, and can dynamically adapt to changes in input. The ViT model has higher recognition accuracy and stability when processing the wheat disease image.

Meanwhile, a downsampling layer and an upsampling layer are added in the Vision Transformer model, the downsampling layer can reduce the dimension and complexity of the image on the basis of maintaining the original resolution of the image, the calculated amount is reduced, and the model training process is accelerated. The up-sampling layer can increase the detail information of the image while restoring the resolution of the image, and improve the expression capability of the wheat disease identification model. The introduction of the downsampling layer and the upsampling layer ensures that the Vision Transformer model improves the sensitivity of the model to disease type characteristics under different scales when the wheat disease image is processed, reduces the calculation complexity of the model, simultaneously retains the main characteristics of the image, can better adapt to complex and changeable wheat disease conditions, improves the generalization capability of the model, and ensures that the accuracy and the stability of the wheat disease identification can be improved when the model processes different disease stages or different varieties of wheat.

In applying the wheat disease image recognition method provided in the present specification, the steps may be performed in a sequence other than that shown in fig. 1, and the specific execution sequence of the steps may be determined according to need, which is not limited in the present specification.

In addition, in one or more embodiments of the present disclosure, in order to further improve the accuracy of identifying the wheat disease identification model, the server may pre-train the Vision Transformer models before training the wheat disease identification model through the sample image. In one or more embodiments of the present description, the Vision Transformer model in step S102 can be a pre-trained Vision Transformer model based on the ImageNet-21k dataset. And then, transferring the model parameters after the pre-training to the constructed wheat disease identification model through transfer learning. As shown in fig. 6, fig. 6 is a schematic diagram of a transfer learning process in the present specification.

Specifically, in one or more embodiments of the present disclosure, the server may perform feature learning on the wheat disease dataset WDD using two transfer learning modes of training Vision Transformer all parameters of the model and training Vision Transformer only the model classifier (MLP Head layer) parameters. The influence of the migration learning on the performance of the wheat disease recognition model is discussed below. The recognition accuracy and Loss curves of ViT models on the training set under different transfer learning modes are shown in table 2 and fig. 7a and 7 b. Table 2 shows the results of model training in different migration learning modes. Fig. 7a is a graph showing comparison of recognition accuracy curves of different scheme models in the present specification, and fig. 7b is a graph showing comparison of Loss curves of different scheme models in the present specification.

TABLE 2 schematic representation of model training results in different migration learning modes

As can be seen from table 2, fig. 7a and fig. 7b, under the same experimental conditions, compared with the case where no transfer learning is used, no matter what transfer learning method is used to train the model, the recognition accuracy of ViT models on the training set, the verification set and the test set is obviously improved. The highest recognition accuracy of the Vision Transformer model on the testing set is 93.28% by adopting a transfer learning method for training all parameters of the Vision Transformer model. By adopting a transfer learning method for training parameters of Vision Transformer model classifier only, the highest recognition accuracy of Vision Transformer model on a test set is 91.26%. When model training is not performed by using transfer learning, the highest recognition accuracy of Vision Transformer models on the test set is only 78.63%. By observing the model recognition accuracy and the Loss curve on the training set under different transfer learning methods in fig. 7b, it can be seen that the model convergence speed of the scheme A1 is the fastest.

From the above, in the wheat disease recognition task based on Vision Transformer models, the model recognition accuracy can be remarkably improved through transfer learning, so that the models can be quickly converged, the model training efficiency is improved, and preferably, the server can adopt a transfer learning method for training all parameters of Vision Transformer models.

Further, in one or more embodiments of the present disclosure, the server may further expand the acquired sample image by a data enhancement method before training the wheat disease identification model based on the sample image. The data enhancement method comprises horizontal overturning, vertical overturning, random angle rotation and size transformation, as shown in fig. 8, and fig. 8 is a schematic diagram of data enhancement in the specification.

As the field wheat disease identification accuracy is affected by various factors such as background environment, shooting angle and the like. Therefore, the invention adopts a data enhancement method to expand the sample image. For example, the present embodiment can expand the original data set in a manner of 3 data enhancements, namely horizontal flip, random angular rotation, and contrast enhancement. The expanded wheat disease dataset was designated LWDD (LARGE WHEAT DISEASE DATASET, LWDD) containing 32000 Zhang Xiaomai disease images.

According to the above, the training effect on the wheat disease identification task is better by adopting the transfer learning method for training all parameters of Vision Transformer models. Therefore, this embodiment also uses this way of transfer learning to perform ViT model training on WDD and LWDD datasets, respectively, and the experimental results are shown in table 3, fig. 9a, and fig. 9 b. Fig. 3 is a table of training results of a model before and after data enhancement in the present specification, fig. 9a is a graph of comparing model recognition accuracy curves of training sets before and after data enhancement in the present specification, and fig. 9b is a graph of comparing model recognition accuracy curves of verification sets before and after data enhancement in the present specification.

TABLE 3 data enhancement front and rear model training results table

As can be seen from table 3, fig. 9a and fig. 9b, under the same experimental conditions, the recognition accuracy of the model on the training set and the test set can be found by comparing the recognition accuracy of the model on the training set, the verification set and the test set before and after data enhancement ViT, and the recognition accuracy of each item of the model on the training set, the verification set and the test set can be improved by 1.12%, 2.19% and 3.67% respectively after the data set is expanded. Meanwhile, as can be seen from fig. 6, when the wheat disease recognition model is trained on a data set with a large sample size, the wheat disease recognition model converges faster, and the fluctuation amplitude of the recognition accuracy of the wheat disease recognition model on the verification set is smaller, which indicates that the wheat disease recognition model is more stable and the generalization capability of the wheat disease recognition model is stronger when the wheat disease recognition model is trained on a large data set.

The results show that by increasing the sample size of the sample data set, the performance of the ViT model can be effectively improved, the problem that the model is poor in performance on the test set due to factors such as complex field environment and shooting angle is solved, and the generalization capability of the model is improved.

In addition, the specification also provides training results of the wheat disease dataset LWDD after data enhancement of the invention and two classical CNN models of AlexNet and VGG16, and explores a deep learning algorithm suitable for wheat disease recognition. In the model training process, the 3 algorithm models are trained by adopting a transfer learning mode for training all parameters of the models, and the recognition accuracy curves of the models on the training set are shown in fig. 10. FIG. 10 is a graph showing comparison of recognition accuracy curves of different models on a training set according to the present disclosure.

The implementation also provides a comparison and schematic table of training results of different models as shown in table 4 and a comparison and schematic table of recognition accuracy of different wheat diseases as shown in table 5. As can be seen from fig. 10, table 4 and table 5, under the same experimental conditions, viT models perform best on the training set, viT models have significantly higher recognition accuracy on the training set than Alexnet and VGG16, and ViT models have the highest convergence rate. As can be seen by examining the data in Table 4, the ViT model has an average recognition accuracy of 96.81% over the test set, which is 6.68% and 4.94% higher than Alexnet and VGG16, respectively. Meanwhile, as can be seen from table 5, viT model has obtained higher recognition accuracy for 3 wheat diseases, wherein the recognition accuracy for wheat scab is highest.

TABLE 4 comparison of training results for different models

Table 5 comparison table of identification accuracy of different wheat diseases

The present embodiment also provides a schematic diagram of confusion matrix of different models on test set as shown in fig. 11 a-c, wherein fig. 11a is a confusion matrix of ViT models on test set in the present specification, fig. 11b is a confusion matrix of Alexnet models on test set in the present specification, and fig. 11c is a confusion matrix of VGG16 on test set in the present specification. The X and Y axes in the confusion matrix correspond to the 4 wheat image class labels for health, powdery mildew, rust and scab. The X axis represents the real label of the image, the Y axis represents the predicted label of the network, and when the predicted label corresponds to the real label one by one, the darker the color in the image, the better the effect of the network for identifying the corresponding label is proved. The error between the model predicted result and the actual result is evident by using the confusion matrix. The result shows that ViT model has better recognition effect in the wheat disease recognition task. Through analyzing the performance indexes, the Vision Transformer model has obvious advantages in the aspects of recognition accuracy and convergence speed compared with the traditional convolutional neural network model in the wheat disease recognition task.

The invention constructs a wheat disease identification data set by collecting 3 wheat disease pictures, and provides a wheat disease identification method based on Vision Transformer models. One or more of the technical schemes in the specification can achieve the following effects:

1. The model weight trained on the ImageNet-21k is transferred to the Vision Transformer model, so that the recognition accuracy of the constructed wheat disease recognition model can be remarkably improved. Different transfer learning methods are used in the training process to improve the performance of the Vision Transformer model, and preferably, a transfer learning method for pre-training all parameters of the Vision Transformer model can be adopted. The Vision Transformer model is trained by adopting a transfer learning method for training only classifier parameters, and the training time is greatly shortened although the recognition accuracy is slightly lower than that of a transfer learning method for training all parameters of the Vision Transformer model.

2. In addition, the sample dataset can be expanded through data enhancement, the training effect obtained by the Vision Transformer model on the dataset after data enhancement is better, the generalization capability of the Vision Transformer model is stronger, and in a wheat disease image recognition task, the recognition accuracy of the Vision Transformer model can be effectively improved by expanding the sample size of the dataset. Considering that the field complex environment, shooting angle and other factors may influence the recognition precision of Vision Transformer models, the invention expands the original data set in a data enhancement mode such as horizontal overturning, random angle rotation and the like, the Vision Transformer model obtains a very good training effect on the expanded data set, and the recognition accuracy on the test set is improved from the original 93.28 percent to 96.81 percent.

3. In the wheat disease identification task, compared with the traditional convolutional neural network model, the wheat disease image identification algorithm based on Vision Transformer and transfer learning adopted in the method is higher in identification accuracy, and the wheat disease type can be accurately, precisely and rapidly diagnosed in practical application.

The above method for identifying wheat disease image provided by one or more embodiments of the present disclosure further provides a corresponding device for identifying wheat disease image based on the same concept, as shown in fig. 12.

Fig. 12 is a schematic diagram of a wheat disease image recognition device provided in the present specification, including:

The acquisition module 201 is used for acquiring sample images of healthy wheat and wheat with various diseases, labeling the real disease types of each sample image and constructing a sample data set;

The construction module 202 is used for adding a downsampling layer and an upsampling layer in an image block embedding layer of the Vision Transformer model to construct a wheat disease identification model;

The recognition module 203 is configured to input each sample image into the constructed wheat disease recognition model to recognize the type of the wheat disease, so as to obtain a predicted disease type of each sample image; the method comprises the steps of extracting local features of a sample image through an up-sampling layer and a down-sampling layer, partitioning the local features, and extracting global features of disease types based on the partitioned local features through a multi-head self-attention layer so as to identify diseases;

the training module 204 is configured to train the wheat disease recognition model with a deviation between the predicted disease type and the actual disease type that minimizes each sample image as an optimization target; and carrying out wheat disease image recognition by the trained wheat disease recognition model.

Optionally, the obtaining module 201 obtains sample images of healthy wheat, gibberellic disease wheat, powdery mildew wheat and rust disease wheat, and adjusts the sizes of the sample images to be the same.

Optionally, the identifying module 203 inputs each sample image into the constructed wheat disease identifying model, extracts local features of the sample image through the up-sampling layer and the down-sampling layer to obtain local feature maps of each sample image, cuts the corresponding local feature map into a plurality of image blocks for each sample image, marks category labels and position information of each image block, inputs each marked image block into Transformer Encoder layers, extracts global features of disease types through a multi-head self-attention layer included in the image blocks, and identifies the disease types of wheat in the sample image according to the global features of disease types.

Optionally, the building module 202 performs dimension reduction on the dimension of the output vector of the first full connection layer in the MLP Block layer of the Vision Transformer model, and builds a wheat disease recognition model.

Optionally, the constructing module 202 expands the acquired sample image by a data enhancement method; the data enhancement method comprises horizontal overturning, vertical overturning, random angle rotation and size transformation.

The specific limitation of the wheat disease image recognition device can be referred to as limitation of the wheat disease image recognition method hereinabove, and the detailed description thereof is omitted. The above-mentioned various modules in the wheat disease image recognition apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

The present specification also provides a computer-readable storage medium storing a computer program operable to perform the wheat disease image recognition method provided in fig. 1 described above.

The present specification also provides a schematic structural diagram of the computer device shown in fig. 13, where, as shown in fig. 13, the computer device includes a processor, an internal bus, a network interface, a memory, and a nonvolatile memory, and may include hardware required by other services. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to realize the wheat disease image identification method provided by the figure 1.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

Claims

1. A wheat disease image recognition method, comprising:

2. The method for identifying wheat disease image according to claim 1, wherein the obtaining of sample images of healthy wheat and various types of disease wheat comprises:

3. The method for identifying wheat disease image according to claim 1, wherein the step of inputting each sample image into the constructed wheat disease identification model to identify the type of wheat disease comprises the steps of:

Cutting the corresponding local feature map into a plurality of image blocks aiming at each sample image, and marking the disease type of each image block and the position information of each image block in the sample image;

4. The wheat disease image recognition method of claim 1, wherein the constructing a wheat disease recognition model specifically comprises:

5. The wheat disease image recognition method of claim 1, wherein the Vision Transformer model is a Vision Transformer model pre-trained based on an ImageNet-21k dataset.

6. The wheat disease image recognition method of claim 2, wherein the method further comprises:

7. A wheat disease image recognition device, characterized by comprising:

8. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-6.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the preceding claims 1-6 when executing the program.