CN114511733A

CN114511733A - Fine-grained image identification method and device based on weak supervised learning and readable medium

Info

Publication number: CN114511733A
Application number: CN202210004720.XA
Authority: CN
Inventors: 余洪山; 赖明明; 赵科
Original assignee: Institute Of Industrial Design And Machine Intelligence Innovation Quanzhou Hunan University
Current assignee: Institute Of Industrial Design And Machine Intelligence Innovation Quanzhou Hunan University
Priority date: 2022-01-05
Filing date: 2022-01-05
Publication date: 2022-05-17

Abstract

The invention discloses a fine-grained image recognition method, a device and a readable medium based on weak supervised learning, wherein a VGG _ Reserve model based on an attention machine system is constructed and trained through two-step migration learning, the VGG _ Reserve model based on the attention machine system comprises a pre-trained VGG16 model, a Reserve part and an attention machine part, the Reserve part comprises a fourth batch of normalization layer, a plurality of Reserve modules, a third convolution layer and a third batch of normalization layer, the Reserve module comprises a first convolution layer, a first batch of normalization layer, an inclusion-A unit, a second convolution layer and a second batch of normalization layer which are connected based on residual errors, the attention machine part comprises an attention machine module, a global average pooling layer, a full-connection layer and a softmax layer, and the two-step migration learning process comprises migration learning between a source domain and a transition domain and learning migration between the transition domain and a target domain; and acquiring a plant leaf disease degree fine-grained image, inputting a trained VGG _ Reservation model based on an attention mechanism, and outputting a classification result. The stability and accuracy can be improved.

Description

Fine-grained image identification method and device based on weak supervised learning and readable medium

Technical Field

The invention relates to the field of deep learning and computer vision, in particular to a fine-grained image identification method and device based on weak supervised learning and a readable medium.

Background

Classification of fine-grained images aims at distinguishing subordinate classes with subtle visual differences, and is more challenging than traditional coarse-grained image classification. One aspect is that the feature differences between images are smaller, thus leading to more subtle discriminative features. Another aspect is that the training data set is limited and there are many uncertain factors in the image, such as illumination differences, background interferences, etc. The key to fine-grained image classification is to obtain the most significant local difference features. According to different requirements of a neural network on supervision information in training data in a training process, research algorithms based on fine-grained image classification are mainly divided into two categories, one category is based on weak supervision, and the other category is based on strong supervision. The main method of the fine-grained classification algorithm based on strong supervision is to provide a class label of an image and also to provide a method for classifying more manual labeling information (such as an object labeling frame, a part labeling point and the like). The goal of weakly supervised learning is to rely solely on class labels to implement a fine grained classification task. At present, a strong supervision mode is mostly adopted for realizing classification of fine-grained images, namely, besides category labels, a data set for fine labeling work is additionally needed, a large amount of manpower and material resources are consumed, and application of an algorithm in an actual scene is severely restricted. Therefore, how to design the weak supervision fine-grained image recognition algorithm without additional manual marking has higher research significance.

In the past, the accurate fine classification problem required discrimination with the professional knowledge of a corresponding domain expert, resulting in a significant increase in cost. For example, in the field of agricultural pest identification, the characteristics of different disease spots are very similar, and the characteristic differences of disease leaves in different degrees are difficult to distinguish and are difficult to distinguish by non-professional people and naked eyes; through a deep learning model and a weak supervised learning method, the classification accuracy is improved by focusing on the detection of the fine characteristics of local key regions; in the industrial field, for example, for the defect detection of some precision components, identification is often performed by means of a microscope and a professional, and the detection of fine defects is completed without a complex labeled data set by adopting a weak supervision learning method.

Disclosure of Invention

The technical problem mentioned above is addressed. An embodiment of the present application aims to provide a fine-grained image recognition method, apparatus and readable medium based on weak supervised learning, so as to solve the technical problems mentioned in the above background.

In a first aspect, an embodiment of the present application provides a fine-grained image recognition method based on weak supervised learning, including the following steps:

constructing a VGG _ Reserve model based on an attention machine system and carrying out two-step migration learning training to obtain a trained VGG _ Reserve model based on the attention machine system, wherein the VGG _ Reserve model based on the attention machine system comprises a pre-trained VGG16 model, a Reserve part and an attention machine part, the Reserve part comprises a fourth batch of normalization layers, a plurality of Reserve modules, a third convolution layer and a third batch of normalization layers, the Reserve module comprises a first convolution layer, a first batch of normalization layers, an inclusion-A unit, a second convolution layer and a second batch of normalization layers which are connected based on residual errors, the attention machine part comprises an attention machine module, a global average pooling layer, a full connection layer and a softlayer, and the two-step migration learning process comprises migration learning between a source domain and a transition domain and migration learning between the transition domain and a target domain, wherein the transition domain is a coarse-granularity image dataset;

and acquiring a plant leaf disease degree fine-grained image, inputting a trained VGG _ Reservation model based on an attention mechanism, and outputting a classification result.

In some embodiments, the plant leaf disease degree fine-grained image is subjected to multiple feature extraction and feature fusion of a pre-trained VGG16 model and a resolution part to obtain feature fusion data; and inputting the feature fusion data into an attention mechanism part to extract fine-grained features and classify the fine-grained features.

In some embodiments, the attention mechanism module is a SENET network, a similar residual error structure is introduced into the SENET network and comprises a global average pooling layer, two full connection layers and a sigmoid layer which are sequentially connected, the feature fusion data is input into the SENET network, the global features of all channel pieces of the feature graph are obtained, the global features are excited, the relationship among all channels is learned by obtaining the weight values of different channels, and finally fine-grained features are obtained by multiplying the weight values by original feature mapping.

In some embodiments, the allowance part includes a first allowance module, a second allowance module and a third allowance module, the output result of the first batch of normalization layers is input into the first allowance module, the output result of the first batch of normalization layers is combined with the output of the first allowance module for feature fusion to obtain first feature fusion data, the first feature fusion data is input into the second allowance module, the first feature fusion data and the output of the second allowance module are subjected to feature fusion to obtain second feature fusion data, the second feature fusion data is input into the third allowance module, the second feature fusion data and the output of the third allowance module are subjected to feature fusion to obtain third feature fusion data, the third feature fusion data passes through the second convolution layer and the second batch of normalization layers to obtain fourth feature fusion data, the output result of the first batch of normalization layers and the fourth feature fusion data are subjected to feature fusion, fifth feature fusion data is obtained.

In some embodiments, the inclusion-a unit comprises a bottleneck structure network of a plurality of convolution layers with convolution kernel size 1 x 1, convolution layers with convolution kernel size 3 x 3, and a mean pooling layer with a mean pooling kernel size 1 x 1.

In some embodiments, the transfer learning between the source domain and the transition domain in the two-step transfer learning training process specifically includes:

pre-training the VGG16 model by adopting a source domain to realize parameter migration of the convolutional layer, and obtaining a pre-trained VGG16 model;

the weight and the parameters of the pre-trained VGG16 model are fixed, the pre-trained VGG16 model is used as a feature extractor, the initialization of the network parameters of the VGG _ Resception model based on the attention mechanism is realized through a transition domain, and the initialized VGG _ Resception model based on the attention mechanism is obtained.

In some embodiments, the transfer learning between the transition domain and the target domain in the two-step transfer learning training process specifically includes:

and fine-tuning the initialized VGG _ Resception model based on the attention mechanism based on the transition domain and the target domain to realize the feature migration between the transition domain and the target domain and obtain the trained VGG _ Resception model based on the attention mechanism.

In a second aspect, an embodiment of the present application provides a fine-grained image recognition method based on weak supervised learning, including:

a model building and training module configured to build a VGG _ Reservation model based on an attention mechanism and perform two-step migration learning training to obtain a trained VGG _ Reservation model based on the attention mechanism, wherein the VGG _ Reservation model based on the attention mechanism comprises a pre-trained VGG16 model, a Reservation part and an attention mechanism part, the Reservation part comprises a fourth batch normalization layer, a plurality of Reservation modules, a third convolution layer and a third batch normalization layer, the Reservation module comprises a first convolution layer and a first batch normalization layer which are connected based on residual errors, the system comprises an expression-A unit, a second convolution layer and a second batch normalization layer, wherein the expression mechanism part comprises an expression mechanism module, a global average pooling layer, a full connection layer and a softmax layer, the two-step migration learning training process comprises migration learning between a source domain and a transition domain and migration learning between the transition domain and a target domain, and the transition domain is a coarse-grained image data set;

and the result output module is configured to acquire a plant leaf disease degree fine-grained image, input the trained VGG _ Reservation model based on the attention mechanism and output a classification result.

In a third aspect, embodiments of the present application provide an electronic device comprising one or more processors; storage means for storing one or more programs which, when executed by one or more processors, cause the one or more processors to carry out a method as described in any one of the implementations of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the method as described in any of the implementations of the first aspect.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention can realize the classification of fine-grained images only by using the coarse-grained labels.

(2) Compared with the traditional transfer learning mode, the method has the advantages that the recognition accuracy and the stability of model operation are greatly improved, and the accuracy is also greatly improved.

(3) According to the method, the problems of overfitting and negative migration of the model in the training process are reduced through a two-step migration learning training method, and the positioning of the visualization area based on the SEnet network attention mechanism is more accurate.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is an exemplary device architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flowchart of a fine-grained image recognition method based on weak supervised learning according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a VGG _ Reserve model based on an attention mechanism of a fine-grained image recognition method based on weak supervised learning according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a pre-training process of a VGG16 model of a fine-grained image recognition method based on weak supervised learning according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an inclusion-A module of a fine-grained image identification method based on weak supervised learning according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a bottleneck structure of a fine-grained image recognition method based on weak supervised learning according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a resolution module of a fine-grained image recognition method based on weak supervised learning according to an embodiment of the present invention;

FIG. 8 is a SEnet network structure diagram of the fine-grained image recognition method based on weak supervised learning and added with fusion features in the embodiment of the present invention;

FIG. 9 is a schematic diagram of a two-step transfer learning training of a fine-grained image recognition method based on weakly supervised learning according to an embodiment of the present invention;

fig. 10 is a schematic diagram of a fine-grained image recognition apparatus based on weak supervised learning according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a computer device suitable for implementing an electronic apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 illustrates an exemplary device architecture 100 to which the fine-grained image recognition method based on weak supervised learning or the fine-grained image recognition device based on weak supervised learning according to the embodiment of the present application may be applied.

As shown in fig. 1, the apparatus architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various applications, such as data processing type applications, file processing type applications, etc., may be installed on the

terminal apparatuses

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a background data processing server that processes files or data uploaded by the

terminal devices

101, 102, 103. The background data processing server can process the acquired files or data to generate a processing result.

It should be noted that the fine-grained image recognition method based on weak supervised learning provided in the embodiment of the present application may be executed by the server 105, or may also be executed by the

terminal devices

101, 102, and 103, and accordingly, the fine-grained image recognition apparatus based on weak supervised learning may be disposed in the server 105, or may also be disposed in the

terminal devices

101, 102, and 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In the case where the processed data does not need to be acquired from a remote location, the above device architecture may not include a network, but only a server or a terminal device.

Fig. 2 illustrates a fine-grained image recognition method based on weak supervised learning, provided by an embodiment of the present application, and includes the following steps:

s1, constructing a VGG _ Reserve model based on an attention mechanism and carrying out two-step migration learning training to obtain a trained VGG _ Reserve model based on the attention mechanism, wherein the VGG _ Reserve model based on the attention mechanism comprises a pre-trained VGG16 model, a Reserve part and an attention mechanism part, the Reserve part comprises a fourth batch normalization layer, a plurality of Reserve modules, a third convolution layer and a third batch normalization layer, the Reserve module comprises a first convolution layer and a first batch normalization layer which are connected based on residual errors, the device comprises an expression-A unit, a second convolution layer and a second batch normalization layer, wherein the expression part comprises an expression module, a global average pooling layer, a full connection layer and a softmax layer, and the two-step migration learning training process comprises migration learning between a source domain and a transition domain and migration learning between the transition domain and a target domain, wherein the transition domain is a coarse-grained image data set.

In a specific embodiment, in order to be able to capture valuable information in a picture in a network model, the invention provides a VGG _ reservation model based on an attention mechanism, and the model is composed of three parts. The first part is a pre-trained VGG16 model, the underlying feature extractor. And the second part is a resolution part, and high-dimensional features are extracted by taking the excellent structural characteristics of a residual error network (ResNet) and an inclusion module as reference to obtain multi-scale feature fusion data. The third section is the attention mechanism section, where the typical channel level attention mechanism SENET is employed. And constructing a network structure with stronger ability of grabbing fine-grained features through the three parts.

Specifically, a process of VGG _ reservation model based on attention mechanism will be specifically set up below, and a specific model structure thereof is shown in fig. 3. First, a pre-trained VGG16 model is obtained using a dynamic tuning method. As shown in fig. 4, the dynamic fine tuning process of the pre-trained VGG16 model includes freezing the convolutional layer of the VGG16 model, adjusting the full-connection layer network structure, fine tuning parameters, releasing parameters of the high-layer convolutional layer when the model is expected to be stable and not to fall any more before fine tuning, and fine tuning the high-layer convolutional layer and the full-connection layer. During the pre-training process, the convolutional neural network has a forward propagation and backward parameter-tuning process. Generation of network structure during training of whole neural networkAnd when the cost function is at a minimum value, the network parameters initialized randomly are adjusted to obtain the pre-trained neural network model. In the pre-training process specific to the VGG16 network model, it is assumed that m training samples are contained in the source domain, and a single input sample is (x)⁽ⁱ⁾,y⁽ⁱ⁾) Wherein x is⁽ⁱ⁾Representing an n-dimensional input vector, y⁽ⁱ⁾A label representing the specimen; denote the l-th layer of VGG-16 by l, then the input feature vector for that layer is denoted as x^(l ^-1)The output feature vector of the layer is denoted as x^(l)And the weight w corresponding to the layer^(l)And an offset value b^(l). The forward propagation of the VGG16 convolutional network can be expressed as:

x^(l)＝f(w^(l)x^(l-1)+b^(l))；

this function f () represents an activation function, which is the ReLU function used in VGG 16.

The overall cost function of the network model is represented as:

wherein h is_w,b(x^(l)) Representing the output value, n, of the network in forward propagation_lIs the total number of layers, s, of the network_lThe number of nodes of the l-th layer neural network is represented.

The batch gradient descent method is used for adjusting parameters to find the minimum value of the overall cost function, and the updated parameter expression is as follows:

and alpha represents the learning rate, and when the value of the cost function is minimum, the pre-training process of the source domain is completed through multiple iterations and continuous updating. In the application, the ImageNet data set is used as a source domain, the data set is rich, and various characteristics of the picture can be better extracted through parameters of the data set through continuous iterative training. Therefore, the method and the device migrate the pre-trained model parameters to the target domain to realize the migration of the model.

Adding a fourth normalization layer behind a pre-trained VGG16 model, and then connecting 3 treatment modules, wherein the treatment modules can more finely extract the characteristics of plant leaf scabs due to the fact that the treatment modules have a residual error structure of ResNet and a parallel structure of addition, and inputting the extracted characteristics into a SENET network structure after multiple times of characteristic extraction and characteristic fusion. In the embodiment of the present application, in order to enable the SENet network to be compatible with the overall structure of the VGG _ resolution model based on the attention mechanism proposed by the present invention, the part of the SENet input in the embodiment of the present application no longer utilizes the pre-trained inclusion structure, but utilizes the feature fusion data as the data input of the SENet. SEnet can improve the useful characteristics of the network according to the importance degree and simultaneously inhibit the less useful characteristics, thereby realizing the reorientation of the fusion characteristics. Then, global average pooling is added after SENET. The global average pooling is the same as the full connection layer, has the function of extracting global information, greatly reduces the parameter quantity and the calculated quantity, has better interpretability, and is greatly helpful for adding a class activation graph later.

In a specific embodiment, the reservation part includes 3 reservation modules, and the basic components in the reservation modules are described below. The inclusion-v 4 is a typical multi-scale convolution kernel neural network, which is composed of a large number of multi-scale convolution kernel modules, and the inclusion-v 4 is a GooglLeNet convolution neural network which is known to perform better after a large number of network parameter adjustments. Therefore, the application refers to the fact that the inclusion-A module is formed by a multi-scale convolution kernel serving as a resolution module. As shown in fig. 5, the inclusion-a unit includes a bottleneck structure network composed of a plurality of convolution layers with convolution kernel size 1 × 1, a convolution layer with convolution kernel size 3 × 3, and a mean pooling layer with mean pooling kernel size 1 × 1, and mainly includes two convolution kernel types 1 × 1 and 3 × 3, wherein the convolution kernel 1 × 1 is used to construct a bottleneck structure for reducing the computation cost.

Aiming at the structure characteristics of the residual error structure and the inclusion-A, a reservation module is designed, a plurality of reservation modules are connected in series, and the connection of a plurality of module groups is realized through a ResNet residual error hopping structure. First, a resolution module is introduced, which first mirrors the bottleneck structure proposed by google lenet. The original purpose of the bottleneck structure is to reduce the calculation amount of the convolutional layer, namely, before calculating a larger convolutional layer, the number of channels of the convolutional layer input feature map is compressed by 1 × 1 convolution to reduce the calculation amount; after the large convolutional layer completes the calculation, the 1 × 1 convolutional layer is reused to recover the number of channels of the output feature map according to the actual requirement, as shown in fig. 6, where c > b.

Fig. 7 shows a schematic structural diagram of the reselection module. As shown in the figure, firstly, 1 × 1 convolutional layer is used, the main purpose is to make the feature map size output by the last convolutional layer of the pre-trained VGG16 model be 512, and in order to make the number of feature maps input by the convolutional layer correspond to the number of feature maps input by inclusion-a, the input data first passes through the 1 × 1 first convolutional layer, and the number of convolution kernels is set to be 384, so that the modification of the internal hyper-parameter of inclusion-a can be avoided as much as possible. After a first convolution layer, a first batch of normalization layers are added after the first convolution layer, because for a complex machine learning system, an internal covariate offset phenomenon is easily caused in the training process. Batch normalization can not only avoid the internal covariate offset phenomenon, but also enable the neural network to become more stable in the training process and less sensitive to the initial value, and can accelerate convergence by adopting a larger learning rate. And then, the Incep-A module is connected, and the Incep structure has convolution kernel sizes with different sizes, so that the Incep module has a good effect on sensing the leaf scabs with different sizes. Then, a second convolution layer of 1 × 1 is added, and in order to recover the number of feature maps before input, the layer completes the recovery of the number of feature maps, and the number of convolution kernels is set to 512, thereby realizing the bottleneck structure of the model trunk. Further, inspired by the residual error structure, in order to avoid the problems of 'model degradation' and model overfitting in the process of constructing the whole model, a residual error connection structure is added in the whole model structure. The above operation completes the construction of a reservation module.

In a specific embodiment, the concentration part comprises a first concentration module, a second concentration module and a third concentration module, the output result of the first batch of normalization layers is input into the first concentration module, the output result of the first batch of normalization layers is combined with the output of the first concentration module for feature fusion to obtain first feature fusion data, the first feature fusion data is input into the second concentration module, the first feature fusion data and the output of the second concentration module are subjected to feature fusion to obtain second feature fusion data, the second feature fusion data is input into the third concentration module, the second feature fusion data and the output of the third concentration module are subjected to feature fusion to obtain third feature fusion data, the third feature fusion data passes through the second convolution layer and the second batch of normalization layers to obtain fourth feature fusion data, the output result of the first batch of normalization layers and the fourth feature fusion data are subjected to feature fusion, fifth feature fusion data is obtained.

The calculation of the attention mechanism can be divided into two steps: the first step is to calculate attention weight in all input information, and the second step is to weight all input feature information according to attention weight to select input key information. Firstly, feature information of attention input is represented as x, a query vector related to a current task is represented as q, then the probability of selecting the ith input feature information is represented as i, score (x, q) represents an attention scoring function, and the calculation process is shown as the following formula:

the screening of the characteristic information is realized by selecting the factors such as the current task, the network and the like, and the process can be further expressed as follows:

in a specific embodiment, the attention mechanism module is a SENET network, a similar residual error structure is introduced into the SENET network and comprises a global average pooling layer, two full connection layers and a sigmoid layer which are sequentially connected, feature fusion data are input into the SENET network, global features of all channel pieces of a feature map are obtained, the global features are excited, relations among all channels are learned by obtaining weights of different channels, and fine-grained features are obtained by multiplying the relations by original feature mapping.

In particular, the SEnet network can improve the performance of the network at the channel level. The SENET is a typical channel attention mechanism-based model, and improves the feature extraction capability of a neural network according to the importance degree between different channels based on the relationship between feature channels. The main working mechanism of the SENET network is to complete compression operation on each feature map of the convolutional layer, compress the length, the width and the channel number [ H, W, C ] of the feature map into features with the size of [1, 1, C ], so as to obtain the global features of all channels of the feature map, then excite the global features (Excitation), learn the relationship among the channels by obtaining the weight values of different channels, and finally multiply the weights by the original feature mapping to obtain the final features.

Because the fine-grained images have very similar features, it is desirable to obtain more detailed image features through an attention mechanism, and then use the SENet network as an attention mechanism module of the neural network, input the features extracted by the pre-training network VGG16 into the SENet network, obtain global features among channels of the feature map, and further improve the accuracy of fine-grained identification. In order to obtain the neural network structure, a SENet model is modified, and an original inclusion module is replaced by a feature vector extracted by a pre-training network and used as feature input of the neural network. The structure of the newly constructed SENET network is shown in FIG. 8. The SENet network introduces a residual error-like structure, plays a role in extracting global information of features and reducing the parameter number and the calculated amount of a neural network by introducing global average pooling, and then adds two full connection layers (FC) for limiting the complexity of the model and assisting in increasing the generalization capability of the model.

The experimental case takes the classification of fine-grained images of citrus greening disease as an example. The overall data set includes three categories of healthy citrus, general citrus greening disease, and severe citrus greening disease. The feature difference of the whole data set is small, the data set belongs to the category of fine-grained image classification, the difference distribution of the features of the data set and the spatial features on the ImageNet data set is large, although the problem of overfitting caused by training of a large number of parameters in deep learning can be avoided in the transfer learning, the transfer learning is a process of transferring the trained parameters in a source domain to a target domain, and if the difference of the distribution of the source domain data and the target domain data is too large, the problem of negative transfer can be caused in the training process. In addition, the overall network structure of the VGG _ resolution model based on the attention mechanism is very complex, contains a large number of untrained parameters, and direct application may cause an overfitting problem of the model in the training process.

In order to avoid the problems of possible overfitting phenomena and negative migration in the migration learning, a two-step migration learning method is adopted. The two-step migration learning method, as the name implies, involves two-step migration learning operations, and introduces the concept of "transition domain".

The transition region is derived from a Plantvillage plant leaf spot data set which comprises a plurality of plant leaf spot pictures, but the pictures are coarse-grained pictures, namely, only a large class of diseases of each type of crop is provided, and classification is not specific to the degree of a certain disease. However, the expression form of the leaf and disease characteristics of the crops is similar to the characteristic of a specific plant leaf disease fine-grained picture researched by the application, and the leaf and disease characteristics are trained as a transition region, so that the risk of negative migration of the model in the actual training process can be reduced. In addition, due to the fact that the data set is abundant in quantity, the risk of overfitting can be reduced for the model parameters initialized randomly in the training process.

For convenience in training, in the process of constructing a transition domain, the related scabs are combined into a large class, namely, the leaves of each class of crops are regarded as a large class, and the pictures of the scabs of the leaves of certain classes of crops are combined to generate a large class; the plantavliage data set contains multiple crop categories, the embodiment of the application selects the crop categories with more types and more quantities of disease spots as the categories, and finally selects 8 types of crops to form the data set, wherein the data set comprises apples, cherries, corns, grapes, peaches, peppers, potatoes and tomatoes, 600 types of crops are randomly selected as a training set, 200 types of crops are selected as a verification set, and 200 types of crops are selected as a test set.

In a specific embodiment, the transfer learning between the source domain and the transition domain and the transfer learning between the transition domain and the target domain in the two-step transfer learning training process specifically include:

fixing the weight and parameters of the pre-trained VGG16 model, taking the pre-trained VGG16 model as a feature extractor, and realizing the initialization of the network parameters of the VGG _ Resception model based on the attention mechanism through a transition domain to obtain the initialized VGG _ Resception model based on the attention mechanism;

Specifically, the first step is to perform migration learning between the source domain and the transition domain, although the source domain (ImageNet dataset) is composed of real pictures, the convolutional layer parameters of the VGG16 model which is pre-trained on the source domain (ImageNet dataset) are fixed, and considering that the VGG16 model which is pre-trained already learns many image low-level features in the ImageNet dataset, only the allowance part and the attention part of the VGG _ allowance model based on the attention mechanism are released in the early stage of the migration learning training, when the model loss is stable and does not fall, all layers after Block5_1 of the VGG16 model which is pre-trained and parameters of the allowance part and the attention part of the VGG _ allowance model based on the attention mechanism are released, and a training 600 wheel is fixed to enable the model to form a multi-level feature extractor. In order to improve the interpretability of the model, a Class Activation Mapping (CAM) method is adopted to locate the significant region of the orange leaf scab identified by the model.

And the second step is to perform transfer learning between the transition domain and the target domain, release parameters of a resolution part and an attention part of the VGG _ resolution model based on the attention mechanism, fix the training 200 rounds, and enable a model feature extraction part to learn the data features of the target data set. A specific two-step migration diagram is shown in fig. 9.

And S2, acquiring a plant leaf disease degree fine-grained image, inputting a trained VGG _ Reservation model based on an attention mechanism, and outputting a classification result.

In a specific embodiment, the plant leaf disease degree fine-grained image is subjected to multiple feature extraction and feature fusion of a pre-trained VGG16 model and a resolution part to obtain feature fusion data; and inputting the feature fusion data into an attention mechanism part to extract fine-grained features and classify the fine-grained features.

With further reference to fig. 10, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of a fine-grained image recognition apparatus based on weak supervised learning, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices in particular.

The embodiment of the application provides a fine-grained image identification method based on weak supervised learning, which comprises the following steps:

a model construction training module 1 configured to construct an attentive mechanism-based VGG _ Reservation model including a pre-trained VGG16 model, a receiption portion including a fourth batch normalization layer, a plurality of receiption modules, a third convolution layer and a third batch normalization layer, and a trained two-step migration learning training process including learning of transition between a source domain and a transition domain and learning of transition between a transition domain and a target domain, the receiption portion including the first convolution layer, the first batch normalization layer, the inclusion-A unit, the second convolution layer and the second batch normalization layer based on residual connection, and the attentive mechanism portion including the attentive mechanism module, the global mean pooling layer, the fully-connected layer and the softmax layer, wherein the transition domain is a coarse-grained image dataset;

and the result output module 2 is configured to acquire a plant leaf disease degree fine-grained image, input the trained VGG _ Reservation model based on the attention mechanism and output a classification result.

Referring now to fig. 11, a schematic diagram of a computer apparatus 1100 suitable for use in implementing an electronic device (e.g., the server or terminal device shown in fig. 1) according to an embodiment of the present application is shown. The electronic device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 11, the computer apparatus 1100 includes a Central Processing Unit (CPU)1101 and a Graphics Processing Unit (GPU)1102, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1103 or a program loaded from a storage section 1109 into a Random Access Memory (RAM) 1104. In the RAM 1104, various programs and data necessary for the operation of the apparatus 1100 are also stored. The CPU 1101, GPU1102, ROM 1103, and RAM 1104 are connected to each other by a bus 1105. An input/output (I/O) interface 1106 is also connected to bus 1105.

The following components are connected to the I/O interface 1106: an input portion 1107 including a keyboard, a mouse, and the like; an output section 1108 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage portion 1109 including a hard disk and the like; and a communication section 1110 including a network interface card such as a LAN card, a modem, or the like. The communication section 1110 performs communication processing via a network such as the internet. The driver 1111 may also be connected to the I/O interface 1106 as needed. A removable medium 1112 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1111 as necessary, so that a computer program read out therefrom is mounted in the storage section 1109 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications portion 1110 and/or installed from removable media 1112. The computer programs, when executed by a Central Processing Unit (CPU)1101 and a Graphics Processor (GPU)1102, perform the above-described functions defined in the methods of the present application.

It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable medium or any combination of the two. The computer readable medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor device, apparatus, or any combination of the foregoing. More specific examples of the computer readable medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution apparatus, device, or apparatus. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution apparatus, device, or apparatus. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based devices that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present application may be implemented by software or hardware. The modules described may also be provided in a processor.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: constructing a VGG _ Reserve model based on an attention machine system and carrying out two-step migration learning training to obtain a trained VGG _ Reserve model based on the attention machine system, wherein the VGG _ Reserve model based on the attention machine system comprises a pre-trained VGG16 model, a Reserve part and an attention machine part, the Reserve part comprises a fourth batch of normalization layers, a plurality of Reserve modules, a third convolution layer and a third batch of normalization layers, the Reserve module comprises a first convolution layer, a first batch of normalization layers, an inclusion-A unit, a second convolution layer and a second batch of normalization layers which are connected based on residual errors, the attention machine part comprises an attention machine module, a global average pooling layer, a full connection layer and a softlayer, and the two-step migration learning process comprises migration learning between a source domain and a transition domain and migration learning between the transition domain and a target domain, wherein the transition domain is a coarse-granularity image dataset; and acquiring a plant leaf disease degree fine-grained image, inputting a trained VGG _ Reservation model based on an attention mechanism, and outputting a classification result.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A fine-grained image identification method based on weak supervised learning is characterized by comprising the following steps:

constructing a VGG _ Resception model based on an attention mechanism and carrying out two-step migration learning training to obtain a trained VGG _ Resception model based on the attention mechanism, the VGG _ Resception model based on the attention mechanism comprises a pre-trained VGG16 model, a reception part and an attention mechanism part, the resolution part comprises a fourth batch normalization layer, a plurality of resolution modules, a third convolution layer and a third batch normalization layer, the allowance module comprises a first convolution layer, a first batch normalization layer, an inclusion-A unit, a second convolution layer and a second batch normalization layer which are connected based on residual errors, the attention mechanism part comprises an attention mechanism module, a global average pooling layer, a full connection layer and a softmax layer, the two-step transfer learning training process comprises transfer learning between a source domain and a transition domain and transfer learning between the transition domain and a target domain, wherein the transition domain is a coarse-grained image data set;

and acquiring a plant leaf disease degree fine-grained image, inputting the trained VGG _ Reservation model based on the attention mechanism, and outputting a classification result.

2. The fine-grained image recognition method based on weak supervised learning of claim 1, wherein the plant leaf disease degree fine-grained image is subjected to multiple feature extraction and feature fusion of a pre-trained VGG16 model and a resolution part to obtain feature fusion data; and inputting the feature fusion data into the attention mechanism part to extract fine-grained features and classify the fine-grained features.

3. The fine-grained image recognition method based on weak supervised learning of claim 2, wherein the attention mechanism module is a SENet network, a similar residual error structure is introduced into the SENet network, the SENet network comprises a global average pooling layer, two full connection layers and a sigmoid layer which are sequentially connected, the feature fusion data is input into the SENet network, the global features of all channel elements of a feature map are obtained, the global features are excited, the relationship among all channels is learned by obtaining the weight values of different channels, and finally the relationship is multiplied by original feature mapping to obtain the fine-grained features.

4. The fine-grained image recognition method based on weak supervised learning of claim 2, wherein the concentration part comprises a first concentration module, a second concentration module and a third concentration module, the output result of the first batch of normalization layers is input into the first concentration module, the output result of the first batch of normalization layers is combined with the output of the first concentration module for feature fusion to obtain first feature fusion data, the first feature fusion data is input into the second concentration module, the first feature fusion data and the output of the second concentration module are subjected to feature fusion to obtain second feature fusion data, the second feature fusion data is input into the third concentration module, the second feature fusion data and the output of the third concentration module are subjected to feature fusion to obtain third feature fusion data, and the third feature fusion data is subjected to feature fusion by the second convolution layer and the second batch of normalization layers to obtain fourth feature fusion data And performing feature fusion on the output results of the first normalization layer and the fourth feature fusion data to obtain fifth feature fusion data.

5. The fine-grained image identification method based on weak supervised learning according to claim 1, wherein the inclusion-A unit comprises a bottleneck structure network consisting of a plurality of convolution layers with convolution kernel size of 1 x 1, convolution layers with convolution kernel size of 3 x 3 and a mean pooling layer with mean pooling kernel size of 1 x 1.

6. The fine-grained image recognition method based on weak supervised learning as recited in claim 1, wherein the transfer learning between the source domain and the transition domain in the two-step transfer learning training process specifically comprises:

pre-training the VGG16 model by using the source domain to realize parameter migration of the convolutional layer, and obtaining the pre-trained VGG16 model;

fixing the weight and the parameters of the pre-trained VGG16 model, taking the pre-trained VGG16 model as a feature extractor, and realizing the initialization of the network parameters of the VGG _ Reservation model based on the attention mechanism through the transition domain to obtain the initialized VGG _ Reservation model based on the attention mechanism.

7. The fine-grained image recognition method based on weak supervised learning as recited in claim 6, wherein the transfer learning between the transition domain and the target domain in the two-step transfer learning training process specifically comprises:

and fine-tuning the initialized VGG _ Resception model based on the attention mechanism based on the transition domain and the target domain, realizing the feature migration between the transition domain and the target domain, and obtaining the trained VGG _ Resception model based on the attention mechanism.

8. A fine-grained image identification method based on weak supervised learning is characterized by comprising the following steps:

a model construction training module configured to construct an attentive force mechanism-based VGG _ Reservation model and perform two-step migration learning training to obtain a trained attentive force mechanism-based VGG _ Reservation model, wherein the attentive force mechanism-based VGG _ Reservation model comprises a pre-trained VGG16 model, a Reservation part and an attentive force mechanism part, the Reservation part comprises a fourth batch normalization layer, a plurality of Reservation modules, a third convolution layer and a third batch normalization layer, the Reservation module comprises a first convolution layer based on residual connection, a first batch normalization layer, an inclusion-A unit, a second convolution layer and a second batch normalization layer, the attentive force mechanism part comprises an attentive force mechanism module, an average pooling layer, a fully-connected layer and a softmax layer, and the two-step migration learning training process comprises source domain and transitional domain learning and transition domain migration learning migration between the transitional domain and the target transitional domain, wherein the transition domain is a coarse-grained image dataset;

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.