CN117475187A

CN117475187A - Method, device, equipment and storage medium for training image classification model

Info

Publication number: CN117475187A
Application number: CN202210831668.5A
Authority: CN
Inventors: 王前前; 冯伟; 高全学
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-07-14
Filing date: 2022-07-14
Publication date: 2024-01-30

Abstract

The application provides a method, a device, equipment and a storage medium for training an image classification model, which can be applied to the fields of artificial intelligence or intelligent agriculture and the like and is used for solving the problems of low classification accuracy and classification reliability of a target image classification model obtained through training. The method at least comprises the following steps: adopting a plurality of image processing strategies to respectively extract pixel-level features of the sample image set to obtain a plurality of feature image sets; each iteration includes: respectively extracting semantic features of the selected sample image and the corresponding multiple feature images, and carrying out feature fusion on each obtained semantic feature to obtain comprehensive features of the sample image; determining a predictive classification of the sample image based on the composite features; based on the plurality of feature maps, the comprehensive features and the predictive classification, model parameters of the image classification model are adjusted. By enlarging the data volume of the sample images and fully learning each sample image, the classification accuracy and the classification reliability of the target image classification model are improved.

Description

Method, device, equipment and storage medium for training image classification model

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for training an image classification model.

Background

With the development of technology, more and more devices can provide an image classification service through a trained object image classification model, and the image classification service can be used for determining the category of an object in an object image.

In the related art, a method for obtaining a trained target image classification model is generally obtained by performing multiple iterative training on an image classification model based on a sample image set, in each iterative process, firstly extracting image features of a sample image, then determining prediction classification of the sample image based on the image features, and finally adjusting model parameters of the image classification model based on errors between the prediction classification and classification labels of the sample image.

However, model training using the above method has the following drawbacks:

first, when the target size of the target object included in the sample image is smaller than the image size of the sample image, the sample image includes more noise information, so that the extracted image features also include more noise features, thereby generating larger training errors.

Second, among the pixel level features characterized by the sample image, only one type of pixel level feature is dominant, while other pixel level features do not have obvious features, for example, among the pixel level features characterized by the photo image, color related features are dominant, while gray level features do not have obvious features; as another example, among the pixel-level features of texture image characterization, texture-related features are typically dominant, while local binary features do not have a distinct characterization.

Therefore, when training the image classification model based on the sample image set, the image classification model usually learns mainly based on the dominant pixel-level features in the pixel-level features represented by the sample image, and cannot comprehensively learn all the pixel-level features represented by the sample image, so that the classification accuracy of the target image classification model obtained by training always has a bottleneck.

Therefore, the training mode adopted under the related technology cannot ensure the classification accuracy and the classification reliability of the target image classification model obtained by training.

Disclosure of Invention

The embodiment of the application provides a method, a device, computer equipment and a storage medium for training an image classification model, which are used for solving the problems of low classification accuracy and low classification reliability of a target image classification model obtained through training.

In a first aspect, a method of training an image classification model is provided, comprising:

adopting a plurality of image processing strategies to respectively extract pixel-level features of a sample image set to obtain a plurality of feature image sets, wherein each image processing strategy corresponds to one type of pixel-level features and corresponds to one feature image set, and each sample image corresponds to one feature image in each feature image set;

performing multiple rounds of iterative training on the image classification model to be trained based on the sample image set and the obtained multiple feature image sets, and outputting a trained target image classification model, wherein each round of iteration comprises:

respectively extracting semantic features of the selected sample image and the corresponding multiple feature images, and carrying out feature fusion on each obtained semantic feature to obtain comprehensive features of the sample image;

determining a predictive classification of the sample image based on the composite features;

and adjusting model parameters of the image classification model based on the plurality of feature maps, the comprehensive features and the prediction classification.

In a second aspect, there is provided an apparatus for training an image classification model, comprising:

the acquisition module is used for: the method comprises the steps of respectively extracting pixel-level features of a sample image set by adopting a plurality of image processing strategies to obtain a plurality of feature image sets, wherein each image processing strategy corresponds to one type of pixel-level features and corresponds to one feature image set, and each sample image corresponds to one feature image in each feature image set;

The processing module is used for: the method is used for carrying out multiple rounds of iterative training on the image classification model to be trained based on the sample image set and the obtained multiple feature image sets, and outputting a trained target image classification model, wherein each round of iteration comprises:

the processing module is further configured to: respectively extracting semantic features of the selected sample image and the corresponding multiple feature images, and carrying out feature fusion on each obtained semantic feature to obtain comprehensive features of the sample image;

the processing module is further configured to: determining a predictive classification of the sample image based on the composite features;

the processing module is further configured to: and adjusting model parameters of the image classification model based on the plurality of feature maps, the comprehensive features and the prediction classification.

Optionally, the processing module is specifically configured to:

determining a fight loss, a fusion loss, and a classification loss for the image classification model based on the plurality of feature maps, the composite features, and the predictive classification, wherein the fight loss characterizes: extracting the accuracy of semantic features, wherein the fusion loss represents: accuracy of feature fusion, the classification loss: characterizing and determining the accuracy of prediction classification;

Model parameters of the image classification model are adjusted based on the fight loss, the fusion loss, and the classification loss.

Optionally, the processing module is specifically configured to:

for the feature maps, the following operations are respectively executed:

converting the feature map into a pseudo image based on semantic features of the feature map and semantic features of the sample image, wherein the pseudo image and the sample image characterize the same kind of pixel-level features;

based on the sample image and the pseudo image, a countering loss of the image classification model is determined.

Optionally, the processing module is specifically configured to:

extracting semantic features of the pseudo image, and determining a first reconstruction loss of the image classification model based on errors between the semantic features of the sample image and the semantic features of the pseudo image;

determining a second reconstruction loss of the image classification model based on an error between the sample image and the pseudo image;

based on the first reconstruction loss and the second reconstruction loss, a countermeasures loss of the image classification model is determined.

Optionally, the image classification model includes a discrimination network, the discrimination network is used for predicting the discrimination probability that the input data is a sample image or a semantic feature of the sample image, and the input data has an associated reference probability; the processing module is specifically configured to:

Respectively taking the sample image and the pseudo image as input data of the discrimination network, predicting corresponding first discrimination probability, and respectively taking semantic features of the sample image and semantic features of the pseudo image as input data of the discrimination network, predicting corresponding second discrimination probability;

determining a discrimination loss of the image classification model based on the obtained first discrimination probabilities and the second discrimination probabilities and the error between the first discrimination probabilities and the corresponding reference probabilities;

and taking the weighted sum of the first reconstruction loss, the second reconstruction loss and the discrimination loss as the countermeasure loss of the image classification model.

Optionally, the processing module is specifically configured to:

carrying out weighted summation on the semantic features of the sample image and the semantic features of the feature images to obtain reference features;

and determining fusion loss of the image classification model based on the error between the comprehensive feature and the reference feature.

Optionally, the processing module is specifically configured to:

carrying out weighted summation on the countermeasure loss, the fusion loss and the classification loss to obtain training loss of the image classification model;

And when the obtained training loss does not meet the training target, adjusting the model parameters of the image classification model, and entering into the next round of iterative training.

Optionally, each feature atlas is one of a gray feature atlas, a texture feature atlas, a local binary feature atlas, or a directional gradient histogram set.

In a third aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to the first aspect.

In a fourth aspect, there is provided a computer device comprising:

a memory for storing program instructions;

and a processor for calling program instructions stored in the memory and executing the method according to the first aspect according to the obtained program instructions.

In a fifth aspect, there is provided a computer readable storage medium storing computer executable instructions for causing a computer to perform the method of the first aspect.

In the embodiment of the application, a plurality of image processing strategies are adopted to extract pixel-level features from a sample image set, so that a plurality of feature image sets can be obtained, each sample image corresponds to one feature image in each feature image set, each feature image corresponding to the sample image can respectively represent a plurality of types of pixel-level features contained in each sample image, and therefore, when an image classification model to be trained is subjected to multi-round iterative training based on the plurality of feature image sets of the sample image sets, the image classification model can fully learn the plurality of types of pixel-level features contained in each sample image, and the problem that the classification accuracy and the classification reliability of a target image classification model obtained through training are low due to the fact that the image classification model only learns one type of pixel-level feature which is dominant in the sample image when the image classification model is directly trained based on the sample image is avoided.

Further, the multiple feature maps corresponding to each sample image and characterizing different kinds of pixel-level features have different interpretation modes for the object, the background, the noise and the like contained in the sample image, and may not be obviously characterized in some feature maps, and in some feature maps, the background is not obviously characterized, and since each feature map is obtained based on the object contained in the sample image, the object is obviously characterized for all feature maps. Therefore, after the semantic features of the sample image and the plurality of feature images are extracted and feature fusion is carried out, the obtained comprehensive features mainly show the features of the target objects contained in the sample image, and unnecessary contents such as background, noise and the like of the sample image are weakened. Therefore, based on the target image classification model trained by the comprehensive characteristics, the semantic characteristics of the target objects contained in the image to be classified can be extracted more accurately so as to be classified accurately and reliably.

Further, after the plurality of feature image sets corresponding to the sample image set are obtained, the image classification model to be trained can be trained based on the sample image set and the plurality of feature image sets, which is equivalent to expanding the number of sample images contained in the original sample image set to multiple times of the original sample images, so that the aim of increasing training data is fulfilled, and therefore, the image classification model can be sufficiently trained, and the classification accuracy and the classification reliability of the trained target image classification model are improved.

Drawings

Fig. 1A is a schematic diagram of an application field of an image classification model according to an embodiment of the present application;

fig. 1B is an application scenario of a method for training an image classification model according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for training an image classification model according to an embodiment of the present application;

FIG. 3A is a schematic diagram of a method for training an image classification model according to an embodiment of the present application;

FIG. 3B is a schematic diagram II of a method for training an image classification model according to an embodiment of the present application;

FIG. 4 is a schematic diagram III of a method for training an image classification model according to an embodiment of the present application;

FIG. 5A is a schematic diagram of a method for training an image classification model according to an embodiment of the present disclosure;

FIG. 5B is a schematic diagram of a method for training an image classification model according to an embodiment of the present disclosure;

FIG. 5C is a schematic diagram seven of a method for training an image classification model according to an embodiment of the present application;

FIG. 5D is a schematic diagram eight of a method for training an image classification model according to an embodiment of the present disclosure;

FIG. 5E is a schematic diagram nine of a method for training an image classification model according to an embodiment of the present application;

FIG. 6A is a schematic diagram of a method for training an image classification model according to an embodiment of the present disclosure;

FIG. 6B is a second flowchart of a method for training an image classification model according to an embodiment of the present application;

FIG. 7 is a schematic diagram eleven of a method for training an image classification model according to an embodiment of the present application;

FIG. 8 is a schematic diagram of an apparatus for training an image classification model according to an embodiment of the present application;

fig. 9 is a schematic diagram ii of an apparatus for training an image classification model according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Some of the terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

(1) Local binary (Local Binary Pattern, LBP) features:

the LBP feature is an operator for describing the local feature of the image, and has the remarkable advantages of gray invariance, rotation invariance and the like. In the neighborhood of the pixel 3*3, the gray value of the neighboring 8 pixels is compared with the pixel value of the neighborhood center by taking the neighborhood center pixel as a threshold value, if the surrounding pixels are larger than the center pixel value, the position of the pixel point is marked as 1, otherwise, the position is marked as 0. Thus, 8 points in the 3*3 neighborhood can be compared to generate 8-bit binary numbers, the 8-bit binary numbers are sequentially arranged to form a binary number, the binary number is the LBP value of the central pixel, and the LBP value is 28 possibilities, so that 256 LBP values exist, and the LBP value of the central pixel can reflect texture information of the surrounding area of the pixel.

(2) Directional gradient histogram (Histogram of Oriented Gradient, HOG) feature:

the directional gradient histogram feature is a feature descriptor used for object detection in computer vision and image processing. HOG features are characterized by computing and counting the gradient direction histograms of local areas of the image. In an image, the appearance and shape of a local object can be well described by the directional density distribution of gradients or edges.

The embodiment of the application relates to the field of artificial intelligence (Artificial Intelligence, AI), which is designed based on Computer Vision (CV) technology and Machine Learning (ML) technology, and can be applied to the fields of cloud computing, intelligent transportation, intelligent agriculture or maps and the like.

Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is a comprehensive technology of computer science, which studies the design principles and implementation methods of various machines in an attempt to understand the essence of intelligence, and to produce a new intelligent machine that can react in a similar way to human intelligence, so that the machine has the functions of sensing, reasoning and decision.

Artificial intelligence is a comprehensive discipline, and relates to a wide range of fields, including hardware-level technology and software-level technology. Basic technologies of artificial intelligence generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation interaction systems, electromechanical integration, and the like. The software technology of artificial intelligence mainly comprises computer vision technology, voice processing technology, natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other large directions. With the development and progress of artificial intelligence, the artificial intelligence is developed and applied in various fields, such as common fields of smart home, smart customer service, virtual assistant, smart sound box, smart marketing, smart wearable equipment, unmanned driving, automatic driving, unmanned plane, robot, smart medical treatment, internet of vehicles, automatic driving, smart transportation, etc., and it is believed that with the further development of future technology, the artificial intelligence will be applied in more fields, playing an increasingly important value. The scheme provided by the embodiment of the application relates to the technologies of artificial intelligence such as deep learning and augmented reality, and is specifically further described through the following embodiments.

The computer vision is a science for researching how to make a machine "see", and more specifically, a camera and a computer are used to replace human eyes to identify, track and measure targets, and the like, and further, graphic processing is performed, so that the computer is processed into images which are more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping, autopilot, intelligent transportation, etc., as well as common biometric technologies such as face recognition, fingerprint recognition, etc.

Machine learning is a multi-field interdisciplinary, and relates to multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like, and a specially researched computer acquires new knowledge or skills by simulating learning behaviors of human beings, reorganizes the existing knowledge structure and enables the computer to continuously improve the performance of the computer.

Machine learning is the core of artificial intelligence, which is the fundamental way for computers to have intelligence, applied throughout various areas of artificial intelligence; the core of machine learning is deep learning, which is a technology for realizing machine learning. Machine learning typically includes deep learning, reinforcement learning, transfer learning, induction learning, artificial neural networks, teaching learning, etc., which includes convolutional neural networks (Convolutional Neural Networks, CNN), deep confidence networks, recurrent neural networks, automatic encoders, generation countermeasure networks, etc.

It should be noted that, in the embodiments of the present application, related data such as a sample image set or an image to be classified is referred to, when the embodiments of the present application are applied to specific products or technologies, user permission or consent is required to be obtained, and the collection, use and processing of related data are required to comply with related laws and regulations and standards of related countries and regions.

The application field of the method for training the image classification model provided in the embodiment of the present application is briefly described below.

For example, in the agricultural field, intelligent agriculture has become a mainstream technology, and tasks such as detection, analysis, early warning, prevention and control in agricultural production are realized by relying on artificial intelligence technology. The plant diseases and insect pests image classification task is used as an important component of intelligent agriculture, and can assist vast agricultural workers to cope with plant diseases and insect pests of crops, so that early discovery and early prevention are realized.

The task of classifying the plant diseases and insect pests image can determine different kinds of diseases existing in the crops based on the crop pictures, please refer to fig. 1A, and can determine that the crop has diseases such as mildew or anthracnose based on leaf states or fruit states in the crop pictures. The disease existing in crops is determined, so that the crops can be subjected to symptomatic drug delivery, abuse of chemical drugs is reduced, environmental pollution is avoided, problems caused by drug-resistant pathogen strains are reduced, and meanwhile, the planting cost investment and economic losses caused by error disease diagnosis are reduced.

Compared with the mode of determining the crop pest and disease damage problem by relying on the guidance of agricultural specialists, the image classification service provided by the trained target image classification model is adopted, agricultural specialists with higher professional level and more abundant experience are not needed, subjective influence factors can be avoided, the efficiency of determining the crop pest and disease damage is improved, and the labor cost is reduced.

However, model training using the above method has the following drawbacks:

In order to solve the problem that the classification accuracy and the classification reliability of the target image classification model obtained through training are low, the application provides a method for training the image classification model. In the method, a plurality of image processing strategies are adopted to respectively extract pixel-level features of a sample image set to obtain a plurality of feature image sets, wherein each image processing strategy corresponds to one type of pixel-level features and corresponds to one feature image set, and each sample image corresponds to one feature image in each feature image set. Performing multiple rounds of iterative training on the image classification model to be trained based on the sample image set and the obtained multiple feature atlas, and outputting a trained target image classification model, wherein each round of iteration comprises:

And respectively extracting the semantic features of the selected sample image and the corresponding multiple feature images, and carrying out feature fusion on each obtained semantic feature to obtain the comprehensive features of the sample image. Based on the integrated features, a predictive classification of the sample image is determined. Based on the plurality of feature maps, the comprehensive features and the predictive classification, model parameters of the image classification model are adjusted.

The application scenario of the method for training the image classification model provided in the present application is described below.

Referring to fig. 1B, a schematic view of an application scenario of a method for training an image classification model provided in the present application is shown. The application scene comprises a client 101 and a server 102. Communication may be between client 101 and server 102. The communication mode can be communication by adopting a wired communication technology, for example, communication is carried out through a connecting network wire or a serial port wire; the communication may also be performed by using a wireless communication technology, for example, a bluetooth or wireless fidelity (wireless fidelity, WIFI) technology, which is not particularly limited.

Client 101 broadly refers to a device that may provide a sample image set to server 102 or may use a trained target image classification model or the like, e.g., a terminal device, a third party application accessible to a terminal device, or a web page accessible to a terminal device, or the like. Terminal devices include, but are not limited to, cell phones, computers, smart medical devices, smart home appliances, vehicle terminals or aircraft, etc. The server 102 generally refers to a device, such as a terminal device or a server, that can train an image classification model. Servers include, but are not limited to, cloud servers, local servers, or associated third party servers, and the like. Both the client 101 and the server 102 can adopt cloud computing to reduce occupation of local computing resources; cloud storage may also be employed to reduce the occupation of local storage resources.

As an embodiment, the client 101 and the server 102 may be the same device, which is not limited in particular. In this embodiment, the description is given by taking the example that the client 101 and the server 102 are different devices respectively.

The following specifically describes a method for training an image classification model provided in the embodiment of the present application with the server 102 as a server and the server as a main body based on fig. 1B. Referring to fig. 2, a flowchart of a method for training an image classification model according to an embodiment of the present application is shown.

S201, adopting various image processing strategies to respectively extract pixel-level features of the sample image set to obtain a plurality of feature atlases.

Before the image classification model to be trained is trained for multiple rounds, a sample image set for training can be acquired first, and the image classification model to be trained is trained for multiple rounds based on the sample image set. The server can acquire a sample image set from network resources, and can collect each sample image to form the sample image set according to actual demands. Each sample image in the set of sample images may have an associated classification tag for characterizing the actual classification of the object contained in the corresponding sample image.

After obtaining the sample image set, the server may employ a plurality of image processing strategies to respectively perform pixel-level feature extraction on the sample image set to obtain a plurality of feature atlases. The image processing strategies include a color processing strategy, a gray level processing strategy, a texture processing strategy, a local binary (Local Binary Pattern, LBP) processing strategy, or a histogram of oriented gradients (Histogram of oriented gradients, HOG) processing strategy, etc. After the pixel-level feature extraction is carried out on the sample image by adopting each image processing strategy, a type of pixel-level feature can be correspondingly obtained, so that a feature map is generated, namely, after the pixel-level feature extraction is carried out on the sample image set by adopting each image processing strategy, a feature map set can be correspondingly obtained, and each sample image corresponds to a feature map in each feature map set. A feature map may be used as a view of the corresponding sample image.

For example, the RGB features of each sample image may be extracted by a color processing policy, so as to obtain RGB feature maps corresponding to the sample images, so that the RGB feature maps corresponding to the sample images may form an RGB feature map set.

The gray characteristic of each sample image can be extracted through a gray processing strategy, and a gray characteristic image corresponding to the sample image is obtained, so that the gray characteristic images corresponding to the sample images can form a gray characteristic image set.

The texture feature of each sample image can be extracted through a texture processing strategy, and a texture feature image corresponding to the sample image is obtained, so that the texture feature images corresponding to the sample images can form a texture feature image set. Please refer to fig. 3A (1) for a sample image, and fig. 3A (2) for a feature map corresponding to the sample image, wherein the feature map characterizes texture features of the sample image.

The local binary characteristic of each sample image can be extracted through a local binary processing strategy, and a local binary characteristic image corresponding to the sample image is obtained, so that the local binary characteristic images corresponding to the sample images can form a local binary characteristic image set. Please refer to fig. 3B (1) for a sample image, and fig. 3B (2) for a feature map corresponding to the sample image, wherein the feature map characterizes local binary features of the sample image.

The directional gradient histogram feature of each sample image can be extracted through a directional gradient histogram processing strategy, and a directional gradient histogram feature map corresponding to the sample image is obtained, so that the directional gradient histogram feature maps corresponding to the sample images can form a directional gradient histogram feature map set.

In the embodiment of the application, the sample image set is processed by adopting various image processing strategies, so that the pixel-level characteristics of the representation of each sample image can be comprehensively obtained, and the obtained characteristic images and the sample images are used as training data of the image classification model, thereby achieving the purpose of expanding the quantity of the training data. By adding the feature map for training, the problem that the target image classification model obtained by training is inaccurate due to larger training errors caused by adverse factors of illumination or shooting of the sample image when training is carried out only through the sample image can be solved.

As an embodiment, after the sample image set is obtained, each sample image included in the sample image set may be preprocessed to obtain a preprocessed sample image, and the preprocessed sample image may be added as the sample image into the sample image set, so as to achieve the purpose of expanding the sample image set.

The preprocessing can comprise a plurality of processing modes such as rotation processing, dislocation processing or cutting processing of the sample images, and the preprocessing can be performed on the premise that the sizes of the images are kept uniform, so that the sample images in the sample image set are ensured to have uniform image sizes, and the training of the image classification model is facilitated.

S202, performing multi-round iterative training on the image classification model to be trained based on the sample image set and the obtained multiple feature image sets, and outputting a trained target image classification model.

After obtaining the sample image set and the plurality of feature image sets, the server may perform multiple rounds of iterative training on the image classification model to be trained based on the sample image set and the plurality of feature image sets obtained, outputting a trained target image classification model. Thus, the image classification model can more accurately extract various pixel-level features contained in the sample image from the sample image through learning the sample image and the corresponding multiple feature images, rather than focusing on only the dominant pixel-level features characterized in the sample image.

Taking a round of iterative training process as an example, the process of training the image classification model is described, and the process of each round of iterative training is similar, and is not described herein again, please refer to S203-S205.

S203, respectively extracting semantic features of the selected sample image and the corresponding multiple feature images, and carrying out feature fusion on the obtained semantic features to obtain comprehensive features of the sample image.

For a sample image selected from the sample image set, the server may extract semantic features of the sample image and its corresponding feature maps respectively using an image classification model, where the semantic features characterize deep abstract features of the corresponding sample image or feature map characterization.

After obtaining the semantic features of the sample image and the semantic features of the feature images, the server can perform feature fusion on the semantic features to obtain the comprehensive features of the sample image. Since the feature extraction is performed on the sample image and the target object contained in the feature map when the semantic features are extracted, some semantic features may be included in the extracted semantic features, including noise or background, which does not help to determine the classification of the target object, but each semantic feature includes features related to the target object, so that after feature fusion is performed on each semantic feature, the representation of the features related to the target object is further enhanced, and the representation of the features unrelated to the target object is weakened, and therefore, the features related to the target object in the sample image can be more accurately obtained through feature fusion, thereby performing more accurate image classification task.

As an embodiment, the image classification model may comprise an encoder network for extracting semantic features of the input image. The sample image and each feature map can be respectively corresponding to different encoders in the encoder network, each encoder can focus on extracting semantic features of the input image representing one type of pixel-level features, one encoder is not required to learn and extract the semantic features of the input image representing multiple types of pixel-level features, the training process of each encoder is simplified, and the training efficiency is improved.

Taking a feature map as an example, please refer to fig. 4, the image classification model includes an encoder network and a fusion network. The encoder network comprises two encoders, the sample image corresponds to a first encoder, the first encoder is used for extracting semantic features of the sample image, the feature map corresponds to a second encoder, and the second encoder is used for extracting the semantic features of the feature map. The fusion network is used for carrying out feature fusion on the semantic features of the sample image and the semantic features of the feature map to obtain the comprehensive features of the sample image.

And carrying out feature extraction on the sample image by adopting a first encoder to obtain a first semantic feature, and carrying out feature extraction on the feature map by adopting a second encoder to obtain a second semantic feature. After the first semantic features and the second semantic features are obtained, the first semantic features and the second semantic features are subjected to feature fusion by adopting a fusion network, so that comprehensive features are obtained.

S204, based on the comprehensive characteristics, determining the prediction classification of the sample image.

After obtaining the composite features, the server may determine a predictive classification of the sample image based on the composite features using the image classification model. The prediction classification may represent the category of the target object contained in the sample image, for example, in the field of agricultural diseases and insect pests, taking a scene of image classification for grape leaf diseases and insect pests as an example, the sample image may be an image containing grape leaf, and then the prediction classification and classification label for the sample image may be whether the grape leaf has diseases and insect pests, or may be the type of diseases and insect pests existing in the grape leaf, and the like, which is not particularly limited.

The prediction classification may be expressed in the form of probability distribution, and may be preset with various categories according to application scenarios, and may be various categories included in each classification label corresponding to the sample image set, so that when the prediction classification of the sample image is determined based on the comprehensive characteristics, the probability that the sample image belongs to each category may be predicted respectively, and the probability distribution is formed. The classification tag page of the sample image may be represented in the form of a probability distribution, for example, the probability of the category to which the sample image belongs is 1, the probability of the other categories is 0, and the like, and is not particularly limited.

S205, adjusting model parameters of an image classification model based on the plurality of feature graphs, the comprehensive features and the prediction classification.

The server may adjust model parameters of the image classification model based on the plurality of feature maps, the composite features, and the predictive classifications obtained for the sample image. The model parameters of the image classification model are adjusted through the feature map, the comprehensive features and the prediction classification in various angles, so that the image classification model can be focused on extracting the semantic features of the feature map, carrying out feature fusion on the semantic features, determining the prediction classification of the sample image and other learning aspects, and compared with a method for adjusting the model parameters of the image classification model based on the prediction classification only, the trained target image classification model has higher classification accuracy and reliability.

As one example, the server may determine the fight loss, fusion loss, and classification loss of the image classification model based on the plurality of feature maps, the composite features, and the predictive classification. Challenge loss characterization: extracting the accuracy of semantic features, and fusing loss characterization: accuracy of feature fusion, classification loss: characterization determines the accuracy of the prediction classification. Model parameters of the image classification model are adjusted based on the antagonism loss, fusion loss, and classification loss.

Through multiple losses, the training effect of the image classification model is measured, so that the image classification model can be comprehensively trained, and the trained target image classification model can have more accurate and reliable image classification capability.

Based on the feature maps, the comprehensive features and the prediction classification, there are various methods for determining the countermeasures, the fusion losses and the classification losses of the image classification model, and the process of obtaining the countermeasures, the fusion losses and the classification losses will be described below by taking one of them as an example. Meanwhile, taking a process of determining the antagonism loss, the fusion loss and the classification loss of the image classification model based on one feature map as an example, the determination process based on other feature maps is similar, and will not be described herein.

Countering losses:

converting the feature map into a pseudo image based on the semantic features of the feature map and the sample image, and determining a countering loss of the image classification model based on the sample image and the pseudo image.

The server may employ an image classification model to convert the feature map into a pseudo image according to the sample image based on semantic features of the feature map and the sample image, the pseudo image and the sample image characterizing the same kind of pixel level features. Because the pseudo image is generated according to the sample image, the more similar the pseudo image and the sample image, the more accurate the image classification model learns the mutual conversion process between the first sample image and the feature map, so that the trained target image classification model is not focused on the dominant pixel-level features represented by the sample image when facing the sample image, but semantic features capable of representing various pixel-level features can be comprehensively extracted from the sample image, and the classification accuracy of the target image classification model is improved. Thus, the contrast loss of the image classification model may be determined based on the pseudo-image and the sample image, the image classification model being trained based on the contrast loss.

As an embodiment, after obtaining the integrated feature, the server may also use an image classification model to convert the feature map into a pseudo image or the like according to the sample image based on the integrated feature and the semantic feature of the sample image, without being limited in particular.

As an embodiment, the image classification model may comprise a generation network for generating the corresponding pseudo-image according to a certain specified image based on the input data. Each feature map may correspond to a generator in the generation network, and different generators may generate corresponding pseudo-images from the sample images based on different feature maps. The feature maps may all correspond to the same generator or the like in the generation network, and are not particularly limited.

Referring to fig. 5A, after feature extraction is performed on the sample image through the image classification model, semantic features, i.e., feature subspaces, of the sample image are obtained. And converting the feature map into a pseudo image based on the semantic features and the feature map of the sample image by adopting a generating network.

As an embodiment, after obtaining the pseudo image generated based on the feature map, the image classification model may be further used to extract semantic features of the pseudo image, determine a first reconstruction loss of the image classification model based on an error between the semantic features of the sample image and the semantic features of the pseudo image, and measure a degree of similarity between the pseudo image and the sample image from the perspective of abstraction of the semantic features respectively represented by the pseudo image and the sample image.

Taking a feature map as an example, the image classification model includes an editor network and a generation network. The editor network comprises a first editor for extracting semantic features of the sample image and a second editor for extracting pseudo-images generated based on the feature map, the generating network being used for converting the feature map into pseudo-images. Referring to fig. 5B, the sample image is passed through a first editor, first semantic features and feature maps,and generating a pseudo image through a generation network, extracting second semantic features from the pseudo image through a second editor, and determining a first reconstruction loss based on errors between the first semantic features and the second semantic features. Please refer to formula (1), to determine the first reconstruction loss of the image classification modelIs a method of (a).

Wherein x is ⁽²⁾ Characterizing a sample image, which obeys a data distributionx ⁽¹⁾ Characterizing a feature map, which obeys a data distribution +.>z characterizes the semantic features of the sample image, G (x ⁽¹⁾ Z) characterizing the semantic features z and the feature map x based on the sample image ⁽¹⁾ Generated pseudo image->The semantic features of the pseudo-image are characterized, I.I ₁ The L1 norm is characterized.

After obtaining the pseudo-image generated based on the feature map, a second reconstruction loss of the image classification model may also be determined based on an error between the sample image and the pseudo-image. The degree of similarity between the pseudo-image and the sample image is measured from the viewpoint of the artefact and the sample image itself.

Taking a feature map as an example, the image classification model includes an editor network and a generation network. The editor network is used to extract semantic features of the sample image and the generation network is used to convert the feature map into a pseudo image. Referring to fig. 5C, the sample image passes through the editor network to extract the semantic features of the sample image, the semantic features of the sample image andthe feature map, by generating a network, generates a pseudo-image, and determines a second reconstruction loss based on an error between the sample image and the pseudo-image. Please refer to formula (2) for determining the second reconstruction loss of the image classification modelIs a method of (a).

Wherein x is ⁽²⁾ Characterizing a sample image, which obeys a data distributionx ⁽¹⁾ Characterizing a feature map, which obeys a data distribution +.>z characterizes the semantic features of the sample image, G (x ⁽¹⁾ Z) characterizing the semantic features z and the feature map x based on the sample image ⁽¹⁾ The generated pseudo-image is then processed to generate, I.I ₁ The L1 norm is characterized.

After obtaining the first reconstruction loss, determining a counterloss of the image classification model based only on the first reconstruction loss; after obtaining the second reconstruction loss, an antagonistic loss of the image classification model may be determined based only on the second reconstruction loss; after obtaining the first and second reconstruction losses, the countermeasures losses of the image classification model may be determined together based on the first and second reconstruction losses.

As an embodiment, in S203, when extracting the semantic feature of the feature map, the feature map may be converted into the pseudo image by using the image classification model according to the foregoing, and the semantic feature of the pseudo image may be extracted by using the image classification model, and the semantic feature of the pseudo image may be used as the semantic feature of the feature map.

As an embodiment, the image classification model may comprise a discrimination network for predicting a discrimination probability of the input data being the sample image or a semantic feature of the sample image. The input data has an associated reference probability, which may be expressed in the form of a probability distribution, e.g. the probability distribution comprises two categories, one being the input data being the sample image and the other being the input data not being the sample image, the probability of the input data being the sample image being 1 and the probability of the input data not being the sample image being 0 in the probability distribution of the reference probability of the sample image.

And respectively taking the sample image and the pseudo image as input data of a discrimination network, and predicting corresponding first discrimination probabilities, namely the first discrimination probability of the sample image and the first discrimination probability of the pseudo image. And simultaneously, respectively taking the semantic features of the sample image and the semantic features of the pseudo image as input data of a discrimination network, and predicting corresponding second discrimination probabilities, namely the second discrimination probabilities of the semantic features of the sample image and the second discrimination probabilities of the semantic features of the pseudo image.

Based on the obtained first discrimination probabilities and the second discrimination probabilities, the discrimination loss of the image classification model is determined based on the error between the first discrimination probabilities and the second discrimination probabilities and the reference probabilities corresponding to the first discrimination probabilities. Determining a discrimination loss of the image classification model based on an error between the first discrimination probability of the sample image and the reference probability of the sample image, and an error between the first discrimination probability of the pseudo image and the reference probability of the pseudo image, and an error between the second discrimination probability of the semantic feature of the sample image and the reference probability of the sample image, and an error between the second discrimination probability of the semantic feature of the pseudo image and the reference probability of the pseudo image.

As an embodiment, the discrimination loss may include a first discrimination loss and a second discrimination loss, and the image classification model includes an editor network, a generation network, and a discrimination network, taking one feature map as an example. The editor network is used to extract semantic features of the sample image. The generating network is used for converting the feature map into a pseudo image, and the discriminating network is used for determining the discriminating probability of the input data. The generating network and the discriminating network form an countermeasure network, and the idea of the countermeasure network is to let the generating network and the discriminating network participate in a minimum or maximum game, i.e. the generating network and the discriminating network are to minimize and maximize the probability that the pseudo image is discriminated as false, respectively, as much as possible. The first discrimination loss is determined based on an error between the first discrimination probability of the sample image and the reference probability of the sample image, and an error between the first discrimination probability of the pseudo image and the reference probability of the pseudo image.

Referring to fig. 5D, the sample image is passed through the editor network to extract the semantic features of the sample image, and the semantic features and feature map are passed through the generation network to generate the pseudo image. And respectively taking the sample image and the pseudo image as input data of a discrimination network to obtain a first discrimination probability of the sample image and a first discrimination probability of the pseudo image. The first discrimination loss is determined based on an error between the first discrimination probability of the sample image and the reference probability of the sample image, and an error between the first discrimination probability of the pseudo image and the reference probability of the pseudo image.

Please refer to formula (3) for determining the first discrimination lossIs a method of (a).

Wherein x is ⁽¹⁾ Characterizing a sample image, which obeys a data distributionx ⁽²⁾ Characterizing a feature map, which obeys a data distribution +.>z characterizes the semantic features of the sample image, G (x ⁽²⁾ Z) characterizing the semantic features z and the feature map x based on the sample image ⁽²⁾ The generated pseudo image D ₁ (x ⁽¹⁾ ) Characterizing a discrimination probability for the sample image, D ₁ (G(x ⁽²⁾ Z)) characterizes the discrimination probability for the pseudo-image.

Continuing with a feature map as an example, the image classification model includes an editor network, a generation network, and a discrimination network. The editor network comprises a first editor for extracting semantic features of the sample image and a second editor for extracting semantic features of the pseudo-image into which the feature map is converted. The generating network is used for converting the feature map into a pseudo image, and the discriminating network is used for determining the discriminating probability of the input data. The second discrimination loss is determined based on an error between the second discrimination probability of the semantic feature of the sample image and the reference probability of the sample image and an error between the second discrimination probability of the semantic feature of the pseudo image and the reference probability of the pseudo image.

Referring to fig. 5E, the sample image is passed through a first editor to extract first semantic features, and the first semantic features and the feature map are passed through a generation network to generate a pseudo image. The pseudo image is extracted to obtain second semantic features through a second editor, and the first semantic features and the second semantic features are respectively used as input data of a discrimination network to obtain second discrimination probabilities of the first semantic features and second discrimination probabilities of the second semantic features. Determining a second discrimination loss based on an error between the second discrimination probability of the first semantic feature and the reference probability of the first semantic feature and an error between the second discrimination probability of the second semantic feature and the reference probability of the second semantic feature.

Please refer to formula (4) for determining the second discrimination lossIs a method of (a).

Wherein x is ⁽¹⁾ Characterizing sample images, x ⁽²⁾ Characterizing feature graphs, which obey data distributionz characterizes the semantic features of the sample image, which obeys the data distribution +.>G(x ⁽²⁾ Z) characterizing pseudo-images generated based on semantic features z and feature maps of the sample image, E ₂ (G(x ⁽¹⁾ Z)) characterizes semantic features of the pseudo-image, D ₁ (z) characterizing the probability of discrimination of semantic features for the sample image, D ₂ (E ₂ (G(x ⁽²⁾ Z)) characterizes the discrimination probability for the semantic features of the pseudo-image.

After the discrimination loss is obtained, the counterloss of the image classification model may be determined based on only the discrimination loss, or a weighted sum of the first reconstruction loss, the second reconstruction loss, and the discrimination loss may be obtained after the first reconstruction loss, the second reconstruction loss, and the discrimination loss, as the counterloss of the image classification model, or the like, without being particularly limited.

Taking a feature map as an example, the image classification model includes an editor network, a generation network, and a discrimination network. The editor network comprises a first editor for extracting semantic features of the sample image and a second editor for extracting semantic features of the pseudo-image into which the feature map is converted. The generating network is used for converting the feature map into a pseudo image, and the discriminating network is used for determining the discriminating probability of the input data. When determining the loss resistance, the first reconstruction loss and the second reconstruction loss may be weighted first, and the weighted first reconstruction loss and second reconstruction loss may be summed with the first discrimination loss and the second discrimination loss to obtain the loss resistance. Please refer to formula (5) for determining the countermeasures loss L _AE Is a mode of the above.

Wherein,characterizing a first discriminant loss, >Characterizing a second discriminant loss,>characterization of first reconstruction loss,>characterization of the second reconstruction loss, lambda ₁ Weighting coefficients, lambda, characterizing the first reconstruction loss ₂ The weighting coefficients characterizing the second reconstruction loss. />

In the embodiment of the application, the constraint and reconstruction constraint of the disk aiming at the semantic features are increased, the influence of noise contained in the sample image on the pseudo image is weakened, and the quality of generating the pseudo image and the robustness of a generating network are improved.

Fusion loss:

and carrying out weighted summation on the semantic features of the sample image and the semantic features of the multiple feature images to obtain reference features. And determining fusion loss of the image classification model based on the error between the comprehensive feature and the reference feature.

The server performs weighted summation on the semantic features of the sample image and the semantic features of each feature map by adopting the image classification model to obtain comprehensive features, and then performs weighted summation on the semantic features of the sample image and the semantic features of a plurality of feature maps to obtain reference features. And determining fusion loss of the image classification model based on the error between the comprehensive feature and the reference feature.

When the error between the comprehensive features and the reference features is large, the features, which are obtained by the image classification model and are related to the target object, of each semantic feature are large in and large in the weighted sum of the features, so that the image classification model does not learn an accurate feature fusion mode, and the image classification model can also learn a mode of accurately extracting the semantic features to a certain extent.

When the error between the comprehensive characteristics and the reference characteristics is smaller, the characteristics of the image classification model, which are obtained from the semantic characteristics and are related to the target object, are similar to the weighted sum of the semantic characteristics, and the image classification model can be used for accurately learning the characteristic fusion mode, and the mode that the image classification model can be used for accurately extracting the semantic characteristics related to the target object can be also described to a certain extent.

Classification loss:

based on the error between the obtained prediction classification and the classification label corresponding to the sample image, a classification loss of the image classification model is determined.

After obtaining the predictive classification of the sample image, the server may determine a classification loss of the image classification model based on an error between the obtained predictive classification and a classification label corresponding to the sample image, implementing a supervised training process.

When the error between the prediction classification and the classification label is large, the prediction classification of the image classification model aiming at the sample image does not accord with the actual classification of the sample image, so that the image classification model can not learn an accurate prediction mode, and can also explain that the image classification model cannot learn an accurate feature fusion mode to a certain extent.

When the error between the prediction classification and the classification label is smaller, the prediction classification of the image classification model aiming at the sample image accords with the actual classification of the sample image, so that the image classification model can learn an accurate prediction mode, and the image classification model can also learn an accurate feature fusion mode to a certain extent.

As an example, the server may take a weighted sum of the fight loss, the fusion loss, and the classification loss as the training loss for the image classification model. And when the obtained training loss does not meet the training target, adjusting the model parameters of the image classification model, and entering the next round of iterative training. And outputting the image classification model when the obtained training loss meets the training target, and obtaining the trained target image classification model. The training objective may be to target training loss convergence.

Please refer to formulas (6), (7), (8) and (9), which are one way to determine the training loss L of the image classification model.

L＝L _AE +λ ₃ L _CLS +λ ₄ L _FU (6)

L _CLS ＝||f(x)-y|| ² (9)

Wherein L is _AE Characterization of fight loss, L _FU Characterization of fusion loss, lambda ₃ L _CLS The classification loss is characterized. X is X ⁱ Characterizing the ith set of data set X, which may be represented as { X } ¹ ，X ² ，……，X ^V Then X ¹ Can characterize a sample image set, X ² To X ^V V-1 feature atlases may be characterized; y is Y ⁱ Characterization is based on X ⁱ The generated pseudo-image, the set of pseudo-images Y may be represented as { Y } ¹ ，Y ² ，……，Y ^V }；Z ⁱ Characterization of X ⁱ Semantic features of (i.e. Z) ⁱ ＝E _i (X ⁱ ) The semantic feature set may be expressed as { Z } ¹ ，Z ² ，……，Z ^V }. By means of self-adaptive weighted summation of semantic features, uniform comprehensive features, namely subspace Z, can be obtained, and each semantic feature Z is factored ⁱ Configuring a corresponding weight beta _i ，β _i Is learned by continuous training process of image classification model, beta _i Satisfy the following requirementsf (x) characterizes the predictive classification for the sample image and y characterizes the classification tag of the sample image. I.I _F The F-norm is characterized.

The server can also set a corresponding training target for the countermeasure loss, the fusion loss and the classification loss respectively, and when one loss in the countermeasure loss, the fusion loss and the classification loss does not meet the corresponding training target, the model parameters of the image classification model are adjusted, and the next round of iterative training is carried out. And outputting the image classification model when the obtained antagonism loss, fusion loss and classification loss meet the training targets to obtain a trained target image classification model.

The method for training the image classification model and the method for using the trained target image classification model are introduced by taking the sample image as an agricultural disease and pest image in the agricultural field, such as grape leaf disease and pest as an example, and the trained target image classification model can be used for developing view products of a corresponding desktop end or a mobile end and developing software for the grape leaf disease and pest.

Please refer to fig. 6A, which is a schematic diagram of a server and a client for training, testing and using the image classification model. The server side comprises a data acquisition module, is used for establishing a basic grape leaf data set, finishing preprocessing such as cleaning an original data set, and can also be used for updating or expanding the data set in a crowdsourcing mode so as to obtain a sample image set.

The data enhancement module is used for establishing a multi-view data enhancement network, the multi-view data enhancement network can obtain a plurality of characteristic image sets based on the sample image set, for example, the multi-view agricultural disease and pest data set is established on the basis of the basic grape leaf disease and pest image data set, missing view data can be effectively repaired, and expansion enhancement of the sample image set is completed.

The training and testing module is used for training the image classification model on the basis of the enhanced data set, namely on the basis of the sample image set and the plurality of feature image sets, and testing the effectiveness of the target image classification model obtained through training.

And the model distribution module is used for distributing the target image classification model to the corresponding client after the server obtains the effective target image classification model.

And the model updating module is used for collecting feedback data of the client in real time in an online learning mode and optimizing the target image classification model based on the collected feedback data.

The client comprises a model loading module which is used for receiving the target image classification model sent by the server and loading the target image classification model into the memory.

The model using module is used for calling the target image classification model to carry out image classification on the image data when the image data is input to the client in a photographing (mobile terminal) or picture importing (mobile terminal or webpage terminal) mode, and obtaining the type and severity of diseases and insect pests to which the image data belongs.

And the data feedback module is used for uploading feedback data to the server according to the actual use condition so as to help the server to perform model optimization on the target image classification model.

The model updating module is used for periodically keeping communication with the server, detecting whether the target image classification model is updated, and downloading the updated target image classification model under the condition that the network environment or storage conditions allow.

Please refer to fig. 6B, which is a schematic diagram of interaction between a server and a client for training and using the image classification model.

S601, a server collects a basic grape leaf data set and establishes a sample image set.

S602, the server side adopts various image processing strategies to respectively extract pixel-level features of the sample image set to obtain a plurality of feature image sets, namely, on the basis of the basic grape leaf disease and insect pest image data set, a multi-view agricultural disease and insect pest data set is established so as to achieve expansion and enhancement of the sample image set.

S603, the server generates a corresponding pseudo image set based on the feature image sets, and expansion and enhancement of the sample image set are completed. The server generates a pseudo image set based on semantic features corresponding to each sample image contained in the sample image set and each feature image set, so that the sample image set is expanded to be several times of the original sample image set.

S604, the server builds an image classification model to be trained, carries out multiple rounds of iterative training on the image classification model to be trained based on a sample image set, a plurality of feature image sets and a plurality of pseudo image sets, and outputs a trained target image classification model.

S605, the server sends the target image classification model to the client, and the client loads the received target image classification model into the memory.

S606, the client can call a target image classification model when the image to be classified is obtained, and the target classification of the image to be classified is determined.

Taking a sample image as an example, the sample image is taken as an RGB feature map and is marked as X ¹ ＝{x ₁ ⁽¹⁾ ，x ₂ ⁽¹⁾ ，……，x _n ⁽¹⁾ X, where x _i ⁽¹⁾ Characterizing an ith feature element in the RGB feature map; adopting texture feature processing strategy to extract pixel level features of the sample image to obtain a texture feature map, which is marked as X ^m ＝{x ₁ ^(m) ，x ₂ ^(m) ，……，x _n ^(m) X, where x _i ^(m) Characterizing an ith feature element in the texture feature map; adopting a local binary feature processing strategy to extract pixel-level features of the sample image to obtain a local binary feature map, which is marked as X ^V ＝{x ₁ ^(V) ，x ₂ ^(V) ，……，x _n ^(V) X, where x _i ^(V) The ith feature element in the local binary feature map is characterized, and V represents the total number of the feature map. And obtaining other characteristic images by adopting other image processing strategies, so as to obtain a plurality of characteristic images corresponding to the sample images.

And carrying out one round of iterative training on the image classification model based on the plurality of feature images.

Please refer to fig. 7, which is a schematic diagram of training an image classification model. The image classification model comprises an encoder network E for extracting semantic features; the encoder network comprises a plurality of encoders, each encoder consists of M full-connection layers and N full-connection layers, wherein the M full-connection layers in each encoder have different model parameters, and the N full-connection layers in each encoder have the same model parameters.

The image classification model further comprises a decoder network D for generating a pseudo-image; the decoder network is inversely related to the network structure of the encoder network.

The image classification model also includes an adaptive fusion network for feature fusion.

The image classification model also includes a classifier for determining a prediction classification.

The encoder network extracts the semantic features of each of the feature maps, wherein the semantic feature of the ith feature map is marked as Z ⁱ ＝{z ₁ ⁽ⁱ⁾ ，z ₂ ⁽ⁱ⁾ ，……，z _n ⁽ⁱ⁾ -wherein z _r ⁽ⁱ⁾ And (3) representing the r feature element in the semantic features of the ith feature map.

The self-adaptive fusion network performs feature fusion on the semantic features of each of the feature graphs to obtain a comprehensive feature Z, which can be obtained by carrying out weighted summation on feature elements at the corresponding positions of the semantic features of each feature graph, and the comprehensive feature Z is recorded as Z= { Z ₁ ，z ₂ ，……，z _n Please refer to equation (10).

Wherein beta is _i Semantic feature Z characterizing a corresponding ith feature map ⁱ Is a weight of (2).

The decoder network is based on the integrated features Z and the semantic features of the sample image, i.e. Z ¹ A pseudo image corresponding to each feature map can be generated, and the ith pseudo image is marked as Y ⁱ ＝{y ₁ ⁽ⁱ⁾ ，y ₂ ⁽ⁱ⁾ ，……，y _n ⁽ⁱ⁾ -wherein y _m ⁽ⁱ⁾ And characterizing the mth characteristic element in the ith pseudo image.

The classifier determines a predictive classification of the sample image based on the composite feature Z. Determining a contrast loss based on the error between each feature map and the corresponding pseudo image; determining fusion loss by error between the comprehensive features and the weighted sum of the semantic features of each feature map; and predicting an error between the classification and the classification label of the sample image, and determining the classification loss. Determining the training loss of the image classification model based on the weighted sum of the antagonism loss, the fusion loss and the classification loss, adjusting the model parameters of the image classification model when the training loss is not converged, entering the next training round, and outputting the trained target image classification model when the training loss is converged.

Based on various pixel-level characteristics of the image, namely various views, the deep learning network is utilized to effectively fuse and reclassify the various views, so that the bottleneck limit of the existing plant disease and insect pest image classification method is broken through, and the accuracy and reliability of the plant disease and insect pest image classification technology are improved. The data enhancement purpose is achieved through multiple views, a multi-view agricultural disease and pest data set can be built on the incomplete agricultural disease and pest data set for training of an image classification model, and the problem of data deletion is effectively solved.

Based on the same inventive concept, the embodiment of the application provides a device for training an image classification model, which can realize the functions corresponding to the method for training the image classification model. Referring to fig. 8, the apparatus includes an acquisition module 801 and a processing module 802, where:

acquisition module 801: the method comprises the steps of respectively extracting pixel-level features of a sample image set by adopting a plurality of image processing strategies to obtain a plurality of feature image sets, wherein each image processing strategy corresponds to one type of pixel-level features and corresponds to one feature image set, and each sample image corresponds to one feature image in each feature image set;

processing module 802: the method is used for carrying out multiple rounds of iterative training on the image classification model to be trained based on the sample image set and the obtained multiple feature atlas, and outputting a trained target image classification model, wherein each round of iteration comprises:

The processing module 802 is further configured to: respectively extracting semantic features of the selected sample image and the corresponding multiple feature images, and carrying out feature fusion on each obtained semantic feature to obtain comprehensive features of the sample image;

the processing module 802 is further configured to: determining a predictive classification of the sample image based on the composite features;

the processing module 802 is further configured to: based on the plurality of feature maps, the comprehensive features and the predictive classification, model parameters of the image classification model are adjusted.

In one possible embodiment, the processing module 802 is specifically configured to:

determining a contrast loss, a fusion loss, and a classification loss of the image classification model based on the plurality of feature maps, the composite features, and the predictive classification, wherein the contrast loss characterizes: extracting the accuracy of semantic features, and fusing loss characterization: accuracy of feature fusion, classification loss: characterizing and determining the accuracy of prediction classification;

model parameters of the image classification model are adjusted based on the antagonism loss, fusion loss, and classification loss.

for a plurality of feature maps, the following operations are respectively executed:

converting the feature map into a pseudo image based on the semantic features of the feature map and the semantic features of the sample image, wherein the pseudo image and the sample image represent the same kind of pixel-level features;

In one possible embodiment, the image classification model includes a discrimination network for predicting a discrimination probability that the input data is a sample image or a semantic feature of the sample image, the input data having an associated reference probability; the processing module 802 is specifically configured to:

respectively taking the sample image and the pseudo image as input data of a discrimination network, predicting corresponding first discrimination probability, and respectively taking semantic features of the sample image and semantic features of the pseudo image as input data of the discrimination network, predicting corresponding second discrimination probability;

determining discrimination loss of the image classification model based on the obtained first discrimination probabilities and the second discrimination probabilities and errors between the first discrimination probabilities and the corresponding reference probabilities;

carrying out weighted summation on the semantic features of the sample image and the semantic features of the multiple feature images to obtain reference features;

weighting and summing the antagonism loss, the fusion loss and the classification loss to obtain the training loss of the image classification model;

In one possible embodiment, each feature atlas is one of a gray feature atlas, a texture feature atlas, a local binary feature atlas, or a directional gradient histogram set.

Referring to fig. 9, the apparatus for training the image classification model may be run on a computer device 900, and a current version and a historical version of a data storage program and application software corresponding to the data storage program may be installed on the computer device 900, where the computer device 900 includes a processor 980 and a memory 920. In some embodiments, the computer device 900 may include a display unit 940, the display unit 940 including a display panel 941 for displaying an interface or the like for interactive operation by a user.

In one possible embodiment, the display panel 941 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD) or an Organic Light-Emitting Diode (OLED) or the like.

The processor 980 is configured to read the computer program and then perform a method defined by the computer program, for example, the processor 980 reads a data storage program or a file, etc., so that the data storage program is executed on the computer device 900 and a corresponding interface is displayed on the display unit 940. Processor 980 may include one or more general-purpose processors and may also include one or more DSPs (Digital Signal Processor, digital signal processors) for performing associated operations to implement the techniques provided by the embodiments of the present application.

Memory 920 generally includes memory and external storage, and memory may be Random Access Memory (RAM), read Only Memory (ROM), CACHE memory (CACHE), and the like. The external memory can be a hard disk, an optical disk, a USB disk, a floppy disk, a tape drive, etc. The memory 920 is used to store computer programs including application programs corresponding to respective clients, etc., and other data, which may include data generated after the operating system or application programs are executed, including system data (e.g., configuration parameters of the operating system) and user data. In the embodiment of the present application, the program instructions are stored in the memory 920, and the processor 980 executes the program instructions in the memory 920, implementing any of the methods discussed in the previous figures.

The above-described display unit 940 is used to receive input digital information, character information, or touch operation/noncontact gestures, and to generate signal inputs related to user settings and function controls of the computer device 900, and the like. Specifically, in the embodiment of the present application, the display unit 940 may include a display panel 941. The display panel 941, such as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the display panel 941 or on the display panel 941 using any suitable object or accessory such as a finger, a stylus, etc.), and drive the corresponding connection device according to a predetermined program.

In one possible embodiment, the display panel 941 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a player, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 980, and can receive commands from the processor 980 and execute them.

The display panel 941 may be implemented by various types such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the display unit 940, in some embodiments, the computer device 900 may also include an input unit 930, and the input unit 930 may include an image input device 931 and other input devices 932, which may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

In addition to the above, computer device 900 may also include a power supply 990 for powering other modules, audio circuitry 960, near field communication module 970, and RF circuitry 910. The computer device 900 may also include one or more sensors 950, such as acceleration sensors, light sensors, pressure sensors, and the like. Audio circuitry 960 may include, among other things, a speaker 961 and a microphone 962, for example, where the computer device 900 may collect a user's voice via the microphone 962, perform a corresponding operation, etc.

The number of processors 980 may be one or more, and the processors 980 and memory 920 may be coupled or may be relatively independent.

As an example, processor 980 in fig. 9 may be used to implement the functionality of acquisition module 801 and processing module 802 as in fig. 8.

As an example, the processor 980 in fig. 9 may be used to implement the functions associated with the servers or terminal devices discussed above.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.

Alternatively, the above-described integrated units of the present invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or in a part contributing to the prior art in the form of a software product, for example, by a computer program product stored in a storage medium, comprising several instructions for causing a computer device to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. A method of training an image classification model, comprising:

2. The method of claim 1, wherein said adjusting model parameters of said image classification model based on said plurality of feature maps, said composite features, and said predictive classifications comprises:

3. The method of claim 2, wherein the determining the contrast loss of the image classification model based on the plurality of feature maps, the composite features, and the predictive classification comprises:

for the feature maps, the following operations are respectively executed:

converting the feature map into a pseudo image based on semantic features of the feature map and the sample image, wherein the pseudo image and the sample image characterize the same kind of pixel-level features;

4. A method according to claim 3, wherein said determining a contrast loss of the image classification model based on the sample image and the pseudo image comprises:

5. The method of claim 4, wherein the image classification model comprises a discrimination network for predicting a discrimination probability that input data is a sample image or a semantic feature of a sample image, the input data having an associated reference probability; then the determining a countermeasures loss of the image classification model based on the first reconstruction loss and the second reconstruction loss comprises:

6. The method of claim 2, wherein the determining a fusion loss of the image classification model based on the plurality of feature maps, the composite feature, and the predictive classification comprises:

7. The method of claim 2, wherein adjusting model parameters of the image classification model based on the fight loss, the fusion loss, and the classification loss comprises:

8. The method of any of claims 1-7, wherein each feature atlas is one of a gray feature atlas, a texture feature atlas, a local binary feature atlas, or a directional gradient histogram set.

9. An apparatus for training an image classification model, comprising:

10. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 8.

11. A computer device, comprising:

a memory for storing program instructions;

a processor for invoking program instructions stored in the memory and for performing the method according to any of claims 1-8 in accordance with the obtained program instructions.

12. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method of any one of claims 1 to 8.