CN114170425A

CN114170425A - Model training method, image classification method, server and storage medium

Info

Publication number: CN114170425A
Application number: CN202111297227.3A
Authority: CN
Inventors: 孙鹏飞
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2021-11-02
Filing date: 2021-11-02
Publication date: 2022-03-11

Abstract

The embodiment of the application provides a model training method, an image classification method, a server and a storage medium. In this embodiment, the plurality of expert models in the first model include at least two expert models with different network structures, so that knowledge learning can be performed by using different expert models, which is beneficial to identifying samples with different difficulties when the number of samples is limited. The fusion model is trained according to the image characteristics extracted from the sample images after the expert model converges, so that the fusion model can dynamically learn how to reasonably utilize the knowledge learned by the expert models according to the knowledge learned by the expert models, the expert models with different performances make different contributions to the finally output classification result, and efficient learning can be performed even under the condition that the number of the sample images is small.

Description

Model training method, image classification method, server and storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a model training method, an image classification method, a server, and a storage medium.

Background

The development of the neural network continuously pushes the progress of artificial intelligence, and the training of the neural network cannot be separated from massive training data and pre-training models. In some application scenarios, a technical problem of lacking sufficient training data and failing to train a neural network model that is sufficiently effective is often faced. Therefore, a solution is yet to be proposed.

Disclosure of Invention

Aspects of embodiments of the present application provide a model training method, an image classification method, a server, and a storage medium, so as to improve performance of a neural network model under a condition that training data is limited.

The embodiment of the application provides an image classification method, which comprises the following steps: inputting an image to be processed into a first model; the first model includes: a plurality of expert models and a fusion model connected with the plurality of expert models; the plurality of expert models comprises expert models of at least two different network structures; performing feature extraction on the sample image through the plurality of expert models to obtain image features corresponding to the plurality of expert models respectively; fusing the image features corresponding to the expert models according to the feature fusion parameters of the expert models through the fusion model; and carrying out classification operation processing on the features obtained by fusion through the fusion model to obtain a first classification result.

The embodiment of the present application further provides a scene classification method, including: acquiring video data obtained by shooting a target place; inputting any frame image in the video data into a first model; the first model includes: a plurality of expert models and a fusion model connected with the plurality of expert models; the plurality of expert models comprises expert models of at least two different network structures; performing feature extraction on the image through the plurality of expert models to obtain image features corresponding to the plurality of expert models respectively; fusing the image features corresponding to the expert models according to the feature fusion parameters of the expert models through the fusion model; and carrying out classification operation processing on the features obtained by fusion through the fusion model to obtain the scene category corresponding to the target place.

The embodiment of the present application further provides a model training method, where the first model to be trained includes: a plurality of expert models, and a fusion model connected to the plurality of expert models; the plurality of expert models comprises expert models of at least two different network structures; the method comprises the following steps: acquiring image characteristics obtained by performing characteristic extraction on the sample image after the plurality of expert models converge; labeling a label truth value on the sample image; fusing image features corresponding to the expert models according to feature fusion parameters of the expert models respectively through the fusion model, and performing classification operation processing on the fused features to obtain a first classification result; and adjusting the characteristic fusion parameters of the expert models respectively by taking the difference between the first classification result and the label truth value as a learning target to converge to a first specified range.

The embodiment of the present application further provides an image classification method, including: responding to a calling request of a client to a first interface, and acquiring an image to be processed contained in interface parameters; inputting the image to be processed into a first model; the first model includes: a plurality of expert models and a fusion model connected with the plurality of expert models; the plurality of expert models comprises expert models of at least two different network structures; performing feature extraction on the sample image through the plurality of expert models to obtain image features corresponding to the plurality of expert models respectively; according to the fusion model and the feature fusion parameters of the expert models, fusion processing is carried out on the image features corresponding to the expert models, and classification operation processing is carried out on the features obtained through fusion to obtain a first classification result; and returning the first classification result to the client.

An embodiment of the present application further provides a server, including: a memory, a processor, and a communication component; the memory to store one or more computer instructions; the processor is configured to execute one or more computer instructions for executing the steps in the method provided by the embodiment of the present application.

Embodiments of the present application further provide a computer-readable storage medium storing a computer program, where the computer program can implement the steps in the method provided in the embodiments of the present application when executed.

In the embodiment of the application, the plurality of expert models in the first model include at least two expert models with different network structures, so that knowledge learning can be performed by using different expert models, and identification of samples with different difficulties is facilitated when the number of samples is limited. The fusion model is trained according to the image characteristics extracted from the sample image after the expert model converges, so that the fusion model can dynamically learn how to reasonably utilize the respective learned knowledge of the expert models according to the learned knowledge of the expert model, the expert models with different performances make different contributions to the final output classification result, and the overall performance of the first model is greatly improved. Meanwhile, the training mode can reduce the dependence on the number of the sample images, and can also carry out efficient learning under the condition of less number of the sample images, thereby obtaining better model performance.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic flow chart diagram illustrating a model training method according to an exemplary embodiment of the present application;

FIG. 2 is a schematic flow chart diagram of a model training method according to another exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of a network structure of a hybrid expert model according to an exemplary embodiment of the present application;

FIG. 4 is a flowchart illustrating an image classification method according to an exemplary embodiment of the present application;

FIG. 5 is a flowchart illustrating an image classification method according to another exemplary embodiment of the present application;

fig. 6 is a schematic structural diagram of a terminal device according to an exemplary embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The development of the neural network continuously pushes the progress of artificial intelligence, and the training of the neural network cannot be separated from massive training data and pre-training models. In some application scenarios, a technical problem of lacking sufficient training data and failing to train a neural network model that is sufficiently effective is often faced.

In some model training methods, a multi-expert hybrid model may be employed for knowledge learning. In the case of performing the classification prediction, in one mode, prediction results output by a plurality of experts are averaged to be used as a final prediction result. In another way, a routing structure is adopted to decide which expert models output the final prediction results according to. However, the above method does not fully utilize the characteristics of the multi-expert mixed model, and does not consider the performance advantages of different expert models, so that the accuracy of the classification result of the output image is limited. Therefore, the performance of the trained neural network model cannot be improved in the absence of sufficient training data.

In view of the above technical problem, some exemplary embodiments of the present application provide a solution, and the following portions will be described in detail with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a model training method according to an exemplary embodiment of the present application. As shown in fig. 1, the method includes:

step 101, obtaining image characteristics obtained by performing characteristic extraction on a sample image after a plurality of expert models in a first model converge; labeling a label truth value on the sample image; the plurality of expert models includes expert models of at least two different network configurations.

And step 102, fusing image features corresponding to the expert models according to feature fusion parameters of the expert models through fusion models in the first model, and performing classification operation processing on the fused features to obtain a first classification result.

Step 103, taking the difference between the first classification result and the label truth value converging to a first designated range as a learning objective, and adjusting the feature fusion parameters of the expert models.

In the present embodiment, the first model for image classification may be a hybrid expert model, which may be implemented based on a multiple expert model (MoE). The multi-expert model is a type of neural network, and is different from general neural network training. In addition to the plurality of expert models, the multi-expert model includes a gating module for controlling which expert is selected for use. In this embodiment, the gating module is implemented as a fusion model connected to multiple expert models. The mixed expert model means that a plurality of expert models are integrated and used in a single task.

In this embodiment, the plurality of expert models includes at least two expert models having different structures. That is, the network configuration of the plurality of expert models includes at least two different network configurations. Of course, the network structures of the expert models may be different, and the embodiment is not limited. In the image classification scenario, the plurality of expert models may include at least two structurally different convolutional neural network models, or different expert models may be implemented based on the structurally different convolutional neural network models. The multi-layer convolution in the convolutional neural network model forms a feature extractor in the expert model so that the expert model can conveniently extract features of input training data. Of course, in different embodiments, other neural network models may be used as the feature extractor, and the convolutional neural network model is not limited.

The number of the expert models in the first model may be 2, 3 or N, and this embodiment is not limited. Wherein N is a positive integer greater than 2.

In some alternative embodiments, the network structure of the expert model may include, but is not limited to: network structures such as ResNet, EfficientNet, RegNet, ResNeSt, SKNet, ECAResNet, NFNet, EfficientNet V2, ViT, T2T-ViT, DeiT, Swin, VOLO, etc. In practice, at least two network structures may be selected from the network structures described above as the network structures of the plurality of expert models, or at least two network structures may be selected from other network structures (for example, AlexNet and vggtet) besides the network structures described above, which is not limited in this embodiment.

In this embodiment, the plurality of expert models in the first model refer to expert models that are trained and converged by the model. And the plurality of expert models can learn the extraction capability of the image characteristics according to the sample image under the supervision of the supervision signal to obtain respective model parameters. The training process of the expert model will be described in the following examples, which are not described herein. After the plurality of expert models converge, a Fusion model (Fusion Module) can be trained based on the output of the plurality of expert models, so that the knowledge learned by the plurality of expert models can be dynamically learned, and how to make full use of the respective performance advantages of the plurality of experts can be learned.

When the fusion model is trained, the image characteristics obtained by performing characteristic extraction on the sample image after the convergence of a plurality of expert models can be obtained. Wherein, the sample image is marked with a label Truth value (Ground Truth). The labeling means performing class labeling on the sample image, so that the sample image has a class label, and therefore supervised training can be performed on the fusion model based on the class label.

After the feature images extracted by the expert models are input into the fusion model, the fusion model can fuse the image features corresponding to the expert models according to the feature fusion parameters of the expert models and perform classification operation processing to obtain classification results.

Wherein the feature fusion parameters of each of the plurality of expert models are located in the fusion model and can be regarded as part of the model parameters of the fusion model. After the classification result output by the fusion model is obtained, the model parameters of the fusion model can be adjusted under the supervision of the label truth value marked on the sample image. Namely, the difference between the classification result output by the fusion model and the label truth value is converged to a first designated range as a learning target, and the feature fusion parameters of the expert models are adjusted.

The steps 101 to 103 may be performed in a loop iteration manner, until the difference between the classification result output by the fusion model and the label truth value converges to the first designated range, the iteration process may be stopped, and the result model obtained by training may be output.

In this embodiment, the plurality of expert models in the first model include at least two expert models with different network structures, so that knowledge learning can be performed by using different expert models, which is beneficial to identifying samples with different difficulties when the number of samples is limited. The fusion model is trained according to the image characteristics extracted from the sample image after the expert model converges, so that the fusion model can dynamically learn how to reasonably utilize the respective learned knowledge of the expert models according to the learned knowledge of the expert model, the expert models with different performances make different contributions to the final output classification result, and the overall performance of the first model is greatly improved. Meanwhile, the training mode can reduce the dependence on the number of the sample images, and can also carry out efficient learning under the condition of less number of the sample images, thereby obtaining better model performance.

It should be noted that, in the above and following embodiments of the present application, before the training of the first model is performed based on the sample image, a data enhancement operation may be further performed on the sample image. Wherein the data enhancement operation may include: single sample based data enhancement operations and/or multiple sample based data enhancement operations.

Optionally, the single sample based data enhancement operation may include, but is not limited to, at least one of: random flip (Random flip), Random Crop (Random Crop), Color Jitter (Color Jitter), Random erasure (Random erasure), NAS search based combinatorial enhancement strategies (automation and Random acquisition), and the like. Optionally, the multi-sample based data enhancement may include, but is not limited to, at least one of a MixUp algorithm and a CutMix algorithm.

Based on the data enhancement operation, the sample images with less quantity can be transformed, on one hand, the quantity of the sample images can be enriched, and more data support is provided for the model process. On the other hand, the complexity of the sample image can be improved through the data enhancement operation, so that the robustness of the model obtained through training is improved.

In some alternative embodiments, the training process for the first model may be an "end-to-end" training process. That is, after the training data is input from the input end of the first model (i.e., the input end of the plurality of expert models), the final prediction result is output from the output end of the first model (i.e., the fusion model). And comparing the final prediction result with the labeled label to obtain the prediction error of the model, performing back propagation based on the prediction error and adjusting the parameters of each model until the prediction error is converged or the expected effect is achieved, and outputting the result model obtained by training. In this manner, multiple expert models as well as the fusion model may be trained simultaneously.

In other alternative embodiments, the training process for the first model may be a "two-stage" training process. The first stage is a training stage of a plurality of expert models, and the second stage is a training stage of a fusion model. In this manner, the prediction errors of the expert models can be converged or the expected effect can be achieved through the first stage of training. After that, the second stage of training can be used for training the fusion model based on the output image characteristics of the trained expert models, and the result model obtained by training is output until the prediction error of the fusion model is converged or the expected effect is achieved. In the embodiment, the fusion model is trained based on the trained expert models, so that the iteration times required by training can be effectively reduced, and the model training efficiency is improved.

The above "two-stage" training process will be described in detail with reference to the accompanying drawings.

Fig. 2 is a schematic flowchart of a model training method according to another exemplary embodiment of the present application.

As shown in fig. 2, the method includes:

step 201, obtaining a sample image, wherein a label truth value is marked on the sample image.

Step 202, inputting the sample image into a first model; the first model includes: a plurality of expert models, and a fusion model connected to the plurality of expert models; the plurality of expert models includes expert models of at least two different network configurations.

And 203, performing feature extraction on the sample image through the plurality of expert models to obtain image features corresponding to the plurality of expert models respectively.

And 204, performing classification operation processing on the image features corresponding to the expert models through the classifiers corresponding to the expert models to obtain classification results of the expert models.

Step 205, adjusting parameters of the expert models with a learning objective of converging the difference between the classification result and the label truth value of each expert model to a second specified range.

And step 206, obtaining image characteristics obtained by performing characteristic extraction on the sample image after the plurality of expert models are converged.

And step 207, fusing the image features corresponding to the expert models according to the feature fusion parameters of the expert models through the fusion model, and performing classification operation on the fused features to obtain a first classification result.

Step 208, adjusting feature fusion parameters of the expert models respectively with a learning objective of converging the difference between the first classification result and the label truth value to a first designated range.

In the present embodiment, steps 202 to 206 are the first stage of training a plurality of expert models. The steps 202 to 206 may be iteratively executed in a loop until the learning objectives of the plurality of expert models converge to a specified range. Step 207 to step 209 are the second stage of training the fusion model. The second stage may begin training after the learning objectives of the plurality of expert models converge to a specified range. The training process of the second stage can refer to the description of the foregoing embodiments, and the following section will exemplarily describe the training process of the first stage.

In the first stage, after the sample image is input into the first model, the sample image can be subjected to feature extraction through feature extractors in a plurality of expert models to obtain image features corresponding to the expert models.

In the training process of the neural network model, sample images required by training can be divided into three types of 'simple', 'medium difficulty' and 'difficult' according to the recognition difficulty. The sample images of the simple category can be normally displayed for the main body of the target category, the detail characteristics are obvious, the problems of shielding, noise, blurring and the like do not exist, and a good identification effect can be achieved by adopting a simple network structure. In the sample images of the medium-difficulty category, the main body occupies the foreground part of the picture, but the characteristics of long distance and unobvious texture details exist, so that the ideal recognition effect can be achieved only by using proper data enhancement and regularization effects. In the image sample with difficult classification, the main body of the target classification is often mixed in the background, and the image occupies a small amount, so that the identification difficulty is high, and a model with extremely strong feature expression and regularization capacity is required to be used for identification.

In the neural network model, networks of different sizes and structures differ in each part of the expression and recognition ability of features. Therefore, in order to meet the training requirements of sample images with different difficulties, expert models with different network structures can be arranged in the first model, and the expert models with different structures are respectively used for recognizing samples with different difficulties, so that the training scene with inconsistent sample image quality is adapted.

And when the network structures of the expert models are different, the implementation forms of the feature extractors in the expert models are also different. For example, the first model includes: expert model 1, expert model 2, and expert model 3. Wherein, the expert model 1 is a ResNeSt model, the expert model 2 is a ResNet model, and the expert model 3 is an EfficientNet model. Then, the feature extractor in the expert model 1 is a feature extraction network in resenestt, the feature extractor in the expert model 2 is a feature extraction network in ResNet, and the feature extractor in the expert model 3 is a feature extraction network in EfficientNet.

Fig. 3 illustrates the structure of the first model, and as shown in fig. 3, a plurality of expert modules may share the shallow convolution and may further perform feature extraction based on feature maps output by the shallow convolution respectively.

In the present embodiment, each expert model may further include a classifier as shown in fig. 3. As shown in fig. 3, a classifier 1 connected to an expert model 1, a classifier 2 connected to an expert model 2, and a classifier 3 connected to an expert model 3. The classifier is used for executing classification operation processing according to the image characteristics so as to output the classification result of each expert model on the sample image.

After the classification results of the expert models are obtained, the expert models can be trained according to the difference between the classification results of the expert models and the label truth value marked on the sample image. In the training process, the learning objectives of the expert models can be determined, and the model parameters of the expert models are optimized according to the learning objectives. Wherein the learning objectives of the plurality of expert models may be: the difference between the classification result of each of the expert models and the label truth value on the sample image is converged to a second specified range.

Wherein, the optimized model parameters mainly comprise: parameters of feature extractors in a plurality of expert models. When the difference between the classification result of each expert model and the label truth value on the sample image is converged to a second specified range, the loop iteration process of the first stage can be stopped, the trained expert models are output, and the output model parameters of the expert models can enable the expert models to extract the image features capable of accurately classifying the image from the input image.

Any one of the plurality of expert models includes a plurality of feature computation modules (schematically illustrated as "modules" in fig. 3). The plurality of feature calculation modules constitute a feature extractor of the expert model. Taking the expert model implemented as a convolutional neural network model as an example, each feature calculation module in the expert model may include: a convolutional layer, a pooling layer, and a fully-connected layer. The number of feature computation modules that an expert model contains may be referred to as the depth of the expert model.

In some exemplary embodiments, to improve the performance of the expert models, the depth of each expert model may be dynamically set during a first training phase, resulting in Dynamic Neural Networks (Dynamic Neural Networks). In the training and reasoning stage, the dynamic neural network can dynamically adjust the structure or parameters of the dynamic neural network according to different input data, so that the dynamic neural network has better performance than a static model in the aspects of calculation cost and recognition effect. In this embodiment, when training based on the dynamic model, some structures (e.g., channels, layers, sub-networks) in the activation network may be dynamically selected.

In this embodiment, for any expert model, when feature extraction is performed on a sample image based on the expert model, a part of feature calculation modules may be randomly selected from the expert model to dynamically set the depth of the expert model. And by randomly selecting part of the feature calculation modules, feature extraction operation can be carried out on the input sample image to obtain the image features output by the expert model.

As shown in fig. 3, the expert models 1 and 2 … N each include a plurality of feature calculation modules. For each expert model, certain feature computation modules of the expert model may be randomly skipped based on a random Depth (Stochastic Depth) algorithm when performing the feature extraction operation. As shown in fig. 3, a "switch" module may be provided for each feature calculation module, and during the training process, switches corresponding to different feature calculation modules may be randomly turned on or off, thereby achieving the effect of dynamically changing the depth of the expert model. The mode greatly enhances the regularization effect of the model and effectively prevents the overfitting phenomenon of the expert model.

It should be noted that, in the present embodiment, for each feature calculation module in the expert model, the Pooling layer in the feature calculation module may be a blu Pooling layer (blu Pooling), so that the translation invariance of the expert model may be improved based on the blu Pooling layer.

In this embodiment, based on the manner of dynamically changing the depth of the expert model, the amount of computation in the training process can be saved, so that the training is more efficient. When the size of the neural network model is larger, the larger-scale neural network can rely on more parameters to obtain stronger expression capability. If a dynamic structure is added to a larger-scale neural network, the equivalent excellent expression capability can be obtained at the cost of smaller calculation amount increase, and the performance of the expert model is improved. In addition, compared with the fixed parameter quantity of the static model, the dynamic network dynamic calculation has stronger adaptability on different hardware platforms and calculation environment deployments due to the characteristics of the dynamic network dynamic calculation.

In some exemplary embodiments, when training a plurality of expert models according to learning objectives of the plurality of expert models, the learning objectives corresponding to the plurality of expert models may be constructed using the following alternative embodiments:

alternatively, the classification loss of each of the plurality of expert models may be calculated based on the classification result of each of the plurality of expert models and the label truth value on the sample image. The classification loss may be calculated based on at least one of a cross entropy loss function, an absolute value loss function, a log logarithmic loss function, a square loss function, an exponential loss function, a Hinge loss function, and a perceptual loss function, which is not limited in this embodiment. Of course, besides the above listed loss functions, the classification loss may also be calculated based on other optional loss functions, which is not limited in this embodiment.

In some optional embodiments, when constructing the classification loss, to further optimize the performance of the expert model, the label truth values on the sample image may be smoothed.

Alternatively, for the ith category of tags, the operation of smoothing the tag l by using a tag smoothing (label smoothing) method can be represented by the following formula:

where ε represents a very small value and C1 represents a constant.

In addition to the smoothing approach described above, in some exemplary embodiments, soft labels (or distillation labels) may be generated based on a method of knowledge distillation to guide the training process. Optionally, a Teacher Model (Teacher Model) may be provided in the first Model during the training phase, the Teacher Model being configured to output a classification result according to the sample image, and use the classification result as a soft label to guide the training process of the multi-expert Model.

In this embodiment, optionally, a preset teacher model M may be obtained_tClassifying the sample image I to obtain a second classification result; the preset teacher model may be a pre-trained expert model or other optional classification models, and this embodiment is not limited. According to the second classification result, smoothing the label truth value marked on the sample image to obtain a smooth label l_train. The above process can be described using the following formula:

l_train＝λ*l_ls+(1-λ)M_t(I) equation 2

Where λ represents a weighting coefficient, and the value of λ may be set according to actual requirements, for example, λ may be set to 0.5, λ may be set to 0.6, and the present embodiment is not limited.

And calculating the classification loss of each expert model according to the classification result of each expert model and the smooth label.

Alternatively, the loss of each expert model may include a soft label-based cross-entropy loss, which may be calculated as shown in the following equation:

l_soft-CE＝-l_train*log(M_t(I) equation 3)

Optionally, the classification penalty for each expert model may include a cosine (cosine) penalty in addition to the soft label-based cross-entropy penalty. And after the cross entropy loss and the cosine loss are superposed, the classification loss of each expert model can be obtained. The calculated classification loss of the jth expert model can be expressed by the following formula:

L_j＝l_soft-CE+(1-cosine(l_train，M_t(I) ))) formula 4

When the first model contains N expert models, the total classification loss of the N expert models can be described using the following formula:

optionally, in addition to the classification loss, a difference metric loss corresponding to the plurality of expert models may be calculated according to the image features corresponding to the plurality of expert models respectively. The difference measurement loss is used for measuring the difference between the expert models, so that the expert models are supervised to learn different knowledge, and the performance advantages of the experts are fully utilized.

In some optional embodiments, when the difference metric loss corresponding to the plurality of expert models is calculated, average features extracted by the plurality of expert models may be calculated according to image features output by the plurality of expert models respectively; and then, respectively calculating the similarity between the features output by the expert models and the average features to obtain the difference measurement loss corresponding to the expert models. Alternatively, the similarity may be calculated using KL (Kullback-Leibler) divergence (or relative entropy loss).

The above-mentioned calculated discrepancy metric loss L_dCan be expressed by the following formula:

wherein p is_jRepresenting the image characteristics output by the jth expert model, N representing the number of expert models comprised by the first model, and p representing the average characteristics of a plurality of expert models.

Based on the classification loss and the dissimilarity measure loss of each of the plurality of expert models, learning objectives of the plurality of expert models can be determined. Wherein the learning objective can be described by the following formula:

L＝L_MoE+L_dequation 8

According to the learning targets of the plurality of expert models, the parameters of the plurality of expert models can be adjusted until the learning targets of the plurality of expert models converge to a second specified range.

After the plurality of expert models converge based on the model training operations of the above embodiments, the above steps 207 to 209 may be performed to train the fusion model. The feature fusion parameters of the expert models to be learned in the fusion model can be realized as feature weighting coefficients of the expert models, and the feature weighting coefficients are used for distributing the contribution degree of the image features extracted by the expert models to the final classification result. The fusion model fuses the image features corresponding to the expert models according to the feature fusion parameters of the expert models, and performs classification operation processing on the features obtained by fusion, which can be described by the following formula:

M_fusion＝Linear(contact(E₁，E₂，...，E_N) C2) equation 9

In the above formula, M_fusionRepresenting the classification result of the output of the fusion model, E₁，E₂，...，E_DRepresenting the image features extracted by each of the N expert models, contact () representing the feature fusion computation, line () representing the Linear operation for classification, and C2 being a constant.

In addition to the model training method described in the foregoing embodiments, the embodiments of the present application provide an image classification method, which will be exemplarily described below.

Fig. 4 is a schematic flowchart of an image classification method according to an exemplary embodiment of the present application, and as shown in fig. 4, the method includes:

step 401, inputting an image to be processed into a first model; the first model includes: a plurality of expert models and a fusion model connected with the plurality of expert models; the plurality of expert models includes expert models of at least two different network configurations.

Step 402, performing feature extraction on the sample image through the plurality of expert models to obtain image features corresponding to the plurality of expert models respectively.

And 403, fusing the image features corresponding to the expert models according to the feature fusion parameters of the expert models through the fusion model.

And step 404, performing classification operation processing on the fused features through the fusion model to obtain a first classification result.

The training process of the first model may refer to the descriptions of the foregoing embodiments, and is not repeated herein. In different application scenarios, the implementation form of the image to be processed may be different. For example, in some embodiments, the image to be processed may include: and shooting the target place to obtain an image. The target place may be a place to be subjected to scene classification, a place to be subjected to potential risk prediction, and the like, and may include, but is not limited to, public places such as stations, cars, and venues. In other embodiments, the image to be processed may include: an image to be subjected to risk content analysis. The images to be subjected to risk content analysis may include, but are not limited to, yellow-related images, storm-related images, and the like.

Wherein, the feature fusion parameter of each of the plurality of expert models may be a feature weighting coefficient of each of the plurality of expert models. The feature weighting coefficients of each of the plurality of expert models may be dynamically learned during model training. In some exemplary embodiments, before performing fusion processing on the image features corresponding to the plurality of expert models according to the feature fusion parameters of the plurality of expert models by using the fusion model, the feature fusion parameters of the plurality of expert models are dynamically learned by using the fusion model according to the image features obtained by performing feature extraction on the sample image after convergence by using the plurality of expert models and the label truth values labeled on the sample image.

The fusion model performs fusion processing on the image features corresponding to the expert models respectively, and performs classification operation processing on the features obtained through fusion, and the operation of obtaining the first classification result can be shown in the formula 9, which is not described again.

In this embodiment, the plurality of expert models in the first model includes at least two expert models having different network structures, so that different classification performances of different expert models can be fully utilized, and the dependence on the number of sample images is reduced. Meanwhile, the feature fusion parameters of the expert models in the fusion model are obtained through dynamic learning, so that the expert models with different performances can make different contributions to the finally output classification result according to the actual training results of different expert models, and the accuracy of the image classification result is greatly improved.

Fig. 5 is a schematic flowchart of an image classification method according to another exemplary embodiment of the present application, and as shown in fig. 5, the method includes:

step 501, responding to a calling request of a client to a first interface, and acquiring an image to be processed contained in interface parameters.

Step 502, inputting the image to be processed into a first model; the first model includes: a plurality of expert models and a fusion model connected with the plurality of expert models; the plurality of expert models includes expert models of at least two different network configurations.

Step 503, performing feature extraction on the sample image through the plurality of expert models to obtain image features corresponding to the plurality of expert models respectively.

And step 504, performing fusion processing on the image features corresponding to the plurality of expert models respectively through the fusion model according to the dynamically learned feature fusion parameters of the plurality of expert models respectively, and performing classification operation processing on the features obtained through fusion to obtain a first classification result.

And step 505, returning the first classification result to the client.

The execution subject of the embodiment may be a server device, such as a conventional server or a cloud server. The client can be realized as a mobile phone, a computer, a tablet computer and other equipment on the user side.

In this embodiment, the image classification method provided in each of the foregoing embodiments may be packaged as a Software tool, such as a SaaS (Software-as-a-Service) tool, that can be used by a third party. Wherein the SaaS tool may be implemented as a plug-in or an application. The plug-in or application may be deployed on a server-side device and may open a specified interface to a third-party user, such as a client. For convenience of description, in the present embodiment, the specified interface is described as the first interface. Furthermore, a third-party user such as a client conveniently accesses and uses the mobile object detection service provided by the server device by calling the first interface.

For example, in some scenarios, the SaaS tool may be deployed on a cloud server, and a third-party user may invoke a first interface provided by the cloud server to use the SaaS tool online. When the third-party user calls the first interface, the image to be classified can be provided for the SaaS tool by configuring the interface parameters of the first interface. Optionally, the image to be classified may include any frame image in video data obtained by continuous shooting, or any frame image in an image sequence obtained by discrete shooting, which is not limited in this embodiment.

After receiving the call request for the first interface, the SaaS tool can obtain the image to be classified provided by the client by analyzing the interface parameters of the first interface. After the class to which the image belongs is identified by the SaaS tool based on the first model deployed on the server, the class to which the image belongs can be returned to the client through the first interface or other communication modes. The training process and the inference process of the first model can refer to the descriptions of the foregoing embodiments, and are not described herein again.

In this embodiment, the server device may provide the image classification service to the client based on the SaaS tool running thereon, and the client user may use the image classification service provided by the server device by calling an interface provided by the SaaS tool. Based on the interaction between the client and the server equipment, the client can completely deliver the image classification operation to the server equipment for execution, and further, the image classification operation with low cost and high accuracy can be realized by means of strong computing capability and reliable image classification algorithm of the server equipment.

It should be noted that in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 401, 402, etc., are merely used for distinguishing different operations, and the sequence numbers do not represent any execution order per se. The execution subjects of the steps of the method provided by the above embodiments may be the same device, or different devices may be used as the execution subjects of the method. For example, the execution subjects of step 201 to step 204 may be device a; for another example, the execution subject of

steps

201 and 202 may be device a, and the execution subject of

steps

203 and 204 may be device B; and so on. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

The image classification method provided by the above example of the present application can be applied to various risk content and risk location identification scenes based on image classification, such as an image riot and terrorist identification scene, an image yellow identification scene, and the like. In such risk discrimination scenarios, it is often difficult to collect enough training samples in a short time to satisfy model training. Under the condition of insufficient training data, a model meeting the requirement of risk content identification can be efficiently trained based on the model training method provided by each embodiment of the application. The following description is exemplary with reference to specific application scenarios.

In some application scenarios, violence/terrorism at a target location (e.g., a public location) needs to be identified and pre-warned. The public place may include: inside of a vehicle, a mall, a station waiting room, etc. In many of the above-described locations, a risk scenario identification system may be deployed based on the image capture device, the server, and the alerting device.

The image capturing Device may be implemented as various electronic devices capable of achieving high-definition shooting, including but not limited to electronic devices that perform imaging based on a CCD (Charge-coupled Device) image sensor or a CMOS (Complementary Metal Oxide Semiconductor) image sensor, such as a high-speed camera, a camcorder, a rotary camera, an infrared night vision camera, and the like, and will not be described in detail. The number of the image capturing devices may be one or more, and may be selected according to actual requirements, which is not limited in this embodiment. For example, in a waiting room of a train station, a plurality of image capturing devices may be provided so that the shooting range of the image capturing devices can cover the waiting room of the train station.

The server may be implemented as a conventional server, a cloud host, a virtual center, or other devices, which is not limited in this embodiment. The server device mainly includes a processor, a hard disk, a memory, a system bus, and the like, and is similar to a general computer architecture, and is not described in detail.

The image capturing device may send the video stream obtained by continuous shooting to the server for processing, or may send the image sequence obtained by interval shooting to the server for processing, which is not limited in this embodiment.

The communication between the image acquisition equipment and the server can be realized in a wired communication mode and a wireless communication mode. The WIreless communication mode includes short-distance communication modes such as bluetooth, ZigBee, infrared, WiFi (WIreless-Fidelity), long-distance WIreless communication modes such as LORA, and WIreless communication mode based on a mobile network. When the mobile network is connected through a mobile network communication, the network format of the mobile network may be any one of 2G (gsm), 2.5G (gprs), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE), 4G + (LTE +), 5G, WiMax, and the like, which is not limited in this embodiment.

After receiving the video stream or the discrete image sent by the image capturing device, the server may classify each frame of image in the video stream or the discrete image based on the first model described in the foregoing embodiments. Specifically, after the server acquires video data obtained by shooting a target place, any frame of image in the video data can be input into the first model; the first model includes: a plurality of expert models and a fusion model connected with the plurality of expert models; the plurality of expert models includes expert models of at least two different network configurations. The server can extract the features of the image through the plurality of expert models to obtain the image features corresponding to the plurality of expert models, and then the fusion model is used for carrying out fusion processing on the image features corresponding to the plurality of expert models according to the feature fusion parameters of the plurality of expert models, and the fusion model is used for carrying out classification operation processing on the features obtained through fusion to obtain the scene category corresponding to the target place. When the scene category output by the first model indicates that the target site is a high risk scene, an alert message may be output based on the alert device.

In addition to the image processing scenarios described in the foregoing embodiments, the model training method provided in the embodiments of the present application may also be applied to a variety of other different artificial intelligence scenarios. Such as natural language processing scenarios, data analysis scenarios, etc. In a natural language processing scenario, the training data may be implemented as corpus data employed for training, such as multiple text corpora or multiple audio corpora. In a behavioral data analysis scenario, the training data may be implemented as a plurality of behavioral data employed for training. In a commodity data analysis scenario, the training data may be implemented as click volume statistics, sales volume statistics, and the like of different commodities, which are not listed one by one.

Fig. 6 is a schematic structural diagram of a server according to an exemplary embodiment of the present application, and as shown in fig. 6, the server may include: memory 601, processor 602, communication component 603, and power component 604. Only some of the components are schematically shown in fig. 6, and the electronic device is not meant to include only the components shown in fig. 6.

The memory 601 may be configured to store other various data to support operations on the server 12. Examples of such data include instructions for any application or method operating on server 12, contact data, phonebook data, messages, pictures, videos, and so forth. The memory may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

In the present embodiment, memory 601 is used to store one or more computer instructions.

A processor 602, coupled to the memory 601, for executing one or more computer instructions in the memory 601 to: inputting the image to be processed into the first model via the communication component 603; the first model includes: a plurality of expert models and a fusion model connected with the plurality of expert models; the plurality of expert models comprises expert models of at least two different network structures; performing feature extraction on the sample image through the plurality of expert models to obtain image features corresponding to the plurality of expert models respectively; and performing fusion processing on the image features corresponding to the expert models respectively through the fusion model according to the feature fusion parameters of the expert models respectively, and performing classification operation processing on the features obtained through fusion to obtain a first classification result.

Further optionally, before performing fusion processing on the image features corresponding to the plurality of expert models according to the feature fusion parameters of the plurality of expert models by using the fusion model, the processor 602 is further configured to: and dynamically learning the feature fusion parameters of the expert models according to the image features obtained by carrying out feature extraction on the sample images after the expert models converge and the label truth values marked on the sample images through the fusion model.

In addition to the foregoing embodiments, the server shown in fig. 6 may further perform the following scene classification method, where the processor 602 is mainly configured to: acquiring video data obtained by shooting a target place; inputting any frame image in the video data into a first model; the first model includes: a plurality of expert models and a fusion model connected with the plurality of expert models; the plurality of expert models comprises expert models of at least two different network structures; performing feature extraction on the image through the plurality of expert models to obtain image features corresponding to the plurality of expert models respectively; fusing the image features corresponding to the expert models according to the feature fusion parameters of the expert models through the fusion model; and carrying out classification operation processing on the features obtained by fusion through the fusion model to obtain the scene category corresponding to the target place.

In addition to the foregoing embodiments, the server shown in fig. 6 may further perform the following image classification method, where the processor 602 is mainly configured to: responding to a calling request of a client to a first interface, and acquiring an image to be processed contained in interface parameters; inputting the image to be processed into a first model; the first model includes: a plurality of expert models and a fusion model connected with the plurality of expert models; the plurality of expert models comprises expert models of at least two different network structures; performing feature extraction on the sample image through the plurality of expert models to obtain image features corresponding to the plurality of expert models respectively; according to the fusion model and the feature fusion parameters of the plurality of expert models which are dynamically learned, fusion processing is carried out on the image features corresponding to the plurality of expert models, and classification operation processing is carried out on the features obtained through fusion to obtain a first classification result; and returning the first classification result to the client.

In addition to the foregoing embodiments, the server shown in fig. 6 may also perform the following model training method. The first model includes: a plurality of expert models, and a fusion model connected to the plurality of expert models; the plurality of expert models includes expert models of at least two different network configurations. The processor 602 is configured to: acquiring image characteristics obtained by performing characteristic extraction on the sample image after the plurality of expert models converge; labeling a label truth value on the sample image; fusing image features corresponding to the expert models according to feature fusion parameters of the expert models respectively through the fusion model, and performing classification operation processing on the fused features to obtain a first classification result; and adjusting the characteristic fusion parameters of the expert models respectively by taking the difference between the first classification result and the label truth value as a learning target to converge to a first specified range.

Further optionally, the processor 602, before obtaining the image features obtained by feature extraction on the sample image after convergence by the plurality of expert models, is further configured to: performing feature extraction on the sample image through the plurality of expert models to obtain image features corresponding to the plurality of expert models respectively; classifying and calculating the image characteristics corresponding to the expert models through classifiers corresponding to the expert models to obtain classification results of the expert models; and adjusting parameters of the expert models by taking the difference between the classification result of each expert model and the label truth value as a learning target and converging the difference to a second specified range.

Further optionally, any expert model of the plurality of expert models comprises a plurality of feature computation modules; when the feature extraction is performed on the sample image through the plurality of expert models to obtain image features corresponding to the plurality of expert models, the processor 602 is specifically configured to: randomly selecting a partial feature calculation module from the expert models for any one of the plurality of expert models to dynamically set a depth of the expert model; and performing feature extraction operation on the input sample image through the partial feature calculation module to obtain the image features output by the expert model.

Further optionally, the processor 602 is specifically configured to, when adjusting parameters of each of the plurality of expert models with a learning goal of converging a difference between a classification result of each of the plurality of expert models and the label truth value to a second specified range, specifically: calculating respective classification losses of the plurality of expert models according to respective classification results of the plurality of expert models and the label truth values; calculating difference metric losses corresponding to the expert models according to the image features corresponding to the expert models respectively; determining learning objectives of the plurality of expert models according to the classification loss and the difference metric loss of each of the plurality of expert models; and adjusting the parameters of the expert models according to the learning targets of the expert models until the learning targets of the expert models converge to the second specified range.

Further optionally, when the processor 602 calculates the classification loss of each of the plurality of expert models according to the classification result of each of the plurality of expert models and the label truth value, it is specifically configured to: obtaining a second classification result obtained by classifying the sample image by a preset teacher model; according to the second classification result, smoothing the label truth value marked on the sample image to obtain a smooth label; and calculating the classification loss of each expert model according to the classification result of each expert model and the smooth label.

Further optionally, when the processor 602 calculates, according to the image features corresponding to the plurality of expert models, the difference metric loss corresponding to the plurality of expert models, specifically configured to: calculating average characteristics extracted by the expert models according to the image characteristics corresponding to the expert models respectively; and respectively calculating the similarity between the features output by the expert models and the average features to obtain the difference metric loss corresponding to the expert models.

Further optionally, the processor 602 is further configured to perform a data enhancement operation on the sample image; the data enhancement operation includes: single sample based data enhancement operations and/or multiple sample based data enhancement operations.

Wherein the communication component 603 is configured to facilitate communication between the device in which the communication component is located and other devices in a wired or wireless manner. The device in which the communication component is located may access a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G, or 5G, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component may be implemented based on Near Field Communication (NFC) technology, Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

The power supply 605 provides power to various components of the device in which the power supply is located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.

In this embodiment, the plurality of expert models in the first model provided by the server include at least two expert models having different network structures, so that knowledge learning can be performed using different expert models, and it is advantageous to identify samples with different difficulties when the number of samples is limited. The fusion model is trained according to the image characteristics extracted from the sample image after the expert model converges, so that the fusion model can dynamically learn how to reasonably utilize the respective learned knowledge of the expert models according to the learned knowledge of the expert model, the expert models with different performances make different contributions to the final output classification result, and the overall performance of the first model is greatly improved. Meanwhile, the training mode can reduce the dependence on the number of the sample images, and can also carry out efficient learning under the condition of less number of the sample images, thereby obtaining better model performance.

Accordingly, the present application further provides a computer-readable storage medium storing a computer program, where the computer program can implement the steps in the method embodiments that can be executed by the server in the above method embodiments when executed.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. An image classification method, comprising:

inputting an image to be processed into a first model; the first model includes: a plurality of expert models and a fusion model connected with the plurality of expert models; the plurality of expert models comprises expert models of at least two different network structures;

performing feature extraction on the image through the plurality of expert models to obtain image features corresponding to the plurality of expert models respectively;

fusing the image features corresponding to the expert models according to the feature fusion parameters of the expert models through the fusion model;

and carrying out classification operation processing on the features obtained by fusion through the fusion model to obtain a first classification result.

2. The method according to claim 1, wherein before performing fusion processing on the image features corresponding to the plurality of expert models according to the feature fusion parameters of the plurality of expert models by using the fusion model, the method further comprises:

and dynamically learning the feature fusion parameters of the expert models according to the image features obtained by carrying out feature extraction on the sample images after the expert models converge and the label truth values marked on the sample images through the fusion model.

3. A method of scene classification, comprising:

acquiring video data obtained by shooting a target place;

inputting any frame image in the video data into a first model; the first model includes: a plurality of expert models and a fusion model connected with the plurality of expert models; the plurality of expert models comprises expert models of at least two different network structures;

and carrying out classification operation processing on the features obtained by fusion through the fusion model to obtain the scene category corresponding to the target place.

4. A method of model training, wherein a first model to be trained comprises: a plurality of expert models, and a fusion model connected to the plurality of expert models; the plurality of expert models comprises expert models of at least two different network structures; the method comprises the following steps:

acquiring image characteristics obtained by performing characteristic extraction on the sample image after the plurality of expert models converge; labeling a label truth value on the sample image;

fusing image features corresponding to the expert models according to feature fusion parameters of the expert models respectively through the fusion model, and performing classification operation processing on the fused features to obtain a first classification result;

and adjusting the characteristic fusion parameters of the expert models respectively by taking the difference between the first classification result and the label truth value as a learning target to converge to a first specified range.

5. The method of claim 4, wherein before obtaining the image features obtained by feature extraction of the sample image after convergence by the plurality of expert models, further comprising:

performing feature extraction on the sample image through the plurality of expert models to obtain image features corresponding to the plurality of expert models respectively;

classifying and calculating the image characteristics corresponding to the expert models through classifiers corresponding to the expert models to obtain classification results of the expert models;

and adjusting parameters of the expert models by taking the difference between the classification result of each expert model and the label truth value as a learning target and converging the difference to a second specified range.

6. The method of claim 5, wherein any one of the plurality of expert models comprises a plurality of feature computation modules;

performing feature extraction on the sample image through the plurality of expert models to obtain image features corresponding to the plurality of expert models respectively, including:

randomly selecting a partial feature calculation module from the expert models for any one of the plurality of expert models to dynamically set a depth of the expert model;

and performing feature extraction operation on the input sample image through the partial feature calculation module to obtain the image features output by the expert model.

7. The method of claim 5, wherein adjusting parameters of each of the expert models with a learning goal of converging a difference between the classification result and the label truth value of each of the expert models to a second specified range comprises:

calculating respective classification losses of the plurality of expert models according to respective classification results of the plurality of expert models and the label truth values;

calculating difference metric losses corresponding to the expert models according to the image features corresponding to the expert models respectively;

determining learning objectives of the plurality of expert models according to the classification loss and the difference metric loss of each of the plurality of expert models;

and adjusting the parameters of the expert models according to the learning targets of the expert models until the learning targets of the expert models converge to the second specified range.

8. The method of claim 7, wherein calculating the classification loss of each of the plurality of expert models based on the classification result and the label truth of each of the plurality of expert models comprises:

obtaining a second classification result obtained by classifying the sample image by a preset teacher model;

according to the second classification result, smoothing the label truth value marked on the sample image to obtain a smooth label;

9. The method of claim 7, wherein calculating the loss of the variability metric for each of the plurality of expert models based on the image features for each of the plurality of expert models comprises:

calculating average characteristics extracted by the expert models according to the image characteristics corresponding to the expert models respectively;

and respectively calculating the similarity between the features output by the expert models and the average features to obtain the difference metric loss corresponding to the expert models.

10. The method according to any one of claims 4-9, further comprising: performing a data enhancement operation on the sample image; the data enhancement operation includes: single sample based data enhancement operations and/or multiple sample based data enhancement operations.

11. An image classification method, comprising:

responding to a calling request of a client to a first interface, and acquiring an image to be processed contained in interface parameters;

inputting the image to be processed into a first model; the first model includes: a plurality of expert models and a fusion model connected with the plurality of expert models; the plurality of expert models comprises expert models of at least two different network structures;

carrying out classification operation processing on the features obtained by fusion through the fusion model to obtain a first classification result;

and returning the first classification result to the client.

12. A server, comprising: a memory, a processor, and a communication component;

the memory to store one or more computer instructions;

the processor configured to execute one or more computer instructions for performing the steps in the method of any one of claims 1-11.

13. A computer-readable storage medium storing a computer program, characterized in that the computer program is capable of carrying out the steps of the method according to any one of claims 1-11 when executed.