CN114419400B

CN114419400B - Training method, recognition method, device, medium and equipment of image recognition model

Info

Publication number: CN114419400B
Application number: CN202210309902.8A
Authority: CN
Inventors: 边成; 李永会; 杨延展
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2022-07-29
Anticipated expiration: 2042-03-28
Also published as: CN114419400A; WO2023185516A1

Abstract

The present disclosure relates to a training method, a recognition method, an apparatus, a medium, and a device of an image recognition model, the method including: the method comprises the steps of obtaining a plurality of training sample sets, wherein data distribution of each training sample set is not completely consistent, determining the gradient of each training image according to the training image and a training recognition result corresponding to the training image aiming at each training image, determining a first statistic and a second statistic of each training sample set according to the gradient of each training image, determining a statistic loss function according to the first statistic and the second statistic, and updating a preset model according to the statistic loss function to obtain an image recognition model. The preset model can be updated according to the statistic loss function determined by the first statistic and the second statistic, the image recognition model with high generalization performance is obtained, additional fine adjustment on the image recognition model is not needed, the over-fitting problem can be avoided, and the recognition accuracy of the image recognition model is improved.

Description

Training method, recognition method, device, medium and equipment of image recognition model

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a training method, an identification method, an apparatus, a medium, and a device for an image identification model.

Background

Colorectal cancer is one of the highest malignant tumors in China, but early diagnosis and appropriate treatment of cancer can bring about 90% of cure rate. Regular enteroscopy screening can identify adenomatous polyps and prevent cancer. During endoscopy, it is important to identify the blind-returns in endoscopic images.

Currently, the identification of the endoscope image is mainly based on a deep neural network (e.g., a convolutional neural network), and in order to achieve good generalization performance, a large amount of training data needs to be collected for training. The training data may be from the same medical center or from different medical centers. However, the related art methods neglect the generalization problem of the model on the new center, and do not take into account the extra knowledge in the multi-center training data. This can result in the need to collect data from a new center to fine-tune the trained model each time the model is deployed to the new center to ensure the generalization performance of the model, which can otherwise affect the accuracy of the model in identifying endoscopic images. Moreover, the process of fine tuning the trained model is complex each time the model is deployed, and problems such as overfitting and the like may be caused, so that the recognition accuracy of the model is affected.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a training method for an image recognition model, the method including:

obtaining a plurality of training sample sets; the training sample set comprises training images and training recognition results corresponding to the training images, and the data distribution of each training sample set is not completely consistent;

for each training image, determining the gradient of the training image according to the training image and a training recognition result corresponding to the training image;

determining a first statistic of each training sample set and a second statistic of each training sample set according to the gradient of each training image; the first statistic is used for representing a mean vector corresponding to the training sample set, and the second statistic is used for representing a covariance matrix corresponding to the training sample set;

Determining a statistic loss function according to the first statistic and the second statistic;

and updating a preset model according to the statistic loss function to obtain an image recognition model.

In a second aspect, the present disclosure provides an image recognition method, the method comprising:

acquiring an image to be identified;

inputting the image to be recognized into a pre-trained image recognition model to obtain a recognition result of the image to be recognized; the image recognition model is obtained by training through the training method of the image recognition model in the first aspect.

In a third aspect, the present disclosure provides an apparatus for training an image recognition model, including:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of training sample sets; the training sample set comprises training images and training recognition results corresponding to the training images, and the data distribution of each training sample set is not completely consistent;

the determining module is used for determining the gradient of the training image according to the training image and the training recognition result corresponding to the training image aiming at each training image;

the determining module is further configured to determine a first statistic of each training sample set and a second statistic of each training sample set according to the gradient of each training image; the first statistic is used for representing a mean vector corresponding to the training sample set, and the second statistic is used for representing a covariance matrix corresponding to the training sample set;

The determining module is further configured to determine a statistic loss function according to the first statistic and the second statistic;

and the updating module is used for updating the preset model according to the statistic loss function to obtain the image recognition model.

In a fourth aspect, the present disclosure provides an image recognition apparatus comprising:

the second acquisition module is used for acquiring an image to be identified;

the processing module is used for inputting the image to be recognized into a pre-trained image recognition model to obtain a recognition result of the image to be recognized; wherein the image recognition model is obtained by training through the training device of the image recognition model of the third aspect.

In a fifth aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first or second aspect of the present disclosure.

In a sixth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of the method of the first or second aspect of the disclosure.

According to the technical scheme, firstly, a plurality of training sample sets comprising training images and training recognition results are obtained, then, for each training image, the gradient of the training image is determined according to the training images and the training recognition results corresponding to the training images, the first statistic of each training sample set and the second statistic of each training sample set are determined according to the gradient of each training image, finally, a statistic loss function is determined according to the first statistic and the second statistic, and the preset model is updated according to the statistic loss function, so that the image recognition model is obtained. The method can determine the statistic loss function according to the first statistic and the second statistic, and update the preset model by using the statistic loss function, so that the preset model can learn the image characteristics with central invariance by using the training images of a plurality of training sample sets, capture the information with discriminability related to image recognition, ignore the noise of a specific training sample set, and further obtain the image recognition model with high generalization performance.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow diagram illustrating a method of training an image recognition model in accordance with an exemplary embodiment;

FIG. 2 is a flow chart illustrating one step 102 according to the embodiment shown in FIG. 1;

FIG. 3 is a flow chart illustrating one step 103 according to the embodiment shown in FIG. 1;

FIG. 4 is a flowchart illustrating one step 104 according to the embodiment shown in FIG. 1;

FIG. 5 is a flow diagram illustrating an image recognition method according to an exemplary embodiment;

FIG. 6 is a block diagram illustrating an apparatus for training an image recognition model in accordance with an exemplary embodiment;

FIG. 7 is a block diagram of a determination module shown in accordance with the embodiment shown in FIG. 6;

FIG. 8 is a block diagram illustrating an image recognition device according to an exemplary embodiment;

FIG. 9 is a block diagram of an electronic device shown in accordance with an example embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and the embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

FIG. 1 is a flow diagram illustrating a method of training an image recognition model according to an exemplary embodiment. As shown in fig. 1, the method may include the steps of:

step 101, a plurality of training sample sets are obtained. The training sample set comprises training images and training recognition results corresponding to the training images, and the data distribution of each training sample set is not completely consistent.

Illustratively, the deep learning approach performs well in image recognition, however this relies on the consistency of the distribution of training data and test data. In training neural network models for image recognition, neural network model learning seeks shortcuts in the optimization process, tending to rely on simple features in the training data. That is, the neural network model will preferentially remember simple bias information in the training data when training. For example, in a scene in which a blind-back portion in an endoscopic image is recognized, the neural network model preferentially remembers information having simple characteristics such as the age of a patient, the model of a machine of an image capturing apparatus, or the body position at the time of shooting when training. However, the distribution of training data from multiple data centers may not be completely consistent (i.e. there is a problem of multi-center data distribution shift), which may cause the generalization capability of the model to be greatly reduced in the test due to the test data of the new center having different bias information. In order to improve the generalization performance of the model, training data of a plurality of data centers can be used to train an image recognition model, so that the image recognition model can learn image features with center invariance and capture discriminant information related to image recognition, and meanwhile, the sensitivity of the data distribution of a specific data center is reduced, thereby ensuring the generalization performance of the image recognition model on a new data center.

Specifically, a plurality of training sample sets collected by different data centers may be obtained first, where data distribution of each training sample set is not completely consistent, and each training sample set corresponds to one training sample set of each data center. The training sample set may include training images and training recognition results corresponding to the training images. Taking the identification of the blind-back portion in the endoscope image as an example, the data center may be a medical center, the training image may be an endoscope image acquired by an image acquisition device of the medical center during endoscopy in a historical time period, and the training identification result may be a classification result labeled on the endoscope image in a manual manner (for example, the classification result may include that the endoscope image is a blind-back portion image and the endoscope image is not a blind-back portion image), and meanwhile, the data distribution of the training sample sets collected by different medical centers is not completely consistent.

And 102, determining the gradient of each training image according to the training image and the training recognition result corresponding to the training image.

And 103, determining a first statistic of each training sample set and a second statistic of each training sample set according to the gradient of each training image. The first statistic is used for representing a mean vector corresponding to the training sample set, and the second statistic is used for representing a covariance matrix corresponding to the training sample set.

For example, a preset model for recognizing the images may be pre-constructed, and after a plurality of training sample sets are obtained, each training image in all the training sample sets is input into the preset model, so as to obtain a predicted recognition result of each training image. The gradient of each training image may then be calculated based on the predicted recognition result of each training image and the training recognition result of that training image. The gradient of each training image can be understood as the depth feature of the training image obtained by combining the image and the training recognition result.

Then, for each training sample set, a first statistic for characterizing the mean vector corresponding to the training sample set may be calculated according to the gradient of the training image, and a second statistic for characterizing the covariance matrix corresponding to the training sample set may be calculated according to the gradient of the training image. Wherein the data distribution shift between the training sample set and the test set typically includes a diversity shift and a correlation shift. Diversity shift refers to the fact that data obtained during model training and testing come from different data centers and therefore have different characteristics (for example, two medical centers use different image acquisition equipment, which results in differences in resolution and color appearance of endoscopic imaging), and correlation shift refers to the fact that correlation information between data on a test set is different from correlation information between data on a training sample set. The first statistic is actually used to measure the diversity offset corresponding to the training sample set, and the second statistic is used to measure the correlation offset corresponding to the training sample set.

And 104, determining a statistic loss function according to the first statistic and the second statistic.

And 105, updating the preset model according to the statistic loss function to obtain an image recognition model.

For example, after determining the first statistical quantity and the second statistical quantity of the training sample sets, the statistical quantity loss function corresponding to each two training sample sets may be determined according to the first statistical quantity of each two training sample sets and the second statistical quantity of the two training sample sets. Wherein the statistic loss function may include a first statistic loss function to characterize a difference between first statistics of the two training sample sets and a second statistic loss function to characterize a difference between second statistics of the two training sample sets. Then, the minimization processing may be performed on the first statistical loss function, the second statistical loss function and the initial loss function of the preset model corresponding to each two training sample sets at the same time to update the model parameters of the preset model, so as to obtain the image recognition model.

It should be noted that the first statistic may be a first order statistic and the second statistic may be a second order statistic, and the first order statistic and the second order statistic may summarize most features of the data distribution. Therefore, the first-order statistic and the second-order statistic are used in the gradient space, the gradient distribution distance between two data centers can be measured explicitly, the gradient distribution difference of data of different data centers is minimized, the gradient distributions of different data centers are made to be as close as possible, the dependence on the data distribution of different data centers is eliminated, and therefore the model is forced to learn and capture cross-center invariant distinguishing information (namely, image features with center invariance) from the data of the multiple data centers during training, and the generalization capability of the model on new data centers is improved.

In summary, in the disclosure, first, a plurality of training sample sets including training images and training recognition results are obtained, then, for each training image, a gradient of the training image is determined according to the training image and the training recognition result corresponding to the training image, a first statistic of each training sample set and a second statistic of each training sample set are determined according to the gradient of each training image, and finally, a statistic loss function is determined according to the first statistic and the second statistic, and a preset model is updated according to the statistic loss function, so as to obtain an image recognition model. The statistical loss function is determined according to the first statistical quantity and the second statistical quantity, the preset model is updated by the statistical loss function, the preset model can learn the image characteristics with central invariance by using the training images of a plurality of training sample sets, the discriminant information related to image recognition is captured, the noise of a specific training sample set is ignored, the image recognition model with high generalization performance is obtained, the accuracy of recognizing the image to be recognized can be ensured, the image recognition model does not need to be subjected to additional fine tuning, the over-fitting problem can be avoided, and the recognition accuracy of the image recognition model is improved.

Fig. 2 is a flow chart illustrating one step 102 according to the embodiment shown in fig. 1. As shown in fig. 2, the preset model may include a feature extraction network and a classifier, and step 102 may include the following steps:

step 1021, preprocessing the training image to obtain a preprocessed training image.

For example, in the process of training the image recognition model, each training image may be preprocessed in advance, for example, random data enhancement may be performed on each training image to obtain a preprocessed training image. Wherein the random data enhancement may include at least one of random scaling, random cropping, random flipping (including random horizontal/vertical flipping), random color dithering (including brightness, contrast, etc.).

Step 1022, inputting the preprocessed training image into the feature extraction network to obtain the image features of the training image.

And step 1023, inputting the image characteristics of the training image into a classifier to obtain a prediction recognition result of the training image.

In one scenario, the pre-set model may include a feature extraction network

And a classifier W. A certain set of training samples may be recorded as

Wherein, in the process,

，efor the data center corresponding to the training sample set,Ethe data center is a whole data center, and the data center is a whole data center,

is the first in the training sample setiThe number of the training images is such that,

is that

The corresponding artificially labeled training recognition result (e.g., one-hot classification label can be adopted),N _e the number of training images included for the set of training samples. After obtaining a plurality of training sample sets, the preprocessed training images corresponding to each training image may be input to the feature extraction network

In the method, the image characteristics of each training image are obtained, and the characteristics can be extracted to form a network

The extracted image features are noted as

. Then, the image features of each training image may be input to the classifier W, resulting in a predictive recognition result for each training image. At the same time, the full link layer activated by softmax can be used as the classifier W, and the parameters of the classifier WwCan be expressed as

Then the predicted recognition result of each training image, i.e., the classification probability predicted by classifier W, can be expressed as

Wherein, in the step (A),Cas the number of the categories,Kin order to be the feature dimension,

is softmax operation.

And step 1024, determining the gradient of the training image according to the prediction recognition result, the training recognition result and the image characteristics of the training image.

In this step, the gradient of each training image may be determined according to the predicted recognition result, the training recognition result, and the image feature of each training image. For training sample set from data centeriA training image

Can be understood as being in the following

And its corresponding training recognition result

Parameters of classification penalty for classifier W as inputwThe gradient is the gradient used in optimizing the network parameters (gradient descent). For example, in the case where the classifier W is a softmax classifier using a cross entropy loss function (English), for the second from the data center eiThe gradient of each training image can be expressed as:

。

fig. 3 is a flow chart illustrating one step 103 according to the embodiment shown in fig. 1. As shown in fig. 3, step 103 may include the steps of:

step 1031, determining a first statistic of each training sample set according to the gradients of all training images included in the training sample set.

Step 1032 determines a second statistic of the training sample set according to the gradients of all training images included in each training sample set and the first statistic of the training sample set.

For example, for a training sample set from data center e, the training sample set may include all of the training samplesThe gradient of the training image is recorded as a matrix

For convenience of representation, each gradient may be drawn as a vector, and the matrix then

. The first statistical measure of the training sample set at this time may be expressed as:

i.e. the first statistic of the set of training samples is represented as a vector of length KC. The second statistic of the training sample set at this time may be represented as:

i.e. the second statistic of the training sample set is represented as a size of

Of the matrix of (a).

Fig. 4 is a flow chart illustrating one step 104 according to the embodiment shown in fig. 1. As shown in fig. 4, where the statistical loss function includes a first statistical loss function and a second statistical loss function, step 104 may include the steps of:

step 1041, determining a first statistical loss function corresponding to each of two training sample sets according to the first statistical quantities of the two training sample sets.

Step 1042, according to the second statistic of every two training sample sets, determining a second statistic loss function corresponding to the two training sample sets.

For example, after determining the first statistic and the second statistic of the training sample set, a first statistic loss function corresponding to every two training sample sets and a second statistic loss function corresponding to every two training sample sets may be further determined. For example, for training sample sets from two different data centers e, f, their corresponding first statistical loss functions L _1st Can be expressed as:

its corresponding second statistic loss function L _2nd Can be expressed as:

. Wherein the content of the first and second substances,

representing the vector norm.

Alternatively, step 105 may be implemented by:

and performing minimization processing on the first statistical loss function and the second statistical loss function corresponding to each two training sample sets and the initial loss function of the preset model to obtain the image recognition model.

For example, for every two training sample sets, each training image in the two training sample sets is used as an input of the preset model, and a training recognition result corresponding to each training image is used as an output of the preset model to train the preset model. And simultaneously carrying out minimization processing on the three loss functions, namely the first statistical loss function, the second statistical loss function and the initial loss function of the preset model, corresponding to every two training sample sets in the process of training the preset model so as to update the model parameters of the preset model, thereby obtaining the image recognition model.

The first statistical loss function, the second statistical loss function and the initial loss function are simultaneously minimized, and actually, the minimization of one target loss function can be equivalently performed. For example, in the case where the preset model includes a feature extraction network and a classifier, the initial loss function may be a classification loss function of the classifier, and the target loss function may be expressed as:

，L _cls In order to classify the function of the loss,

and

can be understood as a predetermined hyper-parameter for balancing the ratio of the various terms.

It should be noted that, the training image recognition model may be implemented by PyTorch, and the training parameters may be set as: 1) learning rate: 5e-5, 2) Batch size: 256, optimizer: AdamW, Epoch: first training period 100, second training period 20, input image size: 448x 448.

In summary, in the present disclosure, first, a plurality of training sample sets including training images and training recognition results are obtained, then, for each training image, a gradient of the training image is determined according to the training image and the training recognition result corresponding to the training image, a first statistic of each training sample set and a second statistic of each training sample set are determined according to the gradient of each training image, and finally, a statistic loss function is determined according to the first statistic and the second statistic, and a preset model is updated according to the statistic loss function, so as to obtain an image recognition model. The statistical loss function is determined according to the first statistical quantity and the second statistical quantity, the preset model is updated by the statistical loss function, the preset model can learn the image characteristics with central invariance by using the training images of a plurality of training sample sets, the discriminant information related to image recognition is captured, the noise of a specific training sample set is ignored, the image recognition model with high generalization performance is obtained, the accuracy of recognizing the image to be recognized can be ensured, the image recognition model does not need to be subjected to additional fine tuning, the over-fitting problem can be avoided, and the recognition accuracy of the image recognition model is improved.

FIG. 5 is a flow chart illustrating a method of image recognition according to an exemplary embodiment. As shown in fig. 5, the method may include the steps of:

step 201, acquiring an image to be identified.

Step 202, inputting the image to be recognized into a pre-trained image recognition model to obtain a recognition result of the image to be recognized. The image recognition model is obtained by training through the training method of the image recognition model shown in any one of the embodiments.

For example, after training the image recognition model is completed, the trained image recognition model may be deployed to a designated data center for use. Then, the image to be recognized collected by the designated data center can be obtained, and the image to be recognized is input into the trained image recognition model, so that the recognition result of the image to be recognized output by the image recognition model is obtained. Taking the image recognition model for recognizing the blind-returning part in the endoscopic image as an example, in the case that the image to be recognized is the endoscopic image, the endoscopic image can be input to the image recognition model, and a recognition result indicating whether the endoscopic image is the blind-returning part is obtained.

It should be noted that the image recognition model in the present disclosure is not limited to recognizing the blind-back portion in the endoscopic image, and may also be applied to any image recognition scene (for example, recognizing people, objects, etc. in the image), and the present disclosure does not specifically limit this.

In summary, the present disclosure first obtains an image to be recognized, and inputs the image to be recognized into a pre-trained image recognition model to obtain a recognition result of the image to be recognized. The image recognition method and the image recognition system can ensure the accuracy of recognizing the image to be recognized by utilizing the pre-trained image recognition model with high generalization performance and high recognition accuracy to perform image recognition.

FIG. 6 is a block diagram illustrating an apparatus for training an image recognition model according to an example embodiment. As shown in fig. 6, the training apparatus 300 for an image recognition model includes:

a first obtaining module 301, configured to obtain a plurality of training sample sets. The training sample set comprises training images and training recognition results corresponding to the training images, and the data distribution of each training sample set is not completely consistent.

The determining module 302 is configured to determine, for each training image, a gradient of the training image according to the training image and a training recognition result corresponding to the training image.

The determining module 302 is further configured to determine a first statistic of each training sample set and a second statistic of each training sample set according to the gradient of each training image. The first statistic is used for representing a mean vector corresponding to the training sample set, and the second statistic is used for representing a covariance matrix corresponding to the training sample set.

The determining module 302 is further configured to determine a statistical quantity loss function according to the first statistical quantity and the second statistical quantity.

And the updating module 303 is configured to update the preset model according to the statistical loss function to obtain an image recognition model.

FIG. 7 is a block diagram illustrating a determination module according to the embodiment shown in FIG. 6. The preset model includes a feature extraction network and a classifier, and as shown in fig. 7, the determining module 302 includes:

the processing submodule 3021 is configured to perform preprocessing on the training image to obtain a preprocessed training image.

The extracting submodule 3022 is configured to input the preprocessed training image into the feature extraction network, so as to obtain an image feature of the training image.

The classification submodule 3023 is configured to input the image features of the training image into the classifier, and obtain a prediction recognition result of the training image.

A gradient determining submodule 3024, configured to determine a gradient of the training image according to the predicted recognition result, the training recognition result, and the image feature of the training image.

Optionally, the processing submodule 3021 is configured to:

and performing random data enhancement on the training image to obtain a preprocessed training image. Wherein the random data enhancement comprises at least one of random scaling, random cropping, random flipping, and random color dithering.

Optionally, the determining module 302 is configured to:

a first statistical measure for each training sample set is determined based on the gradients of all training images included in the training sample set.

And determining a second statistic of each training sample set according to the gradients of all the training images included in the training sample set and the first statistic of the training sample set.

Optionally, the statistical loss function comprises a first statistical loss function and a second statistical loss function. The determination module 302 is configured to:

and determining a first statistical loss function corresponding to each two training sample sets according to the first statistical quantities of the two training sample sets.

And determining a second statistic loss function corresponding to each two training sample sets according to the second statistic of each two training sample sets.

Optionally, the processing module 302 is configured to perform minimization on the first statistical loss function, the second statistical loss function, and the initial loss function of the preset model corresponding to every two training sample sets, so as to obtain an image recognition model.

Fig. 8 is a block diagram illustrating an image recognition apparatus according to an exemplary embodiment. As shown in fig. 8, the image recognition apparatus 400 includes:

and a second obtaining module 401, configured to obtain an image to be identified.

The processing module 402 is configured to input the image to be recognized into a pre-trained image recognition model, so as to obtain a recognition result of the image to be recognized. The image recognition model is obtained by training with the training apparatus 300 for image recognition model.

Optionally, the processing module 402 is configured to, in a case that the image to be recognized is an endoscopic image, input the endoscopic image to the image recognition model, and obtain a recognition result indicating whether the endoscopic image is a ileocecal portion.

In summary, in the present disclosure, an image to be recognized is first obtained, and the image to be recognized is input into an image recognition model, so as to obtain a recognition result of the image to be recognized, where the image recognition model is obtained by training in the following manner: the method comprises the steps of obtaining a plurality of training sample sets comprising training images and training recognition results, determining the gradient of each training image according to the training images and the training recognition results corresponding to the training images aiming at each training image, determining a first statistic of each training sample set and a second statistic of each training sample set according to the gradient of each training image, finally determining a statistic loss function according to the first statistic and the second statistic, and updating a preset model according to the statistic loss function to obtain an image recognition model. The statistical loss function is determined according to the first statistical quantity and the second statistical quantity, the preset model is updated by the statistical loss function, the preset model can learn image characteristics with central invariance by using training images of a plurality of training sample sets, discriminative information related to image recognition is captured, noise of a specific training sample set is ignored, the image recognition model with high generalization performance is obtained, the accuracy of recognizing images to be recognized can be ensured, extra fine tuning of the image recognition model is not needed, the over-fitting problem is avoided, and the recognition accuracy of the image recognition model is improved.

Referring now to fig. 9, a schematic diagram of an electronic device (e.g., the terminal device or the server in fig. 1) 600 suitable for implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 9, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 9 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: obtaining a plurality of training sample sets; the training sample set comprises training images and training recognition results corresponding to the training images, and the data distribution of each training sample set is not completely consistent; for each training image, determining the gradient of the training image according to the training image and a training recognition result corresponding to the training image; determining a first statistic of each training sample set and a second statistic of each training sample set according to the gradient of each training image; the first statistic is used for representing a mean vector corresponding to the training sample set, and the second statistic is used for representing a covariance matrix corresponding to the training sample set; determining a statistic loss function according to the first statistic and the second statistic; and updating a preset model according to the statistic loss function to obtain an image recognition model.

Alternatively, the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring an image to be identified; inputting the image to be recognized into a pre-trained image recognition model to obtain a recognition result of the image to be recognized; the image recognition model is obtained by training through the training method of the image recognition model.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of the module does not in some cases constitute a limitation of the module itself, and for example, the acquisition module may also be described as a "module that acquires an image to be recognized".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides a method of training an image recognition model, the method comprising: acquiring a plurality of training sample sets; the training sample set comprises training images and training recognition results corresponding to the training images, and the data distribution of each training sample set is not completely consistent; for each training image, determining the gradient of the training image according to the training image and a training recognition result corresponding to the training image; determining a first statistic of each training sample set and a second statistic of each training sample set according to the gradient of each training image; the first statistic is used for representing a mean vector corresponding to the training sample set, and the second statistic is used for representing a covariance matrix corresponding to the training sample set; determining a statistic loss function according to the first statistic and the second statistic; and updating a preset model according to the statistic loss function to obtain an image recognition model.

Example 2 provides the method of example 1, the pre-set model comprising a feature extraction network and a classifier, in accordance with one or more embodiments of the present disclosure; determining the gradient of the training image according to the training image and the training recognition result corresponding to the training image comprises: preprocessing the training image to obtain a preprocessed training image; inputting the preprocessed training image into the feature extraction network to obtain the image features of the training image; inputting the image characteristics of the training image into the classifier to obtain a prediction recognition result of the training image; and determining the gradient of the training image according to the prediction recognition result, the training recognition result and the image characteristics of the training image.

Example 3 provides the method of example 2, wherein preprocessing the training image to obtain a preprocessed training image, according to one or more embodiments of the present disclosure includes: carrying out random data enhancement on the training image to obtain the preprocessed training image; the random data enhancement includes at least one of random scaling, random cropping, random flipping, and random color dithering.

Example 4 provides the method of example 1, the determining a first statistic for each of the training sample sets and a second statistic for each of the training sample sets from a gradient of each of the training images, including: determining a first statistic of each training sample set according to the gradient of all training images included in the training sample set; and determining a second statistic of the training sample set according to the gradient of all training images included in each training sample set and the first statistic of the training sample set.

Example 5 provides the method of example 1, the statistical loss function comprising a first statistical loss function and a second statistical loss function, in accordance with one or more embodiments of the present disclosure; determining a statistic loss function based on the first statistic and the second statistic, comprising: determining the first statistical loss function corresponding to each two training sample sets according to the first statistical quantities of the two training sample sets; and determining the second statistic loss function corresponding to each two training sample sets according to the second statistic of each two training sample sets.

Example 6 provides the method of example 5, wherein updating the preset model according to the statistical quantity loss function to obtain an image recognition model comprises: and minimizing the first statistical loss function, the second statistical loss function and the initial loss function of the preset model corresponding to every two training sample sets to obtain the image recognition model.

Example 7 provides, in accordance with one or more embodiments of the present disclosure, an image recognition method, the method comprising: acquiring an image to be identified; inputting the image to be recognized into a pre-trained image recognition model to obtain a recognition result of the image to be recognized; wherein the image recognition model is obtained by training through the training method of the image recognition model described in any one of examples 1 to 6.

Example 8 provides the method of example 7, where inputting the image to be recognized into a pre-trained image recognition model to obtain a recognition result of the image to be recognized, and includes: and under the condition that the image to be recognized is an endoscope image, inputting the endoscope image into the image recognition model to obtain a recognition result for indicating whether the endoscope image is a blind return part.

Example 9 provides an apparatus for training an image recognition model, including: the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of training sample sets; the training sample set comprises training images and training recognition results corresponding to the training images, and the data distribution of each training sample set is not completely consistent; the determining module is used for determining the gradient of the training image according to the training image and a training recognition result corresponding to the training image aiming at each training image; the determining module is further configured to determine a first statistic of each training sample set and a second statistic of each training sample set according to the gradient of each training image; the first statistic is used for representing a mean vector corresponding to the training sample set, and the second statistic is used for representing a covariance matrix corresponding to the training sample set; the determining module is further configured to determine a statistic loss function according to the first statistic and the second statistic; and the updating module is used for updating the preset model according to the statistic loss function to obtain the image recognition model.

Example 10 provides, in accordance with one or more embodiments of the present disclosure, an image recognition apparatus including: the second acquisition module is used for acquiring an image to be identified; the processing module is used for inputting the image to be recognized into a pre-trained image recognition model to obtain a recognition result of the image to be recognized; wherein, the image recognition model is obtained by training through the training device of the image recognition model in example 9.

Example 11 provides a computer-readable medium on which is stored a computer program that, when executed by a processing apparatus, implements the steps of the methods described in examples 1-5 or examples 7-8.

Example 12 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to implement the steps of the methods of examples 1-5 or examples 7-8.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A method for training an image recognition model, the method comprising:

obtaining a plurality of training sample sets; the training sample set comprises training images and training recognition results corresponding to the training images, and the data distribution of each training sample set is not completely consistent; the training sample sets are collected by different data centers, and each data center corresponds to one training sample set;

updating a preset model according to the statistic loss function to obtain an image recognition model;

the determining a first statistic of each of the training sample sets and a second statistic of each of the training sample sets according to the gradient of each of the training images includes:

Determining a first statistic of each training sample set according to the gradient of all training images included in the training sample set;

determining a second statistic of each training sample set according to the gradients of all training images included in the training sample set and the first statistic of the training sample set;

the statistical quantity loss function comprises a first statistical quantity loss function and a second statistical quantity loss function; determining a statistic loss function based on the first statistic and the second statistic, comprising:

determining the first statistical loss function corresponding to each two training sample sets according to the first statistical quantities of the two training sample sets;

and determining the second statistic loss function corresponding to each two training sample sets according to the second statistic of each two training sample sets.

2. The method of claim 1, wherein the pre-set model comprises a feature extraction network and a classifier; determining the gradient of the training image according to the training image and the training recognition result corresponding to the training image comprises:

preprocessing the training image to obtain a preprocessed training image;

Inputting the preprocessed training image into the feature extraction network to obtain the image features of the training image;

inputting the image characteristics of the training image into the classifier to obtain a prediction recognition result of the training image;

and determining the gradient of the training image according to the prediction recognition result, the training recognition result and the image characteristics of the training image.

3. The method of claim 2, wherein the pre-processing the training image to obtain a pre-processed training image comprises:

carrying out random data enhancement on the training image to obtain the preprocessed training image; the random data enhancement includes at least one of random scaling, random cropping, random flipping, and random color dithering.

4. The method of claim 1, wherein updating the pre-set model according to the statistical loss function to obtain the image recognition model comprises:

and performing minimization processing on the first statistical loss function, the second statistical loss function and the initial loss function of the preset model corresponding to every two training sample sets to obtain the image recognition model.

5. An image recognition method, characterized in that the method comprises:

acquiring an image to be identified;

inputting the image to be recognized into a pre-trained image recognition model to obtain a recognition result of the image to be recognized; wherein the image recognition model is obtained by training through the training method of the image recognition model according to any one of claims 1-4.

6. The method according to claim 5, wherein the inputting the image to be recognized into a pre-trained image recognition model to obtain the recognition result of the image to be recognized comprises:

and under the condition that the image to be recognized is an endoscope image, inputting the endoscope image into the image recognition model to obtain a recognition result for indicating whether the endoscope image is a blind return part.

7. An apparatus for training an image recognition model, comprising:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of training sample sets; the training sample set comprises training images and training recognition results corresponding to the training images, and the data distribution of each training sample set is not completely consistent; the training sample sets are collected by different data centers, and each data center corresponds to one training sample set;

The determining module is used for determining the gradient of the training image according to the training image and a training recognition result corresponding to the training image aiming at each training image;

the updating module is used for updating a preset model according to the statistic loss function to obtain an image recognition model;

the determination module is to:

The statistical quantity loss function comprises a first statistical quantity loss function and a second statistical quantity loss function; the determination module is to:

determining a first statistical loss function corresponding to each two training sample sets according to the first statistical measures of the two training sample sets;

8. An image recognition apparatus, characterized in that the image recognition apparatus comprises:

the second acquisition module is used for acquiring an image to be identified;

the processing module is used for inputting the image to be recognized into a pre-trained image recognition model to obtain a recognition result of the image to be recognized; wherein the image recognition model is obtained by training through the training device of the image recognition model according to claim 7.

9. A computer-readable medium, on which a computer program is stored, which program, when being executed by processing means, is adapted to carry out the steps of the method of any one of claims 1 to 4 or 5 to 6.

10. An electronic device, comprising:

a storage device having a computer program stored thereon;

Processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1-4 or 5-6.