WO2023030520A1

WO2023030520A1 - Training method and apparatus of endoscope image classification model, and image classification method

Info

Publication number: WO2023030520A1
Application number: PCT/CN2022/117043
Authority: WO
Inventors: 边成; 李永会; 杨延展
Original assignee: 北京字节跳动网络技术有限公司
Priority date: 2021-09-06
Filing date: 2022-09-05
Publication date: 2023-03-09
Also published as: CN113486990A; CN113486990B

Abstract

A training method and apparatus of an endoscope image classification model, and an image classification method. The endoscope image classification model comprises: a plurality of expert sub-networks. The method comprises: obtaining a training data set, wherein the training data set comprises a plurality of endoscopic images and labeling tags of the plurality of endoscopic images, and the training data set presents a long-tail distribution; and training the endoscope image classification model on the basis of the training data set till a target loss function of the endoscope image classification model converges to obtain a trained endoscope image classification model, wherein the target loss function is determined at least on the basis of a corresponding plurality of output results of the plurality of expert sub-networks.

Description

Training method for endoscope image classification model, image classification method and device

This application claims the priority of the Chinese patent application No. 202111039189.1 submitted on September 6, 2021, and the content disclosed in the above Chinese patent application is cited in its entirety as a part of this application.

technical field

Embodiments of the present disclosure relate to a training method of an endoscope image classification model integrated with knowledge distillation, an image classification method, a device, and a computer-readable medium.

Background technique

Colorectal cancer is the third most common cancer and the fourth most deadly cancer in the world, and more than 95% of colorectal cancers are caused by colonic polyps. Among the detected polyps, adenomas accounted for the majority, accounting for about 10.86% to 80%. It is generally believed that colorectal cancer originated from adenomatous polyps, and the cancerous rate was 1.4% to 9.2%. Other types of polyps, such as hyperplastic polyps and inflammatory polyps (2.32% to 13.8%), accounted for only a small proportion, showing a long-tailed distribution.

In order to reduce the burden on doctors, some work attempts to automatically realize the identification of polyp types using deep learning. Existing recognition work on polyp classification is basically based on ordinary convolutional neural networks. They usually use an off-the-shelf convolutional neural network such as ResNet, VGG, Inceptionv3, etc. But they all only use the traditional training method, which does not take into account the imbalance of polyp type distribution.

At present, a lot of research has been done on the long-tail problem. For example, some research works solve the long-tail problem by resampling the data set, including undersampling the head, oversampling the tail, or according to each A balanced sampling of the data is performed over the distribution of categories. However, these methods know the future data distribution in advance, which is not in line with the reality, and it is easy to cause overfitting to the tail data. Some research works solve the long-tail problem by assigning different weights to different classes or samples, and assign higher weights to the tail data by modifying the loss. However, although such methods are more concise than resampling-based methods, they face the same problem, that is, they are prone to underfitting/overfitting to the head/tail data, and do not conform to the real situation. Some research works transfer the features learned from the head data to the data with insufficient tail data. However, such methods usually have complex models and calculations. There are also some works that try to combine the above methods or solve the long tail problem from other perspectives. For example, by modifying the momentum of classifier model update and removing its momentum biased towards the head data, this imbalance problem can be solved. However, this method cannot guarantee that the accuracy of part of the header data will not be sacrificed.

In the existing methods or research work for classifying polyps, the characteristics of the long-tail distribution of polyp types are usually not considered, and the convolutional neural network is directly used for training, or the distribution of the data set is adjusted before training. This obviously does not conform to the characteristics of polyp data in reality. Direct training without considering the imbalance of the data will easily make the model unable to identify the tail data well, and retraining the data set after re-training will easily cause overfitting to the tail data and cause damage to the accuracy of the head data. A certain loss.

Therefore, it is desirable to propose an improved polyp classification method that can adapt to long-tail data distributions and can simultaneously improve head and tail accuracy.

Contents of the invention

Embodiments of the present disclosure provide a method for training an endoscopic image classification model, an endoscopic image classification method, an apparatus, and a computer-readable medium.

Embodiments of the present disclosure provide a multi-expert decision-based training method for an endoscopic image classification model, wherein the endoscopic image classification model includes a plurality of expert sub-networks, and the method includes: obtaining a training data set, The training data set includes a plurality of endoscopic image images and label tags of the plurality of endoscopic image images, wherein the training data set presents a long-tail distribution; and based on the training data set, the endoscopic The endoscope image classification model is trained until the target loss function of the endoscope image classification model converges to obtain a trained endoscope image classification model, wherein the target loss function is based at least on the basis of the plurality of expert subclasses The corresponding multiple output results of the network are determined.

For example, wherein, training the endoscope image classification model based on the training data set includes: inputting image samples in the training image sample set into each of the plurality of expert sub-networks; using the The plurality of expert sub-networks, generating a corresponding plurality of expert sub-network output results for the image sample; based on the plurality of expert sub-network output results, generating a final output result of the endoscopic image classification model; and Based on at least the plurality of expert sub-network outputs and the final output, a loss value is calculated by a target loss function, and parameters of the endoscopic image classification model are adjusted based on the loss value.

For example, wherein the endoscopic image classification model further includes a shared sub-network, and training the endoscopic image classification model based on the training data set includes: inputting image samples in the training image sample set to the The shared sub-network is used to extract shallow feature representations; based on the extracted shallow feature representations, use the multiple expert sub-networks to generate corresponding multiple expert sub-network output results for the image sample; based on the multiple expert sub-networks output results of a plurality of expert sub-networks to generate a final output result of the endoscopic image classification model; and based on at least the output results of the plurality of expert sub-networks and the final output results, a loss value is calculated by a target loss function, and A parameter of the endoscopic image classification model is adjusted based on the loss value.

For example, wherein, the target loss function of the endoscope image classification model includes: a cross-entropy loss function determined based on the final output result of the endoscope image classification model and the labeled labels of image samples, and based on the multiple The KL divergence determined by the output of the sub-expert network.

For example, wherein, based on the output results of the plurality of expert sub-networks, generating the final output result of the endoscope image classification model includes: fusing the output results of the plurality of expert sub-networks as the endoscope The final output of the image classification model.

For example, merging the output results of the multiple expert sub-networks includes: performing a weighted average on the output results of the multiple expert sub-networks.

For example, wherein the endoscopic image classification model further includes a student network having the same structure as the expert sub-network, wherein the plurality of expert sub-networks form a teacher network, and the teacher network is used for training based on knowledge distillation The student network, the method further comprising utilizing the student network to generate a corresponding student network output for the image sample.

For example, wherein, based on at least the output results of the plurality of expert subnetworks and the final output result, calculating the loss value through the target loss function includes: based on the output results of the plurality of expert subnetworks, the final output result, and the Describe the output result of the student network, and calculate the loss value through the target loss function.

For example, the target loss function is a weighted sum of the loss function of the teacher network and the loss function of the student network.

For example, wherein, the sum of the weight value of the loss function of the teacher network and the weight value of the loss function of the student network is 1, and wherein the weight value of the loss function of the teacher network decreases continuously with the training iterations small until it finally decreases to 0, and the weight value of the loss function of the student network increases continuously with training iterations until it finally increases to 1.

For example, wherein the loss function of the teacher network includes: a cross-entropy loss function determined based on the final output result of the endoscope image classification model and the labeling labels of the image samples, and a cross-entropy loss function based on the output results of the multiple sub-expert networks And the determined KL divergence; the loss function of the student network includes: a cross-entropy loss function determined based on the student network output result of the student network and the final output result of the endoscope image classification model, and based on the The KL divergence determined by the output results of the student network of the student network and the output results of the plurality of expert sub-networks generated by the plurality of expert sub-networks.

For example, wherein the shared sub-network includes a Vision Transformer, each of the plurality of expert sub-networks includes a multi-layer Transformer encoder connected in sequence, and a classifier.

According to another embodiment of the present disclosure, a method for classifying endoscopic images is provided, including: acquiring an endoscopic image to be identified; Classification results; wherein, the trained endoscopic image classification model is obtained based on the training method of the endoscopic image classification model as described above.

According to another embodiment of the present disclosure, a method for classifying endoscopic images is provided, including: acquiring an endoscopic image to be recognized; and obtaining the endoscopic image based on the student network in the trained endoscopic image classification model Classification results of endoscope images; wherein, the trained endoscope image classification model is obtained based on the above-mentioned endoscope image classification model training method.

According to another embodiment of the present disclosure, an endoscope image classification system is provided, including: an image acquisition component, used to acquire an endoscope image to be recognized; a processing component, used to The classification model obtains the classification result of the endoscopic image; the output unit is used to output the classification result of the image to be recognized, wherein the trained endoscopic image classification model is based on the endoscopic image as described above Obtained by the training method of the classification model.

According to another embodiment of the present disclosure, an endoscope image classification system is provided, including: an image acquisition component, used to acquire an endoscope image to be recognized; a processing component, used to The student network in the classification model obtains the classification result of the endoscopic image; the output unit is used to output the classification result of the image to be recognized, wherein the trained endoscopic image classification model is based on the internal Obtained by the training method of the looking-glass image classification model.

According to another embodiment of the present disclosure, a training device for an endoscopic image classification model based on multi-expert decision-making is provided, wherein the endoscopic image classification model includes a plurality of expert sub-networks, and the device includes: training The data set acquisition component is used to obtain a training data set, the training data set includes a plurality of endoscopic image images and label labels of the plurality of endoscopic image images, wherein the training data set presents a long-tail distribution; And a training component for training the endoscope image classification model based on the training data set until the target loss function of the endoscope image classification model converges to obtain a trained endoscope image classification model , wherein the target loss function is determined based at least on the corresponding multiple output results of the multiple expert sub-networks.

An embodiment of the present disclosure also provides an electronic device, including a memory and a processor, wherein the memory stores program codes readable by the processor, and when the processor executes the program codes, the above-mentioned method.

Embodiments of the present disclosure also provide a computer-readable storage medium, on which computer-executable instructions are stored, and the computer-executable instructions are used to execute the method as described above.

According to the training method of the endoscopic image classification model according to the embodiment of the present disclosure, a method based on multi-expert joint decision-making is proposed to learn the unbalanced data distribution in combination with the actual situation. It does not need to know the data distribution in advance, and can improve the model at the same time. The prediction accuracy of the head and tail data does not cause bias. In addition, the model is compressed by knowledge distillation to make the model more concise.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present disclosure more clearly, the accompanying drawings of the embodiments of the present disclosure will be briefly introduced below. Apparently, the drawings in the following description only relate to some embodiments of the present disclosure, rather than limiting the present disclosure.

Fig. 1 shows a schematic diagram of the application architecture of the endoscopic image classification model training and the endoscopic image classification method in the embodiment of the present disclosure;

Fig. 2 shows an exemplary block diagram of Vision Transformer (ViT);

Fig. 3 shows a schematic diagram of ViT in Fig. 2 flattening the original picture into a sequence;

Fig. 4 shows a polyp imaging image according to an embodiment of the present disclosure;

FIG. 5A shows a schematic structure of an endoscopic image classification model 500A according to an embodiment of the present disclosure;

FIG. 5B shows a schematic structure of an endoscope image classification model 500B according to another embodiment of the present disclosure;

FIG. 5C shows a schematic structure of an endoscopic image classification model 500C using Transformer as a feature extractor according to yet another embodiment of the present disclosure;

FIG. 6A shows a flowchart of a method for training an endoscopic image classification model according to one embodiment of the present disclosure;

FIG. 6B shows a more specific exemplary description of step S603 in FIG. 6A;

FIG. 7A shows a schematic diagram of an endoscopic image classification model 700A incorporating knowledge distillation according to an embodiment of the present disclosure;

FIG. 7B shows a schematic diagram of an endoscopic image classification model 700B incorporating knowledge distillation according to another embodiment of the present disclosure;

FIG. 7C shows a schematic diagram of an endoscopic image classification model 700C incorporating knowledge distillation according to yet another embodiment of the present disclosure;

FIG. 8 shows a flowchart of a method for training an endoscopic image classification model incorporating knowledge distillation according to one embodiment of the present disclosure;

FIG. 9 depicts a flowchart of a method for classifying endoscopic images according to an embodiment of the present disclosure;

FIG. 10 is a schematic structural diagram of an endoscope image classification system in an embodiment of the present disclosure;

FIG. 11 shows a training device for an endoscopic image classification model according to an embodiment of the present disclosure; and

FIG. 12 shows a schematic diagram of a storage medium according to an embodiment of the disclosure.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings. Apparently, the described embodiments are only some of the embodiments of the present application, not all of the embodiments. Based on the embodiments of the present application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts also fall within the protection scope of the present application.

The terms used in this specification are those general terms currently widely used in the art in consideration of functions about the present disclosure, but the terms may be changed according to the intention of those of ordinary skill in the art, precedents, or new technologies in the art. Also, specific terms may be selected by the applicant, and in this case, their detailed meanings will be described in the detailed description of the present disclosure. Therefore, the terms used in the specification should not be understood as simple names, but based on the meaning of the terms and the general description of the present disclosure.

Although the application makes various references to certain modules in the system according to the embodiments of the application, any number of different modules may be used and run on the user terminal and/or the server. The modules are illustrative only, and different aspects of the systems and methods may use different modules.

Flow charts are used in this application to illustrate the operations performed by the system according to the embodiments of this application. It should be understood that the preceding or following operations are not necessarily performed in an exact order. Instead, various steps may be processed in reverse order or concurrently, as desired. At the same time, other operations can be added to these procedures, or a certain step or steps can be removed from these procedures.

With regard to the diagnosis of gastrointestinal diseases, images of lesions inside the gastrointestinal tract are usually obtained based on diagnostic tools such as endoscopes, and relevant medical personnel judge the type of lesions by observing with human eyes. In order to reduce the burden on doctors, some work has tried to use deep learning to automatically identify lesion categories, but these lesion types usually have long-tail distribution characteristics. For example, among the detected polyps, adenomas account for the majority, accounting for about 10.86% to 80%. It is generally believed that colorectal cancer originates from adenomatous polyps, and its cancerous rate is 1.4% to 9.2%. Other types of polyps, such as hyperplastic polyps and inflammatory polyps (2.32% to 13.8%), accounted for only a small proportion, showing a long-tailed distribution. In the existing methods for classifying polyps, the characteristics of the polyp type distribution are usually not considered, and the convolutional neural network is directly used for training, or the distribution of the data set is adjusted before training, which is obviously not in line with reality. Properties of polyp data in . Direct training without considering the imbalance of the data will easily make the model unable to identify the tail data well, and retraining the data set after re-training will easily lead to overfitting of the tail data, thus affecting the accuracy of the head data. cause certain losses.

Therefore, aiming at the long-tail distribution characteristics of polyp image data, this disclosure proposes a multi-expert joint algorithm that adapts to the long-tail data distribution and can improve the accuracy of the head and tail at the same time. end) to integrate it into a more compact model.

FIG. 1 shows a schematic diagram of an endoscopic image classification model training method, an endoscopic image classification method, and an application architecture of an endoscopic image classification method according to an embodiment of the present disclosure, including a server 100 and a terminal device 200 .

The terminal device 200 may be a medical device, for example, a user may view endoscopic image classification results based on the terminal device 200 .

The terminal device 200 and the server 100 may be connected through the Internet to realize mutual communication. Optionally, the aforementioned Internet uses standard communication technologies and/or protocols. The Internet is usually the Internet, but can be any network, including but not limited to Local Area Network (LAN), Metropolitan Area Network (MAN), Wide Area Network (WAN), mobile, wired or wireless network , private network, or any combination of virtual private networks. In some embodiments, data exchanged over a network is represented using technologies and/or formats including Hyper Text Markup Language (HTML), Extensible Markup Language (XML), and the like. In addition, conventional methods such as Secure Socket Layer (Secure SocketLayer, SSL), Transport Layer Security (Transport Layer Security, TLS), Virtual Private Network (VirtualPrivate Network, VPN), Internet Protocol Security (Internet Protocol Security, IPsec) can also be used. Encryption technology to encrypt all or some links. In some other embodiments, customized and/or dedicated data communication technologies may also be used to replace or supplement the above data communication technologies.

The server 100 may provide various network services for the terminal device 200, wherein the server 100 may be a server, a server cluster composed of several servers, or a cloud computing center.

Specifically, the server 100 may include a processor 110 (Center Processing Unit, CPU), a memory 120, an input device 130, an output device 140, etc., the input device 130 may include a keyboard, a mouse, a touch screen, etc., and the output device 140 may include a display device, Such as liquid crystal display (Liquid Crystal Display, LCD), cathode ray tube (Cathode Ray Tube, CRT) and so on.

The memory 120 may include Read Only Memory (ROM) and Random Access Memory (RAM), and provides program instructions and data stored in the memory 120 to the processor 110 . In the embodiment of the present disclosure, the memory 120 may be used to store the program of the endoscope image classification model training method or the endoscope image classification method based on the trained endoscope image classification model in the present disclosure embodiment.

The processor 110 calls the program instructions stored in the memory 120, and the processor 110 is used to execute any endoscopic image classification model training method in the embodiments of the present disclosure or based on the trained endoscopic image classification model according to the obtained program instructions. The steps of the method for classifying endoscopic images are performed.

For example, in the embodiment of the present disclosure, the endoscopic image classification model training method or the endoscopic image classification method based on the trained endoscopic image classification model is mainly executed by the server 100 side, for example, for the endoscopic image classification method The terminal device 200 can send the collected images of gastrointestinal lesions (for example, polyps) to the server 100, and the server 100 can identify the type of the lesion images, and can return the lesion classification result to the terminal device 200.

The application architecture shown in FIG. 1 is described by taking the application on the server 100 side as an example. Of course, the endoscopic image classification method in the embodiment of the present disclosure can also be executed by the terminal device 200, for example, the terminal device 200 can download from the server On the 100 side, the trained endoscopic image classification model fused with knowledge distillation is obtained, and based on the student network in the endoscopic image classification model fused with knowledge distillation, the types of lesion images are identified and the result of lesion classification is obtained. The disclosed embodiments are not limited in this respect.

In addition, the application architecture diagrams in the embodiments of the present disclosure are for more clearly illustrating the technical solutions in the embodiments of the present disclosure, and do not constitute limitations on the technical solutions provided in the embodiments of the present disclosure. Of course, for other application architectures and business applications , the technical solutions provided by the embodiments of the present disclosure are also applicable to similar problems.

Various embodiments of the present disclosure are schematically described by taking the application architecture diagram shown in FIG. 1 as an example.

First, in order to enable those skilled in the art to understand the principles of the present disclosure more clearly, some technical terms involved in the present disclosure are briefly described below.

Knowledge distillation: Knowledge distillation usually adopts a teacher-student architecture, using the knowledge learned by the large model (teacher) to guide the training of the small model (student), so that the small model has the same performance as the large model, but The number of parameters is greatly reduced, enabling model compression and acceleration.

KL divergence: The full name of KL divergence is kullback leibler divergence, which is generally used to measure the "distance" between two probability distribution functions. For two probability distributions P and Q of a discrete random variable, their KL Divergence is defined as:

Minimizing the KL divergence can make the distributions P and Q close. Similarly, minimizing the negative KL divergence can maximize the distribution distance between P and Q. KL divergence is a commonly used loss function in the field of machine learning.

Transformer: Transformer was proposed in a paper "Attention is All You Need" by Google to solve the problem of natural language translation. It is based on the attention mechanism to improve the model training speed. A typical Transformer includes a multi-head attention (Multi-head Attention) module and a multi-layer perceptron (MLP, Multilayer Perceptron) module. A multi-head attention module helps the encoder to look at other words while encoding a specific word. Each module has a layer normalization (Layer Normalization) module before it, and uses the residual connection to connect each module. The layer normalization module is used for the Transformer learning process due to the multi-entry embedding (embedding) accumulation may bring Imposing constraints on the "scale" problem that comes, is equivalent to imposing constraints on the polysemy space that expresses each word, effectively reducing the variance of the model.

Vision Transformer (ViT): Vision Transformer is a technology that transfers Transformer from natural language processing to image processing.

Figure 2 shows an exemplary block diagram of ViT. Similar to the series of word embeddings used when applying the Transformer to text, ViT divides the original image into a grid of squares, by concatenating all pixel channels in a square, and then linearly projecting it to the desired input dimension using a linear mapper , flatten each square into a single vector. ViT is agnostic to the structure of the input elements, so it is further required to utilize a position encoder to add a learnable position embedding in each square vector, enabling the model to understand the image structure. Finally, the flattened sequence is input into the encoder part of the original Transformer model (such as the m-layer Transformer encoder block shown in Figure 2) for feature extraction, and finally a fully connected layer is connected to classify or segment pictures, etc. Task.

Fig. 3 shows a schematic diagram of ViT in Fig. 2 flattening the original picture into a sequence.

As shown in Figure 3, the image input into ViT is a polyp white light image of H×W×C, where H and W are the number of pixels in the length and width directions, respectively, and C is the number of channels. The image is divided into squares first, and then flattened. Assuming that the length and width of each square is (P×P), then the number of squares is N=H×W/(P×P), and then each picture square is flattened into a one-dimensional vector, and the size of each vector is P× P×C, the total input vector of N blocks is transformed into N×(P×P×C). Then use the linear mapper to perform a linear transformation (that is, the fully connected layer) on each vector to reshape the matrix, and compress the dimension to D, which is called patch embedding (Patch Embedding) here, and you get An N×D embedding vector, N is the length of the final embedding sequence, and D is the dimension of each vector of the embedding sequence. Thus, the three-dimensional graphics of H×W×C are transformed into two-dimensional input of (N×D). Subsequently, a position encoder is used to add position information to the sequence. Next, the sequence after adding the position information can be input into the Transformer encoder for feature extraction. It should be understood that the structure of Transformer and Vision Transformer and the technology for extracting features thereof are well known in the art, and will not be repeated here.

According to an embodiment of the present disclosure, the Vision Transformer can be used as a backbone network (backbone) to extract features, so as to obtain key information in the image more accurately. In the neural network, especially in the field of computer vision (CV), image features are generally extracted first. This part is the foundation of the entire CV task, because subsequent downstream tasks are based on the extracted image features (such as classification, generation, etc.), so this part of the network structure is called the backbone network.

Of course, it should be noted that the embodiments of the present disclosure may also use other network architectures as the backbone network, such as VggNet and ResNet architectures, etc., and the present disclosure is not limited here.

FIG. 4 shows a polyp imaging image according to an embodiment of the present disclosure.

The endoscope enters the human body through the natural orifice of the human body or through a small surgical incision to obtain images of the lesion, which are subsequently used for diagnosis and treatment of the disease. Figure 4 shows the polyp images captured by the endoscope. The image on the left is the observation result of the polyp obtained by the endoscope operating in white light (WL) imaging mode, and the image on the right is the observation result of the polyp in narrow-band light. Another observation of the same polyp obtained with an endoscope operated in Narrow Band Imaging (NBI) mode.

The broadband spectrum of white light is composed of three kinds of light, R/G/B (red/green/blue), and their wavelengths are 605nm, 540nm, and 415nm respectively. In the white light imaging mode, it presents a high-brightness, sharp white-light endoscopic image, which is conducive to observing the deep structure of the mucosa. The narrow-band light mode uses a narrow-band filter to replace the traditional broadband filter to limit the light of different wavelengths, leaving only the green and blue narrow-band light waves with wavelengths of 540nm and 415nm. The image generated under the narrow-band light mode has significantly enhanced contrast between blood vessels and mucosa, which is suitable for observing the morphology of blood vessels and mucosal structure on the surface of the mucosa.

In order to reduce the burden on doctors, some existing works try to use deep learning to automatically identify the lesion categories of lesions in images acquired by endoscopy. However, the existing automatic recognition work for endoscopic image classification is basically based on ordinary convolutional neural networks. They usually use an off-the-shelf convolutional neural network such as ResNet, VGG, Inceptionv3, etc. However, they all only use traditional training methods and do not take into account the uneven distribution of certain endoscopic image types. For example, among the detected polyps, adenomas usually account for the majority, while other polyp types such as hyperplastic Polyps, inflammatory polyps, etc. only accounted for a small proportion, showing a long-tailed distribution.

Therefore, aiming at the long-tail distribution characteristics of polyp image data, this disclosure proposes a multi-expert joint algorithm that is suitable for long-tail data distribution and can improve the accuracy of the head and tail at the same time.

In the following, the technical solutions of the embodiments of the present disclosure will be schematically described by taking the problem of polyp image classification as an example. It should be noted that the technical solutions provided by the embodiments of the present disclosure are also applicable to some other endoscopic images with unbalanced distribution.

For example, according to one embodiment of the present disclosure, white light images of polyps are used to construct a data set exhibiting a long-tailed distribution. By utilizing the training method of the endoscope image classification model proposed in this application, the trained endoscope image classification model can better identify polyp images exhibiting long-tail distribution.

It should be understood that if it is necessary to classify and identify other endoscopic images of gastrointestinal lesions with uneven distribution, any other endoscopic images of gastrointestinal lesions with uneven distribution can also be used to construct a data set and implement the method according to the present disclosure. The endoscopic image classification model for example is trained. These endoscopic images may be images acquired by the endoscope in any suitable mode, such as narrow-band light images, autofluorescence images, I-SCAN images, and the like. For example, the above various modal images may also be mixed to construct a data set, which is not limited in the present disclosure.

The embodiment of the present disclosure aims at the long-tail distribution of polyp images, and proposes a multi-expert decision-making endoscopic image classification model. On the one hand, the overall accuracy of prediction is improved by fusing the decision results of multiple experts; The distribution distance between the prediction results of an expert allows different experts to pay attention to different data distributions, thereby improving the learning ability of unbalanced data sets.

FIG. 5A shows a schematic structure of an endoscopic image classification model 500A according to one embodiment of the present disclosure.

As shown in FIG. 5A , an endoscopic image classification model 500A according to an embodiment of the present disclosure includes n expert sub-networks, where n is an integer greater than 2, for example. Each expert sub-network consists of a feature extractor and a classifier.

According to the embodiment of the present disclosure, each expert sub-network here can have the same network structure, and the structure of each expert sub-network can be any deep learning network structure that can be used to perform classification tasks. This type of network structure usually includes a A feature extractor for extracting feature representations and a classifier for classification.

For example, the feature extractor here can be the Vision Transformer shown in Figure 2. For example, when using the Vision Transformer in Figure 2 as a feature extractor, the input image is first flattened into N one-dimensional vectors based on the linear mapping module and the position encoder, and then feature extraction is performed through the m-layer transformer encoder block .

For example, the classifier here can be a multi-head normalized classifier (multi-head normalized classifier), based on the feature representation of the image sample received from the Vision Transformer, the classifier can output the predicted classification probability value of the image sample.

It should be understood that the feature extractor and classifier in the multi-expert sub-network in the embodiment of the present disclosure may be any other structures that can perform similar functions. For example, the feature extractor here can also be a deep residual network (Deep residual network, ResNet), for example, the classifier here can also be the convolutional layer part of the ResNet network, and this disclosure is not limited here.

For example, here the final optimization objective of the endoscopic image classification model can be determined as the following two, one is to minimize the loss between the classification prediction value and the real label of the final output of the endoscopic image classification model, so that The prediction accuracy of the endoscope image classification model can be improved. The other is to maximize the distribution distance between the classification prediction values output by multiple experts, so that multiple experts can focus on different data distributions of the dataset.

For example, according to an embodiment of the present disclosure, the loss between the final output classification prediction value of the endoscope image classification model and the true label may be calculated based on a cross-entropy loss function. For example, according to an embodiment of the present disclosure, the difference between different experts can be maximized by maximizing the KL divergence between the classification prediction values output by different experts.

In this way, the embodiment of the present disclosure constructs the target loss function for training the endoscopic image classification model based on the cross-entropy loss function and KL divergence. During the training process, the target loss function is continuously optimized to minimize and converge, and then the internal The training of the looking glass image classification model is complete.

In addition, since each expert sub-network in the above-mentioned endoscopic image classification model 500A needs to start from the original image, it first extracts the shallow feature representation based on the shallower layer of the network, and then extracts the feature representation based on the deeper network structure. Deeper feature representation of specificity. In fact, since the shallow feature representation has little influence on the classification decision, in order to further simplify the model complexity, these expert sub-networks can share the shallow feature representation extracted by the same shallow feature extractor, and then based on the deep feature extractor To further learn specific deep features for classification tasks.

Accordingly, the present disclosure proposes a variation of the endoscopic image classification model 500A, as shown in FIG. 5B . In the endoscopic image classification model 500B of Fig. 5B, multiple expert sub-networks share a shallow feature extractor, and each expert sub-network has its own deep-level feature extractor, and the last classifier, through Sharing some common shallow feature extractors, the endoscopic image classification model 500B has a more compact structure than the endoscopic image classification model 500A.

For example, the shallow feature extractors here may be some common shallow structures in the feature extractors of multiple expert sub-networks in the endoscopic image classification model 500A of FIG. 5A .

For example, when the feature extractor in each expert sub-network of the endoscope image classification model 500A is the Vision Transformer as shown in Figure 2, the shallow feature extractor here can be the linear mapper of the Vision Transformer layer, a position encoder layer, and a Transformer encoder block. These expert sub-networks can share this common shallow feature extractor to obtain common shallow features, and use the remaining (m-1) layer Transformer encoder block as a deep feature extractor to extract specific Deep features, as shown in endoscope classification model 500C in Figure 5C. Alternatively, the shared sub-network and deep feature extractor here can also be any other suitable feature extractors for extracting image features.

FIG. 6A shows a flowchart of a method 600 for training an endoscopic image classification model according to one embodiment of the present disclosure. For example, the endoscopic image classification model here is the endoscopic image classification model 500A shown above with reference to FIG. 5A . For example, the training method 600 of the endoscopic image classification model 500A can be executed by a server, which can be the server 100 shown in FIG. 1 .

First, in step S601, a training data set is obtained, the training data set includes a plurality of endoscopic image images and annotation labels of the plurality of endoscopic image images, wherein the training data set presents a long-tail distribution.

The training data set here can be prepared by simulating the long-tailed distribution of polyp types in the real situation. For example, in a specific implementation of an embodiment of the present disclosure, the training data set here may include 2131 white light image images of polyps, and these images have four kinds of labels, namely adenoma, hyperplasia, inflammation and cancer, where The images labeled with adenoma label accounted for the majority (eg, 65%), while images with other polyp label types such as hyperplastic polyps, inflammatory polyps, and cancer accounted for only a small proportion (eg, only 13%, 12%, respectively). and 10%), so that the entire training data set presents a long-tailed distribution.

It should be understood that the number of training data sets and the proportion of labels used for training the training method of the endoscope image classification model according to the embodiment of the present disclosure may be adjusted according to the actual situation, which is not limited in the present disclosure.

For example, the training data set here may be obtained by operating an endoscope, downloaded from a network, or obtained in other ways, which is not limited in the embodiments of the present disclosure.

It should be understood that the embodiments of the present disclosure may also be applicable to image classification of other digestive tract lesions other than polyps, such as inflammation, ulcer, vascular malformation, and diverticulum, and the present disclosure is not limited thereto.

In step S603, the endoscope image classification model is trained based on the training data set until the target loss function of the endoscope image classification model converges, so as to obtain a trained endoscope image classification model.

As mentioned above, the goal here is on the one hand to improve the overall accuracy of the prediction by fusing the decision results of multiple experts, and on the other hand to maximize the distribution distance between the prediction results of multiple experts so that different experts can Focus on different data distributions, thereby improving the learning ability on datasets with imbalanced distribution. Therefore, based on the multi-expert decision-making endoscopic image classification model 500A, the final output classification prediction value and the real label cross-entropy loss minimization and the KL divergence between the classification prediction values output by different expert sub-networks Maximization is used as the training target for training the endoscopic image classification model according to the embodiment of the present application.

Referring to FIG. 6B below, a more specific exemplary description will be given to the step of training the endoscope image classification model based on the training data set in step S603.

As shown in Fig. 6B, the training of the endoscope image classification model based on the training data set in step S603 may include the following sub-steps S603_1-S603_4.

Specifically, in step S603_1, the image samples in the training image sample set are input into each of the plurality of expert sub-networks.

As an alternative embodiment, in the case of performing classification training based on the endoscope image classification model 500B shown in FIG. These shallow features (instead of raw image samples directly) are input into each of the multiple expert sub-networks as shown in the endoscopic image classification model 500B. As mentioned above, the endoscopic image classification model 500B has a more compact structure than the endoscopic image classification model 500A by sharing some common shallow feature extractors.

Next, in step S603_2, using the plurality of expert sub-networks to generate corresponding output results of the plurality of expert sub-networks for the image sample.

For example, let the input image be x, for each expert sub-network, first extract the feature representation of the image sample based on its feature extractor

(For example, the feature extractor here is the Vision Transformer as described above, with the function

represents, where θ _i represents the parameters of the i-th expert subnetwork), then the extracted features are expressed as:

As an alternative embodiment, in the case of performing classification training based on the endoscope image classification model 500B shown in FIG. 5B , the extracted features can also be expressed as:

where f _(x) represents the shared subnetwork,

Represents a deep feature extractor.

Then, based on the feature representation

Use a classifier to classify the image sample. For example, the classifier here can be a multi-head normalized classifier. Based on the multi-head normalized classifier, the logits of the i-th expert subnetwork are calculated as follows:

Among them, γ and τ are parameters, K is the number of multi-heads, and w _i is the weight parameter of the classifier in the i-th expert subnetwork,

is the logits obtained by classifying and calculating the input image samples for the i-th expert sub-network, as known to those skilled in the art, after the logits are normalized by softmax, the probability value of the predicted classification can be obtained, as shown in equation (2) below.

In step S603_3, based on the output results of the plurality of expert sub-networks, a final output result of the endoscopic image classification model is generated. For example, the output results of multiple expert sub-networks can be fused to obtain the final result of the endoscopic image classification model. For example, the fusion here can be a linear average, as shown in equation (3) below.

Among them, n is the number of expert sub-networks in the endoscopic image classification model, and p _soft (x) is the final prediction result of the endoscopic image classification model.

In step S603_4, a loss value is calculated through a target loss function, and parameters of the endoscopic image classification model are adjusted based on the loss value.

As mentioned above, there are two goals of model optimization here. One goal is that the final result of multi-expert fusion is closer to the real label, and the other goal is to maximize the distribution distance between the output results of multiple experts, so that more An expert can focus on different distributions of the data.

Therefore, the objective function can include two parts, the first part is based on the cross-entropy loss function between the fused classification prediction probability and the real label of the image sample, for example, as shown in equation (4) below,

Among them, L _ce represents the cross-entropy loss function, p _soft (x) is the final prediction result of the endoscopic image classification model obtained after fusing the prediction results of multiple expert sub-networks,

is the true label of the image sample.

The second part in the objective function is the negative KL divergence between the classification prediction probabilities output by multiple expert sub-networks. As understood by those skilled in the art, the smaller the KL divergence, the closer the distances between different distributions are. Since when optimizing with a loss function, the ultimate optimization goal is to minimize the loss function. Therefore, the difference between the distributions of the output of each expert sub-network is increased by minimizing the negative KL divergence, for example, the following equation (5),

Equation (5) above expresses the average of the KL divergence between the output of the i-th expert sub-network and the KL divergence of the remaining (n-1) expert sub-networks.

in,

The divergence loss function defining all expert subnetworks is shown in Figure (6):

Among them, n represents the number of multiple expert sub-networks, _θi represents the parameters of the i-th expert sub-network, and c is the number of label categories.

Therefore, the total loss function of the training method of the endoscopic image classification model according to one embodiment of the present disclosure can be defined, as shown in the following equation (7).

Based on the above total loss function, the parameters of the endoscope image classification model of the embodiment of the present disclosure can be adjusted, so that as the iterative training continues, the total loss function is minimized to obtain a trained endoscope image classification model.

The embodiment of the present disclosure is based on the way of multi-expert joint decision-making, and the final result of multi-expert fusion is closest to the real label, and the distribution distance between the output results of multiple experts is maximized as the training goal, so that the trained endoscope The image classification model can adapt to the data distribution and can improve the accuracy of head and tail prediction simultaneously.

In addition, due to the large number of expert sub-networks and the relatively complex models, this disclosure further compresses the endoscopic image classification model structure composed of multiple expert sub-networks based on knowledge distillation, making it integrated into a more concise student network.

FIG. 7A shows a schematic diagram of an endoscopic image classification model 700A incorporating knowledge distillation according to another embodiment of the present disclosure.

As shown in FIG. 7A , an endoscopic image classification model 700A incorporating knowledge distillation according to an embodiment of the present disclosure includes two sub-networks, namely a teacher network 703A and a student network 705A.

For example, the teacher network 703A here may be a plurality of expert sub-networks in the endoscopic image classification model 500A described in FIG. 5A . The student network 705A here may have the same structure as each expert sub-network.

In the embodiment of the present disclosure, a student network 705A with the same structure as the multi-expert sub-network is designed. Based on the principle of knowledge distillation, multiple expert sub-networks are used as the teacher network to train the student network, so that a trained student network is finally obtained. Compared with the original multi-expert network structure, it has a simpler structure and fewer parameters, and at the same time can achieve an accuracy rate close to that of the multi-expert classification network.

Similarly, since each expert sub-network and student network in the teacher network 703A in Figure 7A needs to start from the original picture, first extract shallow feature representations based on the shallower layers of the network, and then extract the shallow feature representation based on the deeper layers of the network structure to extract deeper feature representations with specificity. In fact, since the shallow feature representation has little effect on classification, in order to further simplify the model complexity, in a modification of the knowledge distillation-fused endoscopic image classification model 700A according to an embodiment of the present disclosure, the teacher network and the student network It is possible to share the same shallow feature extractor, and then further learn specific deep features based on the deep feature extractor for classification tasks. As shown in FIG. 7B , FIG. 7B shows a schematic diagram of an endoscopic image classification model 700B incorporating knowledge distillation according to another embodiment of the present disclosure.

As shown in FIG. 7B , the endoscopic image classification model 700B incorporating knowledge distillation includes a shared sub-network 701B in addition to a teacher network 703B and a student network 705B.

As described with reference to FIG. 5B , the teacher network 703B here may be a plurality of expert sub-networks constituting the endoscopic image classification model 500B described in FIG. 5B . Both the teacher network 703B and the student network 705B are connected to a shared sub-network 701B, and further deep feature extraction is performed based on the shallow feature representation extracted by the shared sub-network 701B to perform classification tasks.

Alternatively, the shallow feature extractor in the shared subnetwork 701B and the deep feature extractors in the multiple expert subnetworks here may also be any other suitable feature extractors for extracting image features.

FIG. 7C shows an exemplary endoscopic image classification model 700C incorporating knowledge distillation using Transformer as a feature extractor. For example, the shared sub-network 701C here can be a Vision Transformer, which includes a linear mapper layer, a position encoder layer and a traditional Transformer encoder block. These expert sub-networks in teacher network 703C and student network 705C can share this common shallow feature extractor (i.e., shared sub-network 701C) to obtain common shallow features, and based on multiple layers (e.g., shown in FIG. 7C The traditional Transformer encoder block is used as a deep feature extractor to extract specific deep features for classification and recognition, as shown in Figure 7C shown.

FIG. 8 shows a flowchart of a method 800 for training an endoscopic image classification model incorporating knowledge distillation according to one embodiment of the present disclosure.

First, in step S801, the image samples in the training image sample set are input into each of the plurality of expert sub-networks of the teacher network and into the student network.

For example, the endoscopic image classification model fused with knowledge distillation here may be the model 700A shown in FIG. 7A .

As an alternative embodiment, in the case of performing classification training based on the endoscope image classification model 700B fused with knowledge distillation shown in FIG. These shallow features of the image sample (instead of directly feeding the original image sample) are fed into each of the plurality of expert sub-networks and the student network, which further utilize deep feature extraction to extract more specific deep features.

Next, in step S803, use the multiple expert sub-networks to generate corresponding output results of multiple expert sub-networks for the image sample, and use the student network to generate a corresponding student network for the image sample Output the result. The process of generating the network output result here is similar to step S603_2 in FIG. 6B , and its repeated description will be omitted here.

In step S805, a final output result of the teacher network is generated based on the output results of the plurality of expert sub-networks. The process of generating the final output result of the teacher network here is similar to step S603_3 in FIG. 6B , and its repeated description will be omitted here.

In step S807, a loss value is calculated through a target loss function, and parameters of the endoscopic image classification model fused with knowledge distillation are adjusted based on the loss value.

As mentioned above, there are two goals for the optimization of the endoscope image classification model 500A, 500B or 500C. One goal is 1) the final result of multi-expert fusion is closer to the real label, and the other goal is 2) to make the multi-expert The distribution distance between output results is maximized so that multiple experts can focus on different distributions of data. Here, the training method 800 for an endoscope image classification model incorporating knowledge distillation uses the model 500A, 500B or 500C as a teacher network, and trains a student network with a relatively simplified structure and parameters based on knowledge distillation. Therefore, in addition to the above two goals 1) and 2), the goal of the training method 800 of the endoscopic image classification model fused with knowledge distillation here is also expected to achieve the following two further goals: 3) Make the student network The output results are closer to the output results of the teacher network, and 4) make the output distribution of the student network closer to the distribution of the output results of each expert sub-network in the teacher network.

The embodiment of the present disclosure constructs the loss function (8) of the teacher network based on the above objectives 1) and 2):

here

It is the cross-entropy loss function between the final output result (such as classification prediction probability) of the teacher network based on the fusion of the output results of multiple expert sub-networks described above with reference to FIG. 6B and the real label of the image sample,

It is the divergence loss function of the output results of the multiple expert sub-networks described above with reference to FIG. 6B .

The embodiment of the present disclosure constructs the loss function of the student network based on the above objectives 3) and 4), as shown in the following equation (9):

where p _soft is the classification prediction probability of the final output of the teacher network,

is the class prediction probability output by the student network.

Represents the cross-entropy loss function between the classification prediction probability output by the student network and the final classification prediction probability output by the teacher network.

is the logits output by each expert sub-network in the teacher network, n is the number of expert sub-networks in the teacher network,

is the logits output by the student network, and those skilled in the art should understand that after the logits are normalized by softmax, the probability distribution of the predicted classification can be obtained.

is the KL divergence between the distribution of the output of the student network and the multiple outputs of multiple expert sub-networks in the teacher network.

Therefore, the total loss function of the training method of the endoscopic image classification model incorporating knowledge distillation according to an embodiment of the present disclosure can be defined, as shown in the following equation (10).

Among them, α is the weight parameter, which is set to 1 in the initial stage, and gradually decreases with the training process, and finally drops to 0.

Based on the above-mentioned total loss function, the parameters of the endoscopic image classification model fused with knowledge distillation can be adjusted in the embodiment of the present disclosure, so that as the iterative training continues, the total loss function is minimized, so as to obtain the fused knowledge after training. A Distilled Endoscopy Image Classification Model. In the endoscopic image classification model integrated with knowledge distillation completed by this training, the number of student network parameters is small, the model structure is relatively simple, and the prediction accuracy close to that of the complex teacher network can be achieved, so it can be directly based on the training student network for subsequent classification applications.

Based on the student network trained in the above manner, an embodiment of the present disclosure also provides a method for classifying endoscopic images. Referring to FIG. 9 to describe the flow chart of the endoscopic image classification method in the embodiment of the present disclosure, the method includes:

In step S901, an endoscopic image to be identified is acquired.

For example, if the trained image classification model is trained for polyp type identification, the acquired endoscopic image to be identified is the acquired polyp image.

In step S903, the endoscopic image to be recognized is input into a trained endoscopic image classification model to obtain a classification result of the endoscopic image.

For example, the endoscope image classification model here may be the endoscope image classification model 500A, 500B or 500C trained for the above method.

For example, alternatively, if the trained endoscopic image classification model is the model shown in Figure 5B, the endoscopic image to be recognized can be input to the trained endoscopic image fused with knowledge distillation The sub-networks in the classification model are shared to extract shallow features, which are then fed into the trained endoscopic image classification model.

For example, alternatively, what is trained is an endoscope image classification model fused with knowledge distillation, such as the above-mentioned endoscope

image classification model

700A, 700B or 700C fused with knowledge distillation. Due to the small number of parameters of the student network, the relatively simple model structure, and the ability to achieve prediction accuracy close to that of the complex teacher network, the endoscopic image to be recognized can be directly input to the trained endoscopic image that incorporates knowledge distillation. in the student network in an image classification model.

Based on the above-mentioned embodiments, refer to FIG. 10 , which is a schematic structural diagram of an endoscope image classification system 1000 in an embodiment of the present disclosure. The endoscopic image classification system 1000 at least includes an image acquisition unit 1001 , a processing unit 1002 and an output unit 1003 . In the embodiment of the present disclosure, the image acquisition unit 1001, the processing unit 1002, and the output unit 1003 are related medical devices, which can be integrated into the same medical device, or can be divided into multiple devices, connected and communicated with each other to form a medical system for use etc. For example, for the diagnosis of digestive tract diseases, the image acquisition unit 1001 can be an endoscope, and the processing unit 1002 and the output unit 1003 can be a computer device in communication with the endoscope, etc.

Specifically, the image acquiring component 1001 is used to acquire an image to be recognized. The processing component 1002 is, for example, configured to execute the method steps shown in FIG. 9 , extract image feature information of the image to be recognized, and obtain a lesion classification result of the image to be recognized based on the feature information of the image to be recognized. The output unit 1003 is used to output the classification result of the image to be recognized.

FIG. 11 shows a training device 1100 for an endoscope image classification model according to an embodiment of the present disclosure, which specifically includes a training data set acquisition component 1101 and a training component 1103 .

The training data set acquisition component 1101 is used to: acquire a training data set, the training data set includes a plurality of endoscopic image images and label tags of the plurality of endoscopic image images, wherein the training data set presents a long tail distribution; and a training component 1103, configured to train the endoscope image classification model based on the training data set until the target loss function of the endoscope image classification model converges, so as to obtain a trained endoscope Image classification model.

For example, wherein the target loss function is determined based at least on the corresponding multiple output results of the multiple expert sub-networks.

For example, the training component 1103 includes: an input subcomponent 1103_1, which is used to input image samples in the training image sample set into each of the plurality of expert sub-networks; an output result generation subcomponent 1103_2, which utilizes the The plurality of expert sub-networks, generating a corresponding plurality of expert sub-network output results for the image sample; based on the plurality of expert sub-network output results, generating a final output result of the endoscopic image classification model; and The loss function calculation subcomponent 1103_3 calculates a loss value through a target loss function based on at least the output results of the plurality of expert subnetworks and the final output result; and the parameter adjustment subcomponent 1103_4 adjusts the internal Parameters for the looking-glass image classification model.

For example, the endoscope image classification model also includes a shared subnetwork, wherein the training component 1103 includes: an input subcomponent 1103_1, which inputs image samples in the training image sample set into the shared subnetwork to extract Shallow feature representation; output result generating subcomponent 1103_2, based on the extracted shallow feature representation, using the multiple expert sub-networks to generate corresponding output results of multiple expert sub-networks for the image sample; based on the multiple output results of a plurality of expert sub-networks to generate the final output results of the endoscopic image classification model; and loss function calculation subcomponent 1103_3, based on at least the output results of the plurality of expert sub-networks and the final output results, through the target loss function to calculate a loss value; and a parameter adjustment subcomponent 1103_4, which adjusts the parameters of the endoscopic image classification model based on the loss value.

For example, the output result generating subcomponent 1103_2 fuses the output results of the plurality of expert sub-networks as the final output result of the endoscope image classification model.

For example, where the output result generating subcomponent 1103_2 fuses the output results of the multiple expert sub-networks includes: performing weighted average on the output results of the multiple expert sub-networks.

For example, the endoscopic image classification model further includes a student network with the same structure as the expert sub-network, wherein the plurality of expert sub-networks constitute a teacher network, and the teacher network is used to train the A student network, the output result generating subcomponent 1103_2 further utilizes the student network to generate a corresponding student network output result for the image sample.

For example, the loss function calculation subcomponent 1103_3 calculates the loss value through the target loss function based on the output results of the plurality of expert subnetworks, the final output result and the output result of the student network, and the parameter adjustment subcomponent 1103_4 is based on The loss value adjusts parameters of the endoscopic image classification model.

For example, wherein the loss function of the teacher network includes: a cross-entropy loss function determined based on the final output result of the endoscope image classification model and the labeling labels of the image samples, and a cross-entropy loss function based on the output results of the multiple sub-expert networks And the determined KL divergence, the loss function of the student network includes: a cross-entropy loss function determined based on the output result of the student network of the student network and the final output result of the endoscope image classification model, and based on the The KL divergence determined by the output results of the student network of the student network and the output results of the plurality of expert sub-networks generated by the plurality of expert sub-networks.

Based on the above-mentioned embodiments, an electronic device in another exemplary embodiment is also provided in the embodiments of the present disclosure. In some possible implementation manners, the electronic device in the embodiments of the present disclosure may include a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein, when the processor executes the program, the above embodiments may be implemented. Steps of a method for training an endoscopic image classification model or a method for endoscopic image recognition.

For example, if the electronic device is the server 100 in FIG. 120.

Embodiments of the present disclosure also provide a computer-readable storage medium. FIG. 12 shows a schematic diagram 1200 of a storage medium according to an embodiment of the disclosure. As shown in FIG. 12 , computer-executable instructions 1201 are stored on the computer-readable storage medium 1200 . When the computer-executable instructions 1201 are executed by the processor, the training method of the endoscopic image classification model incorporating knowledge distillation and the endoscopic image classification method according to the embodiments of the present disclosure described with reference to the above figures can be executed. The computer readable storage medium includes, but is not limited to, for example, volatile memory and/or nonvolatile memory. The volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache). The non-volatile memory may include, for example, a read-only memory (ROM), a hard disk, a flash memory, and the like.

Embodiments of the present disclosure also provide a computer program product or computer program, the computer program product or computer program including computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the training method of the endoscopic image classification model incorporating knowledge distillation according to an embodiment of the present disclosure and Classification methods for endoscopic images.

Those skilled in the art can understand that the content disclosed in the present disclosure can be modified and improved in many ways. For example, the various devices or components described above may be implemented by hardware, software, firmware, or a combination of some or all of the three.

Furthermore, although this disclosure makes various references to certain units in a system according to embodiments of the present disclosure, any number of different units may be used and run on the client and/or server. The elements described are illustrative only, and different aspects of the systems and methods may use different elements.

Those of ordinary skill in the art can understand that all or part of the steps in the above method can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable storage medium, such as a read-only memory, a magnetic disk or an optical disk, and the like. Optionally, all or part of the steps in the foregoing embodiments may also be implemented using one or more integrated circuits. Correspondingly, each module/unit in the foregoing embodiments may be implemented in the form of hardware, or may be implemented in the form of software function modules. This disclosure is not limited to any specific form of combination of hardware and software.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It should also be understood that terms such as those defined in common dictionaries should be interpreted as having meanings consistent with their meanings in the context of the relevant technology, and should not be interpreted in idealized or extremely formalized meanings, unless explicitly stated herein defined in this way.

The above is an illustration of the present disclosure and should not be considered as a limitation thereof. Although a few example embodiments of this disclosure have been described, those skilled in the art will readily appreciate that many modifications are possible in the example embodiments without departing from the novel teachings and advantages of this disclosure. Accordingly, all such modifications are intended to be included within the scope of this disclosure as defined in the claims. It is to be understood that the above is a description of the disclosure and should not be considered limited to the particular embodiments disclosed and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be within the scope of the appended claims. The disclosure is defined by the claims and their equivalents.

Claims

A training method of an endoscopic image classification model based on multi-expert decision-making, wherein the endoscopic image classification model includes a plurality of expert sub-networks, the method comprising:

Obtain a training data set, the training data set includes a plurality of endoscopic image images and annotation labels of the plurality of endoscopic image images, wherein the training data set presents a long-tailed distribution; and

The endoscope image classification model is trained based on the training data set until the target loss function of the endoscope image classification model converges to obtain a trained endoscope image classification model,

Wherein, the target loss function is determined based at least on the corresponding multiple output results of the multiple expert sub-networks.
The method according to claim 1, wherein training the endoscopic image classification model based on the training data set comprises:

inputting image samples from the training image sample set into each of the plurality of expert sub-networks;

using the plurality of expert sub-networks to generate corresponding plurality of expert sub-network outputs for the image samples;

generating a final output of the endoscopic image classification model based on the plurality of expert sub-network outputs; and

Based on at least the plurality of expert sub-network outputs and the final output, a loss value is calculated by a target loss function, and parameters of the endoscopic image classification model are adjusted based on the loss value.
The method according to claim 1, wherein the endoscopic image classification model further comprises a shared sub-network, wherein training the endoscopic image classification model based on the training data set comprises:

inputting image samples in the training image sample set into the shared sub-network to extract shallow feature representations;

using the plurality of expert sub-networks to generate corresponding plurality of expert sub-network outputs for the image sample based on the extracted shallow feature representations;

generating a final output of the endoscopic image classification model based on the plurality of expert sub-network outputs; and

Based on at least the plurality of expert sub-network outputs and the final output, a loss value is calculated by a target loss function, and parameters of the endoscopic image classification model are adjusted based on the loss value.
The method according to claim 2 or 3, wherein the target loss function of the endoscopic image classification model comprises: an intersection determined based on the final output result of the endoscopic image classification model and the labeled label of the image sample An entropy loss function, and a Kullback Leibler divergence determined based on the output results of the plurality of sub-expert networks.
The method according to any one of claims 2-4, wherein, based on the output results of the plurality of expert sub-networks, generating the final output result of the endoscopic image classification model comprises:

The output results of the plurality of expert sub-networks are fused as the final output result of the endoscope image classification model.
The method according to claim 5, wherein merging the output results of the plurality of expert subnetworks comprises:

Weighted averaging is performed on the output results of the plurality of expert sub-networks.
The method according to any one of claims 2-6, wherein the endoscopic image classification model further comprises a student network having the same structure as the expert sub-network, wherein the plurality of expert sub-networks constitute a teacher network, using the teacher network to train the student network based on knowledge distillation, the method further comprising:

The student network is utilized to generate a corresponding student network output for the image sample.
The method according to claim 7, wherein, based on at least the output results of the plurality of expert sub-networks and the final output result, calculating the loss value by the target loss function comprises:

Based on the output results of the plurality of expert sub-networks, the final output result and the output result of the student network, a loss value is calculated by an objective loss function.
The method of claim 8, wherein the target loss function is a weighted sum of a loss function of the teacher network and a loss function of the student network.
The method according to claim 9, wherein the sum of the weight value of the loss function of the teacher network and the weight value of the loss function of the student network is 1, and wherein the weight value of the loss function of the teacher network varies with The weight value of the loss function of the student network increases continuously with the iterations of the training until it finally decreases to 1.
A method according to claim 9 or 10, wherein,

The loss function of the teacher network includes: the cross-entropy loss function determined based on the final output result of the endoscope image classification model and the labeling label of the image sample, and the Kullback loss function determined based on the output results of the multiple sub-expert networks. Leibler divergence,

The loss function of the student network includes: a cross-entropy loss function determined based on the student network output of the student network and the final output of the endoscope image classification model, and a student network output based on the student network Kullback Leibler divergence determined by the result and the output results of the plurality of expert sub-networks generated by the plurality of expert sub-networks.
The method according to claim 3, wherein said shared sub-network comprises Vision Transformer, each of said plurality of expert sub-networks comprises a multi-layer Transformer encoder connected in sequence, and a classifier.
A method for classifying endoscopic images, comprising:

Obtain the endoscopic image to be identified;

Obtaining a classification result of the endoscopic image based on the trained endoscopic image classification model;

Wherein, the trained endoscopic image classification model is obtained based on the training method of the endoscopic image classification model according to any one of claims 1-12.
A method for classifying endoscopic images, comprising:

Obtain the endoscopic image to be identified;

Based on the student network in the trained endoscope image classification model, the classification result of the endoscope image is obtained;

Wherein, the trained endoscopic image classification model is obtained based on the training method of the endoscopic image classification model according to any one of claims 7-12.
An endoscopic image classification system comprising:

An image acquisition component, configured to acquire an endoscopic image to be identified;

A processing component, configured to obtain a classification result of the endoscopic image based on the trained endoscopic image classification model;

an output component, configured to output the classification result of the image to be recognized,

Wherein, the trained endoscopic image classification model is obtained based on the training method of the endoscopic image classification model according to any one of claims 1-12.
An endoscopic image classification system comprising:

An image acquisition component, configured to acquire an endoscopic image to be identified;

A processing component for obtaining a classification result of the endoscopic image based on a trained student network in the endoscopic image classification model;

an output component, configured to output the classification result of the image to be recognized,

Wherein, the trained endoscopic image classification model is obtained based on the training method of the endoscopic image classification model according to any one of claims 7-12.
A training device for an endoscopic image classification model based on multi-expert decision-making, wherein the endoscopic image classification model includes a plurality of expert sub-networks, and the device includes:

The training data set acquisition component is used to obtain a training data set, the training data set includes a plurality of endoscopic image images and the label labels of the plurality of endoscopic image images, wherein the training data set presents a long-tail distribution ;as well as

a training component, configured to train the endoscopic image classification model based on the training data set until the target loss function of the endoscopic image classification model converges to obtain a trained endoscopic image classification model,

Wherein, the target loss function is determined based at least on the corresponding multiple output results of the multiple expert sub-networks.
An electronic device, comprising a memory and a processor, wherein the memory stores program codes readable by the processor, and when the processor executes the program codes, any one of the methods described.
A computer-readable storage medium, on which computer-executable instructions are stored, and the computer-executable instructions are used to execute the method described in any one of claims 1-14 above.