WO2022057078A1

WO2022057078A1 - Real-time colonoscopy image segmentation method and device based on ensemble and knowledge distillation

Info

Publication number: WO2022057078A1
Application number: PCT/CN2020/130114
Authority: WO
Inventors: 李坚强; 陈杰; 黄志超
Original assignee: 深圳大学
Priority date: 2020-09-21
Filing date: 2020-11-19
Publication date: 2022-03-24
Also published as: CN111932561A

Abstract

Disclosed in the present invention are a real-time colonoscopy image segmentation method and device based on ensemble and knowledge distillation. The method comprises: acquiring a plurality of training images, the training images being classified into a plurality training image sets, and the training images in the same training image set being from the same data set; first training teacher models, different teacher models obtaining first segmented images respectively according to different training image sets; and then extracting a student model by using trained teacher models jointly. The training images are colonoscopy image screenshots, and a trained student model can generate a real-time colonoscopy image segmented image according to a real-time colonoscopy image. Therefore, the problem that data sets among different hospitals are discontinuous and cannot be collected together to train a colonoscopy automatic image segmentation model is solved.

Description

Real-time colonoscopy image segmentation method and device based on integrated knowledge distillation

technical field

The invention relates to the field of image segmentation, in particular to a real-time colonoscopy image segmentation method and device based on integrated knowledge distillation.

Background technique

Minimally invasive surgery has limited field of view, especially in colonoscopy, there are often blind spots in the field of vision. Therefore, automatic image segmentation of real-time colonoscopy plays an important role in intestinal surgery. The training of existing automatic image segmentation models for colonoscopy usually requires datasets from different hospitals. However, the datasets between different hospitals are discontinuous and cannot be pooled together to train an automatic image segmentation model for colonoscopy.

Therefore, the existing technology still needs to be improved.

SUMMARY OF THE INVENTION

The technical problem to be solved by the present invention is to provide a real-time colonoscopy image segmentation method and storage medium based on integrated knowledge distillation, aiming at solving the problem that the data sets between different hospitals cannot be collected in the prior art. The problem of training an automatic image segmentation model for colonoscopy.

The technical scheme adopted by the present invention to solve the problem is as follows:

In a first aspect, an embodiment of the present invention provides a real-time colonoscopy image segmentation method based on integrated knowledge distillation, wherein the method includes:

Acquiring training images; the training images are screenshots of colonoscopy images used to train the teacher model and the student model; the training images are divided into multiple training image sets, and the training images of the same training image set are from the same data set;

Inputting the training image to the teacher model to obtain a first segmentation map; wherein, the number of the teacher models is greater than or equal to two; different teacher models obtain the first segmentation map according to different training image sets;

Modify the parameters of the teacher model according to the first segmentation map and the first real label, and continue to perform the steps of inputting the training image into the teacher model to obtain the first segmentation map until the The preset training conditions of the teacher model to obtain the trained teacher model; the first real label is used to reflect the real classification situation corresponding to the pixels on the training image under the first preset classification conditions;

Inputting the training image to the student model to obtain a second segmentation map;

Modify the parameters of the student model according to the second segmentation map, the teacher label and the second real label, and continue to perform the steps of inputting the training image into the student model to obtain the second segmentation map, until Meet the preset training conditions of the student model to obtain the trained student model; the second real label is used to reflect the real classification situation corresponding to the pixels on the training image under the second preset classification condition; the The teacher label is used to reflect the classification situation of the training image in the trained teacher model;

Real-time colonoscopy images are input to the trained student model to generate real-time colonoscopy image segmentation maps.

In one embodiment, the acquired training images are screenshots of colonoscopy images used to train the teacher model and the student model, including:

Take screenshots of colonoscopy images;

Compression processing is performed according to the screenshot of the colonoscopy image to obtain the training image; the height, width and number of channels of the training image are all constant.

In one embodiment, the teacher model includes a first down-sampling encoder and a first up-sampling decoder; the inputting the training image into the teacher model to obtain the first segmentation map includes:

Perform feature extraction on the training image according to the first down-sampling encoder to obtain a first feature map; the first feature map includes feature information of the training image;

Analyze the first feature map according to the first upsampling decoder to obtain the first segmentation map;

The first segmentation map includes the first standard probability and the first abnormal probability corresponding to the pixels in the training image; the first standard probability is the probability that the pixel belongs to the standard under the first preset classification condition , the first abnormal probability is the probability that the pixel belongs to abnormal under the first preset classification condition; the sum of the probability value of the first abnormal probability and the first standard probability is 1.

In an embodiment, the parameters of the teacher model are modified according to the first segmentation map and the first real label, and the training image is input into the teacher model continuously to obtain the first The step of dividing the graph, until the preset training conditions of the teacher model are met, to obtain the trained teacher model, including:

Calculate a first loss value according to the first segmentation map and the first ground truth;

Adjust parameters of the first upsampling decoder according to the first loss value to update the teacher model;

The step of inputting the training image into the teacher model to obtain the first segmentation map is continued until the preset training conditions of the teacher model are satisfied, so as to obtain the trained teacher model.

In one embodiment, the student model includes a second down-sampling encoder and a second up-sampling decoder; the inputting the training image into the student model to obtain the second segmentation map includes:

The second feature map is output after feature extraction is performed on the training image according to the second down-sampling encoder; the second feature map includes feature information of the training image;

Analyze the second feature map according to the second upsampling decoder to obtain the second segmentation map;

Wherein; the second segmentation map includes the second standard probability and the second abnormal probability corresponding to the pixels in the training image; the second standard probability is the probability that the pixel belongs to the standard under the second preset classification condition , the second abnormality probability is the probability that the pixel belongs to abnormality under the second preset classification condition; the number of categories of the second preset classification condition is more than the number of categories of the first preset classification condition.

In an embodiment, the parameters of the student model are modified according to the second segmentation map, the teacher label and the second real label, and the step of generating a second segmentation map according to the training image is continued. , until the preset training conditions of the student model are met to obtain a trained student model, including:

calculating a second loss value according to the second segmentation map, the teacher label and the second ground truth;

Adjust parameters of the second upsampling decoder according to the second loss value to update the student model;

Continue to perform the step of generating a second segmentation map according to the training image until the preset training conditions of the student model are met, so as to obtain a trained student model.

In one embodiment, the calculating the second loss value according to the second segmentation map, the teacher label and the second ground truth includes:

Obtain the total probability value according to the first segmentation map output by all trained teacher models;

The teacher labels are obtained by adjusting the first segmentation maps output by all trained teacher models according to the probability total value.

In a second aspect, an embodiment of the present invention further provides an apparatus for real-time colonoscopy image segmentation based on integrated knowledge distillation, wherein the apparatus includes:

an image acquisition module, the image acquisition module is used to acquire training images;

a teacher model unit, the teacher model unit is used to obtain a first segmentation map according to the training image;

a first parameter correction module, which is used to correct the parameters of the teacher model according to the first segmentation map and the first real label;

a student model unit, the student model unit is used to obtain a second segmentation map according to the training image;

The second parameter correction module, the second parameter correction module is used to modify the parameters of the student model according to the second segmentation map, the teacher label and the second real label;

The teacher model unit also includes:

a first down-sampling encoder module, where the first down-sampling encoder module is used to perform feature extraction on the training image to obtain a first feature map;

a first upsampling decoder module, the first upsampling decoder module is configured to parse the first feature map to obtain the first segmentation map;

The student model unit also includes:

a second down-sampling encoder module, the second down-sampling encoder module is configured to perform feature extraction on the training image to obtain a second feature map;

A second up-sampling decoder module, the second up-sampling decoder module is configured to parse the second feature map to obtain the second segmentation map.

In a third aspect, an embodiment of the present invention also provides a terminal, which is characterized in that it includes a memory and one or more programs, wherein one or more programs are stored in the memory and configured to be composed of one or more programs. Execution of the one or more programs by a processor includes performing any of the methods described above.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, any one of the above-mentioned real-time integrated knowledge distillation-based real-time Steps of the colonoscopy image segmentation method.

Beneficial effects of the present invention: the present invention obtains a plurality of training images, the training images are divided into a plurality of training image sets, and the training images of the same training image set are from the same data set; Obtain the first segmentation map according to different training image sets; then use the trained teacher model to jointly refine a student model. The training image is a screenshot of a colonoscopy image, and the trained student model can generate a real-time colonoscopy image segmentation map according to the real-time colonoscopy image. This solves the problem that the data sets between different hospitals are discontinuous and cannot be pooled together to train an automatic image segmentation model for colonoscopy.

Description of drawings

In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments described in the present invention. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

FIG. 1 is a first schematic flowchart of a real-time colonoscopy image segmentation method based on integrated knowledge distillation provided by an embodiment of the present invention.

FIG. 2 is a second schematic flowchart of a real-time colonoscopy image segmentation method based on integrated knowledge distillation provided by an embodiment of the present invention.

3 is a third schematic flowchart of a real-time colonoscopy image segmentation method based on integrated knowledge distillation provided by an embodiment of the present invention.

FIG. 4 is a schematic diagram of a connection relationship between a down-sampling encoder and an up-sampling decoder provided by an embodiment of the present invention.

5 is a fourth schematic flowchart of a real-time colonoscopy image segmentation method based on integrated knowledge distillation provided by an embodiment of the present invention.

FIG. 6 is a fifth schematic flowchart of a real-time colonoscopy image segmentation method based on integrated knowledge distillation provided by an embodiment of the present invention.

FIG. 7 is a sixth schematic flowchart of a real-time colonoscopy image segmentation method based on integrated knowledge distillation provided by an embodiment of the present invention.

FIG. 8 is a seventh schematic flowchart of a real-time colonoscopy image segmentation method based on integrated knowledge distillation provided by an embodiment of the present invention.

FIG. 9 is a schematic structural diagram of a hardware operating environment involved in the solution of an embodiment of the present invention.

FIG. 10 is a schematic diagram of the internal structure of a real-time enteroscopic image segmentation device based on integrated knowledge distillation provided by an embodiment of the present invention.

FIG. 11 is a prediction effect diagram of the student model provided by the solution according to the embodiment of the present invention.

detailed description

In order to make the objectives, technical solutions and advantages of the present invention clearer and clearer, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

It should be noted that if there are directional indications (such as up, down, left, right, front, back, etc.) involved in the embodiments of the present invention, the directional indications are only used to explain a certain posture (as shown in the accompanying drawings). If the specific posture changes, the directional indication also changes accordingly.

With the development of minimally invasive surgery, artificial intelligence-assisted surgery represented by robotic systems has become more and more widely used. Assisted surgery is a system that uses robots to assist doctors in performing surgical operations, and its purpose is mainly to overcome the limitations of the existing minimally invasive surgical field of view. Adjunctive surgery is especially important during colonoscopy. Colonoscopy technique is one of the important techniques in intestinal surgery, however, in colonoscopy, many colon lesions usually have similar characteristics to normal mucosa, such as similar color, or too flat, this deceptive colon Lesions are often difficult to detect without special methods. Therefore, automatic image segmentation of real-time colonoscopy plays an important role in colonoscopy technology. Specifically, in the past studies, many people tried to use deep neural network models for natural images, or convolutional networks for biomedical image segmentation for medical image detection, such as automatic endoscopic detection and classification of colorectal polyps, and Achieving impressive detection results: studies have shown that deep learning can locate and identify polyps in real-time with high accuracy in screening colonoscopies (e.g., using YOLO to detect polyps in colonoscopy videos can be located in real-time with 96% accuracy) and identifying polyps).

At present, the training of automatic image segmentation models for colonoscopy usually requires datasets from different hospitals. However, the datasets between different hospitals are discontinuous and cannot be directly pooled together to train the model. In addition, prior art studies on automatic image segmentation for colonoscopy mainly focus on polyp detection, and studies on automatic detection of ulcers, hemorrhages and Merkel's diverticulum are lacking.

Based on the above-mentioned defects of the prior art, the present invention provides a real-time colonoscopy image segmentation method based on integrated knowledge distillation. The real-time colonoscopy image segmentation method is a technology for assisting doctors in judging a patient's colon image. Simply put, due to the difference in morphology between diseased tissue (lesion) and normal tissue in colon images, such as differences in features such as color, outline, and texture. Therefore, after learning a large number of colon examination images with known results, the automatic detection model can perform image analysis on the input colon examination images and give prediction results, thereby giving doctors corresponding reference opinions. However, the automatic detection model requires a lot of computing resources in the training process, so that information can be extracted from very large and highly redundant data sets. It is deployed in practice, so model compression becomes an important issue. Knowledge distillation is a model compression method. Its main idea is to train a small network model to imitate a pre-trained large network or an integrated network. In knowledge distillation, the method for teachers to impart knowledge to students is: In the process of training students, a loss function that targets the probability distribution of the teacher's predicted results is added.

To put it simply, the present invention firstly trains multiple binary classification models through data sets of different hospitals, and the multiple binary classification models can respectively detect polyps, ulcers, bleeding and Merkel's diverticulum, which are common in colonoscopy examinations. Then use multiple trained binary classification models to jointly refine a multivariate classification model, so that the multivariate classification model can automatically detect polyps, ulcers, bleeding and Merkel's diverticulum. In the present invention, the teacher model is a binary classification model, the student model is a multivariate classification model, and the purpose of training the teacher model and the student model is to determine the best model between the teacher model and the student model. optimal parameters to achieve the best classification effect. Different teacher models are trained through different training image sets to obtain teacher models with different classification conditions. The training images of the same teacher model are all from the same hospital dataset, which solves the problem that the datasets between different hospitals are discontinuous and cannot be pooled together to train an automatic image segmentation model for colonoscopy. Finally, the knowledge distillation method is used (that is, the knowledge contained in the trained model is distilled and extracted into another model), and the knowledge contained in the trained teacher model is distilled and extracted into the student model, thereby effectively compressing the student model. Reduce the size of the student model.

As shown in FIG. 1 , the real-time colonoscopy image segmentation method based on integrated knowledge distillation provided in this embodiment includes the following steps:

S100. Acquire a training image; the training image is a screenshot of a colonoscopy image used for training the teacher model and the student model; the training image is divided into multiple training image sets, and the training images of the same training image set are from the same data set.

To put it simply, the first step in training a model is to obtain available training images, and then use the training images to train the teacher model and the student model. The data in the same training image set comes from the same hospital, which solves the problem that the data sets between different hospitals are discontinuous and cannot be pooled together to train an automatic image segmentation model for colonoscopy.

In an implementation manner, the step S100 as shown in FIG. 2 further includes the following steps:

S110. Obtain a screenshot of the colonoscopy image;

S120. Perform compression processing according to the screenshot of the colonoscopy image to obtain the training image; the height, width, and number of channels of the training image are all constant.

Specifically, since the dimensions of the network parameters in the teacher model and the student model are fixed, the size of the input images of the teacher model and the student model must be adapted to the dimensions of the network parameters, in order to To avoid dynamic changes of the network in the model, the size of the input images of the teacher model and the student model needs to be fixed. The size of the input image is mainly related to the height, width and number of channels of the image, so the size of the input image is fixed, that is, the height, width and number of channels of the input image are all constant. In a specific implementation, after the screenshot of the colonoscopy image is obtained, the screenshot of the colonoscopy image is compressed so that the height, width, and number of channels conform to the input image standards of the teacher model and the student model. The compressed colonoscopy image screenshot becomes the training image, which can be directly input into the teacher model and the student model for training the teacher model and the student model. In addition, the training images can be divided into multiple training image sets, and the data of the same training image set are all from the same hospital data set.

After completing the step S100, as shown in FIG. 1, the method further includes step S200, inputting the training image into the teacher model to obtain a first segmentation map; wherein the number of the teacher models is greater than or equal to two ; Different teacher models obtain the first segmentation map according to different training image sets.

Because the present invention firstly trains a plurality of binary classification models, and then uses the trained binary classification model to jointly refine a multivariate classification model, namely, firstly trains the teacher model, and then uses the trained teacher model to jointly refine a student model. Therefore, the teacher model needs to be trained first. Different teacher models obtain the first segmentation map according to different training image sets. For example, there are currently two teacher models A and B. Teacher model A is used to automatically detect polyps, and teacher model B is used to automatically detect bleeding. A and B are trained through different training image sets respectively. The data in the training image set of teacher model A can come from the hospital with the strongest local polyp treatment technology, and the data in the training image set of teacher model B can come from the local hospital with the strongest treatment technology for colonic hemorrhage. Hospital. The teacher model classifies the pixels on the training image according to preset classification conditions by collecting specific feature information in the training image, and outputs a classification result, where the classification result is the first segmentation map .

The specific classification process is as follows, in one implementation, the teacher model includes the first down-sampling encoder and the first up-sampling decoder, as shown in Figure 3, the step S200 also includes the following steps:

Step S210, performing feature extraction on the training image according to the first down-sampling encoder to obtain a first feature map; the first feature map includes feature information of the training image;

Step S220, analyzing the first feature map according to the first upsampling decoder to obtain the first segmentation map;

In short, the teacher model mainly adopts the stochastic gradient descent algorithm during training. The teacher model is mainly composed of a first down-sampling encoder and a first up-sampling decoder. The first down-sampling encoder and the first up-sampling decoder are composed of four down-sampling layers and four up-sampling layers, respectively, and the first down-sampling encoder and the first up-sampling decoder are composed of four downsampling layers. There is a connection relationship between the four down-sampling layers and the four up-sampling layers are connected one-to-one, and the outputs of the four down-sampling layers are respectively added to their corresponding up-sampling layers to participate in the up-sampling process. , to maintain the gradient of the teacher model. After the training image is input into the teacher model, the first feature map is obtained by sequentially passing through the four downsampling layers of the first downsampling encoder. Then, the first feature map passes through the four upsampling layers of the first upsampling decoder in sequence, and the final output result is the first segmentation map.

In addition, since deep learning models usually rely on powerful computing power, it is difficult to deploy in devices with limited computing resources and limited storage space. In order to solve this problem, when constructing the teacher model, a lightweight network MobileNetv2 is used to construct the first down-sampling encoder. The lightweight network MobileNetv2 is a lightweight model proposed for devices with limited computing resources. It uses depthwise separable convolution to build a lightweight deep neural network, which simplifies the network structure, so it has high accuracy and good model compression. ability. In an implementation manner, as shown in FIG. 4 , in this embodiment, four layers in MobileNetv2 are used as the first downsampling layer 10 , the second downsampling layer 20 , the third downsampling layer 20 and the third downsampling layer in the first downsampling encoder, respectively. The down-sampling layer 30 and the fourth down-sampling layer 40 . The training image 1 is input into the first down-sampling encoder, the training image 1 is firstly input into the first down-sampling layer 10, and then the output of the first down-sampling layer 10 is used as the second down-sampling The input of layer 20, and so on, take the output image of the previous down-sampling layer as the input image of the next down-sampling layer, and continue to perform the step of extracting features from the input image until the fourth down-sampling layer 40 completes the After the feature extraction of the input image, the first feature map 2 is output. Wherein, each downsampling layer in the first downsampling encoder is composed of an inverse residual module, and the inverse residual module is constructed by depthwise separable convolution. Specifically, the inverse residual module first performs a point-by-point convolution calculation on the input image to expand the number of channels of the image; then extracts image features by performing a depthwise convolution calculation; and then performs a point-by-point convolution calculation , the number of channels of the compressed image. Therefore, the size of the teacher model can be reduced without losing the accuracy of the image feature extraction by the teacher model.

The first down-sampling encoder performs feature extraction on the input training image and outputs the first feature map, and uses the first feature map as the input image of the first up-sampling decoder, and then executes the Step S220.

Specifically, each of the four upsampling layers in the upsampling decoder is composed of a transposed convolutional layer and a normalization layer respectively. The transposed convolutional layer is used for expanding the image and extracting image features. The normalization layer Layers are used to avoid parameter interaction between different upsampling layers. For example, as shown in FIG. 4 , after the first feature map 2 is input to the up-sampling decoder, it passes through the first up-sampling layer 50 to combine the output result of the first up-sampling layer 50 with the down-sampling layer. The output results of the fourth downsampling layer 40 in the encoder are combined as the input image to the second upsampling layer 60 and into the second upsampling layer 60 . Continue to combine the output result of the second upsampling layer 60 with the output result of the third downsampling layer 30 in the downsampling encoder as the input image of the third upsampling layer 70 and input the third upsampling layer 70 Sampling layer 70. Then continue to combine the output result of the third upsampling layer 70 with the output result of the second downsampling layer 20 in the downsampling encoder as the input image of the fourth upsampling layer 80 and input the first Four upsampling layers 80 . Finally, the output result of the fourth up-sampling layer 80 is combined with the output result of the first down-sampling layer 10 in the down-sampling encoder to obtain the first segmentation diagram 3 .

In an implementation manner, the first segmentation map includes a first standard probability and a first abnormal probability corresponding to a pixel in the training image; the first standard probability is the pixel in a first preset classification condition The probability that the pixel belongs to the standard, the first abnormal probability is the probability that the pixel is abnormal under the first preset classification condition; the sum of the probability value of the first abnormal probability and the first standard probability is 1. For example, when the teacher model is a model for automatically detecting polyps, the corresponding first preset classification condition is whether polyps are present, and the first standard probability is the probability that the pixels in the training image correspond to normal (that is, no polyps). The probability of ), the first abnormal probability is the probability that the pixel in the training image corresponds to having a polyp.

Specifically, after the training image is input into the teacher model, the first downsampling encoder performs downsampling according to the training image and obtains a first feature map of size W'×H'×N, where W ', H', N are the height, width and number of channels of the first feature map. The first upsampling decoder performs upsampling according to the first feature map and obtains a first segmentation map

Among them, T _i ,i∈{1,2,3...,k} represents the ith teacher model, k is the number of teacher models, and each teacher model corresponds to a specific classification category.

j∈{1,2},

is the probability that each pixel is predicted to be normal under the first preset classification condition, that is, the first standard probability,

is the probability that each pixel is predicted to be abnormal under the first preset classification condition, that is, the first abnormal probability,

and

The sum is 1. j is an image channel, j=1 means channel 1, outputting the first standard probability; j=2 means channel 2, outputting the first abnormal probability, and R is a set of real numbers.

For example, there are currently 4 teacher models: A teacher model, B teacher model, C teacher model, D teacher model, the corresponding classification conditions are whether there is polyp, whether there is Merck's diverticulum, whether there is ulcer and whether With hemorrhage, used to train automatic detection of polyps, Merck's diverticulum, ulcers, and hemorrhages, respectively. After inputting the training image into the A teacher model, the first segmentation map (0.1, 0.9) is obtained, then 0.1 is the probability that the predicted pixel output by channel 1 is normal, and 0.9 is the probability that the predicted pixel output by channel 2 has polyps After the training image is input into the B teacher model, the first segmentation map (0.2, 0.8) is obtained, then 0.2 is the probability that the predicted pixel output by channel 1 is normal, and 0.8 is the predicted pixel output by channel 2 suffers from Merck The probability of diverticulum; after inputting the training image into the C teacher model, the first segmentation map (0.3, 0.7) is obtained, then 0.3 is the probability that the predicted pixel output by channel 1 is normal, and 0.7 is the predicted pixel output by channel 2. There is a probability of ulcer; after inputting the training image into the C teacher model, the first segmentation map (0.4, 0.6) is obtained, then 0.4 is the probability that the predicted pixel output by channel 1 is normal, and 0.6 is the predicted pixel output by channel 2. Probability of suffering from bleeding.

In order to know the correctness of the prediction of the teacher model during training, the method further includes step S300, modifying the parameters of the teacher model according to the first segmentation map and the first real label, and continuing to perform the The training image is input into the teacher model, and the first segmentation map is obtained until the preset training conditions of the teacher model are met, so as to obtain the trained teacher model; the first real label is used to reflect the training image. The corresponding real results of the pixels above under the first preset classification conditions.

In the actual training process, each training image has its corresponding real label to evaluate the classification effect (prediction effect) of the model. The real label used for training the teacher model is the first real label, to indicate the corresponding real result of the training image under the first preset classification condition. The purpose of training is to keep the output of the teacher model close to the real label, so the teacher model will continuously perform parameter correction during the training process, so as to control the training process and guide the training process towards The optimal direction converges.

As shown in FIG. 5 , the step S300 specifically includes the following steps:

Step S310, calculating a first loss value according to the first segmentation map and the first real label;

Step S320, adjusting the parameters of the first upsampling decoder according to the first loss value to update the teacher model;

Step S330: Continue to perform the step of inputting the training image into the teacher model to obtain the first segmentation map, until the preset training conditions of the teacher model are met, so as to obtain the trained teacher model.

By continuously comparing the first segmentation map and the first real label, the gap between the prediction result of the teacher model and the real result can be obtained, so that the teacher model can calculate the difference between the prediction result and the real result according to the teacher model. The gap determines how to perform parameter correction and achieve a better prediction effect. Specifically, the teacher model classifies the training image according to the first preset classification condition to obtain a first segmentation map, and substitutes the first segmentation map and the first true label into the first loss value In the calculation formula of , the first loss value is obtained, and the first loss value can represent the gap between the first segmentation map and the first real label. The calculation formula of the first loss value is as follows:

Since the first loss value refers to the gap between the first segmentation map and the first real label, the larger the value of the first loss value, the greater the value of the first segmentation map and the The larger the gap between the first true labels, the worse the classification effect of the teacher model; the smaller the value of the first loss value, that is, the difference between the first segmentation map and the first true label. The smaller the gap is, the better the classification effect of the teacher model is.

in

is the first segmentation map of the pixels in the wth row and hth column of the jth channel output by the teacher model T _i

represents the prediction result of the teacher model _Ti for this pixel.

are the corresponding ground truth labels of the pixels in the wth row and hth column of the jth channel of the training image

It can indicate the true classification situation of the pixel under the classification condition corresponding to the teacher model T _i .

j∈{1,2}, where

is the normal label of this pixel in the teacher model _Ti , and

is the abnormal label ( _diseased label) of this pixel in the teacher model Ti. the true label

It is a one-hot vector (one-hot vector), that is, the result corresponding to only one channel in the same label is not 0, and the results corresponding to other channels are all 0. In other words, the real labels of the teacher model are only (1, 0) and There are two forms of (0, 1), (1, 0) indicates that the real situation corresponding to the pixel is normal, and (0, 1) indicates that the real situation corresponding to the pixel is abnormal. According to the true label

to evaluate the first segmentation map output by the teacher model

The predicted effect is determined by the

and

Substitute into the calculation formula of the first loss value to obtain the first loss value of the teacher model, and evaluate the prediction effect of the teacher model according to the size of the obtained first loss value. The worse the prediction effect of the teacher model is; the smaller the obtained first loss value is, the better the prediction effect of the teacher model is.

After the training of the teacher model is completed, the trained teacher model is obtained, and the trained teacher model can be used to refine the student model, and the process of refining the student model is the training process of the student model.

Therefore, as shown in FIG. 1 , the method further includes step S400 , inputting the training image into the student model to obtain a second segmentation map. Specifically, the student model classifies the pixels on the training image according to preset classification conditions by collecting specific feature information in the training image, and outputs a classification result, where the classification result is the first classification result. Two-part diagram.

As shown in Figure 6, the step S400 specifically includes the following steps:

Step S410 , outputting a second feature map after performing feature extraction on the training image according to the second down-sampling encoder; the second feature map includes feature information of the training image;

Step S420: Analyze the second feature map according to the second upsampling decoder to obtain the second segmentation map.

Specifically, the structure of the student model is similar to that of the teacher model, and both include a down-sampling encoder and an up-sampling decoder. The downsampling encoder in the student model is a second downsampling encoder, the upsampling decoder in the student model is a second upsampling decoder, the second downsampling encoder and the second upsampling encoder are The sampling decoder is also composed of four down-sampling layers and four up-sampling layers respectively, the second down-sampling encoder and the second up-sampling decoder have a connection relationship, and the four down-sampling layers are connected to the The four up-sampling layers are connected in one-to-one correspondence, and the outputs of the four down-sampling layers are respectively added to their corresponding up-sampling layers to participate in the up-sampling process to maintain the gradient of the student model. After the training image is input into the student model, the second feature map is obtained by sequentially passing through four downsampling layers of the second downsampling encoder. Then, the second feature map passes through the four upsampling layers of the second upsampling decoder in sequence, and the final output result is the first segmentation map. The main difference between the student model and the teacher model is that the number of categories of the classification conditions of the student model is more than the number of categories of the classification conditions of the teacher model, so that the prediction results output by the student model have more dimensions. The dimension of the prediction result output by the teacher model. And the number of channels in the middle layer in the student model is smaller than the number of channels in the middle layer in the teacher model, thereby reducing the overall size of the student model.

Specifically, this embodiment also uses the four layers in MobileNetv2 as the four downsampling layers in the second downsampling encoder, and uses the output image of the previous downsampling layer as the input image of the next downsampling layer , and continue to perform the feature extraction step of the input image until the fourth downsampling layer completes the feature extraction of the input image and then outputs the second feature map (for a detailed process, please refer to step S210 ). Likewise, each downsampling layer in the second downsampling encoder consists of an inverse residual module constructed from depthwise separable convolutions. Specifically, the reverse residual module first performs a point-by-point convolution calculation on the input image to expand the number of channels of the image. Image features are then extracted by performing depthwise convolution calculations. Then, the number of channels of the image is compressed by performing a point-by-point convolution calculation. Thus, the size of the student model can be reduced without losing the accuracy of the image feature extraction by the student model.

The second down-sampling encoder performs feature extraction on the input training image and outputs the second feature map, and uses the second feature map as the input image of the second up-sampling decoder, and then executes the Step S420.

During specific implementation, the outputs of the four down-sampling layers in the second down-sampling encoder are added to the corresponding four up-sampling layers in the second up-sampling decoder, respectively, to participate in the up-sampling process. , to preserve the gradient of the student model. The second feature map sequentially passes through the four upsampling layers of the second upsampling decoder, and the final output result is the second segmentation map (for a detailed process, please refer to step S220).

In an implementation manner, the second segmentation map includes a second standard probability and a second abnormal probability corresponding to a pixel in the training image; the second standard probability is the pixel in a second preset classification condition The probability that the pixel belongs to the standard, and the second abnormal probability is the probability that the pixel is abnormal under the second preset classification condition; the number of categories of the second preset classification condition is more than the first preset classification condition number of categories.

Specifically, after the training image is input into the student model, the second downsampling encoder performs downsampling according to the training image and obtains a second feature map of size W'×H'×N, where W ', H', N are the height, width and number of channels of the second feature map. The second upsampling decoder performs upsampling according to the second feature map and obtains a second segmentation map

where S represents the student model,

n∈{normal,1,2,...N}, n is the image channel (for example, n=1 means channel 1), N is a positive integer and N is greater than or equal to 2, the value of N is related to the number of teacher models, R is the set of real numbers.

is the minimum standard probability corresponding to the pixel under the second preset classification condition,

represents the probability that the pixel belongs to the first type of abnormality under the second preset classification condition,

Indicates the probability that the pixel belongs to the second type of abnormality under the second preset classification condition, and so on.

For example, since the present invention distills the knowledge contained in multiple trained teacher models into the student model, the classification conditions set by the student model are related to the classification conditions of the multiple trained teacher models. For example, there are currently 4 teacher models: A teacher model, B teacher model, C teacher model, D teacher model, the corresponding classification conditions are whether there is polyp, whether there is Merck's diverticulum, whether there is ulcer and whether Bleeding, used to train automatic detection of polyps, Merck's diverticula, ulcers and bleeding, respectively. Then the second preset classification conditions of the student model extracted by the above four teacher models are divided into four categories: the first classification condition is whether there is a polyp, the second classification condition is whether there is a Merkel's diverticulum, and the third The first class classification condition is whether there is ulceration, and the fourth class classification condition is whether there is bleeding. After inputting the training image into the student model, a second segmentation map (0.2, 0.8, 0.7, 0.5, 0.3) is obtained, which means that the predicted pixel has a probability of polyp of 0.8, a Merkel's diverticulum with a probability of 0.7, and a The probability of having an ulcer is 0.5 and the probability of having a bleeding is 0.3. From the above probability of disease, the probability of not suffering from polyps is 1-0.8=0.2, the probability of not suffering from polyps is 1-0.7=0.3, the probability of not suffering from polyps is 1-0.5=0.5, and the probability of not suffering from polyps is 1 -0.3=0.7, since the probability of not suffering from polyps is the minimum value of the normal probability among all diseases, the probability of not suffering from polyps 0.2 is retained as the normal probability corresponding to the pixel in the second segmentation map, so as to avoid the existence of excessive normal Probability causes the prediction effect of the student model to be inaccurate.

In the specific implementation, the prediction effect diagram of the student model is shown in Figure 11, where column A is the training image input into the student model; column B is the corresponding second real label (real category map); column C is the corresponding The output of the second segmentation map (prediction effect map).

In order to know the correctness of the classification of the student model during training, the method further includes:

Step S500, modify the parameters of the student model according to the second segmentation map, the teacher label and the second real label, and continue to input the training image into the student model to obtain the second segmentation map. step, until the preset training conditions of the student model are met, to obtain the trained student model; the second real label is used to reflect the corresponding real classification of the pixels on the training image under the second preset classification condition situation; the teacher label is used to reflect the classification situation of the training image in the trained teacher model.

The real label used for training the student model is the second real label, to indicate the corresponding real classification situation of the pixels in the training image under the second preset classification condition. The purpose of training is to keep the output of the student model close to the real label, so the student model will continuously perform parameter correction during the training process, so as to control the training process and guide the training process toward The optimal direction converges.

As shown in Figure 7, the step S500 specifically includes the following steps:

Step S510, calculating a second loss value according to the second segmentation map, the teacher label and the second real label;

Step S520, adjusting the parameters of the second upsampling decoder according to the second loss value to update the student model;

Step S530: Continue to perform the step of generating a second segmentation map according to the training image until the preset training conditions of the student model are met, so as to obtain a trained student model.

By continuously comparing the second segmentation map and the second real label, the gap between the predicted result of the student model and the real classification result can be obtained, so that the student model can compare the predicted result and the real classification result according to the student model. The gap between them determines how to make parameter corrections and achieve better prediction results. Specifically, the student model classifies the training image according to the second preset classification condition to obtain a second segmentation map, and substitutes the second segmentation map and the second true label into the second loss value In the calculation formula of , the second loss value is obtained, and the second loss value can represent the gap between the second segmentation map and the second real label. The calculation formula of the second loss value is as follows:

Since the second loss value refers to the gap between the second segmentation map and the second true label, the larger the value of the second loss value, the greater the The larger the gap between the second true labels, the worse the classification effect of the student model; the smaller the value of the second loss value, the smaller the difference between the second segmentation map and the second true label. The smaller the gap, the better the classification effect of the student model.

in,

is the second segmentation map of the pixels in the wth row and hth column of the jth channel output by the student model

is the teacher label of the pixel in the wth row and hth column of the jth channel of the training image, which can represent the prediction result of the pixel in the trained teacher model,

is the real label of the pixel in the wth row and hth column of the jth channel of the training image, that is, the second real label, which can indicate the real label of the pixel in the training image under the second preset classification condition Classification situation.

Specifically, the second real label

in the form of

where y _normal is the label of the pixel that belongs to the standard under the second preset classification condition, the y ₁ ^S pixel belongs to the label of the first type of abnormality under the second preset classification condition, and the y ₂ ^S pixel belongs to the label of the first type of abnormality under the second preset classification condition. Two labels belong to the second type of abnormality under the preset classification conditions, and y _n ^S pixels belong to the labels of the nth type of abnormality under the second preset classification conditions. The second real label is also a single-hot vector, that is, the result corresponding to only one channel in the same label is not 0, and the results corresponding to other channels are all 0. In other words, the same label can only refer to the pixel as normal or as One of the n types of exceptions.

In an implementation manner, the teacher label is obtained according to the first segmentation map output by all trained teacher models. As shown in FIG. 8 , the step S510 includes the following steps:

Step S511, obtain the probability total value according to the first segmentation map output by all the trained teacher models;

Step S512: Adjust the first segmentation maps output by all trained teacher models according to the probability total value to obtain a teacher label.

Since the dimensions of the first segmentation map output by the teacher model and the second segmentation map output by the student model are different (equivalent to a different number of channels), it is necessary to first adjust the first segmentation maps output by all teacher models to Fit the teacher labels to the dimensions of the output of the student model. This embodiment adjusts the first segmentation maps output by all teacher models through the following formula, and obtains the teacher label:

where D is the total probability value and p ^T is the teacher label. Specifically, first, according to the first segmentation map output by the i teacher models

get the first vector

In the first vector, the values in the first segmentation map output by the i teacher models are reserved.

and retain the smallest probability value in the first segmentation map output by the i teacher models

as p _normal . According to the formula for calculating the probability total value D, the p _normal in the first vector is compared with all the

Add up to get the total probability. Then the first vector

Divide by the probability total value D to obtain a second vector, which is the teacher label p ^T .

For example, there are currently four teacher models A, B, C, and D. The first segmentation maps output by the teacher models A, B, C, and D are (0.1, 0.9), (0.2, 0.8), (0.3, 0.7) respectively. , (0.4, 0.6), then firstly obtain the first vector (0.1, 0.9, 0.8, 0.7, 0.6) according to the first segmentation map output by the A, B, C, D teacher models, and then calculate all the probabilities in the first vector Add to get the total probability value of 3.1, that is, 0.1+0.9+0.8+0.7+0.6=3.1, and then divide the first vector by the total probability value, which is equivalent to dividing all the probabilities in the first vector by 3.1 , get the second vector, then the second vector is the teacher label.

The second vector is represented as

After the training of the student model is completed, the trained student model is obtained, and the trained student model can be used for real-time colonoscopy image segmentation, that is, as shown in FIG. 1 , the method further includes step S600: The endoscopy images are input to the trained student model to generate real-time colonoscopy image segmentation maps.

Based on the above embodiment, as shown in FIG. 10 , the present invention further provides an apparatus for real-time colonoscopy image segmentation based on integrated knowledge distillation, wherein the apparatus includes: an image acquisition module 120, and the image acquisition module 120 is used for Acquire a training image; a teacher model unit 130, the teacher model unit 130 is used to obtain a first segmentation map according to the training image; a first parameter correction module 110, the first parameter correction module 110 is used to The segmentation map and the first real label, modify the parameters of the teacher model; the student model unit 170, the student model unit 170 is used to obtain the second segmentation map according to the training image; the second parameter correction module 160, the The second parameter modification module 160 is configured to modify the parameters of the student model according to the second segmentation map, the teacher label and the second real label;

The teacher model unit 130 further includes: a first downsampling encoder module 90, the first downsampling encoder module 90 is configured to perform feature extraction on the training image to obtain a first feature map; a first upsampling decoding a decoder module 100, the first upsampling decoder module 100 is configured to parse the first feature map to obtain the first segmentation map;

The student model unit 170 further includes: a second downsampling encoder module 140, the second downsampling encoder module 140 is configured to perform feature extraction on the training image to obtain a second feature map; a second upsampling decoding The second upsampling decoder module 150 is configured to analyze the second feature map to obtain the second segmentation map.

Based on the above embodiments, the present invention also provides a non-transitory computer-readable storage medium, where a data storage program is stored on the non-transitory computer-readable storage medium, and the data storage program is implemented as described above when executed by a processor Each step of the described real-time colonoscopy image segmentation method based on integrated knowledge distillation.

Any reference to a memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Based on the above embodiments, the present invention also provides a terminal including a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be processed by the one or more programs The execution of the one or more programs by the processor includes performing the method of real-time colonoscopy image segmentation based on ensemble knowledge distillation as described in any of the above. A functional block diagram of the terminal may be shown in FIG. 9 . The terminal includes a processor, a memory, and a network interface connected through a system bus. The processor of the terminal is used to provide computing and control capabilities. The memory of the terminal includes a non-volatile storage medium and an internal memory. The nonvolatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The network interface of the terminal is used to communicate with external terminals through a network connection. The computer program, when executed by the processor, implements a real-time colonoscopy image segmentation method based on integrated knowledge distillation.

Those skilled in the art can understand that the principle block diagram shown in FIG. 9 is only a block diagram of a partial structure related to the solution of the present invention, and does not constitute a limitation on the intelligent terminal to which the solution of the present invention is applied. More or fewer components than shown in the figures may be included, or some components may be combined, or have a different arrangement of components. In addition, the realization of the real-time colonoscopy image segmentation method based on integrated knowledge distillation described in any of the above can be accomplished by instructing the relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer. In reading the storage medium, when the computer program is executed, it may include the processes of the embodiments of the above-mentioned methods. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided by the present invention may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

To sum up, the present invention obtains multiple training images, which are divided into multiple training image sets, and the training images of the same training image set come from the same data set; first, the teacher model is trained, and different teacher models are The first segmentation map is obtained from different training image sets; then a student model is jointly refined with the trained teacher model. The training image is a screenshot of a colonoscopy image, and the trained student model can generate a real-time colonoscopy image segmentation map according to the real-time colonoscopy image. This solves the problem that the data sets between different hospitals are discontinuous and cannot be pooled together to train an automatic image segmentation model for colonoscopy.

It should be understood that the application of the present invention is not limited to the above examples. For those of ordinary skill in the art, improvements or transformations can be made according to the above descriptions, and all these improvements and transformations should belong to the protection scope of the appended claims of the present invention.

Claims

A real-time colonoscopy image segmentation method based on integrated knowledge distillation, characterized in that the method comprises:

Acquiring training images; the training images are screenshots of colonoscopy images used to train the teacher model and the student model; the training images are divided into multiple training image sets, and the training images of the same training image set are from the same data set;

Inputting the training image to the teacher model to obtain a first segmentation map; wherein, the number of the teacher models is greater than or equal to two; different teacher models obtain the first segmentation map according to different training image sets;

Modify the parameters of the teacher model according to the first segmentation map and the first real label, and continue to perform the steps of inputting the training image into the teacher model to obtain the first segmentation map until the The preset training conditions of the teacher model to obtain the trained teacher model; the first real label is used to reflect the real classification situation corresponding to the pixels on the training image under the first preset classification conditions;

Inputting the training image to the student model to obtain a second segmentation map;

Modify the parameters of the student model according to the second segmentation map, the teacher label and the second real label, and continue to perform the steps of inputting the training image into the student model to obtain the second segmentation map, until Meet the preset training conditions of the student model to obtain the trained student model; the second real label is used to reflect the real classification situation corresponding to the pixels on the training image under the second preset classification condition; the The teacher label is used to reflect the classification situation of the training image in the trained teacher model;

Real-time colonoscopy images are input to the trained student model to generate real-time colonoscopy image segmentation maps.
The method according to claim 1, wherein the acquiring training image is a screenshot of a colonoscopy image used for training the teacher model and the student model, comprising:

Take screenshots of colonoscopy images;

Compression processing is performed according to the screenshot of the colonoscopy image to obtain the training image; the height, width and number of channels of the training image are all constant.
The method according to claim 1, wherein the teacher model comprises a first down-sampling encoder and a first up-sampling decoder; the training image is input to the teacher model to obtain a first segmentation Figures include:

Perform feature extraction on the training image according to the first down-sampling encoder to obtain a first feature map; the first feature map includes feature information of the training image;

Analyze the first feature map according to the first upsampling decoder to obtain the first segmentation map;

The first segmentation map includes the first standard probability and the first abnormal probability corresponding to the pixels in the training image; the first standard probability is the probability that the pixel belongs to the standard under the first preset classification condition , the first abnormal probability is the probability that the pixel belongs to abnormal under the first preset classification condition; the sum of the probability value of the first abnormal probability and the first standard probability is 1.
The method according to claim 3, wherein, according to the first segmentation map and the first real label, the parameters of the teacher model are modified, and the input of the training image to the The teacher model, the step of obtaining the first segmentation map, until the preset training conditions of the teacher model are met, to obtain the trained teacher model, including:

Calculate a first loss value according to the first segmentation map and the first ground truth;

Adjust parameters of the first upsampling decoder according to the first loss value to update the teacher model;

The step of inputting the training image into the teacher model to obtain the first segmentation map is continued until the preset training conditions of the teacher model are satisfied, so as to obtain the trained teacher model.
The method according to claim 1, wherein the student model comprises a second down-sampling encoder and a second up-sampling decoder; the training image is input to the student model to obtain a second segmentation Figures include:

The second feature map is output after feature extraction is performed on the training image according to the second down-sampling encoder; the second feature map includes feature information of the training image;

Analyze the second feature map according to the second upsampling decoder to obtain the second segmentation map;

Wherein; the second segmentation map includes the second standard probability and the second abnormal probability corresponding to the pixels in the training image; the second standard probability is the probability that the pixel belongs to the standard under the second preset classification condition , the second abnormal probability is the probability that the pixel is abnormal under the second preset classification condition; the number of categories of the second preset classification condition is more than the number of categories of the first preset classification condition.
The method according to claim 5, wherein, according to the second segmentation map, the teacher label and the second real label, the parameters of the student model are modified, and the generation according to the training image is continued. The step of the second segmentation map, until the preset training conditions of the student model are met, to obtain the trained student model, including:

calculating a second loss value according to the second segmentation map, the teacher label and the second ground truth;

Adjust parameters of the second upsampling decoder according to the second loss value to update the student model;

Continue to perform the step of generating a second segmentation map according to the training image until the preset training conditions of the student model are met, so as to obtain a trained student model.
The method according to claim 6, wherein the calculating the second loss value according to the second segmentation map, the teacher label and the second true label comprises:

Obtain the total probability value according to the first segmentation map output by all trained teacher models;

The teacher labels are obtained by adjusting the first segmentation maps output by all trained teacher models according to the probability total value.
A device for real-time colonoscopy image segmentation based on integrated knowledge distillation, characterized in that the device comprises:

an image acquisition module, the image acquisition module is used to acquire training images;

a teacher model unit, the teacher model unit is used to obtain a first segmentation map according to the training image;

a first parameter correction module, which is used to correct the parameters of the teacher model according to the first segmentation map and the first real label;

a student model unit, the student model unit is used to obtain a second segmentation map according to the training image;

The second parameter correction module, the second parameter correction module is used to modify the parameters of the student model according to the second segmentation map, the teacher label and the second real label;

The teacher model unit also includes:

a first down-sampling encoder module, where the first down-sampling encoder module is used to perform feature extraction on the training image to obtain a first feature map;

a first up-sampling decoder module, the first up-sampling decoder module is configured to parse the first feature map to obtain the first segmentation map;

The student model unit also includes:

a second down-sampling encoder module, the second down-sampling encoder module is configured to perform feature extraction on the training image to obtain a second feature map;

A second up-sampling decoder module, the second up-sampling decoder module is configured to parse the second feature map to obtain the second segmentation map.
A terminal comprising a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors to execute the one or more programs Contains for performing the method of any of claims 1-7.
A computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the real-time colonoscopy image segmentation based on integrated knowledge distillation according to any one of claims 1 to 7 is realized steps of the method.