CN111932561A

CN111932561A - Real-time enteroscopy image segmentation method and device based on integrated knowledge distillation

Info

Publication number: CN111932561A
Application number: CN202010997859.XA
Authority: CN
Inventors: 李坚强; 陈杰; 黄志超
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2020-09-21
Filing date: 2020-09-21
Publication date: 2020-11-13
Also published as: WO2022057078A1

Abstract

The invention discloses a real-time enteroscopy image segmentation method and device based on integrated knowledge distillation. The method comprises the following steps: acquiring a plurality of training images, wherein the training images are divided into a plurality of training image sets, and the training images of the same training image set come from the same data set; firstly, training teacher models, wherein different teacher models respectively obtain a first segmentation graph according to different training image sets; and then the trained teacher model is used for refining a student model together. The training image is an enteroscope image screenshot, and the trained student model can generate a real-time enteroscope image segmentation image according to the real-time enteroscope image. Therefore, the problem that data sets among different hospitals are discontinuous and cannot be gathered together to train the colonoscopy automatic image segmentation model is solved.

Description

Real-time enteroscopy image segmentation method and device based on integrated knowledge distillation

Technical Field

The invention relates to the field of image segmentation, in particular to a real-time enteroscopy image segmentation method and device based on integrated knowledge distillation.

Background

Minimally invasive surgery has limitations on the field of view, particularly where blind fields of view often exist in colonoscopy. Real-time colonoscopy automated image segmentation therefore plays an important role for intestinal surgery. The existing automatic image segmentation model for colonoscopy usually needs data sets from different hospitals when training, however, the data sets have discontinuity between different hospitals and can not be collected together to train the automatic image segmentation model for colonoscopy.

The prior art therefore remains to be improved.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a real-time enteroscopy image segmentation method based on integrated knowledge distillation and a storage medium thereof, aiming at solving the problem that in the prior art, data sets between different hospitals cannot be collected to train an automatic image segmentation model for colonoscopy.

The technical scheme adopted by the invention for solving the problems is as follows:

in a first aspect, an embodiment of the present invention provides a real-time enteroscopy image segmentation method based on integrated knowledge distillation, where the method includes:

acquiring a training image; the training image is a colonoscopy image screenshot used for training a teacher model and a student model; the training images are divided into a plurality of training image sets, and the training images of the same training image set come from the same data set;

inputting the training image into the teacher model to obtain a first segmentation graph; wherein the number of the teacher models is greater than or equal to two; different teacher models respectively obtain a first segmentation graph according to different training image sets;

correcting parameters of the teacher model according to the first segmentation graph and the first real label, and continuously executing the step of inputting the training image to the teacher model to obtain the first segmentation graph until preset training conditions of the teacher model are met to obtain a trained teacher model; the first real label is used for reflecting a real classification condition corresponding to the pixels on the training image under a first preset classification condition;

inputting the training image into the student model to obtain a second segmentation graph;

correcting parameters of the student model according to the second segmentation graph, a teacher label and a second real label, and continuously executing the step of inputting the training image to the student model to obtain a second segmentation graph until preset training conditions of the student model are met to obtain a trained student model; the second real label is used for reflecting the real classification condition corresponding to the pixels on the training image under a second preset classification condition; the teacher label is used for reflecting the classification condition of the training images in the trained teacher model;

and inputting the real-time enteroscopy image into the trained student model to generate a real-time enteroscopy image segmentation map.

In one embodiment, the acquiring a training image, which is a colonoscopy video screenshot for training a teacher model and a student model, includes:

acquiring an enteroscope image screenshot;

compressing according to the enteroscope image screenshot to obtain the training image; the height, the width and the number of channels of the training images are all constant.

In one embodiment, the teacher model includes a first down-sampling encoder and a first up-sampling decoder; the inputting the training image into the teacher model to obtain a first segmentation graph includes:

extracting features of the training image according to the first down-sampling encoder to obtain a first feature map; the first feature map contains feature information of the training image;

analyzing the first feature map according to the first up-sampling decoder to obtain the first segmentation map;

wherein the first segmentation map comprises first standard probabilities and first abnormal probabilities corresponding to pixels in the training image; the first standard probability is the probability that the pixel belongs to a standard under a first preset classification condition, and the first abnormal probability is the probability that the pixel belongs to an abnormal under the first preset classification condition; the sum of the probability values of the first anomaly probability and the first standard probability is 1.

In one embodiment, the modifying the parameters of the teacher model according to the first segmentation graph and the first real label, and continuing to perform the step of inputting the training image to the teacher model to obtain the first segmentation graph until a preset training condition of the teacher model is met to obtain a trained teacher model includes:

calculating a first loss value from the first segmentation map and the first real label;

adjusting parameters of the first upsampling decoder according to the first loss value to update the teacher model;

and continuing to execute the step of inputting the training image into the teacher model to obtain a first segmentation graph until a preset training condition of the teacher model is met, so as to obtain a trained teacher model.

In one embodiment, the student model includes a second downsampling encoder and a second upsampling decoder; the inputting the training image into the student model to obtain a second segmentation map comprises:

performing feature extraction on the training image according to the second downsampling encoder and outputting a second feature map; the second feature map contains feature information of the training image;

analyzing the second feature map according to the second up-sampling decoder to obtain the second segmentation map

Wherein; the second segmentation map comprises second standard probabilities and second abnormal probabilities corresponding to pixels in the training image; the second standard probability is the probability that the pixel belongs to the standard under a second preset classification condition, and the second abnormal probability is the probability that the pixel belongs to the abnormal under the second preset classification condition; the number of categories of the second preset classification condition is more than that of the first preset classification condition.

In one embodiment, the modifying the parameters of the student model according to the second segmentation map, the teacher label and the second real label, and continuing to perform the step of generating the second segmentation map according to the training image until a preset training condition of the student model is met to obtain a trained student model includes:

calculating a second loss value according to the second segmentation graph, the teacher label and a second real label;

adjusting parameters of the second upsampling decoder according to the second loss value to update the student model;

and continuing to execute the step of generating a second segmentation graph according to the training image until the preset training condition of the student model is met, so as to obtain the trained student model.

In one embodiment, said calculating a second loss value from said second segmentation graph, said teacher label and a second true label comprises:

obtaining a total probability value according to the first segmentation graphs output by all the trained teacher models;

and adjusting the first segmentation chart output by all the trained teacher models according to the total probability value to obtain a teacher label.

In a second aspect, an embodiment of the present invention further provides an apparatus for real-time enteroscopic image segmentation based on integrated knowledge distillation, wherein the apparatus includes:

the image acquisition module is used for acquiring a training image;

a teacher model unit for obtaining a first segmentation graph from the training image;

the first parameter correction module is used for correcting the parameters of the teacher model according to the first segmentation graph and the first real label;

the student model unit is used for obtaining a second segmentation graph according to the training image;

the second parameter correction module is used for correcting the parameters of the student model according to the second segmentation chart, the teacher label and the second real label;

the teacher model unit further includes:

the first down-sampling encoder module is used for extracting the features of the training image to obtain a first feature map;

a first upsampling decoder module, configured to parse the first feature map to obtain the first segmentation map;

the student model unit further includes:

the second downsampling encoder module is used for extracting the features of the training image to obtain a second feature map;

and the second up-sampling decoder module is used for analyzing the second characteristic diagram to obtain the second segmentation diagram.

In a third aspect, an embodiment of the present invention also provides a terminal, which includes a memory and one or more programs, where the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include a processor configured to execute any of the methods described above.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement any of the steps of the method for real-time enteroscopic image segmentation based on integrated knowledge distillation described above.

The invention has the beneficial effects that: according to the method, a plurality of training images are obtained, the training images are divided into a plurality of training image sets, and the training images of the same training image set are from the same data set; firstly, training teacher models, wherein different teacher models respectively obtain a first segmentation graph according to different training image sets; and then the trained teacher model is used for refining a student model together. The training image is an enteroscope image screenshot, and the trained student model can generate a real-time enteroscope image segmentation image according to the real-time enteroscope image. Therefore, the problem that data sets among different hospitals are discontinuous and cannot be gathered together to train the colonoscopy automatic image segmentation model is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a first flowchart of a real-time enteroscopy image segmentation method based on integrated knowledge distillation according to an embodiment of the present invention.

Fig. 2 is a second flowchart of the real-time enteroscopy image segmentation method based on integrated knowledge distillation according to the embodiment of the present invention.

Fig. 3 is a third flowchart of the real-time enteroscopy image segmentation method based on integrated knowledge distillation according to the embodiment of the present invention.

Fig. 4 is a schematic diagram of a connection relationship between a down-sampling encoder and an up-sampling decoder according to an embodiment of the present invention.

Fig. 5 is a fourth flowchart of a real-time enteroscopy image segmentation method based on integrated knowledge distillation according to an embodiment of the present invention.

Fig. 6 is a fifth flowchart illustrating a real-time enteroscopy image segmentation method based on integrated knowledge distillation according to an embodiment of the present invention.

Fig. 7 is a sixth flowchart of a real-time enteroscopy image segmentation method based on integrated knowledge distillation according to an embodiment of the present invention.

Fig. 8 is a seventh flowchart illustrating a real-time enteroscopy image segmentation method based on integrated knowledge distillation according to an embodiment of the present invention.

Fig. 9 is a schematic structural diagram of a hardware operating environment according to an embodiment of the present invention.

Fig. 10 is a schematic diagram of an internal structure of a real-time enteroscopy image segmentation device based on integrated knowledge distillation according to an embodiment of the present invention.

Fig. 11 is a diagram of the predicted effect of the student model provided by the embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It should be noted that, if directional indications (such as up, down, left, right, front, and back … …) are involved in the embodiment of the present invention, the directional indications are only used to explain the relative positional relationship between the components, the movement situation, and the like in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indications are changed accordingly.

With the development of minimally invasive surgery, the application of artificial intelligence assisted surgery represented by a robot system is more and more widespread. The auxiliary surgery is a system for assisting a doctor to complete a surgery by using a robot, and the purpose of the auxiliary surgery is mainly to overcome the limitation of the visual field of the existing minimally invasive surgery. Assisted surgery is particularly important in colonoscopy. Colonoscopy techniques are one of the important techniques in intestinal surgery, however, in colonoscopy many colon lesions often have similar properties to normal mucosa, such as similar color, or too flat, and such colon lesions with deceptive nature are often difficult to find without the aid of special methods. Real-time colonoscopy automated image segmentation therefore plays an important role in colonoscopy techniques. In particular, in past studies, many have attempted to perform medical image detection, such as automatic endoscopic detection and classification of colorectal polyps, using deep neural network models of natural images, or convolutional networks of biomedical image segmentation, and have achieved considerable detection results: studies have shown that deep learning can locate and identify polyps in screening colonoscopy with high accuracy in real time (e.g., using YOLO to detect polyps in colonoscopy video can locate and identify polyps in real time with 96% accuracy).

Currently, colonoscopy automatic image segmentation models usually require data sets from different hospitals when training, however, data sets between different hospitals have discontinuity and cannot be directly aggregated together to train the models. Furthermore, prior art studies of automated image segmentation for colonoscopy have focused primarily on polyp detection, and lack studies on automated detection of ulcers, bleeding, and merkel diverticulum.

Based on the defects of the prior art, the invention provides a real-time enteroscopy image segmentation method based on integrated knowledge distillation, which is a technology for assisting a doctor to judge a colon image of a patient. Briefly, in the colon image, the diseased tissue (lesion) is morphologically different from the normal tissue, such as differences in color, contour, texture, and other features. Therefore, after the automatic detection model learns a large number of colon examination images with known results, corresponding reference opinions can be given to doctors by carrying out image analysis on the input colon examination images and giving prediction results. However, in the training process of the automatic detection model, a large amount of computing resources are needed, so that information is extracted from a very large and highly redundant data set, and therefore, the scale of the trained model is very large, and the large-scale model is inconvenient to deploy in actual application, so that the model is compressed to be an important problem. The knowledge distillation is a model compression method, the main idea is to train a small network model to simulate a large network or an integrated network which is trained in advance, and in the knowledge distillation, a teacher teaches knowledge to students by the following method: and adding a loss function taking probability distribution of the teacher predicted result as a target in the process of training the student.

Briefly, the method includes the steps that firstly, a plurality of binary classification models are trained through data sets of different hospitals respectively, the binary classification models can detect common problems in colonoscopy such as polyps, ulcers, bleeding and merkel diverticula respectively, and then the trained binary classification models are used for refining a multi-element classification model together, so that the multi-element classification model can automatically detect the polyps, the ulcers, the bleeding and the merkel diverticula. In the invention, the teacher model is a binary classification model, the student model is a multivariate classification model, and the purpose of training the teacher model and the student model is to determine the optimal parameters of the teacher model and the student model and realize the optimal classification effect. And training different teacher models through different training image sets respectively to obtain the teacher models with different classification conditions. Training images of the same teacher model are from the data set of the same hospital, so that the problem that the data sets of different hospitals are discontinuous and cannot be gathered together to train the colonoscopy automatic image segmentation model is solved. And finally, distilling and extracting the knowledge contained in the trained teacher model into the student models by using a knowledge distilling method (namely distilling and extracting the knowledge contained in the trained models into the other models), thereby effectively compressing the student models and reducing the sizes of the student models.

As shown in fig. 1, the real-time enteroscopy image segmentation method based on integrated knowledge distillation provided in this embodiment includes the following steps:

s100, acquiring a training image; the training image is a colonoscopy image screenshot used for training a teacher model and a student model; the training images are divided into a plurality of training image sets, and the training images of the same training image set come from the same data set.

Briefly, the first step in training a model is to acquire available training images and then use the training images to train the teacher model and the student models. The data in the same training image set all come from the same hospital, so that the problem that data sets among different hospitals are discontinuous and cannot be gathered together to train the colonoscopy automatic image segmentation model is solved.

In one implementation, the step S100 shown in fig. 2 further includes the following steps:

s110, acquiring a colonoscopy image screenshot;

s120, compressing according to the enteroscope image screenshot to obtain the training image; the height, the width and the number of channels of the training images are all constant.

Specifically, since the dimensions of the network parameters in the teacher model and the student model are fixed, the sizes of the input images of the teacher model and the student model must be adapted to the dimensions of the network parameters to avoid causing dynamic changes in the network in the models, i.e., the sizes of the input images of the teacher model and the student model need to be fixed. The size of the input image is mainly related to the height, the width and the number of channels of the image, so that the size of the input image is fixed, namely the height, the width and the number of channels of the input image are all constant. In specific implementation, after the enteroscope image screenshot is obtained, the enteroscope image screenshot is compressed, so that the height, the width and the channel number of the enteroscope image screenshot meet the standards of the input images of the teacher model and the student model. And the enteroscope image screenshot after the compression processing is finished is formed into the training image, and the training image can be directly input into the teacher model and the student model and is used for training the teacher model and the student model. In addition, the training images can be divided into a plurality of training image sets, and the data of the same training image set are all from the data set of the same hospital.

After the step S100 is completed, as shown in fig. 1, the method further includes a step S200 of inputting the training image to the teacher model to obtain a first segmentation graph; wherein the number of the teacher models is greater than or equal to two; and different teacher models respectively obtain a first segmentation graph according to different training image sets.

The invention firstly trains a plurality of binary classification models, and then uses the trained binary classification models to jointly refine a multi-element classification model, namely, firstly trains a teacher model, and then uses the trained teacher model to jointly refine a student model. It is therefore necessary to train the teacher model first. For example, there are currently two teacher models A, B, teacher model a is used for automatically detecting polyps, teacher model B is used for automatically detecting bleeding, teacher model A, B is trained through different training image sets, data in the training image set of teacher model a may come from a hospital with the strongest local polyp treatment technology, and data in the training image set of teacher model B may come from a hospital with the strongest local colonic bleeding treatment technology. The teacher model classifies pixels on the training images according to preset classification conditions by collecting specific characteristic information in the training images, and outputs a classification result, wherein the classification result is the first segmentation graph.

The specific classification process is as follows, in one implementation, the teacher model includes a first down-sampling encoder and a first up-sampling decoder, and as shown in fig. 3, the step S200 further includes the following steps:

step S210, extracting the features of the training image according to the first down-sampling encoder to obtain a first feature map; the first feature map contains feature information of the training image;

step S220, analyzing the first feature map according to the first up-sampling decoder to obtain the first segmentation map;

Briefly, the teacher model mainly adopts a random gradient descent algorithm during training. The teacher model is mainly composed of a first down-sampling encoder and a first up-sampling decoder. The first downsampling encoder and the first upsampling decoder are respectively composed of four downsampling layers and four upsampling layers, a connection relation exists between the first downsampling encoder and the first upsampling decoder, the four downsampling layers are connected with the four upsampling layers in a one-to-one correspondence mode, the output of the four downsampling layers is added into the corresponding upsampling layers respectively, and an upsampling process is participated, so that the gradient of the teacher model is kept. After the training image is input into the teacher model, the training image sequentially passes through four downsampling layers of the first downsampling encoder to obtain the first feature map. And then the first characteristic diagram sequentially passes through four upsampling layers of the first upsampling decoder, and the final output result is the first segmentation diagram.

Furthermore, since deep learning models generally need to rely on powerful computing power as support, it is difficult to deploy them in devices with limited computing resources and limited storage space. To address this issue, the first downsampling encoder is constructed using a lightweight network MobileNetv2 when constructing the teacher model. The lightweight network MobileNetv2 is a lightweight model provided for equipment with limited computing resources, and a lightweight deep neural network is constructed by using deep separable convolution, so that the network structure is simplified, and the lightweight network MobileNetv2 has high accuracy and good model compression capability. In one implementation, as shown in fig. 4, the present embodiment adopts four layers in MobileNetv2 as the first downsampling layer 10, the second downsampling layer 20, the third downsampling layer 30, and the fourth downsampling layer 40 in the first downsampling encoder, respectively. Inputting a training image 1 into the first downsampling encoder, wherein the training image 1 is firstly input into the first downsampling layer 10, then the output of the first downsampling layer 10 is used as the input of the second downsampling layer 20, and by analogy, the output image of the last downsampling layer is used as the input image of the next downsampling layer, and the step of performing feature extraction on the input image is continuously performed until the fourth downsampling layer 40 finishes feature extraction on the input image and then outputs the first feature map 2. Wherein each downsampled layer in the first downsampling encoder is comprised of an inverse residual module constructed from a depth separable convolution. Specifically, the reverse residual error module performs point-by-point convolution calculation on the input image to expand the number of channels of the image; then, extracting image features by executing depth convolution calculation; then, the number of channels of the image is compressed by performing point-by-point convolution calculation. Therefore, the size of the teacher model is reduced, and meanwhile, the accuracy of the teacher model for extracting the image features is not lost.

The first down-sampling encoder performs feature extraction on an input training image, outputs the first feature map, uses the first feature map as an input image of the first up-sampling decoder, and then performs the step S220.

Specifically, four upsampling layers in the upsampling decoder respectively consist of a transposed convolutional layer and a standard layer, the transposed convolutional layer is used for expanding an image and extracting image features, and the standard layer is used for avoiding the mutual influence of parameters between different upsampling layers. For example, the following steps are carried out: as shown in fig. 4, after the first feature fig. 2 is inputted to the up-sampling decoder, the output result of the first up-sampling layer 50 is combined with the output result of the fourth down-sampling layer 40 in the down-sampling encoder through the first up-sampling layer 50, and is inputted to the second up-sampling layer 60 as an input image of the second up-sampling layer 60. The combining of the output result of the second upsampling layer 60 with the output result of the third downsampling layer 30 in the downsampling encoder is continuously performed as an input image of the third upsampling layer 70 and is input to the third upsampling layer 70. The combining of the output of the third upsampling layer 70 with the output of the second downsampling layer 20 in the downsampling encoder is then performed again as an input image for the fourth upsampling layer 80 and input to the fourth upsampling layer 80. Finally, the output result of the fourth upsampling layer 80 is combined with the output result of the first downsampling layer 10 in the downsampling encoder to obtain the first segmentation map 3.

In one implementation, the first segmentation map includes a first standard probability and a first abnormal probability corresponding to pixels in the training image; the first standard probability is the probability that the pixel belongs to a standard under a first preset classification condition, and the first abnormal probability is the probability that the pixel belongs to an abnormal under the first preset classification condition; the sum of the probability values of the first anomaly probability and the first standard probability is 1. For example, if the teacher model is a model for automatically detecting polyps, the corresponding first predetermined classification condition is whether polyps are present, the first standard probability is a probability that pixels in the training image are corresponding to normal (i.e., not having polyps), and the first abnormal probability is a probability that pixels in the training image are corresponding to polyps.

Specifically, after the training image is input into the teacher model, the first downsampling encoder downsamples the training image to obtain a downsampled training image with a size of

The first characteristic diagram of (1), wherein

，

，

Is the height, width and number of channels of the first signature. The first up-sampling decoder performs up-sampling according to the first feature map to obtain a first segmentation map

. Wherein,

represents the first

And k is the number of the teacher models, and each teacher model corresponds to a specific classification category.

，

Each pixel is predicted to be normal under a first predetermined classification conditionThe probability of (a) is the first standard probability,

is the probability that each pixel is predicted to be abnormal under the first preset classification condition, i.e. the first abnormal probability,

and

the sum of (1).

For the image channel, j =1 denotes channel 1, outputting a first standard probability; j =2 denotes channel 2, which outputs a first anomaly probability, R being a real number set.

For example, there are currently 4 teacher models: the teacher model A, the teacher model B, the teacher model C and the teacher model D respectively have corresponding classification conditions of whether polyps are suffered or not, whether merck diverticulums are suffered or not, whether ulcers are suffered or not and whether bleeding is suffered or not, and the classification conditions are respectively used for training to automatically detect the polyps, the merck diverticulums, the ulcers and the bleeding. Inputting the training image into the teacher model A to obtain a first segmentation graph (0.1, 0.9), wherein 0.1 is the probability that the prediction pixel output by the channel 1 is normal, and 0.9 is the probability that the prediction pixel output by the channel 2 is polyp; inputting the training image into the teacher model B to obtain a first segmentation graph (0.2, 0.8), wherein 0.2 is the probability that the predicted pixel output by the channel 1 is normal, and 0.8 is the probability that the predicted pixel output by the channel 2 has a Merck diverticulum; inputting the training image into the teacher model C to obtain a first segmentation graph (0.3, 0.7), wherein 0.3 is the probability that the predicted pixel output by the channel 1 is normal, and 0.7 is the probability that the predicted pixel output by the channel 2 is ulcer; and inputting the training image into the teacher C model to obtain a first segmentation graph (0.4, 0.6), wherein 0.4 is the probability that the predicted pixel output by the channel 1 is normal, and 0.6 is the probability that the predicted pixel output by the channel 2 has bleeding.

In order to obtain the correctness of the teacher model prediction during training, the method further comprises a step S300 of correcting parameters of the teacher model according to the first segmentation graph and the first real label, and continuing to execute the step of inputting the training image into the teacher model to obtain the first segmentation graph until a preset training condition of the teacher model is met, so as to obtain a trained teacher model; the first real label is used for reflecting a real result corresponding to the pixel on the training image under a first preset classification condition.

In the actual training process, each training image has its corresponding real label to evaluate the classification effect (prediction effect) of the model. The real label used for training the teacher model is the first real label to indicate a corresponding real result of the training image under the first preset classification condition. The training is to make the output result of the teacher model approach the real label continuously, so that the teacher model will perform parameter correction continuously during the training process, thereby controlling the training process and guiding the training process to converge towards the optimal direction.

As shown in fig. 5, the step S300 specifically includes the following steps:

step S310, calculating a first loss value according to the first segmentation chart and the first real label;

step S320, adjusting parameters of the first up-sampling decoder according to the first loss value to update the teacher model;

and S330, continuing to input the training image into the teacher model to obtain a first segmentation graph until a preset training condition of the teacher model is met to obtain a trained teacher model.

By continuously comparing the first segmentation graph with the first real label, the difference between the prediction result and the real result of the teacher model can be obtained, so that the teacher model can determine how to correct the parameters according to the difference between the prediction result and the real result, and a better prediction effect is achieved. Specifically, the teacher model classifies the training images according to the first preset classification condition to obtain a first segmentation graph, and substitutes the first segmentation graph and the first real label into a calculation formula of the first loss value to obtain the first loss value, where the first loss value may represent a difference between the first segmentation graph and the first real label. The calculation formula of the first loss value is as follows:

since the first loss value refers to the difference between the first segmentation graph and the first real label, the larger the value of the first loss value, i.e. the larger the difference between the first segmentation graph and the first real label, the poorer the classification effect of the teacher model; the smaller the value of the first loss value is, the smaller the difference between the first segmentation graph and the first real label is, the better the classification effect of the teacher model is.

Wherein

Is a teacher model

Output the first

First of the channel

And row and column

First segmentation of column pixels

To show the teacher model

The predicted junction of the pixelAnd (5) fruit.

Is the first of the training image

First of the channel

And row and column

Corresponding real label of column pixel

The pixel may be indicated in the teacher model

True classification under the corresponding classification conditions.

Wherein

Is that the pixel is in the teacher model

A normal tag of (1), and

is that the pixel is in the teacher model

The abnormal signature (diseased signature) in (1). The real label

Is a one-hot vector, i.e. only one channel in the same label corresponds to a result that is not 0, the other channels all correspond to a result that is 0,in other words, the real label of the teacher model has only two forms, namely (1, 0) and (0, 1), wherein (1, 0) indicates that the real condition corresponding to the pixel is normal, and (0, 1) indicates that the real condition corresponding to the pixel is abnormal. According to the real label

To evaluate the first segmentation graph output by the teacher model

The predicted effect is obtained by

And

substituting the first loss value into the calculation formula of the first loss value to obtain a first loss value of the teacher model, and evaluating the prediction effect of the teacher model according to the obtained first loss value, wherein the larger the obtained first loss value is, the worse the prediction effect of the teacher model is; the smaller the first loss value obtained, the better the prediction effect of the teacher model.

And obtaining a trained teacher model after the teacher model is trained, wherein the trained teacher model can be used for refining the student model, and the process of refining the student model is the training process of the student model.

Therefore, as shown in fig. 1, the method further includes step S400 of inputting the training image to the student model to obtain a second segmentation map. Specifically, the student model collects specific feature information in the training image, classifies pixels on the training image according to preset classification conditions, and outputs a classification result, wherein the classification result is the second segmentation map.

As shown in fig. 6, the step S400 specifically includes the following steps:

step S410, extracting the features of the training image according to the second down-sampling encoder and outputting a second feature map; the second feature map contains feature information of the training image;

and step S420, analyzing the second feature map according to the second up-sampling decoder to obtain the second segmentation map.

Specifically, the student model is similar to the teacher model in construction, and each student model comprises a down-sampling encoder and an up-sampling decoder. The down-sampling encoder in the student model is the second down-sampling encoder, the up-sampling decoder in the student model is the second up-sampling decoder, the second down-sampling encoder with the second up-sampling decoder constitutes by four down-sampling layers and four up-sampling layers respectively equally, the second down-sampling encoder with there is the relation of connection between the second up-sampling decoder, four down-sampling layers with four up-sampling layer one-to-one are connected, and will in the up-sampling layer that its correspondence was added respectively to the output on four down-sampling layers, participate in the up-sampling process, in order to keep the gradient of student model. And after the training image is input into the student model, the training image sequentially passes through four down-sampling layers of the second down-sampling encoder to obtain the second feature map. And then the second feature map sequentially passes through four upsampling layers of the second upsampling decoder, and the final output result is the first segmentation map. The student model is mainly different from the teacher model in that the number of classes of the classification condition of the student model is greater than that of the class of the classification condition of the teacher model, so that the dimension of the prediction result output by the student model is greater than that of the prediction result output by the teacher model. And the number of channels of the middle layer in the student model is smaller than that of the channels of the middle layer in the teacher model, so that the whole size of the student model is reduced.

Specifically, the present embodiment also adopts four layers in MobileNetv2 as four downsampling layers in the second downsampling encoder, and uses the output image of the previous downsampling layer as the input image of the next downsampling layer, and continues to perform the step of performing feature extraction on the input image until the fourth downsampling layer finishes feature extraction on the input image, and then outputs the second feature map (the detailed process may refer to step S210). Likewise, each downsampled layer in the second downsampled encoder is comprised of an inverse residual module constructed from depth separable convolutions. Specifically, the inverse residual error module performs point-by-point convolution calculation on the input image to expand the number of channels of the image. Then, by performing a depth convolution calculation, image features are extracted. Then, the number of channels of the image is compressed by performing point-by-point convolution calculation. Therefore, the size of the student model is reduced, and meanwhile, the accuracy of the student model for extracting the image features is not lost.

The second downsampling encoder performs feature extraction on the input training image, outputs the second feature map, uses the second feature map as an input image of the second upsampling decoder, and then performs the step S420.

In specific implementation, the outputs of the four down-sampling layers in the second down-sampling encoder are added into the corresponding four up-sampling layers in the second up-sampling decoder one by one to participate in the up-sampling process, so as to maintain the gradient of the student model. The second feature map passes through the four upsampling layers of the second upsampling decoder in sequence, and the final output result is the second segmentation map (the detailed process may refer to step S220).

In one implementation, the second segmentation map includes a second standard probability and a second abnormal probability corresponding to pixels in the training image; the second standard probability is the probability that the pixel belongs to the standard under a second preset classification condition, and the second abnormal probability is the probability that the pixel belongs to the abnormal under the second preset classification condition; the number of categories of the second preset classification condition is more than that of the first preset classification condition.

Specifically, after the training image is input into the student model, the second down-sampling encoder performs down-sampling according to the training image to obtain a value of

A second characteristic diagram of

，

，

Is the height, width and number of channels of the second signature. The second up-sampling decoder performs up-sampling according to the second feature map to obtain a second segmentation map

. Wherein,Srepresents a model of a student,

，

n is an image channel (for example, N =1 indicates a channel 1), N is a positive integer and N is equal to or greater than 2, the value of N is related to the number of teacher models, and R is a real number set.

Is the minimum standard probability that the pixel corresponds under the second preset classification condition,

indicating pixels in a second predetermined classification barThe probability of an anomaly under condition belonging to the first class,

representing the probability that the pixel belongs to the second class of anomaly under the second preset classification condition, and so on.

For example, since the invention extracts the knowledge contained in the trained teacher models into the student models, the classification conditions set by the student models are all related to the classification conditions of the trained teacher models. For example, there are currently 4 teacher models: the teacher model A, the teacher model B, the teacher model C and the teacher model D respectively have corresponding classification conditions of whether polyps are suffered or not, whether merck diverticulums are suffered or not, whether ulcers are suffered or not and whether bleeding is suffered or not, and the classification conditions are respectively used for training to automatically detect the polyps, the merck diverticulums, the ulcers and the bleeding. The second preset classification conditions of the student models distilled and extracted from the above four teacher models are four types: the first category of classification conditions is whether polyps are present, the second category of classification conditions is whether merkel diverticula are present, the third category of classification conditions is whether ulcers are present, and the fourth category of classification conditions is whether bleeding is present. Inputting the training image into the student model to obtain a second segmentation chart

It means that the probability of the predicted pixel having polyp is 0.8, the probability of having merkel diverticulum is 0.7, the probability of having ulcer is 0.5, and the probability of having hemorrhage is 0.3. The probability of not suffering from polyps is 1-0.8=0.2, the probability of not suffering from polyps is 1-0.7=0.3, the probability of not suffering from polyps is 1-0.5=0.5, the probability of not suffering from polyps is 1-0.3=0.7, and the probability of not suffering from polyps is the minimum value of normal probabilities in all diseases, so that the probability of not suffering from polyps 0.2 is kept in the second segmentation map as the normal probability corresponding to pixels, and therefore the inaccurate prediction effect of the student model caused by the existence of the overhigh normal probability is avoided.

In specific implementation, the predicted effect graph of the student model is shown in fig. 11, where a column is a training image input into the student model; column B is the corresponding second true label (true category graph); column C is the second segmentation map (predicted effect map) corresponding to the output.

In order to obtain the correctness of the classification of the student model during training, the method further comprises the following steps:

step S500, correcting parameters of the student model according to the second segmentation graph, a teacher label and a second real label, and continuously executing the step of inputting the training image to the student model to obtain a second segmentation graph until preset training conditions of the student model are met to obtain a trained student model; the second real label is used for reflecting the real classification condition corresponding to the pixels on the training image under a second preset classification condition; the teacher label is used for reflecting the classification condition of the training images in the trained teacher model.

And the real label used for training the student model is a second real label to indicate the real classification condition corresponding to the pixel in the training image under the second preset classification condition. The training is to make the output result of the student model approach the real label continuously, so that the student model can perform parameter correction continuously in the training process, thereby controlling the training process and guiding the training process to converge towards the optimal direction.

As shown in fig. 7, the step S500 specifically includes the following steps:

step S510, calculating a second loss value according to the second segmentation chart, the teacher label and a second real label;

step S520, adjusting parameters of the second upsampling decoder according to the second loss value to update the student model;

and step S530, continuing to execute the step of generating a second segmentation graph according to the training image until the preset training condition of the student model is met, so as to obtain the trained student model.

By continuously comparing the second segmentation graph with the second real label, the difference between the prediction result and the real classification result of the student model can be obtained, so that the student model can determine how to correct the parameters according to the difference between the prediction result and the real classification result, and a better prediction effect is achieved. Specifically, the student model classifies the training images according to the second preset classification condition to obtain a second segmentation map, and substitutes the second segmentation map and the second real label into a calculation formula of a second loss value to obtain the second loss value, where the second loss value can represent a difference between the second segmentation map and the second real label. The calculation formula of the second loss value is as follows:

since the second loss value refers to the difference between the second segmentation map and the second real label, the larger the value of the second loss value, i.e. the larger the difference between the second segmentation map and the second real label, the poorer the classification effect of the student model; the smaller the value of the second loss value is, the smaller the difference between the second segmentation map and the second real label is, the better the classification effect of the student model is.

Wherein,

is the output of the student model

First of the channel

And row and column

Second segmentation of column pixels

。

Is the first of the training image

First of the channel

And row and column

The teacher label of a column of pixels, which may represent the predicted result of that pixel in the trained teacher model,

is the first of the training image

First of the channel

And row and column

The true label of a column pixel, i.e. the second true label, may indicate the true classification condition of the pixel in the training image under the second preset classification condition.

In particular, the second authentic tag

In the form of

. Wherein

Is the label of the pixel belonging to the criterion under said second predetermined classification condition, y₁ ^SThe pixels belong to a label of a first class anomaly, y, under said second preset classification condition₂ ^SThe pixel is at the secondLabels belonging to a second class of anomalies under two predetermined classification conditions, y_n ^SAnd the pixels belong to the labels of the n-th class of abnormity under the second preset classification condition. The second real label is also a single-time thermal vector, that is, the result corresponding to only one channel in the same label is not 0, and the results corresponding to other channels are all 0, in other words, the same label can only refer to one of normal pixels or n-type anomalies.

In one implementation, the teacher label is derived from the first segmentation map output by all the trained teacher models, as shown in fig. 8, and the step S510 includes the following steps:

step S511, obtaining a probability total value according to the first segmentation graphs output by all the trained teacher models;

and S512, adjusting the first segmentation graphs output by all the trained teacher models according to the total probability value to obtain teacher labels.

Since the dimensions of the first segmentation graph output by the teacher model and the second segmentation graph output by the student model are different (which is equivalent to different channel numbers), the first segmentation graphs output by all the teacher models need to be adjusted first so that the teacher label is matched with the dimensions of the output result of the student model. In this embodiment, the first segmentation graph output by all teacher models is adjusted by the following formula, and a teacher label is obtained:

wherein D is the total probability value,

is a teacher label. Specifically, first, a first segmentation chart is output according to i teacher models

Obtaining a first vector

The first vector retains the i teacher model outputs in the first segmentation graph

And keeping the smallest probability value in the first segmentation graphs output by the i teacher models

As

. According to a formula for calculating the total probability value D, the first vector is divided into a plurality of first vectors

And all of the first vector

And adding to obtain a probability total value. Then, the first vector is measured

Dividing the total probability value D to obtain a second vector, wherein the second vector is the teacher label

。

For example, currently, there are A, B, C, D four teacher models, and the first segmentation maps output by A, B, C, D teacher models are (0.1, 0.9), (0.2, 0.8), (0.3, 0.7), (0.4, 0.6), respectively, then first a first vector (0.1, 0.9, 0.8, 0.7, 0.6) is obtained according to the first segmentation map output by A, B, C, D teacher model, then all probabilities in the first vector are added to obtain a total probability value of 3.1, i.e., 0.1+0.9+0.8+0.7+0.6=3.1, and then the first vector is divided by the total probability value, i.e., it is equivalent to divide all probabilities in the first vector by 3.1, respectively, to obtain a second vector, which is the teacher label.

The second vector is represented as

。

After the training of the student model is completed, a trained student model is obtained, and the trained student model can be used for real-time enteroscopy image segmentation, namely as shown in fig. 1.

Based on the above embodiment, as shown in fig. 10, the present invention further provides a device for real-time enteroscopy image segmentation based on integrated knowledge distillation, wherein the device comprises: an image acquisition module 120, wherein the image acquisition module 120 is configured to acquire a training image; a teacher model unit 130, the teacher model unit 130 configured to obtain a first segmentation map from the training image; a first parameter modification module 110, where the first parameter modification module 110 is configured to modify parameters of the teacher model according to the first segmentation map and the first real label; a student model unit 170, wherein the student model unit 170 is configured to obtain a second segmentation map according to the training image; a second parameter modification module 160, wherein the second parameter modification module 160 is configured to modify parameters of the student model according to the second segmentation map, the teacher label, and the second real label;

the teacher model unit 130 further includes: a first downsampling encoder module 90, where the first downsampling encoder module 90 is configured to perform feature extraction on the training image to obtain a first feature map; a first upsampling decoder module 100, where the first upsampling decoder module 100 is configured to parse the first feature map to obtain the first segmentation map;

the student model unit 170 further includes: a second downsampling encoder module 140, where the second downsampling encoder module 140 is configured to perform feature extraction on the training image to obtain a second feature map; a second upsampling decoder module 150, wherein the second upsampling decoder module 150 is configured to parse the second feature map to obtain the second segmentation map.

Based on the above embodiments, the present invention also provides a non-transitory computer readable storage medium, on which a data storage program is stored, the data storage program, when executed by a processor, implements the steps of the integrated knowledge distillation based real-time enteroscopy image segmentation method as described above.

Any reference to memory, storage, database or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Based on the foregoing embodiments, the present invention further provides a terminal, which includes a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include a method for performing the integrated knowledge distillation-based real-time enteroscopy image segmentation method as described in any one of the above. A functional block diagram of the terminal may be as shown in fig. 9. The terminal comprises a processor, a memory and a network interface which are connected through a system bus. Wherein the processor of the terminal is configured to provide computing and control capabilities. The memory of the terminal comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the terminal is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a real-time enteroscopy image segmentation method based on integrated knowledge distillation.

It will be understood by those skilled in the art that the block diagram of fig. 9 is only a block diagram of a part of the structure related to the solution of the present invention, and does not constitute a limitation to the intelligent terminal to which the solution of the present invention is applied, and a specific intelligent terminal may include more or less components than those shown in the figure, or combine some components, or have different arrangements of components. In addition, the method for real-time segmentation of an enteroscopy image based on distillation of integrated knowledge as described in any one of the above embodiments may be implemented by a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the above embodiments of the method. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

In summary, in the present invention, a plurality of training images are obtained, the training images are divided into a plurality of training image sets, and the training images of the same training image set are from the same data set; firstly, training teacher models, wherein different teacher models respectively obtain a first segmentation graph according to different training image sets; and then the trained teacher model is used for refining a student model together. The training image is an enteroscope image screenshot, and the trained student model can generate a real-time enteroscope image segmentation image according to the real-time enteroscope image. Therefore, the problem that data sets among different hospitals are discontinuous and cannot be gathered together to train the colonoscopy automatic image segmentation model is solved.

It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims

1. A method for real-time enteroscopy image segmentation based on integrated knowledge distillation, the method comprising:

2. The method of claim 1, wherein the obtaining of the training images, which are screenshots of the colonoscopy images for training the teacher model and the student model, comprises:

acquiring an enteroscope image screenshot;

3. The method of claim 1, wherein the teacher model comprises a first downsampling encoder and a first upsampling decoder; the inputting the training image into the teacher model to obtain a first segmentation graph includes:

4. The method of claim 3, wherein the modifying the parameters of the teacher model according to the first segmentation map and the first real labels and continuing to input the training image into the teacher model to obtain the first segmentation map until a preset training condition of the teacher model is met to obtain the trained teacher model comprises:

5. The method of claim 1, wherein the student model comprises a second downsampling encoder and a second upsampling decoder; the inputting the training image into the student model to obtain a second segmentation map comprises:

6. The method of claim 5, wherein the modifying the parameters of the student model according to the second segmentation map, the teacher label and the second real label and continuing to perform the step of generating the second segmentation map according to the training image until a preset training condition of the student model is met to obtain the trained student model comprises:

7. The method of claim 6, wherein said calculating a second loss value from said second segmentation graph, said teacher label, and a second true label comprises:

8. An apparatus for real-time enteroscopic image segmentation based on integrated knowledge distillation, the apparatus comprising:

the image acquisition module is used for acquiring a training image;

the teacher model unit further includes:

the student model unit further includes:

9. A terminal comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein the one or more programs being configured to be executed by the one or more processors comprises instructions for performing the method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for real-time enteroscopic image segmentation based on integrated knowledge distillation of any one of claims 1-7.