WO2020078252A1

WO2020078252A1 - Method, apparatus and system for automatic diagnosis

Info

Publication number: WO2020078252A1
Application number: PCT/CN2019/110353
Authority: WO
Inventors: Chung Yan Carmen Poon; Ruikai Zhang; Yuqi JIANG
Original assignee: The Chinese University Of Hong Kong
Priority date: 2018-10-16
Filing date: 2019-10-10
Publication date: 2020-04-23
Also published as: CN113302649A

Abstract

A method (2000), an apparatus (100) and a system (900, 1000) for automatic diagnosis are disclosed. The method (2000) comprises: receiving, in sequence, a predetermined number of frames of medical video data by a CNN group (S201); predicting, by the CNN group, lesion prediction for each of the frames (S202); and outputting the frames each of which is labeled with the lesion prediction (S203). For each inputted frame, the predicting comprises: extracting feature from the inputted frame and determining a prediction for the inputted frame by a first CNN of the CNN group; extracting, by each of the CNNs after the first CNN, feature from input thereof, wherein the input is generated by concatenating at least one latest output from a previous CNN; and determining a prediction for the inputted frame, wherein a prediction from a last CNN of the CNN group is outputted as the lesion prediction for the inputted frame.

Description

METHOD, APPARATUS AND SYSTEM FOR AUTOMATIC DIAGNOSIS

TECHNICAL FIELD

Embodiments of the present disclosure relate generally to a field of automatic diagnosis. Particularly, embodiments of the disclosure relate to a method, an apparatus and a system for automatic diagnosis.

BACKGROUND

Gastrointestinal cancers such as colorectal cancer (CRC) are one of the commonest cancers worldwide. CRC is preventable and endoscopy is an effective way to detect it. Endoscopists can visually check the lower gastrointestinal tract using an endoscope and resect polyps that have a high risk of developing into colorectal cancer. Nevertheless, this endoscopic procedure mainly relies on a human visual inspection and thus is a highly experience-dependent. During such endoscopic procedure, lesions may be miss-detected or miss-diagnosed.

Therefore, it is desired to develop an automatic diagnosis system that is capable of assisting the endoscopists to locate and classify the polyps and reduce polyp miss-detection rate and costs of unnecessary pathology assessment.

SUMMARY OF THE APPLICATION

An aspect of the present invention provides a method for automatic diagnosis, comprising: receiving, in sequence, a predetermined number of frames of medical video data by a CNN group that comprises a plurality of CNNs connected in series; predicting, by the CNN group, a lesion prediction for each of the frames; and outputting the frames each of which is labeled with the lesion prediction, wherein, for each inputted frame, the predicting comprising: extracting, by a first CNN of the CNN group, feature from the inputted frame; determining, by the first CNN, a prediction for the inputted frame based on the extracted feature; extracting, by each of the CNNs after the first CNN, feature from input thereof, wherein the input to each CNN after the first CNN is generated by concatenating at least one latest output from a previous CNN, the output being at least one of the extracted feature and the determined prediction; and determining, by each CNN after the first CNN, a prediction for the inputted frame based on the extracted feature thereof, wherein a prediction from a last CNN of the CNN group is outputted as the lesion prediction for the inputted frame.

Another aspect of the present invention provides an automatic diagnosis apparatus, comprising: a processor; and a memory coupled to the processor to store instructions executable by the processor to perform operations: receiving, in sequence, a predetermined number of frames of medical video data by a CNN group that comprises a plurality of CNNs connected in series; predicting, by the CNN group, a lesion prediction for each of the frames; and outputting the frames each of which is labeled with the lesion prediction, wherein, for each inputted frame, the predicting comprising: extracting, by a first CNN of the CNN group, feature from the inputted frame; determining, by the first CNN, a prediction for the inputted frame based on the extracted feature; extracting, by each of the CNNs after the first CNN, feature from input thereof, wherein the input to each CNN after the first CNN is generated by concatenating at least one latest output from a previous CNN, the output being at least one of the extracted feature and the determined prediction; and determining, by each CNN after the first CNN, a prediction for the inputted frame based on the extracted feature thereof, wherein a prediction from a last CNN of the CNN group is outputted as the lesion prediction for the inputted frame.

Yet another aspect of the present invention provides an automatic diagnosis system, comprising: an endoscopy obtaining endoscopy data; a diagnosis apparatus receiving the endoscopy data and comprising: a processor; and a memory coupled to the processor to store instructions executable by the processor to perform operations: receiving, in sequence, a predetermined number of frames of medical video data by a CNN group that comprises a plurality of CNNs connected in series; predicting, by the CNN group, a lesion prediction for each of the frames; and outputting the frames each of which is labeled with the lesion prediction, wherein, for each inputted frame, the predicting comprising: extracting, by a first CNN of the CNN group, feature from the inputted frame; determining, by the first CNN, a prediction for the inputted frame based on the extracted feature; extracting, by each of the CNNs after the first CNN, feature from input thereof, wherein the input to each CNN after the first CNN is generated by concatenating at least one latest output from a previous CNN, the output being at least one of the extracted feature and the determined prediction; and determining, by each CNN after the first CNN, a prediction for the inputted frame based on the extracted feature thereof, wherein a prediction from a last CNN of the CNN group is outputted as the lesion prediction for the inputted frame.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other features of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the accompanying drawings, in which:

Fig. 1 is a schematic view that shows an automatic diagnosis system having an automatic diagnosis apparatus according to one embodiment of the present disclosure.

Fig. 2 shows a flow diagram of a method for automatic diagnosis according to one embodiment of the present disclosure.

Fig. 3 shows a schematic structure of the CNN group according to one embodiment of the present application.

Figs. 4A-4C show an example for illustrating the operation of the CNN group according to one embodiment of the present application.

Fig. 5 shows another example for illustrating the operation of the CNN group according to one embodiment of the present application.

Fig. 6 shows a structure of the CNN according to one embodiment of the present application.

Figs. 7A-7C show three candidate structures of feature extractor according to one embodiment of the present application.

Figs. 8A-8B show two candidate structures of predictor in the CNN according to one embodiment of the present application.

Fig. 9 is a schematic diagram illustrating a system adapted to implement the present application.

DETAILED DESCRIPTION

Various embodiments and aspects of the disclosures will be described with reference to details discussed below, and the accompanying drawings will illustrate the various embodiments. The following description and drawings are illustrative of the disclosure and are not to be construed as limiting the disclosure. Numerous specific details are described to provide a thorough understanding of various embodiments of the present disclosure. However, in certain instances, well-known or conventional details are not described in order to provide a concise discussion of embodiments of the present disclosures.

Reference in the specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in conjunction with the embodiment can be included in at least one embodiment of the disclosure. The appearances of the phrase “in one embodiment” in various places in the specification do not necessarily all refer to the same embodiment.

The method, apparatus and system for automatic diagnosis of the present application is used to automatic diagnosis disease according to medical video data and is especially suitable to cancers screening with an endoscopy. The system for automatic diagnosis of the present application may be connected between a medical video acquiring device and a display device, to automatically diagnose lesions and display the diagnostic results in real-time on the display device. The system may assist clinicians in enhancing the quality of endoscopy and reducing unnecessary tissue resection.

Fig. 1 is a schematic view that shows an automatic diagnosis system 1000 according to one embodiment of the present disclosure. Fig. 1 illustrates a scenario where the automatic diagnosis apparatus 100 is used to diagnose lesions according to endoscopy data, but it should be noted that the present application is not limited thereto. Referring to Fig. 1, the automatic diagnosis apparatus 100 is connected between an endoscopy 200 and a display device 300. The automatic diagnosis apparatus 100 may receive endoscopy video data from the endoscopy 200, diagnose lesions according to the endoscopy video data, and then output the endoscopy video data with diagnostic prediction to the display device 300. Endoscopists may view the endoscopy video data and the diagnostic prediction of the automatic diagnosis apparatus 100 via the display device 300. In the present application, the automatic diagnosis apparatus 100 may apply a method for automatic diagnosis described later according to Figs. 2-8B. In the following description, the method for automatic diagnosis of the present application will be described in detail.

Fig. 2 shows a flow diagram of a method 2000 for automatic diagnosis according to one embodiment of the present disclosure. As shown, the method 2000 starts with step S201 to receive a predetermined number of frames of medical video data in sequence by a CNN (convolutional neural network) group (to be described in detail later) that comprises a plurality of CNNs connected in series. The medical video data may acquire by the medical video acquiring device such as an endoscopy or a storage device, and may comprise a plurality of frames. Then, at step S202, the CNN group may predict a lesion prediction for each of the frames. Finally, at step S203, the frames each of which is labeled with the lesion prediction are outputted. For example, the outputted frames may be labeled with a bounding box indicating the lesion, classification of the lesion and the like. The labeled frames may be outputted to a display device such that the doctors may view the displayed information as a diagnose reference.

Hereinafter, the CNN group will be described in detail referring to Figs. 3-8B.

Referring to Fig. 3, the CNN group comprises a plurality of CNNs 1 to n that will be described in detail referring to Fig. 6, wherein each of a previous CNN is connected to a next CNN, that is, the plurality of CNNs 1-n are connected in series. The received frames at the step S201 are inputted into the CNN group in sequence. Fig. 3 shows a case that the frames 1 to n have been inputted into the CNN group.

Hereinafter, the frame n will be taken as an example to describe the prediction operation of the CNN group. It should be note that the predictions of the other frames by the CNN group may be same as the frame n.

The frame n is firstly inputted into the CNN 1. The CNN 1 extracts feature from the inputted frame n, and then determines a prediction for the inputted frame n based on the extracted feature. After extracting and determining operations of CNN 1, at least one of the extracted feature and the prediction may be outputted.

For CNN 2 in the CNN group, an input thereof may be a sum of at least one latest output from the previous CNN 1. For example, the outputs of CNN 1 respectively based on frames n-2, n-1 and n may be concatenated and inputted into CNN 2. As described above, the output of each CNN in the CNN group may be at least one of the extracted feature and the prediction. The extracted feature and the prediction may be of a vector form or matrix form, and the concatenation between different feature or prediction may be performed by linking a plurality of vectors or matrixes into one vector or matrix with a higher dimensionality.

The CNNs after CNN 1 in the CNN group may operate similarly to CNN 2. That is, the input to each CNN after CNN 1 is generated by concatenating at least one latest output from a previous CNN.

After the last CNN n of the CNN group determining a prediction for the inputted frame n, the prediction from the CNN n is outputted as the lesion prediction for the inputted frame n.

Figs. 4A-4C show a specific example for illustrating the operation of the CNN group according to one embodiment of the present application. In the example shown in Figs. 4A-4C, predicting of three frames 1-3 are described, and the CNN group comprises two CNNs (CNN 1 and CNN 2) . However, it should be noted the present application is not limited thereto, and the number of frames and the CNNs may be variously changed.

Fig. 4A shows the predicting process for frame 1. Since frame 1 is the first frame of the inputted frames, only one CNN is used to predict frame 1. As shown, frame 1 is firstly inputted into CNN 1. Then, CNN 1 extracts feature from frame 1, and determines a prediction for frame 1 based on the extracted feature. The determined prediction 1 may be outputted as a lesion prediction for frame 1, meanwhile, CNN 1 may also output at least one of the extracted feature and determined prediction to assist the predicting process for the subsequent frames.

Fig. 4B shows the predicting process for frame 2. Frame 2 is firstly inputted into CNN 1. Then, CNN 1 extracts feature from the frame 2, and determines prediction for frame 1 based on the extracted feature. CNN 1 may output at least one of the extracted feature and determined prediction. Next, the outputs (output 1 and output 2) of CNN 1 obtained respectively based on

frames

1 and 2 may be concatenated together and inputted into CNN 2, wherein output 1 has been obtained by CNN 1 during the predicting process of frame 1. CNN 2 determines a prediction 2 for frame 2 based on the input thereof, and this prediction may be outputted as a lesion prediction for frame 2.

Fig. 4C shows the predicting process for frame 3. Frame 3 is firstly inputted into CNN 1, and the process performed by CNN 1 is same as that described referring to Fig. 4B. During the process shown in Fig. 4C, the input of CNN 2 is concatenated outputs (outputs 1-3) of CNN 1 obtained respectively based on frames 1-3, wherein outputs 1-2 have been obtained by CNN 1 during the predicting processes for

frames

1 and 2 respectively. Then, CNN 2 may determine a prediction 3 for frame 3 based on the input thereof, and this prediction may be outputted as a lesion prediction for frame 3. It should be noted that, although Figs. 4B and 4C show a plurality of CNNs 1, it is only for schematically illustrating how the concatenated outputs from CNN 1 are obtained, and there is only one CNN 1 exists in the CNN group.

Fig. 5 shows another example for illustrating the operation of the CNN group according to one embodiment of the present application. In the example shown in Fig. 5, predicting of frame 4 is described, and the CNN group comprises three CNNs (CNN 1, CNN 2 and CNN 3) . However, it should be noted the present application is not limited thereto, and the number of frames and the CNNs may be variously changed.

As shown in Fig. 5, frame 4 is firstly inputted into CNN 1. The process performed by CNN 1 is same as that described referring to Fig. 4B. Next, concatenated outputs of CNN 1 obtained respectively based on frames 2-4 is inputted into CNN 2. CNN 2 may extract feature based on the input thereof and determine a prediction for frame 3 based on the extracted feature, and then output at least one of the extracted feature and determined prediction. Then, concatenated outputs of CNN 2 obtained respectively based on frames 3-4 is inputted into CNN 3, wherein the output of CNN 2 obtained based on frame 3 is generated by inputting concatenated outputs of CNN 1 obtained respectively based on frames 1-3 into CNN 2, extracting feature and determining the prediction for frame 3 by CNN 2, and outputting at least one of the extracted feature and determined prediction. Finally, CNN 3 may determine a prediction 4 for frame 4 based on the input thereof, and this prediction may be outputted as a lesion prediction for frame 4. In the embodiment shown in Fig. 5, because the CNN 1 continuously process inputted frames, the outputs of CNN 1 based on frames 1-4 are continuously obtained. Therefore, outputs of CNN 1 obtained respectively based on frames 2-4 are latest three outputs of CNN 1 when predicting frame 4. Similarly, outputs of CNN 2 obtained respectively based on frames 3-4 are latest two outputs of CNN 2 when predicting frame 4. In other word, the input to each CNN after the first CNN is generated by concatenating at least one latest output from a previous CNN. Further, it should be noted that, although Fig. 5 only shows the predicting process for frame 4, other frames may be predicted by the similar process.

Because the information obtained from previous frames is used in the predicting process for the subsequent frames, the information contained in the medical video data may be fully utilized, and thus the accuracy of the prediction may be improved.

Fig. 6 shows a structure of the CNN according to one embodiment of the present application. In the CNN of the present disclosure, a plurality of feature extractor layers are used, each previous feature extractor layer is connected to a next feature extractor layer. Each feature extractor layer comprises a plurality of feature extractors each of which is configured to extract feature of the input thereof. In addition, each of the feature extractor may have a structure that is shown in Figs. 7A-7C. Further, the CNN may comprise a predictor following the extract feature layers, and the predictor is configured to predict lesion prediction according to the output of the previous feature extractors. As shown in Fig. 6, each feature extractor layer is composed of a plurality of feature extractors connected in parallel. For example, feature extractor layer L1 comprises feature extractors E11-Em1 connected in parallel, feature extractor layer L2 comprises feature extractors E21-Em2 connected in parallel, and feature extractor layer Ln comprises feature extractors E2n-Emn connected in parallel. Outputs of the feature extractors in one feature extractor layer may be concatenated and inputted into each feature extractor in the next feature extractor layer. For example, outputs of feature extractors E11-Em1 in the feature extractor layer L1 are concatenated to be feature F1, and then feature F1 is inputted to each of feature extractor (feature extractor E12-Em2) in feature extractor layer L2. Each of feature extractors E12-Em2 may extract feature from input thereof, and then outputs of feature extractors E12-Em2 may be concatenated to be feature F2 and inputted into each feature extractor in the next feature extractor layer. The feature extractors in the following feature extractor layers may operate similarly. Outputs of feature extractors E1n-Emn in the last feature extractor layer Ln may be concatenated to be feature Fn and inputted into the predictor, and the predictor may determine a prediction for the inputted frame based on the inputted feature. In one embodiment, at least one of features F1-Fn and prediction may be used as the output of the CNN.

In the present application, the feature extractors in one feature extractor layer may extract features in different aspect, this may take fully use of the information contained in the inputted frame.

Figs. 7A-7C show three candidate structures for the feature extractor.

In Fig. 7A, a first structure for the feature extractor comprises a plurality of convolution layers, such as the convolution layers C11-C15. In the first structure, each of convolution layers is connected to following layers of the convolution layers. Specifically, taking the convolution layers C11-C15 as an example, the convolution layer C11 is not only connected to the convolution layer C12, but also connected to the convolution layers C13-C15. In other word, the convolution layer C11 is connected to all of the convolution layers following it. Likely, the convolution layers C12-C15 may be connected to the following convolution layers in a similar manner.

In Fig. 6B, a second structure for the feature extractor comprises a plurality of convolution layers, such as the convolution layers C21-C25. In the second structure, each of the convolution layers is connected to next layer or next two layers of the convolution layers. Specifically, taking the convolution layers C21-C25 as an example, the convolution layer C21 is connected to the convolution layers C22 and C23 that closely follow the convolution layer C21 (i.e., the next two layers of the layer C21) . The convolution layer C22 is connected to the convolution layer C23 that closely follow the convolution layer C22 (i.e., the next layer of the layer C21) . The convolution layers C23 and 25 may be connected to the following convolution layers thereof in a similar manner to the convolution layer C21, and the convolution layer C24 may be connected to the following convolution layer thereof in a similar manner to the convolution layer C22.

In Fig. 6C, a third structure for the feature extractor comprises the structures described in Fig. 6A-6B in parallel, where the outputs from both structures are concatenated as output. For example, in the third structure, the first structure composed of the convolution layers C11-C15 and the second structure composed of the convolution layers C21-C25 are connected in parallel.

In the structures shown in Figs. 6A-6C, in the convolution layers connected to each other, the feature extracted by the previous layers may be concatenated and outputted as the input of the following convolution layer.

It should be note that although Figs. 6A-6C shows the first and second structures comprising five convolution layers, the number of the convolution layers is not limited thereto. It should also be noted that, although the third structure in Fig. 6C is shown to comprise of one first structure in Fig. 6A and one second structure in Fig. 6B, the present disclosure is not limited thereto. In addition, the feature extractors in the CNN may has one of the first to third structures, and the structures of the feature extractors in one CNN may be different from each other. Further, structures of the CNNs in one CNN group may be different from each other.

In one embodiment, in the CNN, each of the convolution layers may be followed by a pooling layer, and a fully connected layer may be located after all of the convolution layers. The convolution layers perform the convolution operation on the inputted image or feature and generate a plurality of feature maps, and then the pooling layer performs the pooling operation on the feature maps to obtain the features that have translation invariance from the feature maps. The fully connected layer combines the outputs of the pooling layers, and then generates a representation for the inputted image or feature. The representation may be of vector form.

The CNN as shown in Fig. 6 may comprise one or more predictors.

Fig, 8A shows an example that the CNN comprises one predictor. The predictor can generate one output vector. The output vector is a combination of multiple predictions, of which each prediction is associated to a specific task. The task may be at least one of lesion recognition, lesion detection, lesion localization determination, lesion segmentation and disease diagnosis.

Fig, 8A shows an example that the CNN comprises a plurality of predictors. The multiple predictions may be generated by the plurality of predictor, of which each predictor is associated to one prediction for a specific task.

Referring to Figs. 3 and 6-7B again, in one embodiment, the feature extractors in the CNNs except for the last CNN of the CNN group may have the first structure shown in Fig. 7A, and feature extractors in the last CNN of the CNN group may have the second structure shown Fig. 7B.

Because the convolution layers in the first structure are connected to all of the following convolution layers thereof, the information obtained by one of the convolution layers from the frames may be transmitted to all of the following convolution layers. In this way, the CNNs may take full advantage of each frame and the accuracy of the prediction may be improved. Further, because input of the last CNN is the concatenated output of previous CNNs, the information quantity of information to be processed is relatively small. In this instance, it is not necessary that each of the convolution layers in the last CNN is connected to all of the following convolution layers. Therefore, the structure of last CNN uses the second structure in Fig. 7B to reduce the computing cost while maintaining the accuracy of the prediction.

In some embodiments, the CNN group may be trained before applying to the actual prediction task. The CNN group may be trained by:

a) inputting a predetermined number of training frames of medical video data to the CNN group;

b) predicting, by the CNNs except for the last CNN in the CNN group, frame lesion prediction candidate for each of the training frames;

c) comparing the frame lesion prediction candidate and ground truth for each of the training frames to obtain first training error for each of the training frames;

d) predicting, by the last CNN in the CNN group, final frame lesion prediction candidate for each of the training frames, according to the concatenated outputs of the previous CNNs;

e) comparing the final frame lesion prediction candidate and final ground truth for each of the training frames to obtain second training error for each of the training frames;

f) backward-propagating the first and second training errors to the CNN group to adjust parameters of the CNNs; and

g) repeating steps a) -f) until the first and second training errors converge.

At the step a) , the medical video data for training may be obtained from a database of medical video or downloaded from the Internet, but the present disclosure is not limited thereto. In the present embodiment, each of the training frame has a corresponding ground truth (used in step c) ) , and the ground truth presents the reference answer about the lesion in the frames. In some embodiments, the ground truth may comprise at least one of presence of interested lesion, size of interested lesion, location of interested lesion, histology type of interested lesion, area of interested lesion and the diagnosis related to interested lesion. The frame lesion prediction candidates determined by the CNNs may correspond to the ground truth, for example, the frame lesion prediction candidates may also comprise at least one of presence of interested lesion, size of interested lesion, location of interested lesion, histology type of interested lesion, area of interested lesion and the diagnosis related to interested lesion. In other embodiment, the frame lesion prediction candidates may be at least one of lesion recognition, lesion detection, lesion localization, lesion segmentation and disease diagnosis. In this instance, the ground truth is at least one of lesion recognition, lesion detection, lesion localization, lesion segmentation and disease diagnosis, accordingly.

At the step b) , the CNNs except for the last CNN in each of the CNN groups are used to predict frame lesion prediction candidate for each of the training frames. The predicting process of each used CNN may be similar to the process described referring to Fig. 6, and the information transmission between the used CNNs may be similar to that described referring to Fig. 3.

At the step c) , the frame lesion prediction candidate and ground truth for each of the training frame are compared to obtain first training error for each of the training frames. In the present embodiment, the first training error represent the difference between the predictions of the used CNNs and the reference answers of the frames, which may reflect the training degree of the used CNNs.

At the step d) , the outputs of the used CNNs in the step b) are inputted to the last CNN in each of the CNN groups, and the last CNN in each of the CNN groups may predict final frame lesion prediction candidates for each of the frames according to the concatenated outputs of the previous CNNs. The prediction process may be similar to the process described referring to Fig. 6.

At the step e) , the final frame lesion prediction candidate and final ground truth for each of the training frames may be compared to obtain second training error for each of the training frames. In the present disclosure, the final ground truth may represent the reference answer about the whole training medical video or segments of medical video. In some embodiment, the final ground truth may be the ground truth of the last frame of the medical video.

At the step f) , the first and second training errors are backward-propagated to the CNN group to adjust parameters of the CNNs in the CNN group. In the present embodiment, the parameters of the CNNs may be the connection weights between neurons of the CNNs. Further, in some embodiments, a sum of the first and second training errors may be backward-propagated to the CNN group to adjust connection weights between neurons of the CNNs.

The steps a) -f) may be repeated until the first and second training errors converge. By repeating the steps a) -g) , the parameters of the CNNs in the CNN group may be continually optimized according to the backward-propagated errors, and the errors reduce gradually. When all errors cannot further reduce, the repeating may be stop and the training process finishes.

In some embodiments, the method of the present application may further comprise pre-training the CNNs except for the last CNN in the CNN group. The pre-training process may comprise:

h) inputting training images to the CNNs except for the last CNN in the CNN group;

i) predicting, by the CNNs except for the last CNN in each of the CNN group, image prediction candidate for each of the inputted training images;

j) comparing the image prediction candidate with ground truth for each of the inputted training images to obtain image error for each of the inputted training images;

k) backward propagating the image error to the CNNs except for the last CNN and adjust parameters thereof; and

l) repeating steps h) -k) until the image error converges.

The steps h) -j) are similar to the steps a) -c) described above, and the difference between them is that, the pre-training process uses the images as training data. However, it should be noted that the pre-training process also can use video data as training data. For the step k) , because the pre-training process only training the CNNs except for the last CNN in the CNN group, only one set of errors (i.e., the detection and localization errors of image) is backward-propagated. Similar to the above described training process, the pre-training process is also repeated for a plurality of times until the set of errors converges.

In some embodiments, the pre-training process may be performed by inputting public non-medical images and public medical images respectively as the training images. Specifically, the CNNs except for the last CNN in the CNN group are firstly pre-trained by public non-medical images, and then are pre-trained by public medical images. In such way, the CNNs except for the last CNN in the CNN group may be gradually optimized to be suitable for medical prediction.

After pre-training using public non-medical images and public medical images respectively, all of the CNNs in the CNN group may be trained using a specific medical video data via the training steps a) -g) . The specific medical video date may be specific target medical video data. After these three stages, all of the CNNs may be trained to be suitable for the specific medical diagnose.

In some embodiments, during the training process composed of the steps a) -g) and the pre-training composed of the steps h) -l) , the number of the training frames and the precisions of the CNNs may be dynamically changed according to a device applying the method. For example, if the device applying the method of the present disclosure has a limited computing resource, the number of the training frames and the precisions of the CNNs may be reduced; if the device applying the method of the present disclosure has a sufficient computing resource, the number of the training frames and the precisions of the CNNs may be increased. This may make the method of the present disclosure more suitable for devices with different computing resource.

In some embodiments, in order to make the frame clearer and remove some noises that may disturb the prediction, the method of the present disclosure may comprise pre-processing frames of the medical video data by at least one of scaling, brightness and contrast adjustment, color transformation, sharpening and blurring.

In some embodiments, in addition to output the medical video data labeled with the final prediction in real-time to the display device, it may also be sent a peripheral device, such as a general-purpose IO device, a wireless transceiver, a USB dongle or a peripheral storage, so as to adapt different applying scene. For example, the medical video data labeled with the final prediction may be sent to a peripheral storage for the purpose of backup, or the medical video data labeled with the final prediction may be sent to a wireless transceiver to be shared with the endoscopists in other place.

The present disclosure also provides an automatic diagnosis apparatus, the apparatus comprises a processor; and a memory coupled to the processor to store instructions executable by the processor to perform operations: receiving, in sequence, a predetermined number of frames of medical video data by a CNN group that comprises a plurality of CNNs connected in series; predicting, by the CNN group, a lesion prediction for each of the frames; and outputting the frames each of which is labeled with the lesion prediction, wherein, for each inputted frame, the predicting comprising: extracting, by a first CNN of the CNN group, feature from the inputted frame; determining, by the first CNN, a prediction for the inputted frame based on the extracted feature; extracting, by each of the CNNs after the first CNN, feature from input thereof, wherein the input to each CNN after the first CNN is generated by concatenating at least one latest output from a previous CNN, the output being at least one of the extracted feature and the determined prediction; and determining, by each CNN after the first CNN, a prediction for the inputted frame based on the extracted feature thereof, wherein a prediction from a last CNN of the CNN group is outputted as the lesion prediction for the inputted frame.

In one embodiment, each of the CNNs in the CNN group comprises a plurality of feature extractor layers connected in series and at least one predictor following the last one of the feature extractor layers, wherein, each of the feature extractor layers comprises a plurality of feature extractors connected in parallel, and each of the feature extractors extracts feature for input thereof, an input for each of the feature extractors in a first feature extractor layer is an input of the corresponding CNN, an input for each of the feature extractors in the feature extractor layers after the first feature extractor layer is a sum of features extracted by all of the feature extractors in a previous feature extractor layer, and an input of each predictor is a sum of features extracted by all of the feature extractors in a last feature extractor layer, and the predictor determines the prediction based on the input thereof.

In one embodiment, each of the feature extractors has a plurality of convolution layers and has at least one of a first structure, a second structure and a parallel structure, in the first structure, each of the convolution layers is connected to all of the following layers of the convolution layers; in the second structure, each of convolution layers is connected to next one layer or next two layers of the convolution layers; and in the parallel structure, the first structure and the second structure are connected in parallel.

In one embodiment, each of the CNNs in the CNN group comprises one predictor, and the prediction determined by the one predictor is a combination of multiple predictions used for different prediction tasks.

In one embodiment, each of the CNNs in the CNN group comprises a plurality of predictors, and the predictions determined by the plurality of predictors respectively used for different prediction tasks.

In one embodiment, the operations further comprising: training the CNN group by: a) inputting a predetermined number of training frames of medical video data to the CNN group; b) predicting, by the CNNs except for the last CNN in the CNN group, frame lesion prediction candidate for each of the training frames; c) comparing the frame lesion prediction candidate and ground truth for each of the training frames to obtain first training error for each of the training frames; d) predicting, by the last CNN in the CNN group, final frame lesion prediction candidate for each of the training frames, according to the concatenated outputs of the previous CNNs; e) comparing the final frame lesion prediction candidate and final ground truth for each of the training frames to obtain second training error for each of the training frames; f)backward-propagating the first and second training errors to the CNN group to adjust parameters of the CNNs; and g) repeating steps a) -f) until the first and second training errors converge.

In one embodiment, the backward-propagating comprises: backward-propagating a sum of the first and the second training errors to the CNN group to adjust the parameters of CNNs.

In one embodiment, the operations further comprising: pre-training the CNNs except for the last CNN in the CNN group by: h) inputting training images to the CNNs except for the last CNN in the CNN group; i) predicting, by the CNNs except for the last CNN in each of the CNN group, image prediction candidate for each of the inputted training images; j) comparing the image prediction candidate with ground truth for each of the inputted training images to obtain image error for each of the inputted training images; k) backward propagating the image error to the CNNs except for the last CNN and adjust parameters thereof; and l) repeating steps h) -k) until the image error converges.

In one embodiment, the pre-training is performed by inputting public non-medical images and public medical images respectively as the training images.

In one embodiment, the training medical video data is specific target medical video data.

In one embodiment, the final ground truth is the ground truth of the last training frame.

In one embodiment, each of the frame lesion prediction candidates and the final prediction candidate comprises at least one of lesion recognition, lesion detection, lesion localization, lesion segmentation and disease diagnosis.

In one embodiment, the ground truth for each of the training frames comprises at least one of presence of interested lesion, size of interested lesion, location of interested lesion, histology type of interested lesion, area of interested lesion and the diagnosis related to interested lesion.

In one embodiment, the number and the data precisions of the plurality of CNNs are dynamically changed.

In one embodiment, the operations further comprising: pre-processing frames of the medical video data by at least one of scaling, brightness and contrast adjustment, color transformation, sharpening and blurring.

In one embodiment, the operations further comprising: outputting the medical video data labeled with the lesion prediction in real-time to a peripheral device.

The system 900 may be a mobile terminal, a personal computer (PC) , a tablet computer, a server, etc. In Fig. 9, the system 900 includes one or more processors, a communication portion, etc. The one or more processors may be: one or more central processing units (CPUs) 901 and/or one or more image processor (GPUs) 913 and/or one or more domain specific deep learning accelerator (XPUs) , etc. The processor may perform various suitable actions and processes in accordance with executable instructions stored in the read-only memory (ROM) 902 or executable instructions loaded from the storage unit 908 into the random access memory (RAM) 903. The communication portion 912 may include, but is not limited to a network card and/or specific media receivers. The network card may include, but is not limited to an IB (Infiniband) network card. The specific media receivers may include, but is not limited to a high definition SDI image/video receiver. The processor may communicate with the read-only memory 902 and/or the RAM 903 to execute the executable instructions, connect to the communication portion 912 through the bus 904 and communicate with other target devices through the communication portion 912 to complete the corresponding step in the present application. In a specific example of the present application, the steps performed by the processor includes: receiving, in sequence, a predetermined number of frames of medical video data by a CNN group that comprises a plurality of CNNs connected in series; predicting, by the CNN group, a lesion prediction for each of the frames; and outputting the frames each of which is labeled with the lesion prediction, wherein, for each inputted frame, the predicting comprising: extracting, by a first CNN of the CNN group, feature from the inputted frame; determining, by the first CNN, a prediction for the inputted frame based on the extracted feature; extracting, by each of the CNNs after the first CNN, feature from input thereof, wherein the input to each CNN after the first CNN is generated by concatenating at least one latest output from a previous CNN, the output being at least one of the extracted feature and the determined prediction; and determining, by each CNN after the first CNN, a prediction for the inputted frame based on the extracted feature thereof, wherein a prediction from a last CNN of the CNN group is outputted as the lesion prediction for the inputted frame.

In addition, in the RAM 903, various programs and data required by operation of the apparatus may also be stored. The CPU 901, the ROM 902 and the RAM 903 are connected to each other through the bus 904. Where RAM 903 exists, the ROM 902 is an optional module. The RAM 903 stores executable instructions or writes executable instructions to the ROM 902 during operation, and the executable instructions cause the central processing unit 901 to perform the steps included in the method of any of the embodiments of the present application. The input/output (I/O) interface 905 is also connected to the bus 904. The communication portion 912 may be integrated, and may also be provided with a plurality of sub-modules (e.g., a plurality of IB network cards) and connected to the bus 904, respectively.

The following components are connected to the I/O interface 905: an input unit 906 including a keyboard, a mouse, and the like; an output unit 907 including such as a cathode ray tube (CRT) , a liquid crystal display (LCD) and a loudspeaker, and the like; a storage unit 908 including a hard disk, and the like; and a communication unit 909 including a network interface card such as a LAN card, a modem, and the like. The communication unit 909 performs communication processing via a network such as the Internet and/or an USB interface and/or a PCIE interface. A driver 910 also connects to the I/O interface 905 as needed. A removable medium 911, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, is installed on the driver 910 as needed so that the computer programs read therefrom are installed in the storage unit 908 as needed.

It should be noted that the architecture shown in Fig. 9 is only an alternative implementation. During the specific practice process, the number and types of parts as shown in Fig. 9 may be selected, deleted, added or replaced according to actual needs. Upon setting different functional parts, implementations such as separate setting or integrated setting may also be adopted, for example, the GPU and the CPU may be set separately, and again for the same reason, the GPU may be integrated on the CPU, the communication portion may be set separately, and may also be set integrally on the CPU or GPU. These alternative implementations all fall within the protection scope of the present application.

In particular, according to the embodiments of the present application, the process described above with reference to the flowchart may be implemented as a computer software program, for example, the embodiments of the present application include a computer program product, which includes a computer program tangible included in a machine-readable medium. The computer program includes a program code for performing the steps shown in the flowchart. The program code may include corresponding instructions to perform correspondingly the steps in the method provided by any of the embodiments of the present application, including: receiving, in sequence, a predetermined number of frames of medical video data by a CNN group that comprises a plurality of CNNs connected in series; predicting, by the CNN group, a lesion prediction for each of the frames; and outputting the frames each of which is labeled with the lesion prediction, wherein, for each inputted frame, the predicting comprising: extracting, by a first CNN of the CNN group, feature from the inputted frame; determining, by the first CNN, a prediction for the inputted frame based on the extracted feature; extracting, by each of the CNNs after the first CNN, feature from input thereof, wherein the input to each CNN after the first CNN is generated by concatenating at least one latest output from a previous CNN, the output being at least one of the extracted feature and the determined prediction; and determining, by each CNN after the first CNN, a prediction for the inputted frame based on the extracted feature thereof, wherein a prediction from a last CNN of the CNN group is outputted as the lesion prediction for the inputted frame.

In such embodiments, the computer program may be downloaded and installed from the network through the communication unit 909, and/or installed from the removable medium 911. When the computer program is executed by the central processing unit (CPU) 901 and/or GPU 913 and/or XPU, the above-described instruction described in the present application is executed.

As will be appreciated by one skilled in the art, the disclosure may be embodied as a system, a method or an apparatus with domain specific hardware and computer program product. Accordingly, the disclosure may take the form of an entirely hardware embodiment and hardware aspects that may all generally be referred to herein as a “unit” , “circuit, ” “module” or “system. ” Much of the inventive functionality and many of the inventive principles when implemented, are best supported with or integrated circuits (ICs) , such as a digital signal processor and software therefore or application specific ICs. It is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating ICs with minimal experimentation. Therefore, in the interest of brevity and minimization of any risk of obscuring the principles and concepts according to the disclosure, further discussion of such software and ICs, if any, will be limited to the essentials with respect to the principles and concepts used by the preferred embodiments. In addition, the present invention may take the form of an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software. For example, the system may comprise a memory that stores executable components and a processor, electrically coupled to the memory to execute the executable components to perform operations of the system, as discussed in reference to Figs. 1-8B. Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.

The present disclosure also provides a non-transitory machine-readable medium having instructions stored therein, which when executed by a processor, cause the processor to perform operations for automatic diagnosis, the operations comprising: receiving, in sequence, a predetermined number of frames of medical video data by a CNN group that comprises a plurality of CNNs connected in series; predicting, by the CNN group, a lesion prediction for each of the frames; and outputting the frames each of which is labeled with the lesion prediction, wherein, for each inputted frame, the predicting comprising: extracting, by a first CNN of the CNN group, feature from the inputted frame; determining, by the first CNN, a prediction for the inputted frame based on the extracted feature; extracting, by each of the CNNs after the first CNN, feature from input thereof, wherein the input to each CNN after the first CNN is generated by concatenating at least one latest output from a previous CNN, and the output from the each CNN is at least one of the extracted feature and the determined prediction; and determining, by each CNN after the first CNN, a prediction for the inputted frame based on the extracted feature thereof, wherein a prediction from a last CNN of the CNN group is outputted as the lesion prediction for the inputted frame.

Although the preferred examples of the disclosure have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims are intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the disclosure.

Obviously, those skilled in the art can make variations or modifications to the disclosure without departing the spirit and scope of the disclosure. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the disclosure.

Claims

A method for automatic diagnosis, comprising:

receiving, in sequence, a predetermined number of frames of medical video data by a CNN group that comprises a plurality of CNNs connected in series;

predicting, by the CNN group, a lesion prediction for each of the frames; and

outputting the frames each of which is labeled with the lesion prediction,

wherein, for each inputted frame, the predicting comprising:

extracting, by a first CNN of the CNN group, feature from the inputted frame;

determining, by the first CNN, a prediction for the inputted frame based on the extracted feature;

extracting, by each of the CNNs after the first CNN, feature from input thereof, wherein the input to each CNN after the first CNN is generated by concatenating at least one latest output from a previous CNN, and the output from the each CNN is at least one of the extracted feature and the determined prediction; and

determining, by each CNN after the first CNN, a prediction for the inputted frame based on the extracted feature thereof, wherein a prediction from a last CNN of the CNN group is outputted as the lesion prediction for the inputted frame.
The method of claim 1, wherein each of the CNNs in the CNN group comprises a plurality of feature extractor layers connected in series and at least one predictor following the last one of the feature extractor layers,

wherein, each of the feature extractor layers comprises a plurality of feature extractors connected in parallel, and each of the feature extractors extracts feature for input thereof,

an input for each of the feature extractors in a first feature extractor layer is an input of the corresponding CNN,

an input for each of the feature extractors in the feature extractor layers after the first feature extractor layer is a sum of features extracted by all of the feature extractors in a previous feature extractor layer, and

an input of each predictor is a sum of features extracted by all of the feature extractors in a last feature extractor layer, and the predictor determines the prediction based on the input thereof.
The method of claim 2, wherein each of the feature extractors has a plurality of convolution layers and has at least one of a first structure, a second structure and a parallel structure,

in the first structure, each of the convolution layers is connected to all of the following layers of the convolution layers;

in the second structure, each of convolution layers is connected to next one layer or next two layers of the convolution layers; and

in the parallel structure, the first structure and the second structure are connected in parallel.
The method of claim 2, wherein each of the CNNs in the CNN group comprises one predictor, and the prediction determined by the one predictor is a combination of multiple predictions used for different prediction tasks.
The method of claim 2, wherein each of the CNNs in the CNN group comprises a plurality of predictors, and the predictions determined by the plurality of predictors respectively used for different prediction tasks.
The method of claim 2, further comprising:

training the CNN group by:

a) inputting a predetermined number of training frames of medical video data to the CNN group;

b) predicting, by the CNNs except for the last CNN in the CNN group, frame lesion prediction candidate for each of the training frames;

c) comparing the frame lesion prediction candidate and ground truth for each of the training frames to obtain first training error for each of the training frames;

d) predicting, by the last CNN in the CNN group, final frame lesion prediction candidate for each of the training frames, according to the concatenated outputs of the previous CNNs;

e) comparing the final frame lesion prediction candidate and final ground truth for each of the training frames to obtain second training error for each of the training frames;

f) backward-propagating the first and second training errors to the CNN group to adjust parameters of the CNNs; and

g) repeating steps a) -f) until the first and second training errors converge.
The method of claim 6, wherein the backward-propagating comprises:

backward-propagating a sum of the first and the second training errors to the CNN group to adjust the parameters of CNNs.
The method of claim 2, further comprising:

pre-training the CNNs except for the last CNN in the CNN group by:

h) inputting training images to the CNNs except for the last CNN in the CNN group;

i) predicting, by the CNNs except for the last CNN in each of the CNN group, image prediction candidate for each of the inputted training images;

j) comparing the image prediction candidate with ground truth for each of the inputted training images to obtain image error for each of the inputted training images;

k) backward propagating the image error to the CNNs except for the last CNN and adjust parameters thereof; and

l) repeating steps h) -k) until the image error converges.
The method of claim 8, wherein the pre-training is performed by inputting public non-medical images and public medical images respectively as the training images.
The method of claim 6, wherein the training medical video data is specific target medical video data.
The method of claim 6, wherein the final ground truth is the ground truth of the last training frame.
The method of claim 6, wherein each of the frame lesion prediction candidates and the final prediction candidate comprises at least one of lesion recognition, lesion detection, lesion localization, lesion segmentation and disease diagnosis.
The method of claim 6, wherein the ground truth for each of the training frames comprises at least one of presence of interested lesion, size of interested lesion, location of interested lesion, histology type of interested lesion, area of interested lesion and the diagnosis related to interested lesion.
The method of claim 6, wherein the number and the data precisions of the plurality of CNNs are dynamically changed according to a device applying the method.
The method of claim 1, further comprising:

pre-processing the frames of the medical video data by at least one of scaling, brightness and contrast adjustment, color transformation, sharpening and blurring.
The method of claim 1, further comprising:

outputting the medical video data labeled with the lesion prediction in real-time to a peripheral device.
An automatic diagnosis apparatus, comprising:

a processor; and

a memory coupled to the processor to store instructions executable by the processor to build a CNN group and perform operations:

receiving, in sequence, a predetermined number of frames of medical video data by the CNN group that comprises a plurality of CNNs connected in series;

predicting, by the CNN group, a lesion prediction for each of the frames; and

outputting the frames each of which is labeled with the lesion prediction,

wherein, for each inputted frame, the predicting comprising:

extracting, by a first CNN of the CNN group, feature from the inputted frame;

determining, by the first CNN, a prediction for the inputted frame based on the extracted feature;

extracting, by each of the CNNs after the first CNN, feature from input thereof, wherein the input to each CNN after the first CNN is generated by concatenating at least one latest output from a previous CNN, the output being at least one of the extracted feature and the determined prediction; and

determining, by each CNN after the first CNN, a prediction for the inputted frame based on the extracted feature thereof, wherein a prediction from a last CNN of the CNN group is outputted as the lesion prediction for the inputted frame.
The apparatus of claim 17, wherein each of the CNNs in the CNN group comprises a plurality of feature extractor layers connected in series and at least one predictor following the last one of the feature extractor layers,

wherein, each of the feature extractor layers comprises a plurality of feature extractors connected in parallel, and each of the feature extractors extracts feature for input thereof,

an input for each of the feature extractors in a first feature extractor layer is an input of the corresponding CNN,

an input for each of the feature extractors in the feature extractor layers after the first feature extractor layer is a sum of features extracted by all of the feature extractors in a previous feature extractor layer, and

an input of each predictor is a sum of features extracted by all of the feature extractors in a last feature extractor layer, and the predictor determines the prediction based on the input thereof.
The apparatus of claim 18, wherein each of the feature extractors has a plurality of convolution layers and has at least one of a first structure, a second structure and a parallel structure,

in the first structure, each of the convolution layers is connected to all of the following layers of the convolution layers;

in the second structure, each of convolution layers is connected to next one layer or next two layers of the convolution layers; and

in the parallel structure, the first structure and the second structure are connected in parallel.
The apparatus of claim 18, wherein each of the CNNs in the CNN group comprises one predictor, and the prediction determined by the one predictor is a combination of multiple predictions used for different prediction tasks.
The apparatus of claim 18, wherein each of the CNNs in the CNN group comprises a plurality of predictors, and the predictions determined by the plurality of predictors respectively used for different prediction tasks.
The apparatus of claim 18, the operations further comprising:

training the CNN group by:

a) inputting a predetermined number of training frames of medical video data to the CNN group;

b) predicting, by the CNNs except for the last CNN in the CNN group, frame lesion prediction candidate for each of the training frames;

c) comparing the frame lesion prediction candidate and ground truth for each of the training frames to obtain first training error for each of the training frames;

d) predicting, by the last CNN in the CNN group, final frame lesion prediction candidate for each of the training frames, according to the concatenated outputs of the previous CNNs;

e) comparing the final frame lesion prediction candidate and final ground truth for each of the training frames to obtain second training error for each of the training frames;

f) backward-propagating the first and second training errors to the CNN group to adjust parameters of the CNNs; and

g) repeating steps a) -f) until the first and second training errors converge.
The apparatus of claim 22, wherein the backward-propagating comprises:

backward-propagating a sum of the first and the second training errors to the CNN group to adjust the parameters of CNNs.
The apparatus of claim 18, the operations further comprising:

pre-training the CNNs except for the last CNN in the CNN group by:

h) inputting training images to the CNNs except for the last CNN in the CNN group;

i) predicting, by the CNNs except for the last CNN in each of the CNN group, image prediction candidate for each of the inputted training images;

j) comparing the image prediction candidate with ground truth for each of the inputted training images to obtain image error for each of the inputted training images;

k) backward propagating the image error to the CNNs except for the last CNN and adjust parameters thereof; and

l) repeating steps h) -k) until the image error converges.
The apparatus of claim 24, wherein the pre-training is performed by inputting public non-medical images and public medical images respectively as the training images.
The apparatus of claim 22, wherein the training medical video data is specific target medical video data.
The apparatus of claim 22, wherein the final ground truth is the ground truth of the last training frame.
The apparatus of claim 22, wherein each of the frame lesion prediction candidates and the final prediction candidate comprises at least one of lesion recognition, lesion detection, lesion localization, lesion segmentation and disease diagnosis.
The apparatus of claim 22, wherein the ground truth for each of the training frames comprises at least one of presence of interested lesion, size of interested lesion, location of interested lesion, histology type of interested lesion, area of interested lesion and the diagnosis related to interested lesion.
The apparatus of claim 22, wherein the number and the data precisions of the plurality of CNNs are dynamically changed.
The apparatus of claim 17, wherein the operations further comprise:

pre-processing frames of the medical video data by at least one of scaling, brightness and contrast adjustment, color transformation, sharpening and blurring.
The apparatus of claim 17, wherein the operations further comprise:

outputting the medical video data labeled with the lesion prediction in real-time to a peripheral device.
An automatic diagnosis system, comprising:

an endoscopy obtaining endoscopy data;

a diagnosis apparatus receiving the endoscopy data and comprising:

a processor; and

a memory coupled to the processor to store instructions executable by the processor to build a CNN group and perform operations:

receiving, in sequence, a predetermined number of frames of medical video data by the CNN group that comprises a plurality of CNNs connected in series;

predicting, by the CNN group, a lesion prediction for each of the frames; and

outputting the frames each of which is labeled with the lesion prediction,

wherein, for each inputted frame, the predicting comprising:

extracting, by a first CNN of the CNN group, feature from the inputted frame;

determining, by the first CNN, a prediction for the inputted frame based on the extracted feature;

extracting, by each of the CNNs after the first CNN, feature from input thereof, wherein the input to each CNN after the first CNN is generated by concatenating at least one latest output from a previous CNN, the output being at least one of the extracted feature and the determined prediction; and

determining, by each CNN after the first CNN, a prediction for the inputted frame based on the extracted feature thereof, wherein a prediction from a last CNN of the CNN group is outputted as the lesion prediction for the inputted frame.
A non-transitory machine-readable medium having instructions stored therein, which when executed by a processor, cause the processor to perform operations for automatic diagnosis, the operations comprising:

receiving, in sequence, a predetermined number of frames of medical video data by a CNN group that comprises a plurality of CNNs connected in series;

predicting, by the CNN group, a lesion prediction for each of the frames; and

outputting the frames each of which is labeled with the lesion prediction,

wherein, for each inputted frame, the predicting comprising:

extracting, by a first CNN of the CNN group, feature from the inputted frame;

determining, by the first CNN, a prediction for the inputted frame based on the extracted feature;

extracting, by each of the CNNs after the first CNN, feature from input thereof, wherein the input to each CNN after the first CNN is generated by concatenating at least one latest output from a previous CNN, and the output from the each CNN is at least one of the extracted feature and the determined prediction; and

determining, by each CNN after the first CNN, a prediction for the inputted frame based on the extracted feature thereof, wherein a prediction from a last CNN of the CNN group is outputted as the lesion prediction for the inputted frame.