CN112419246B

CN112419246B - Depth detection network for quantifying esophageal mucosa IPCLs blood vessel morphological distribution

Info

Publication number: CN112419246B
Application number: CN202011263459.2A
Authority: CN
Inventors: 钟芸诗; 颜波; 蔡世伦; 谭伟敏; 王沛晟; 李吉春; 阿依木克地斯·亚力孔
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2022-07-22
Anticipated expiration: 2040-11-12
Also published as: CN112419246A

Abstract

The invention belongs to the technical field of medical image processing, and particularly relates to a depth detection network for quantifying vascular morphology distribution of esophageal mucosa IPCLs. The method comprises a characteristic extraction network, a characteristic pyramid, a region candidate network, a cancer focus classification network with interest region pooling and clustering distribution prior self-embedding, and a system for visualization on a narrow-band imaging endoscope image. Extracting a feature map of an input image by a feature extraction network; the characteristic pyramid fuses the characteristics of different scales; the regional candidate network provides possible lesion regions; pooling the region of interest to pool features into suspicious lesion areas; the cluster distribution priori classifies the cancer foci from an embedded cancer focus classification network; and finally, visualizing the images on the narrow-band imaging endoscope, and using different colors to frame and mark the cancer focus. The invention can detect and diagnose the cancer focus of early esophageal squamous carcinoma in the image, can effectively improve the diagnosis efficiency and assist doctors to obtain higher diagnosis precision.

Description

Depth detection network for quantifying esophageal mucosa IPCLs blood vessel morphological distribution

Technical Field

The invention belongs to the technical field of medical image processing, and particularly relates to a depth detection network for quantifying vascular morphology distribution of esophageal mucosa IPCLs.

Background

The prognosis of esophageal cancer and gastric cancer is poor, the relative survival rate in 5 years is respectively 20.9 percent and 27.4 percent, and the serious burden is brought to the health care^[11,13-14]. Standardized upper gastrointestinal cancer screening, treatment and follow-up are effective means for reducing cancer morbidity and mortality. Narrow-band imaging endoscopic screening is the first means for finding upper gastrointestinal cancer. The pathological type and infiltration depth of esophageal mucosal lesion under narrow-band imaging endoscope are judged mainly according to the unique vascular morphology of capillary loops (IPCLs) in epithelial papilla.

According to the typing standards proposed by Inoue and Arima^[15]Blood vessels can be generally classified into A, B1, B2 and B3. Wherein type a refers to the observation of non-abnormal blood vessels; type B1 refers to observation of loop-shaped abnormal blood vessel, blood vessel dilation, snake-shaped form, different caliber, non-uniform shape, diameter of 20-30 μ M, and infiltration depth of M1-M2 layers; type B2 refers to observation of non-loop blood vessels, which are irregularly dendritic or multiple, and have a depth of infiltration M3-SM 1; type B3 indicates that large green blood vessels were observed, the blood vessels were highly dilated, and the depth of infiltration was SM 2.

The type, number and distribution of IPCL blood vessels play an important guiding role in clinical treatment decision. For example, IPCL with a deep infiltration depth is suggested to have a large amount of aggregation, which may suggest that esophageal lesions enter the middle and late stages, and is not suitable for minimally invasive treatment or even surgical treatment; conversely, if the IPCL with a deeper infiltration depth is scattered, the patient may have an opportunity for surgery.

Clinically, the observation of the IPCL is greatly affected by human subjective factors because, unlike conventional gastrointestinal endoscopic imaging modalities, the observation of the IPCL requires 10-50 times magnification of the surface of the lesion using a magnifying gastroscope in the NBI mode. In the same principle as in a microscope, the doctor will get an image close to 200 fine structures/fields of view in the zoom mode. Under the condition, a clinician needs to observe all structures, visual fatigue is very easy to generate, and due to the shortage of clinical experience, after 5-10 visual fields are observed, the clinician only remembers a particularly impressive part, namely the Murphy's law of reference, lacks an objective and quantifiable concept, and is easy to cause misjudgment of the state of an illness and cause errors of medical decision.

The research can enable clinicians to get rid of the influence of subjective factors (fatigue, careless omission, insufficient experience and the like caused by a large amount of fine observation), only needs to amplify focuses, obtains IPCL prediction aiming at all visual fields through computer analysis, comprises the quantity, proportion and aggregation conditions of various blood vessels, and can help the clinicians to judge the focuses more accurately.

The deep convolutional neural network is a machine learning technology, can effectively avoid human factors, and automatically learns how to extract abundant representative visual features from a large amount of marked data. The technology uses a back propagation optimization algorithm, so that a machine updates internal parameters thereof and learns the mapping relation from an input image to a label. In recent years, deep convolutional neural networks have greatly improved the performance of various tasks in computer vision.

2012, Krizhevsky et al^[1]First in ImageNet^[2]The image classification competition applies a deep convolutional neural network, and obtains a champion with a Top-5 error rate of 15.3%, which causes a hot tide of deep learning. 2015 Simnyan et al^[3]The neural networks VGG-16 and VGG-19 with 16 and 19 layers are provided, the parameter number of the networks is increased, and the result of the ImageNet image classification task is further improved. 2016, He et al^[4]The use of the 152-layer residual network ResNet achieves a classification effect exceeding that of human eyes.

Deep convolutional neural networks not only perform excellently in image classification tasks, but also in some structured output tasks, such as object detection^[5-7]Semantic segmentation^[8,9]The same excellent effects are obtained. If the deep convolutional neural network is applied to computer-aided diagnosis, doctors can be assisted to make better medical diagnosis, early discovery and early treatment can be achieved, and the treatment effect is improved.

The invention provides a cluster distribution priori self-embedded detection network, which can fully excavate the potential cluster distribution priori of a cancer focus, extract rich characteristics and simultaneously realize the detection and diagnosis of the cancer focus of early esophageal squamous carcinoma.

Disclosure of Invention

The invention aims to provide a cluster distribution prior self-embedded depth detection network for quantifying vascular morphology distribution of esophageal mucosa IPCLs, which eliminates the influence of human factors and realizes automatic diagnosis of a narrow-band imaging endoscope image.

The invention provides a cluster distribution prior self-embedded detection network, which is based on a target detection neural network and specifically comprises the following steps: the system comprises a characteristic extraction backbone network, a characteristic pyramid network, a region candidate network, an interest region pooling and clustering distribution prior self-embedded cancer focus classification network and an auxiliary diagnosis system for performing visualization on a narrow-band imaging endoscope image; wherein:

(1) the feature extraction backbone network is in ResNet-50^[4]The basic construction includes 50 convolutional layers for extracting the feature map of the input image (i.e., the feature extractor as a feature pyramid). Specifically, feature maps are extracted at the tail of layers 1,2,3 and 4 of the ResNet-50 model respectively, the extracted feature maps have 256 channels, 512 channels, 1024 channels and 2048 channels respectively, and the sizes of the feature maps are original sizes of 1/4 channels, 1/8 channels, 1/16 channels and 1/32 channels respectively; feature graph feed into feature pyramid network^[12]；

(2) The characteristic pyramid network is used for fusing the characteristics of different scales, firstly unifying all characteristic graphs to 256 channels by using convolution of 1 multiplied by 1, then upsampling the characteristics of an upper layer to twice size layer by layer from top to bottom, adding the upsampled characteristics and the characteristics of a lower layer, and carrying out convolution of 3 multiplied by 3; thus, a multi-scale feature map is obtained: the sizes of the original images are 1/4, 1/8, 1/16 and 1/32 respectively, and the number of channels is 256;

(3) the region candidate network is used for extracting possible lesion regions; wherein the anchor generator is first used^[5]Generating a dense rectangular candidate frame; the rectangle candidate frame has 5 × 3 different sizes, and is formed by combining five different sizes (such as width of 32, 64, 128, 256, 512) and three different shapes (such as 1:1, 1:2, 2: 1); the features of each layer in the feature pyramid are subjected to 3 × 3 convolution and 1 × 1 convolution, and the candidate box is judged to belong to a positive sample or a negative sample through Softmax; finally, performing boundary frame regression of three shapes (each box has 4 coordinates, so 3 × 4 is 12 channels) through convolution of 12 channels by 1 × 1, and correcting inaccurate candidate frames;

(4) the region of interest pooling and cluster distribution is a priori self-embedded cancer lesion classification network, wherein the region of interest pooling is performed by pooling features into suspicious lesion regions; the cluster distribution prior self-embedded cancer focus classification network is used for classifying the cancer focus; specifically, the region of interest is framed with a rectangular bounding box parallel to the coordinate axes, and the cancer focus classification result of the region, i.e., a normal (i.e., class a) region or a lesion region (i.e., class B1, B2, B3) is given; the network firstly extracts regions of interest from feature maps of different levels of the feature pyramid, aligns the regions of interest, and pools the regions of interest to 7 × 7 at the maximum, so that each region of interest corresponds to a feature with the size of 256 × 7 × 7; then, the features of each region of interest are overlapped with the features of K adjacent neighbors (namely, the feature channels are connected and combined) to form a feature map in a shape of (256 × K) × 7 × 7, so that the classification network applies potential cancer focus distribution prior; two output branches are then produced through the full connection layer: the first branch circuit outputs the position offset of each characteristic region for further correcting the position of the detection frame; calculating the classification probability of the features through a Softmax function by the second branch to obtain the category of the cancer focus of the region; wherein the fully-connected layer is to flatten the characteristic diagram of (256 × K) × 7 × 7 shape to form a characteristic of (12544 × K) × 1 × 1, the output of the fully-connected layer is 1024 channels, the output of the first branch is 20 channels, i.e. each category corresponds to four coordinates of the bounding box, 5 × 4, and the output of the second branch is 5 channels, i.e. 5 categories including negative samples;

(5) the auxiliary diagnosis system for visualization on the narrow-band imaging endoscope image is used for visually displaying the narrow-band imaging endoscope image and performing frame selection marking on a cancer focus by using different colors. Specifically, the input is a narrow-band imaging endoscope image; the network is used for detecting and diagnosing the cancer foci, and detection boxes with different colors are used for representing different types of the cancer foci, namely, the detection boxes with different colors are green, red, purple and black respectively represent A, B1, B2 and B3, and the classification confidence of the detection boxes is marked. Then, the confidence degrees of all the detection frames are screened, and the confidence degrees smaller than a threshold value T are removed₁All ofDetecting the frame, and then eliminating the intersection ratio larger than the threshold T by using non-maximum value inhibition₂Redundant overlap boxes. Wherein T is₁、T₂Take [0, 1] over]All values in the range with a step size of 0.05 and by comparison F₁Score to determine the optimal threshold value T₁、T₂。

The training method of the network model comprises the following steps:

before training, network parameters of the ResNet-50 model are initialized randomly, images in a training set are scaled, the resolution of the images is not more than 800 x 1333, and corresponding bounding boxes are scaled at the same time.

During training, the images are firstly set to be [0.485, 0.456 and 0.406 ] according to the mean value]And standard deviation of [0.229, 0.224, 0.225 ]]Three channels (R, G, B) of the image are normalized. Using Adam optimization algorithm^[16]Let the initial learning rate be 10^-4Two estimated exponential decay rates: beta is a₁Is set to 0.9, beta₂0.999, weight decay is 0, and a small batch stochastic gradient descent strategy is used, with the batch size set to 8 to minimize the loss function; training is carried out for N rounds; because the distribution of each type of blood vessel in the training set is not balanced, the blood vessels of B2 and B3 types cannot be sufficiently trained, and Focal loss is used as a loss function of the cancer focus classification network, wherein the weights of negative samples, A, B1, B2 and B3 are respectively C₁、C₂、C₃、C₄、C₅And the distribution proportion of each blood vessel in the training set is determined after a plurality of experiments.

In the invention, after the image of the narrow-band imaging endoscope is input, the detection and diagnosis result of the cancer focus can be obtained only by one-time forward transmission.

The invention has the beneficial effects that:

the invention designs a cluster distribution prior self-embedded detection network, which takes a narrow-band imaging endoscope image as input and simultaneously realizes the cancer focus detection and diagnosis of early esophageal squamous carcinoma. The image to be tested can obtain detection and diagnosis results only through one-time forward propagation, and detection and classification tasks share part of network parameters, so that the calculation amount is effectively reduced, and the diagnosis efficiency is improved. Experimental results show that the method can accurately detect the cancer focus area of early esophageal squamous carcinoma, provide an accurate diagnosis result based on the detection frame, reduce the influence of human factors and improve the efficiency and accuracy of clinical diagnosis.

Drawings

FIG. 1 is a network framework diagram of the present invention.

FIG. 2 is a schematic diagram showing the detection and diagnosis effects of the present invention after the narrow-band imaging endoscope image is inputted into the network model. The method comprises the following steps of (a) obtaining a narrow-band imaging endoscope image, (b) obtaining a result of detecting and classifying the cancer foci in the image by the method, and (c) obtaining a result of detecting and classifying the cancer foci in the image by a doctor through experience.

Fig. 3 is a comparison of the present invention and the visualization effect of the doctor in the narrow band imaging endoscopic image for detection and diagnosis.

Figure 4 is a recall comparison of the present invention with different classifications of physician detection and diagnosis in a narrow band imaging endoscopic image.

Fig. 5 is a characteristic diagram display after the characteristic extraction is performed through the characteristic extraction network.

Detailed Description

The embodiments of the present invention are described in detail below, but the scope of the present invention is not limited to the examples.

The invention adopts the network frame shown in figure 1, and uses 144 narrow-band imaging endoscopic images which are cooperatively marked by a plurality of doctors with abundant seniority to train, thereby obtaining a model which can automatically detect and diagnose esophageal squamous carcinoma foci on the narrow-band imaging endoscopic images. The specific process comprises the following steps:

(1) before training, network parameters of the ResNet-50 model are initialized randomly, images in a training set are scaled, the resolution of the images is not more than 800 x 1333, and corresponding bounding boxes are scaled at the same time. .

(2) During training, the images are first set to [0.485, 0.456, 0.406 ] according to the mean value]And standard deviation of [0.229, 0.224, 0.225 ]]-normalizing the three channels (R, G, B) of the image; using Adam optimization algorithm^[16]Let the initial learning rate be 10^-4Two estimated exponential decay rates beta₁Is set to 0.9, beta₂0.999, weight decay is 0, and a small batch stochastic gradient descent strategy is used, with the batch size set to 8 to minimize the loss function; training is carried out for N rounds; because the distribution of each type of blood vessel in the training set is not balanced, the blood vessels of B2 and B3 types cannot be sufficiently trained, and Focal loss is used as a loss function of the cancer focus classification network, wherein the weights of negative samples, A, B1, B2 and B3 are respectively C₁、C₂、C₃、C₄、C₅And the distribution proportion of each blood vessel in the training set is determined after a plurality of experiments.

(3) During testing, the narrow-band imaging endoscope image is scaled so that the resolution does not exceed 800 × 1333, and is input into a trained model, and the model outputs an outer surrounding frame of all detected blood vessels, corresponding cancer focus categories (including normal category a and abnormal categories B1, B2 and B3, which are four categories in total) and confidence level p of the cancer focus categories. In particular, since the number of blood vessels included in a narrow-band imaging endoscopic image is large, the upper limit of the number of detection frames per image is set to 250. Setting a threshold value T₁Is 0.3 when p>At 0.3, the outer enclosure frame is retained, otherwise the outer enclosure frame is removed. Setting a threshold T₂0.3, the remaining outer bounding box is non-maximum suppressed, remaining only in the neighborhood (intersection ratio greater than threshold T)₂Time) the bounding box with the highest confidence p.

FIG. 2 illustrates the detection and diagnosis effect of the present invention after the narrow-band imaging endoscope image is inputted into the network model, wherein (a) is the original narrow-band imaging endoscope image; (b) for the outer surrounding frame obtained by detecting the cancer foci in the image and the corresponding classification and confidence coefficient, different colors respectively represent different cancer focus types, namely green, red, purple and white respectively represent A, B1, B2 and B3; (c) the method is a result of the cooperation of cancer focus detection and classification in images after discussion for a plurality of doctors with years of clinical practice and abundant experience. It can be seen from the figure that the results of the combined judgment of the system and a plurality of doctors with abundant experience are basically consistent in the detection and classification of the cancer foci, and the invention has strong application value.

Figure 3 shows the visualization of the present invention in comparison to a single physician's detection and diagnosis in a narrow band imaging endoscopic image, where the reference basis (i.e., the standard outcome) for detection and diagnosis is cooperatively annotated by a plurality of seniority-rich physicians. Therefore, when a single doctor carries out detection and diagnosis, mistakes and omissions are inevitably generated, and higher sensitivity cannot be achieved, but the system of the invention not only has higher judgment speed (each image is less than 1 second) but also has higher accuracy compared with a single doctor.

FIG. 4 is a comparison of the recall rate of different categories of detection and diagnosis by a single physician in a narrow band imaging endoscopic image, wherein the reference basis (i.e., standard outcome) for detection and diagnosis is cooperatively annotated by a plurality of highly qualified physicians. It can be seen that the recall rate of the present invention is much higher than that of a single doctor, and the recall rate means the rate of detection and correct classification of real cancer foci, which means that the present invention is much less than the case of detection of misdiagnosis by a single doctor.

Fig. 5 is a characteristic diagram showing the invention after characteristic extraction through the characteristic extraction network, and it can be seen that after the characteristic extraction, the characteristic values of blood vessels and non-blood vessels are greatly different, which shows that the characteristic extraction network can effectively extract key characteristics for detection and diagnosis from a narrow-band imaging endoscopic image.

Tables 1 and 2 show the sensitivity, accuracy and recall rate analysis of the invention and a single doctor in the narrow-band imaging endoscope image. Table 1 is the performance of the network of the present invention when K ═ 4 is taken (i.e., the classification uses feature fusion of 4 neighbors); table 2 shows the results of the tests and diagnoses by the individual doctors. The judgment criteria of detection and diagnosis are labeled by a plurality of doctors with rich seniority. Therefore, the invention exceeds the detection and diagnosis level of a single doctor in recall rate and embodies the clinical use value of the invention.

TABLE 1

Type (B)	TP	FP	FN	Sensitivity of a sample to a test	Rate of accuracy	Recall rate
							A	169	267	53	0.761	0.388	0.669
B1	3248	489	249	0.929	0.869	0.916
							B2	98	40	70	0.583	0.710	0.466
B3	20	22	5	0.800	0.476	0.500
							Overall	3535	818	377	0.904	0.812	0.884

TABLE 2

Type of lesion	TP	FP	FN	Sensitivity of a sample to a test	Rate of accuracy	Recall rate
							A	-	-	-	-	-	0.50
B1	-	-	-	-	-	0.70
							B2	-	-	-	-	-	0.93
B3	-	-	-	-	-	1.00
							Overall	-	-	-	-	-	0.67

。

Reference to the literature

[1]Krizhevsky,A.,Sutskever,I.&Hinton,G.E.ImageNet classification with deep convolutional neural networks.Advances in Neural Information Processing Systems,1097-1105(2012).

[2]Russakovsky,O.,Deng,J.,Su,H.et al.ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision 115,211-252(2015).

[3]Simonyan,K.&Zisserman A.Very deep convolutional networks for large-scale image recognition.International Conference on Representation Learning,(2014).

[4]He,K.,Zhang,X.,Ren,S.&Sun,J.Deep residual learning for image recognition.IEEE Conference on Computer Vision and Pattern Recognition,770-778(2016).

[5]Girshick,R.,Donahue,J.,Darrell,T.&Malik,J.Rich feature hierarchies for accurate object detection and semantic segmentation.IEEE Conference on Computer Vision and Pattern Recognition,580-587(2014).

[6]Girshick,R.Fast R-CNN.IEEE International Conference on Computer Vision,1440-1448(2015).

[7]Ren,S.,He,K.,Girshick,R.&Sun,J.Faster R-CNN:Towards real-time object detection with region proposal networks.Neural Information Processing Systems,(2015).

[8]Long,J.,Shelhamer,E.&Darrell,T.Fully convolutional networks for semantic segmentation.IEEE International Conference on Computer Vision,3431-3440(2015).

[9]Chen,L.,Papandreou,G.,Kokkinos,I.,Murphy,K.&Yuille,A.L.DeepLab:Semantic image segmentation with deep convolutional nets,atrous convolution,and fully connected CRFs.IEEE Transactions on Pattern Analysis and Machine Intelligence 40,834-848(2018).

[10]Ervik M,L F,Ferlay J,et al.Cancer Today.Lyon,France:International Agency for Research on Cancer.Cancer Today.[EB/OL].[2017-02-26].

[11] Chenwangqing, Zhengrongshan, Zhangwei, etc. analysis of the morbidity and mortality of Chinese malignant tumors in 2013 [ J ] Chinese tumors, 2017,26(1):1-7.

[12]Tsung-Yi Lin,Piotr Dollár,Ross B.Girshick,Kaiming He,BharathHariharan,Serge J.Belongie:Feature Pyramid Networks for Object Detection.CVPR 2017:936-944.

[13]Zeng H,Zheng R,Guo Y,et al.Cancer survival in China,2003-2005:a population based study[J].Int J Cancer,2015,136(8)

[14]Chen WQ,Zheng RS,Baade PD,et al.Cancer statistics in China,2015[J].CA Cancer J Clin,2016,66(2):115-132.

[15]Inoue H，Kaga M，Ikeda H，et al.Magnification endoscopyin esophageal squamous cell carcinoma:a review of theintrapapillary capillary loop classification[J].AnnGastroenterol,2015,28(1):41-48.

[16]Diederik P.Kingma,Jimmy Ba.Adam:A Method for Stochastic Optimization.ICLR(Poster)2015。

Claims

1. A clustering distribution prior self-embedding depth detection system for quantifying esophageal mucosa IPCLs blood vessel morphological distribution is characterized by specifically comprising: the system comprises a characteristic extraction backbone network, a characteristic pyramid network, a region candidate network, an interest region pooling and clustering distribution prior self-embedded cancer focus classification network and an auxiliary diagnosis system for performing visualization on a narrow-band imaging endoscope image; wherein:

(1) the feature extraction backbone network is constructed on the basis of ResNet-50 and comprises 50 convolutional layers for extracting a feature map of an input image; specifically, feature maps are extracted at the tail of layers 1,2,3 and 4 of the ResNet-50 model respectively, the extracted feature maps have 256 channels, 512 channels, 1024 channels and 2048 channels respectively, and the size of each feature map is 1/4, 1/8, 1/16 and 1/32 size of an original map; the feature graph is sent into a feature pyramid network;

(2) the characteristic pyramid network is used for fusing characteristics of different scales, firstly unifying all characteristic graphs to 256 channels by using convolution of 1 multiplied by 1, then up-sampling the characteristics of an upper layer to twice size layer by layer from top to bottom, adding the characteristics of the upper layer and the lower layer, and performing convolution of 3 multiplied by 3; obtaining a multi-scale feature map: the sizes of the original images are 1/4, 1/8, 1/16 and 1/32 respectively, and the number of channels is 256;

(3) the region candidate network is used for extracting possible lesion regions; wherein, firstly, an anchor generator is used for generating a dense rectangular candidate box; the candidate frames of the rectangle have 5 multiplied by 3 different sizes and are formed by combining five different sizes and three different shapes; the features of each layer in the feature pyramid are subjected to 3 × 3 convolution and 1 × 1 convolution, and the candidate box is judged to belong to a positive sample or a negative sample through Softmax; finally, carrying out boundary frame regression of three shapes through convolution of 12 channels by 1 × 1, and correcting inaccurate candidate frames;

(4) the region of interest pooling and cluster distribution is a priori self-embedded cancer lesion classification network, wherein the region of interest pooling is a pooling of features into suspicious lesion regions by pooling the region of interest; the cluster distribution prior self-embedded cancer focus classification network is used for classifying the cancer focus; specifically, the region of interest is framed by a rectangular bounding box parallel to the coordinate axis, and the classification result of the cancer focus of the region is given, namely a normal A-type region or a lesion region B1, B2 and B3; the network firstly extracts interested areas from feature maps of different levels of a feature pyramid, aligns the interested areas and pools the interested areas to 7 x 7 to the maximum extent, so that each interested area corresponds to a feature with the size of 256 x 7; then, the features of each region of interest are superposed with the adjacent K adjacent features to form a (256 XK). times.7X 7 shaped feature map, so that the classification network is applied to the potential cancer focus distribution prior; thereafter two output branches are generated through the full connection layer: the first branch circuit outputs the position offset of each characteristic region for further correcting the position of the detection frame; calculating the classification probability of the features by a Softmax function through the second branch to obtain the category of the cancer focus of the region; wherein the fully-connected layer is to flatten the characteristic diagram of (256 × K) × 7 × 7 shape to form a characteristic of (12544 × K) × 1 × 1, the output of the fully-connected layer is 1024 channels, the output of the first branch is 20 channels, i.e. each category corresponds to four coordinates of the bounding box, 5 × 4, and the output of the second branch is 5 channels, i.e. 5 categories including negative samples;

(5) the auxiliary diagnosis system for visualization on the narrow-band imaging endoscope image is used for performing visualization display on the narrow-band imaging endoscope image and performing frame selection marking on a cancer focus by using different colors; specifically, the input is a narrow-band imaging endoscope image; detecting and diagnosing the cancer foci by using the network, wherein detection frames with different colors are used for representing different types of the cancer foci, namely, the detection frames respectively represent A, B1, B2 and B3 in green, red, purple and black, and the classification confidence degrees of the detection frames are marked; then, the confidence degrees of all the detection frames are screened, and the confidence degrees smaller than a threshold value are removedT ₁Using non-maximum suppression to eliminate the intersection ratio larger than the thresholdT ₂Redundant overlap boxes of (2); whereinT ₁、T ₂Take [0, 1] over]All values of step size 0.05 within the range and by comparison F₁-score to determine optimal thresholdT ₁、T ₂。

2. The depth detection system of claim 1, wherein the network model is trained as follows:

before training, randomly initializing network parameters of a ResNet-50 model, and scaling images in a training set to ensure that the resolution ratio of the images does not exceed 800 multiplied by 1333 and the corresponding bounding boxes are scaled in the same proportion;

in training, firstly, images are based on the mean value = [0.485, 0.456, 0.406 =]And standard deviation = [0.229, 0.224, 0.225]Normalizing the three channels R, G, B of the image; using Adam optimization algorithm, set initial learning rate to 10^-4Two estimated exponential decay rates:β ₁the setting was made to be 0.9,β ₂0.999, weight decay of 0, and using a small batch stochastic gradient descent strategySlightly, the batch size is set to 8 to minimize the loss function; training is carried out for N rounds; because the distribution of each type of blood vessel in the training set is not balanced, the blood vessels of B2 and B3 types cannot be sufficiently trained, and Focal loss is used as a loss function of the cancer focus classification network, wherein the weights of negative samples, A, B1, B2 and B3 are respectively C₁、C₂、C₃、C₄、C₅And the distribution proportion of each blood vessel in the training set is determined after a plurality of experiments.

3. The depth detection system of claim 2, wherein the images of the narrow band imaging endoscope are input into the trained network and are propagated forward once to obtain the detection and diagnosis results of the cancer focus.