CN115018767A

CN115018767A - Cross-modal endoscope image conversion and lesion segmentation method based on eigen expression learning

Info

Publication number: CN115018767A
Application number: CN202210477177.5A
Authority: CN
Inventors: 颜波; 钟芸诗; 谭伟敏; 蔡世伦; 林青
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2022-05-03
Filing date: 2022-05-03
Publication date: 2022-09-06

Abstract

The invention belongs to the technical field of medical image processing, and particularly relates to a cross-mode endoscope image conversion and lesion area segmentation method. The invention converts the white light image of the digestive tract endoscope into a high-quality narrow-band image through the constructed neural network based on the eigen expression learning; obtaining the essential features of the white light image by using an unsupervised training essential feature extractor, and predicting a focus area through a cavity space convolution pooling pyramid network to obtain a segmentation result of the focus area; during testing, the white light image to be tested only needs to be transmitted with one auxiliary narrow-band image in a forward direction once, and the narrow-band image corresponding to the white light image can be obtained. The method adopts an unsupervised learning mode, has good generalization and has excellent effect on different endoscope devices. The invention can provide additional narrow-band imaging for white light endoscope equipment, provides better reference for doctor diagnosis, and can automatically position the focus region based on narrow-band image-assisted focus region segmentation, thereby greatly improving the disease diagnosis efficiency and reducing the morbidity and mortality.

Description

Cross-mode endoscope image conversion and lesion segmentation method based on intrinsic representation learning

Technical Field

The invention belongs to the technical field of medical image processing, and particularly relates to a cross-mode endoscope image conversion and lesion area segmentation method.

Background

With the rapid development of medical technology, endoscopes have been widely used in the fields of clinical diagnosis, treatment, and surgery. The endoscope can be directly inserted into body cavity for observation, and can display tissue form of internal organ ^[1] . Colon cancer and esophageal cancer are two diseases with high late mortality in the world. Fortunately, they can be diagnosed at an early stage by colonoscopy and enteroscopy, which can effectively improve survival ^[2,3] 。

Among the endoscopic imaging techniques, white light endoscopic imaging (WLI) is the most widely used, allowing good visualization of vascular structures and surrounding mucosa. However, lesions of early cancer are generally limited to the mucosal and submucosal layers. WLI is relatively limited in diagnostic efficacy, exhibiting low sensitivity and specificity. Therefore, Narrow Band Imaging (NBI) is becoming a new trend for early cancer diagnosis. NBI can observe the boundary of the lesion margins, clearly observing the morphology and distribution of the microvasculature. Under NBI, superficial blood vessels are brown and central internal blood vessels are blue-green, which enhances visualization of capillary and mucosal morphology. It can significantly increase the difference between the lesion and the surrounding mucosa, thereby improving the judgment of the lesion areaAccuracy of fracture ^[4] 。

WLI can help physicians locate lesions, while NBI can better see the boundaries and extent of lesions. Thus, if both WLI and NBI can be seen simultaneously, the advantages of both modes can be exploited. In fact, most endoscopic devices cannot display both WLI and NBI simultaneously, since NBI requires much light to be filtered out. Therefore, the WLI is converted into the NBI through a deep learning technology, and the WLI has great social value.

In the field of computer vision, the task of image-to-image conversion aims at learning the mapping between different domains, so as to generate images similar to the target. Some deep learning methods have made great progress, Pix2Pix ^[5] A method of pairwise image transformation based on pixel-level correspondence datasets is proposed. CycleGAN due to difficulty in obtaining paired datasets ^[6] A non-paired image transformation method based on cycle consistency is provided. The existing image conversion method only considers style conversion, does not pay attention to the association between two modes, and cannot learn intrinsic representation related to medical information. Therefore, they cannot serve downstream medical tasks such as anomaly detection and lesion region segmentation.

The invention provides a novel method for converting a WLI image into an NBI image based on an unsupervised eigen expression learning method, fully learns eigen expression between two modes, can convert the WLI image into a high-quality NBI image, provides an effective basis for diagnosis of doctors, and can be used for segmentation of a focus region and improvement of the detection rate of digestive tract diseases.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a cross-mode endoscope image conversion and lesion area segmentation method based on unsupervised intrinsic expression learning, so as to eliminate the influence of human factors, realize the conversion from a WLI endoscope image to an NBI endoscope image and simultaneously perform the segmentation of a lesion area.

The invention provides a cross-mode endoscope image conversion and focus region segmentation method based on unsupervised intrinsic representation learning, which comprises the following specific steps:

converting a White Light Image (WLI) of a digestive tract endoscope into a high-quality narrow-band image (NBI) through a constructed neural network based on eigen-representation learning;

and (II) acquiring the essential features of the white light image by using the essential feature extractor obtained in the step (I) through a cavity space convolution pooling pyramid network (ASPP), and predicting the focus region of the white light image to obtain the segmentation result of the focus region.

In step (one), for a given White Light Image (WLI), the goal is to generate a corresponding narrowband image (NBI). Previous image translation methods generally considered cross-modality image conversion as a style transfer task. This process, while producing a realistic looking result, may destroy the original high-level information and the basic features of the image that are not applicable to medical images. This may not only result in loss of useful information, but may also affect downstream tasks or the judgment of the physician. Depending on the light model, the present invention assumes that the endoscopic images can be decoupled into optical information and essential features. According to the invention, the neural network is adopted to recombine the optical information of the other mode with the essential characteristics of the mode to obtain the corresponding cross-mode image.

The neural network is a symmetrical network structure. For white light image I _WLI Passing through a specific modal characterization encoder E for obtaining optical information _MS Obtaining the optical characteristics of white light

At the same time, white light image I _WLI Modal invariant feature encoder for obtaining modal invariant features

Obtaining essential characteristics of white light image

Similarly, for narrow-band image I _NBI Passing through a modal-specific feature encoder E for obtaining optical information _MS Obtaining optical characteristics of NBI

A mode-invariant feature encoder for obtaining mode-invariant features

Obtaining essential features of NBI images

And

combined input white light image generator G _WLI In generating white light images

And

combined input narrowband image generator G _NBI In generating narrow-band imaging images

Two essential characteristics

And

respectively input into an intrinsic generator G _Eigen All can generate an intrinsic representation I _Eigen 。

And

the weights are shared. White light pattern to be generated

And the NBI image are respectively sent to a discriminator D for distinguishing the generated image from the real image _Gen And obtaining a classification result for generating a vivid medical image against learning.

In order to increase the diversity of samples to be popularized to different devices and modes, the neural network adopted by the invention has a loop structure, namely, is a loop network, and does not use pixel-level loss to constrain the transformed image. Specifically, the cycle is as follows: white light image to be generated

Passing through a modal-specific feature encoder E for obtaining optical information _MS Obtaining new white light optical characteristics

Modal invariant feature encoder by obtaining modal invariant features

Obtaining new essential characteristics of white light image

Similarly, new optical characteristics of NBI images can be obtained

And essential characteristics

And

combined input white light image generator G _WLI In the method, white light image is generated

And

combined input narrowband image generator G _NBI In the method, narrow-band image is generated

White light image obtained after circulation

And narrow band imaging

White light image I to be input originally _WLI And narrow band image I _NBI Consistently, i.e., pixel-level penalties may be used to constrain.

Further, since intrinsic features cannot be directly constrained, in order to better acquire medical intrinsic features, the invention employs an intrinsic countervailing learning strategy to generate reasonable intrinsic images. Because the intrinsic image has no real ground truth, the invention adopts a circulating mode to lead the network learning to be represented efficiently and reasonably. First, to ensure E _MI Extracting meaningful essential representation, and providing an intrinsic image generator G _Eigen . Hope for E _MI The method has good universality, and the nature of the endoscope image can be observed, so that the cross-modal and cross-device images can keep better consistency. Taking WLI as an example, the essential characteristics are assumed

Can effectively express the organization in the WLI image, and we can obtain an essence map

With the same depth characteristics as the input WLI. Similarly, the essence diagrams

Feeding into E _MI Intermediate features may be encoded that have the same essential features as the input WLI.

Generated only by essential features and should not contain any optical information. Therefore, will be essentially illustrated

Feeding into E _MS A complete zero vector can be generated.

Further, although the cyclic network can learn a reasonable expression, the deep neural network tends to become lazy in the training process, thereby making the generated essential image unreal. Furthermore, due to the lighting conditions and limitations of the display device, we cannot obtain a true intrinsic image. Thus, the present invention constructs a feature discriminator D _Eigen And classifying the generated characteristic images. Since the intrinsic images exhibit true common features of both WLI and NBI images, we expect the intrinsic distributions to be close to both WLI and NBI. D _Eigen The generated feature images are distinguished from the real WLI and NBI images, and G _Eigen To generate a true essence image to confuse the discriminator. Through antagonism learning, not only can a more realistic intrinsic image be obtained, but also E can be encouraged _MI The essential features of WLI and NBI are better captured for downstream medical tasks.

In the invention, during the neural network test, an endoscope white light image I is input _WLI Through an intrinsic feature extractor E _MI Extracting essential features, inputting an additional endoscope narrow-band imaging image, extracting narrow-band optical features through an optical feature extractor, combining the two features, and sending the combined features to an input narrow-band image generator G _NBI In the method, a narrow-band imaging image can be obtained through one-time forward propagation

In the invention, the loss of the neural network used in the training process is specifically designed as follows:

two essential characteristics

And

input eigen generator G _Eigen Respectively generating a white light image eigenrepresentation

And narrowband image intrinsic representation

These two eigenrepresentations are constrained using the LPIPS penalty as follows:

in a narrow-band imaging branch, optical features

And narrow band nature

Pass through discriminator D _gen Calculating the GAN loss L _GAN . White light image intrinsic representation

Re-input intrinsic feature extractor E _MI Extracting again the eigenrepresentation

Optical characteristics of the original white light image

Combined input white light image generator G _WLI To generate a reconstructed white light image

In that

And the original white light image I _WLI Calculates the cyclic loss L between _cycle . Generated white light image intrinsic representation

And narrowband image intrinsic representation

Input eigen discriminator D _Eigen In (1), calculating a modal invariance loss as follows:

L _Eigen ＝E[logD _Eigen (I _WLI ,I _NBI )]+E[log(1-D _Eigen (I _Eigen ))]， (2)

the ideal essential features should not contain optical features and therefore the white light image to be generated is intrinsically represented

Re-input into the optical feature extractor E _MS Extracting the optical feature of the intrinsic feature

The feature loss was calculated as follows:

therefore, when the neural network is trained, the final loss function is:

L＝λ ₁ L _perceptual +λ ₂ L _Eigen +λ ₃ L _feature +λ ₄ L _cycle +λ ₅ L _GAN ， (4)

wherein λ is _i Is the weight used to balance the individual loss functions, i ═ 1,2, …, 5; in the present invention, based on experience, λ is taken ₁ ＝10，λ ₂ ＝1，λ ₃ ＝1，λ ₄ ＝10，λ ₅ ＝1。

In the step (II), the essential characteristics of the white light image are obtained by using the essential characteristic extractor obtained in the step (I), and the method adopts a cavity space convolution pooling pyramid (ASPP) network ^[7] Dividing the focus area of the digestive tract; the specific process is as follows:

the void space convolution pooling pyramid (ASPP) network is connected with a fixed essential feature extractor E _MI A rear side; ASPP convolves parallel samples for a given input with holes of different sampling rates. Endoscopic white light image pass E _MI And (4) providing essential characteristics, inputting the essential characteristics into an ASPP network, and directly outputting a segmentation result of the lesion area. Unlike conventional semantic segmentation networks, first, the present invention does not take an image as an input to the segmentation network, but uses an essential feature extractor E _MI The extracted intermediate features. Second, an intrinsic feature extractor E _MI Is trained in an unsupervised way, and when training a semantic segmentation task, the intrinsic feature extractor E _MI This greatly reduces the amount of parameters and time required for training. Finally, due to the advantage of unsupervised training in generalization capability, the semantic segmentation network provided by the invention can be directly applied to other data sets without any retraining or fine tuning, so that a better cross-device effect is realized.

In the invention, the intrinsic feature extractor can map images of different modalities into the same feature space, so that for images of cross-equipment, the invention can generate similar intrinsic features, and the focus detection of the invention has better cross-equipment effect.

In the invention, the converted narrow-band image can be obtained by using the white light image conversion network, and the narrow-band image generated by the network and the original white light image are simultaneously used as the input of the segmentation network, so that the accurate lesion segmentation result of the white light image can be output. And using Unet as a basic network framework, extracting the characteristics of the white light image by one encoder, extracting the characteristics of the narrow-band image obtained by converting the white light image by the other encoder, and inputting the two characteristics into a decoder together to predict the segmentation result of the lesion region.

The present invention also provides the first multi-modal (WLI and NBI) esophageal endoscopic video dataset and pixel-level aligned paired esophageal endoscopic image dataset. The data set used was collected based on a binge (Pentax) device, including 34 videos and 8700 pairs of WLI and NBI images, and the images used in the test set were not coincident with the corresponding patient and training set. The construction of the data set specifically comprises:

(1) selection of a data object. The lack of large-scale paired datasets is one of the key challenges that hinder the endoscopic image conversion task. Due to the special structural design, the binge endoscope can display WLI and NBI almost synchronously in real time. Therefore, we collected a video data set of esophageal endoscopy containing 34 videos including 29 videos of normal esophagus and 5 videos of abnormal esophagus from affiliated hospitals of certain university in Shanghai. The time span is from 4 months to 5 months in 2021. Each video is about 1 to 5 minutes in length, with abnormal esophageal video lengths generally being longer. The total duration of the normal video is 39 minutes 57 seconds, and the total duration of the abnormal video is 12 minutes 52 seconds;

(2) images were taken by endoscopy. All patients received gastroscopy after intravenous anesthesia. The camera tool employs a binge EPK-i 7000. The endoscopist selects Pentax's dual mode to implement WLI and NBI dual screen displays. Esophageal endoscopy was performed by regression and recorded. The endoscopist starts the examination from the cardia (40 cm from the incisors) and then slowly pushes the mirror image until the esophagus begins (15 cm from the incisors). For suspected lesions, the endoscopist may repeat the viewing and filming and take multiple shots at different angles. Each set of videos was used as a case and included in the analysis. Please note that in order to ensure the quality of the recorded video, the endoscopist tries to avoid the out-of-focus during the shooting and to avoid the interference caused by breathing, heartbeat, mucus, air bubbles, blood, etc. If the video is fuzzy, shooting again;

(3) construction of paired multimodal datasets. The WLI image of Pentax is read by three color signals, while the NBI image is acquired by digital image technology, which requires some time to process. When the lens is moved slowly, this processing time is negligible, and therefore a nearly aligned cross mode image can be obtained. However, when the lens moves rapidly or the patient's breathing and heartbeat cause severe esophageal jitter, the images of the two modes displayed in the video frame will appear significantly distorted, blurred, etc. Therefore, we pre-process the captured video frames.

First, we manually remove some of the apparently blurred frames, as well as the beginning and ending frames of each video. These frames typically show anomalous images due to the debugging equipment. Second, according to the imaging principle and motion law of video frames, WLI will occur before NBI at the same location. Thus, three adjacent frames of WLI and NBI are compared and the image residuals are used to find the corresponding WLI and NBI images. Based on the video dataset, 11 normal and 5 abnormal videos were used to balance the samples and a paired multimodality endoscopy dataset was constructed containing 8700 pairs of WLI and NBI images. WLI and NBI images can almost achieve pixel level alignment. To our knowledge, this is the first multi-modal (WLI and NBI) esophageal endoscopic video data set and pixel-level aligned paired esophageal endoscopic image data set.

The invention has the beneficial effects that: the invention designs an image conversion network based on eigen expression learning, which can convert a WLI image of an alimentary tract endoscope into a high-quality NBI image, and meanwhile, the NBI image can assist the WLI image to realize the segmentation of a focus area. The WLI image to be detected can obtain the NBI image corresponding to the WLI image only by one-time forward propagation with one auxiliary NBI image. The method adopts an unsupervised learning mode, has good generalization performance and has excellent effect on different endoscope devices. The invention can provide extra NBI imaging for WLI endoscope equipment, provides better reference for doctor diagnosis, and can automatically position the focus region based on NBI image-assisted focus region segmentation, thereby greatly improving the disease diagnosis efficiency and reducing the morbidity and mortality.

Drawings

FIG. 1 is a network framework diagram of the present invention.

Fig. 2 is a paired WLI and NBI data set presentation.

Fig. 3 shows the result of image transformation for the same data sets WLI and NBI.

Fig. 4 is the image transformation result across datasets WLI and NBI.

Fig. 5 is the result of transforming the generated NBI map to assist WLI lesion segmentation.

Fig. 6 is the result of intrinsic feature used for lesion segmentation.

Detailed Description

The embodiments of the present invention are described in detail below, but the scope of the present invention is not limited to the examples.

By adopting the network structure in FIG. 1, 6947 is used to train the image transformation network on WLI and NBI images, and a trained image transformation network and an essential feature extractor are obtained.

The method comprises the following specific steps:

(1) during testing, an endoscope white light image I is input _WLI Through an intrinsic feature extractor E _MI Extracting essential features, inputting an additional endoscope narrow-band imaging image, extracting narrow-band optical features through an optical feature extractor, combining the two features, and sending the combined features to an input narrow-band image generator G _NBI In the method, a narrow-band imaging image can be obtained through one-time forward propagation

(2) Fixed essential feature extractor E _MI And then adding an ASPP network for semantic segmentation. Endoscopic white light image pass through E _MI Providing essential characteristics, inputting the essential characteristics into an ASPP network, and directly outputting a segmentation result of a focus area;

(3) the converted narrow-band image can be obtained by using the white light image conversion network, the narrow-band image generated by the network and the original white light image are simultaneously used as the input of the segmentation network, and the accurate white light image focus segmentation result can be output. And using Unet as a basic network framework, extracting the characteristics of the white light image by one encoder, extracting the characteristics of the narrow-band image obtained by converting the white light image by the other encoder, and inputting the two characteristics into a decoder together to predict the segmentation result of the lesion region.

Fig. 3 and 4 show the result of image transformation on the same data set and across data sets WLI and NBI. Since the Pix2Pix and CycleGAN methods treat the task directly as a style conversion, without taking into account the essential features of the medical image, the result is more like a hue conversion with some artifacts, without generating useful information. Therefore, their cross-dataset results are poor. The invention has stable performance both within data set and in cross data set experiments.

Fig. 5 is the result of transforming the generated NBI map to assist WLI lesion segmentation. The invention can generate more real NBI images and accurate lesion segmentation results. This indicates that we generated NBI images that are closest to the true NBI images, and that important information from the endoscopic images can be focused by feature extraction, thus facilitating downstream tasks.

Fig. 6 is the result of intrinsic feature used for lesion segmentation. Even supervised trained end-to-end segmentation methods have difficulty distinguishing certain flat lesions. In the present invention, the feature representation of the endoscopic image is obtained by unsupervised training, which ignores the variations brought to the medical image by different devices and maps all images to the same feature domain. Thus, our intrinsic features can output good lesion prediction results even without fine tuning.

Reference to the literature

[1]Mamonov A V,Figueiredo I N,Figueiredo P N,et al.Automated polyp detection in colon capsule endoscopy[J].IEEE Transactions on Medical Imaging,2014,33(7):1488-1502.

[2]GhatwaryN,Zolgharni M,Ye X.Early esophageal adenocarcinoma detection using deep learning methods[J].International Journal of Computer Assisted Radiology and Surgery,2019,14(4):611-621.

[3]MesejoP,Pizarro D,Abergel A,et al.Computer-Aided Classification of Gastrointestinal Lesions in Regular Colonoscopy[J].IEEE Transactions on Medical Imaging,2016,35(9):2051.

[4]Gai W,Jin X F,Du R,et al.Efficacy of narrow-band imaging in detecting early esophageal cancer and risk factors for its occurrence[J].Indian Journal of Gastroenterology,2018.

[5]P.Isola,J.Zhu,T.Zhou and A.A.Efros,Image-to-Image Translation with Conditional Adversarial Networks[C].2017IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2017,pp.5967-5976,doi:10.1109/CVPR.2017.632.

[6]J.Zhu,T.Park,P.Isola and A.A.Efros,Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks[C].2017IEEE International Conference on Computer Vision(ICCV),2017,pp.2242-2251,doi:10.1109/ICCV.2017.244.

[7]He K,Zhang X,Ren S,et al.Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2015。

Claims

1. A cross-mode endoscope image conversion and lesion segmentation method based on intrinsic representation learning is characterized by comprising the following specific steps:

secondly, acquiring the essential features of the white light image by using the essential feature extractor obtained in the step one through a cavity space convolution pooling pyramid network (ASPP), and predicting the focus area of the white light image to obtain the segmentation result of the focus area;

in step (one), for a given White Light Image (WLI), the goal is to generate a corresponding narrowband image (NBI); according to the light model, it is assumed that the endoscopic image can be decoupled into optical information and essential features; then, recombining the optical information of the other mode with the essential characteristics of the mode by adopting a neural network to obtain a corresponding cross-mode image;

the neural network is a symmetrical network structure; for white light image I _WLI Passing through a specific modal characterization encoder E for obtaining optical information _MS To obtain a white light image I _WLI Optical characteristics of

At the same time, white light image I _WLI A mode-invariant feature encoder for obtaining mode-invariant features

Obtaining a white light image I _WLI Essential characteristics of

Also for narrow-band images I _NBI Passing through a specific modal characterization encoder E for obtaining optical information _MS Obtaining a narrow-band image I _NBI Optical characteristics of

A mode-invariant feature encoder for obtaining mode-invariant features

Obtaining a narrow-band image I _NBI Essential characteristics of

And

combined input to a white light image generator G _WLI In generating white light images

And

combined input narrowband image generator G _NBI In generating narrow-band image

Two essential characteristics

And

respectively input an intrinsic generator G _Eigen All can generate an intrinsic representation I _Eigen ；

And

sharing the weight; white light pattern to be generated

And narrow band image I _NBI The images are fed into a discriminator D which distinguishes the generated image from the real image _Gen Obtaining a classification result for generating a vivid medical image by resisting learning;

the neural network has a circulation structure, namely a circulation network, and the circulation mode is as follows: white light image to be generated

Passing through a modal-specific feature encoder E for obtaining optical information _MS To obtain new white light optical characteristics

Modal invariant feature encoder by obtaining modal invariant features

Obtaining new essential characteristics of white light image

Similarly, new optical characteristics of narrow-band images can be obtained

And essential characteristics

And

combined input white light image generator G _WLI In the middle, white light image is obtained

And

combined input narrowband image generator G _NBI In the method, a narrow band image is obtained

White light image obtained after circulation

And narrow band imaging

White light image I to be input originally _WLI And NBI image I _NBI Consistently, i.e., pixel-level penalties may be used to constrain.

2. The method of claim 1, wherein for better obtaining essential medical features, an intrinsic countervailing learning strategy is used to generate reasonable intrinsic images, and a loop mode is used to make the network learn an efficient and reasonable representation, and the process is as follows: first, to ensure E _MI Extracting a meaningful essence representation using an intrinsic image generator G _Eigen Let E _MI The method has good universality, and the essence of the endoscope image can be observed, so that the cross-modal and cross-equipment images can keep better consistency; for White Light Images (WLI), intrinsic features are assumed

Can effectively express the tissue in the white light image and can obtain an essence map

Having the same depth characteristics as the input White Light Image (WLI); likewise, the essence diagrams

Feeding into E _MI Intermediate features having the same essential features as the input White Light Image (WLI) can be encoded;

generated only by essential features, not containing any light information, and therefore will be an essential diagram

Feeding into E _MS A complete zero vector can be generated.

3. The method of claim 2, wherein said feature identifier D _Eigen Classifying the generated characteristic images; since the intrinsic image exhibits a true common feature of the White Light Image (WLI) and the narrow-band image (NBI), it is expected that the intrinsic distribution can approach both the White Light Image (WLI) and the narrow-band image (NBI); feature discriminator D _Eigen The generated characteristic image is distinguished from the true White Light Image (WLI) and the Narrow Band Image (NBI), while the eigengenerator G _Eigen To generate a true essence image to confuse the discriminator; through antagonism learning, more real essence images can be obtained, and E can be encouraged _MI The essential features of WLI and NBI are better captured for downstream medical tasks.

4. The method of claim 3, wherein the endoscopic white light image I is inputted during the neural network test _WLI Through an intrinsic feature extractor E _MI Extracting essential features, inputting an additional endoscope narrow-band imaging image, extracting narrow-band optical features through an optical feature extractor, combining the two features, and sending the combined features to an input narrow-band image generator G _NBI In the method, a narrow-band imaging image can be obtained through one-time forward propagation

5. The method of claim 4, wherein the neural network, the losses used in its training process are as follows:

two essential characteristics

And

And narrowband image intrinsic representation

These two eigenrepresentations are constrained using LPIPS penalty, as follows:

in narrow-band imaging branches, optical features

And narrow band nature

Pass through discriminator D _gen Calculating the GAN loss L _GAN (ii) a White light image intrinsic representation

Optical characteristics of the original white light image

In that

And the original white light image I _WLI Calculates the cyclic loss L between _cycle (ii) a Generated white light image intrinsic representation

And narrowband image intrinsic representation

The feature loss was calculated as follows:

therefore, when the neural network is trained, the final loss function is:

wherein λ is _i Is the weight used to balance the individual loss functions, i ═ 1,2, …, 5.

6. The method according to claim 5, wherein in the second step, the intrinsic feature extractor trained in the first step is used to obtain the intrinsic features of the white light image, and the alimentary tract lesion region is segmented by a hollow space convolution pooling pyramid network (ASPP); the specific process is as follows: the cavity space convolution pooling pyramid network is connected with an essential feature extractor E _MI A rear side; the ASPP performs convolution parallel sampling on given input holes at different sampling rates; endoscopic white light image pass E _MI And (4) providing essential characteristics, inputting the essential characteristics into an ASPP network, and directly outputting a segmentation result of the lesion area.