CN113496489B

CN113496489B - Training method of endoscope image classification model, image classification method and device

Info

Publication number: CN113496489B
Application number: CN202111039387.8A
Authority: CN
Inventors: 边成; 李永会; 杨延展
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2021-09-06
Filing date: 2021-09-06
Publication date: 2021-12-24
Anticipated expiration: 2041-09-06
Also published as: CN113496489A; WO2023030521A1

Abstract

A training method of an endoscope image classification model, an image classification method and an image classification device are provided. The method comprises the following steps: acquiring a first set of images, the first set of images being a set of first modality imagery images of one or more objects acquired by an endoscope operating at a first modality; acquiring a second set of images, the second set of images being a set of second modality imagery images of the one or more objects acquired by an endoscope operating at a second modality different from the first modality, the second modality imagery images corresponding one-to-one with the first modality imagery images; and inputting the first image set and the second image set into the endoscope image classification model as training data sets, and training the endoscope image classification model to obtain a trained endoscope image classification model.

Description

Training method of endoscope image classification model, image classification method and device

Technical Field

The application relates to the field of artificial intelligence, in particular to a training method of an endoscope image classification model based on contrast learning, an endoscope image classification method, an endoscope image classification device and a computer readable medium.

Background

Most colorectal cancers begin with neoplasms on the surface of the colorectal intima, called polyps, while some may develop into cancer. Therefore, early detection and identification of polyp types is critical for the prevention and treatment of cancer. However, visual classification of polyps is challenging, with different endoscope illumination conditions, different textures, and appearances leading to identification difficulties.

To alleviate the burden on physicians, there have been some efforts to study the automated implementation of polyp type identification using deep learning. However, these tasks are all based on a fully supervised approach, i.e. a large amount of annotation data is required, and the cost of the annotation data consumption is huge. Furthermore, they are trained using only data from a single modality, and in fact in medical imaging, the information observed from different modalities differs but is very important.

Therefore, an improved method for training an endoscopic image classification model is desired, which can better learn the features of the image at an abstract semantic level and utilize multi-modal feature information under the condition of limited annotation data.

Disclosure of Invention

The present disclosure has been made in view of the above problems. An object of the present disclosure is to provide a training method, apparatus and computer readable medium for semi-supervised training of an endoscopic image classification model based on contrast learning.

The embodiment of the present disclosure provides a training method of an endoscope image classification model based on contrast learning, the method including: acquiring a first set of images, the first set of images being a set of first modality imagery images of one or more objects acquired by an endoscope operating at a first modality; acquiring a second set of images, the second set of images being a set of second modality imagery images of the one or more objects acquired by an endoscope operating at a second modality different from the first modality, the second modality imagery images corresponding one-to-one with the first modality imagery images; and inputting the first image set and the second image set into the endoscope image classification model as training data sets, and training the endoscope image classification model to obtain a trained endoscope image classification model.

For example, a method according to an embodiment of the present disclosure, wherein the training method is a semi-supervised training method, images of a first subset of the first set of images have labels labeling endoscopic image classes, and other images of the first set of images do not have labels labeling endoscopic image classes; and the images of the second subset in the second image set, which correspond to the images of the first subset one by one, have the same label marking the endoscope image category, and the other images of the second image set do not have the label marking the endoscope image category.

For example, a method according to an embodiment of the present disclosure, wherein the endoscopic image classification model comprises: a comparative learning submodel, the comparative learning submodel comprising: a first learning module for receiving the first set of images and learning the first set of images to obtain a first feature representation and a second feature representation of the first set of images; a second learning module for receiving the second set of images and learning the second set of images to obtain a first feature representation and a second feature representation of the second set of images; a memory queue for storing second feature representations of the first set of images generated by the first learning module and second feature representations of the second set of images generated by the second learning module; a classifier submodel comprising: a first classifier submodel for performing classification learning according to the first feature representation of the first image set generated by the first learning module to generate a classification prediction probability distribution of each image in the first image set; and the second classifier submodel is used for performing classification learning according to the first feature representation of the second image set generated by the second learning module so as to generate a classification prediction probability distribution of each image in the second image set.

For example, the method according to the embodiment of the present disclosure, wherein a first learning module includes a first encoder and a first nonlinear mapper connected in sequence, a second learning module includes a second encoder and a second nonlinear mapper connected in sequence, wherein the first encoder and the second encoder have the same structure, and the first nonlinear mapper and the second nonlinear mapper have the same structure,

a first classifier submodel comprises a first classifier connected to an output of the first encoder and a first classifier submodel comprises a second classifier connected to an output of the second encoder, wherein the first classifier and the second classifier are structurally identical.

For example, a method according to an embodiment of the present disclosure, wherein inputting the first set of images and the second set of images as a training dataset into an endoscopic image classification model comprises: at each iterative training: selecting a first batch of first modality image images from the first image set and inputting the first batch of first modality image images into the first learning module; and selecting a second batch of second modality image images which are in one-to-one correspondence with the first batch of first modality image images from the second image set, and inputting the second batch of second modality image images into the second learning module.

For example, a method according to an embodiment of the present disclosure, wherein training the endoscopic image classification model to obtain a trained endoscopic image classification model comprises: and training the endoscope image classification model until the joint loss function of the endoscope image classification model converges to obtain the trained endoscope image classification model.

For example, a method according to an embodiment of the present disclosure, wherein training the endoscopic image classification model until a joint loss function of the endoscopic image classification model converges comprises: carrying out unsupervised contrast learning by utilizing the contrast learning submodel to generate a first characteristic representation of a first batch and a second characteristic representation of the first batch for the first-batch first-mode image images and generate a first characteristic representation of a second batch and a second characteristic representation of the second batch for the second-batch second-mode image images; storing the second feature representation of the first batch and the second feature representation of the second batch into the memory queue based on a first-in-first-out rule; performing classification training by using the classifier submodel to generate a first classification prediction probability distribution for each image in the first batch of first modality image images, thereby obtaining a first classification prediction probability distribution for the first batch, and to generate a second classification prediction probability distribution for each image in the second batch of second modality image images, thereby obtaining a second classification prediction probability distribution for the second batch; calculating a joint loss function based on the second feature representation of the first batch and the second feature representation of the second batch, and the first classification prediction probability distribution of the first batch and the second classification prediction probability distribution of the second batch, and adjusting parameters of the endoscope image classification model according to the joint loss function; determining whether a trusted pseudo-tag is generated for an unlabeled image in the first batch of first modality imagery images and an unlabeled image in the second batch of second modality imagery images; if the credible pseudo labels are determined to be generated for the unlabeled images in the first batch of first-modality image images and the unlabeled images in the second batch of second-modality image images, adding the first-modality image images and the corresponding second-modality image images which generate the credible pseudo labels into the first image set and the second image set respectively to form a new first image set and a new second image set so as to update the training data set; and taking the new first image set and the new second image set as a new training data set to continuously carry out iterative training on the adjusted endoscope image classification model.

For example, a method according to an embodiment of the present disclosure, wherein if it is determined that no authentic pseudo-labels are generated for unlabeled images in the first batch of first-modality image images and unlabeled images in the second batch of second-modality image images, continuing iterative training of the adjusted endoscopic image classification model based on the first set of images and the second set of images as a training dataset.

For example, a method according to an embodiment of the present disclosure, wherein the joint loss function of the endoscope image classification model is a sum of: the loss function of the contrast learning, the loss function when performing classification training for the labeled images in the first batch of first-mode image images, and the loss function when performing classification training for the labeled images in the second batch of second-mode image images.

For example, a method according to an embodiment of the present disclosure, wherein the loss function learned for the contrast is a noise contrast estimation loss function InfoNCE, and the loss function trained for classifying labeled images in the first-batch of first-modality image images and the loss function trained for classifying labeled images in the second-batch of second-modality image images are focus loss functions.

For example, a method according to an embodiment of the present disclosure, wherein performing unsupervised contrast learning using the contrast learning submodel to generate a first feature representation of a first batch and a second feature representation of the first batch for the first batch of first-modality imagery images, and to generate a first feature representation of a second batch and a second feature representation of the second batch for the second batch of second-modality imagery images, includes: converting each image in the first batch of first modality image images into a first feature representation based on the first encoder to obtain a first feature representation of a first batch, and nonlinearly mapping each first feature representation in the first feature representation of the first batch based on a first nonlinear mapper to obtain a second feature representation of the first batch; based on the second encoder, each image in the second batch of second modality image images is converted into a first feature representation to obtain a first feature representation of the second batch, and based on a second nonlinear mapper, each first feature representation in the first feature representation of the second batch is subjected to nonlinear mapping to obtain a second feature representation of the second batch.

For example, a method according to an embodiment of the present disclosure, wherein determining whether to generate authentic pseudo-labels for unlabeled images in the first batch of first-modality imagery images and unlabeled images in the second batch of second-modality imagery images comprises: for each unlabeled first modality video image, determining a first label prediction value for the unlabeled first modality video image based on a first classification prediction probability distribution generated for the unlabeled first modality video image; and determining a second label prediction value of the unlabeled second modality video image for an unlabeled second modality video image that corresponds one-to-one with the unlabeled first modality video image based on a second classification prediction probability distribution generated for the unlabeled second modality video image; determining whether the first tag prediction value and the second tag prediction value are consistent; if not, not generating the credible pseudo label; and if the predicted value of the first label is consistent with the predicted value of the second label, fusing the predicted value of the first label and the predicted value of the second label, generating the credible pseudo label when the fused predicted value of the label is greater than a preset threshold value, and otherwise, not generating the credible pseudo label.

For example, a method according to an embodiment of the present disclosure, wherein fusing the first tag predictor and the second tag predictor comprises: and carrying out weighted average on the first label predicted value and the second label predicted value to obtain the fused label predicted value.

For example, according to a method of an embodiment of the present disclosure, the object is a polyp, and the endoscopic image is a polyp endoscopic image.

For example, a method according to an embodiment of the present disclosure, wherein the signature comprises at least one of hyperplasia, adenoma and cancer.

For example, a method according to an embodiment of the present disclosure, wherein the first modality picture image is a white light picture image and the second modality picture image is a narrowband light picture image.

For example, a method according to an embodiment of the present disclosure, wherein the first modality picture image is a white light picture image and the second modality picture image is an autofluorescence picture image.

For example, a method according to an embodiment of the present disclosure, wherein the encoder is a convolutional layer part of a residual neural network ResNet, the nonlinear mapper is composed of a two-layer multi-layer perceptron MLP, and the classifier is composed of a two-layer multi-layer perceptron MLP.

Embodiments of the present disclosure provide further an endoscope image classification method, including: acquiring an endoscope image to be identified; extracting an image feature representation of the endoscopic image based on an encoder in a trained endoscopic image classification model; inputting the extracted image feature representations into corresponding classifiers in an endoscope image classification model to obtain a classification result of the endoscope image; wherein the trained endoscope image classification model is obtained based on a training method of an endoscope image classification model based on contrast learning according to an embodiment of the disclosure.

Embodiments of the present disclosure provide further provide an endoscopic image classification system, comprising: an image acquisition section for acquiring an endoscopic image to be recognized; the processing component is used for extracting image characteristic representations of the endoscope images based on an encoder in a trained endoscope image classification model and inputting the extracted image characteristic representations into corresponding classifiers in the endoscope image classification model to obtain classification results of the endoscope images; and an output component for outputting the classification result of the image to be recognized, wherein the trained endoscope image classification model is obtained based on the training method of the endoscope image classification model based on contrast learning according to the embodiment of the disclosure.

Embodiments of the present disclosure also provide a training apparatus for an endoscope image classification model based on contrast learning, the apparatus including: an image acquisition component for acquiring a first set of images, the first set of images being a set of first modality imagery images of one or more objects acquired by an endoscope operating at a first modality; and acquiring a second set of images, the second set of images being a set of second modality imagery images of the one or more objects acquired by an endoscope operating in a second modality different from the first modality, the second modality imagery images corresponding one-to-one to the first modality imagery images; and a training component, configured to input the first image set and the second image set into the endoscope image classification model as training data sets, and train the endoscope image classification model to obtain a trained endoscope image classification model.

Embodiments of the present disclosure also provide an electronic device comprising a memory and a processor, wherein the memory has stored thereon a program code readable by the processor, which when executed by the processor performs the method according to any of the above methods.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer-executable instructions for performing the method according to any one of the above-described methods.

According to the training method of the semi-supervised endoscope image classification model based on contrast learning, a new positive and negative example selection mode is provided, and information of images in different endoscope modes is better utilized to enhance the classification accuracy of endoscope image images. In addition, unlike the traditional SimCLR-based contrast learning approach, in order to reduce the computation of the model, the embodiment of the present disclosure adds a memory queue for dynamic negative examples of storage. Finally, the embodiment of the disclosure provides a new semi-supervised learning mode, and data labels are dynamically added in a pseudo label mode to assist training, so that the labeling cost can be saved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments of the present disclosure will be briefly described below. It is to be expressly understood that the drawings in the following description are directed to only some embodiments of the disclosure and are not intended as limitations of the disclosure.

FIG. 1 is a schematic diagram illustrating an architecture for applying the endoscopic image classification model training and the endoscopic image classification method in the embodiment of the present disclosure;

fig. 2 shows a schematic diagram of a conventional SimCLR-based contrast learning network architecture;

FIG. 3 shows image images of two modalities of the same polyp, shown in accordance with an embodiment of the present disclosure;

FIG. 4 shows a schematic structure of an endoscopic image classification model 400 based on contrast learning according to an embodiment of the present disclosure;

FIG. 5 shows a flow diagram of a method of training an endoscopic image classification model according to an embodiment of the present disclosure;

FIG. 6 shows a specific exemplary illustration of the implementation depicted in step S503 of FIG. 5;

FIG. 7 depicts a flow chart of an endoscopic image classification method according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram illustrating the structure of an endoscopic image classification system in an embodiment of the present disclosure;

FIG. 9 illustrates a training apparatus for an endoscopic image classification model according to an embodiment of the present disclosure; and

FIG. 10 shows a schematic diagram of a storage medium according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the accompanying drawings, and obviously, the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without any creative effort also belong to the protection scope of the present application.

The terms used in the present specification are those general terms currently widely used in the art in consideration of functions related to the present disclosure, but they may be changed according to the intention of a person having ordinary skill in the art, precedent, or new technology in the art. Also, specific terms may be selected by the applicant, and in this case, their detailed meanings will be described in the detailed description of the present disclosure. Therefore, the terms used in the specification should not be construed as simple names but based on the meanings of the terms and the overall description of the present disclosure.

Although various references are made herein to certain modules in a system according to embodiments of the present application, any number of different modules may be used and run on a user terminal and/or server. The modules are merely illustrative and different aspects of the systems and methods may use different modules.

Flowcharts are used herein to illustrate the operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

In the diagnosis of digestive tract diseases, usually, an image of a lesion inside the digestive tract is acquired by a diagnostic tool such as an endoscope, and a medical staff determines the type of lesion by observing the lesion with the human eye. In order to reduce the burden of doctors, some efforts have been made to automatically identify the lesion type by deep learning, but these efforts are based on a fully supervised method, i.e. a large amount of labeled image data is required, and the cost for labeling the image data is enormous. Furthermore, they are trained using only data from a single modality, and in fact in medical imaging, the information observed from different modalities differs but is very important.

Therefore, the present disclosure provides a training method for an endoscope image classification model based on contrast learning, which better utilizes information of images in different endoscope modes by adopting a new positive and negative example selection manner to learn the features of the abstract semantic levels of the images, so as to enhance the classification accuracy of the endoscope image images. In addition, under the condition that the labeled data are limited, the data labels are dynamically added in a pseudo label mode to assist training, and the cost problem of manually collecting and labeling a large number of training sets is better solved.

Fig. 1 is a schematic diagram illustrating an application architecture of an endoscopic image classification model training and an endoscopic image classification method in an embodiment of the present disclosure, and includes a server 100 and a terminal device 200.

The terminal device 200 may be a medical device, and for example, the user may view the endoscope image classification result based on the terminal device 200.

The terminal device 200 and the server 100 can be connected via the internet to communicate with each other. Optionally, the internet described above uses standard communication techniques and/or protocols. The internet is typically the internet, but can be any Network including, but not limited to, Local Area Networks (LANs), Metropolitan Area Networks (MANs), Wide Area Networks (WANs), mobile, wired or wireless networks, private networks, or any combination of virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including Hypertext Markup Language (HTML), Extensible Markup Language (XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet Protocol Security (IPsec), and so on. In other embodiments, custom and/or dedicated data communication techniques may also be used in place of, or in addition to, the data communication techniques described above.

The server 100 may provide various network services for the terminal device 200, wherein the server 100 may be a server, a server cluster composed of several servers, or a cloud computing center.

Specifically, the server 100 may include a processor 110 (CPU), a memory 120, an input device 130, an output device 140, and the like, the input device 130 may include a keyboard, a mouse, a touch screen, and the like, and the output device 140 may include a Display device, such as a Liquid Crystal Display (LCD), a Cathode Ray Tube (CRT), and the like.

Memory 120 may include Read Only Memory (ROM) and Random Access Memory (RAM), and provides processor 110 with program instructions and data stored in memory 120. In the embodiment of the present disclosure, the memory 120 may be used to store a program of an endoscopic image classification model training method or an endoscopic image classification method in the embodiment of the present disclosure.

The processor 110 is configured to execute the steps of any one of the endoscope image classification model training methods or endoscope image classification methods according to the obtained program instructions by calling the program instructions stored in the memory 120.

For example, in the embodiment of the present disclosure, the endoscopic image classification model training method or the endoscopic image classification method is mainly performed by the server 100, for example, for the endoscopic image classification method, the terminal device 200 may transmit the acquired image images of multiple modalities of a digestive tract lesion (e.g., polyp) to the server 100, perform type recognition on the lesion image by the server 100, and may return a lesion classification result to the terminal device 200.

As shown in fig. 1, the application architecture is described by taking the application to the server 100 side as an example, but it is needless to say that the endoscope image classification method in the embodiment of the present disclosure may be executed by the terminal device 200, and for example, the terminal device 200 may obtain a trained endoscope image classification model from the server 100 side, and perform type recognition on a lesion image based on the endoscope image classification model to obtain a lesion classification result, which is not limited in the embodiment of the present disclosure.

In addition, the application architecture diagram in the embodiment of the present disclosure is for more clearly illustrating the technical solution in the embodiment of the present disclosure, and does not limit the technical solution provided by the embodiment of the present disclosure, and of course, for other application architectures and business applications, the technical solution provided by the embodiment of the present disclosure is also applicable to similar problems.

The various embodiments of the present disclosure are schematically illustrated as applied to the application architecture diagram shown in fig. 1.

First, in order to make the principle of the present disclosure more clearly understood to those skilled in the art, a brief description is given below of the basic concept of comparative learning.

The contrast learning belongs to unsupervised learning and is characterized in that category label information which does not need to be labeled manually is directly used as supervision information to learn the characteristic expression of sample data and is used for downstream tasks, such as classification of types of polyp images. In contrast learning, the representation is learned by making a comparison between input samples. Contrast learning does not learn signals from a single data sample at a time, but rather learns by making comparisons between different samples. A comparison may be made between a positive pair of examples of "similar" inputs and a negative pair of examples of "different" inputs. Contrast learning is learned by simultaneously maximizing the correspondence between different transformed views (e.g., cropping, flipping, color transformation, etc.) of the same image, and minimizing the correspondence between transformed views of different images. In short, after the same image is subjected to various transformations in the comparison learning, the same image can still be identified, so that the similarity of various transformed images is maximized (because the images are obtained from the same image). Conversely, if the images are different (even though they may appear very similar through various transformations), the similarity between them is minimized. With such contrast training, the encoder (encoder) can learn higher-level generic features of the image (e.g., image-level features) rather than image-level generative models (e.g., pixel-level generation).

Fig. 2 shows a schematic diagram of a conventional SimCLR-based contrast learning network architecture.

As shown in fig. 2, the conventional SimCLR model architecture is composed of two symmetric branches (Branch) which are respectively provided with an encoder and a nonlinear mapper symmetrically. The SimCLR provides a mode for constructing positive and negative examples, and the basic idea is as follows: inputting N (N is a positive integer larger than 1) images X =of one batch

,

,

,…,

In the image of one of them

It is transformed (image enhancement, including cropping, flipping, color transformation, and Gaussian blur, for example) randomly to obtain two images

And

then N images X of one batch are enhanced to obtain two batches of images

And

the two batches

And

each containing N images and there is a one-to-one correspondence between the images of the two batches. For example, imagesxTransformed data pair<

,

>Are positive examples of each other

And the remaining 2N-2 images are negative examples of each other. After transformation, the enhanced image is projected into the representation space. The above branch is taken as an example to illustrate, enhancing the image

First through a feature Encoder Encoder (typically using a Deep residual network (Deep resi)dual network, ResNet) as a model structure, here in terms of functions

) Representative), is converted into a corresponding feature representation

. Followed by another Non-linear transformer Non-linear Projector (consisting of a two-layer multi-layer perceptron (MLP)), here in the form of a function

) Representative), further representing the feature

Mapping to vectors in another space

. Thus, pass through

) Two non-linear transformations) the enhanced image is projected into the representation space. The process of the lower branch is similar and will not be described herein.

Unsupervised learning of image features can be achieved by calculating and maximizing the similarity between positive example mapping features and minimizing the similarity between negative example mapping features. The similarity between two enhanced images is calculated in SimCLR using cosine similarity, for the two enhanced images

And

in its projected representation

And

the cosine similarity is calculated. Ideally, an enhanced pair of images (which may be referred to herein as a positive example, for example<

,

>) The similarity between the pair of images and the other images in the two batches will be high and low.

The loss function for contrast learning may be defined based on the similarity between positive and negative examples, and SimCLR uses a contrast loss InfoNCE, as shown in equation (1) below:

（1）

wherein the content of the first and second substances,

representing the features after being subjected to the non-linear mapping,

is shown and

in a corresponding positive example, the first and second,

is shown except that

All other features of (including positive and negative examples). I denotes all pictures.

Indicating a dot product operation.

And representing a temperature parameter for preventing from falling into a local optimal solution in the early stage of model training and helping convergence along with the model training.

By optimizing the above contrast loss function InfoNCE, it is possible to maximize the similarity between positive examples and minimize the similarity between negative examples, and the essential features of the image can be learned in an unsupervised environment.

Conventional contrast loss models (such as the SimCLR model introduced above) have a positive example by enhancing the same image, however, image enhancement methods such as those involving cropping, flipping, color transformation, and gaussian blurring are essentially only a data augmentation of the real image, i.e. generating dummy data, which does not provide more feature information than the original image. However, such a conventional image enhancement method is not suitable for classifying endoscopic images, and due to different endoscope illumination conditions, different textures and appearances cause difficulty in recognition, and polyps, for example, have large differences in color, shape and size, and large color change between polyps and limited visibility of surface texture, so that polyp inspection based on only the image enhancement method results in a high false detection rate.

Since the information observed by different modalities is different and very important in medical imaging, the present disclosure proposes a new selection method for comparing positive and negative examples of learning in order to better learn the essential features of endoscopic imaging. Specifically, different from the conventional image enhancement-based contrast learning method, the method disclosed by the disclosure uses the image images of different modalities of the same digestive tract lesion as a pair of positive examples of contrast learning, so that richer features of the same lesion in different modalities can be obtained, and the method is more beneficial to learning essential features of the lesion. Hereinafter, the technical solutions of the embodiments of the present disclosure will be schematically described by taking a polyp image as an example. It should be noted that the technical solutions provided by the embodiments of the present disclosure are also applicable to other endoscopic images.

Fig. 3 shows a pictorial image of two modalities of the same polyp, according to an embodiment of the present disclosure.

As shown in fig. 3, the image on the left is an observation of a polyp acquired by operating the endoscope in White Light (WL) Imaging mode, and the image on the right is another observation of the same polyp acquired by operating the endoscope in Narrow Band Imaging (NBI) mode.

The broadband spectrum of white light is composed of 3 kinds of light of R/G/B (red/green/blue), and the wavelengths of the light are 605nm, 540nm and 415nm respectively. The high-brightness sharp white light endoscope image is presented in a white light imaging mode, and the observation of the structure of the mucosa deep layer is facilitated. The narrow-band optical mode uses a narrow-band filter to replace the traditional broadband filter, and limits the light with different wavelengths, and only leaves the green and blue narrow-band light waves with the wavelengths of 540nm and 415 nm. The contrast of the image blood vessels generated in the narrow-band light mode relative to the mucosa is obviously enhanced, and the image blood vessels are suitable for observing the blood vessel morphology and the mucosa structure of the mucosa surface layer. The high contrast between blood vessels and surrounding mucosa means that it is helpful to detect and characterize lesions, even suspicious lesions that show high vascularization in deeper tissue layers. The images of the capillaries are less blurred compared to white light endoscopy, and the likelihood of missing lesions is reduced.

According to one embodiment of the present disclosure, by replacing the conventional enhanced image with image images of different modalities of the same polyp (e.g., white light image and narrow band light image), richer features of the polyp can be better learned, which is beneficial for classifying the polyp images based on the learned features.

It should be understood that the modality image herein may also be any other type of modality image, such as autofluorescence image, I-SCAN image, etc., and the present disclosure is not limited thereto.

Fig. 4 shows a schematic structure of an endoscopic image classification model 400 based on contrast learning according to an embodiment of the present disclosure.

As shown in fig. 4, the structure of the endoscope image classification model 400 according to the embodiment of the present disclosure is divided into a contrast learning submodel 401 and a classifier submodel 402, and as shown in the figure, the contrast learning submodel 401 may include, for example, an upper branch and a lower branch (branch). Here, for convenience of description, the upper and lower branches may be referred to as a first learning module 401-1 and a second learning module 401-2, respectively. For example, the first learning module 401-1 includes a first encoder and a first nonlinear mapper connected in sequence, and the second learning module 401-2 includes a second encoder and a second nonlinear mapper connected in sequence.

According to an embodiment of the present disclosure, for example, the first encoder and the second encoder may have the same structure. For example, the encoder here may be a convolutional layer part of a ResNet network. For example, the first nonlinear mapper and the second nonlinear mapper may have the same structure. For example, the nonlinear mapper may be a two-layer Multilayer Perceptron (MLP).

In addition, the contrast learning submodel 401 includes a memory queue for storing feature vectors of a plurality of recently trained batches.

The other classifier submodel 402 comprises two classifiers coupled to the outputs of the two encoders in the contrast learning submodel 401, respectively, for performing further classification tasks based on the feature representations generated by the encoders.

According to one embodiment of the present disclosure, the classifiers herein may have the same structure, for example. For example, the classifier here may be a two-layered multi-layered perceptron MLP.

Those skilled in the art will appreciate that the encoder, linear mapper and classifier used herein may be replaced with other architectures and the disclosure is not limited thereto.

In the following, a method for training an endoscope image classification model and an endoscope classification method provided according to at least one embodiment of the present disclosure are described in a non-limiting manner by using several examples or embodiments, and as described below, different features of these specific examples or embodiments may be combined with each other without mutual conflict, so as to obtain new examples or embodiments, which also belong to the scope of protection of the present disclosure.

Currently, the mainstream methods for automatically identifying polyps based on deep learning are mostly full-supervised learning methods, and such methods rely on manually labeled labels. However, in practice, the obtained polyp images are unmarked, and the cost for marking the data is enormous. Therefore, the present disclosure proposes a semi-supervised training mode, which assists training by dynamically adding data labels in a pseudo label mode. In addition, by using the image images of different modalities of the same polyp, more abundant feature information can be extracted.

Fig. 5 shows a flowchart of a method of training an endoscopic image classification model according to an embodiment of the present disclosure. The endoscopic image classification model is, for example, the endoscopic image classification model 400 as described above with reference to fig. 4. For example, the training method of the endoscope image classification model 400 may be performed by a server, which may be the server 100 shown in fig. 1.

First, in step S501, a first set of images, which is a set of first-modality video images of one or more subjects acquired through an endoscope operating in a first modality, is acquired. Next, in S503, a second set of images is acquired, which is a set of second-modality picture images of the one or more subjects acquired through the endoscope operating in a second modality different from the first modality, the second-modality picture images being in one-to-one correspondence with the first-modality picture images.

For example, one or more of the objects herein may be polyps. For example, the first modality image may be a white light image, and the second modality image may be a narrow band light image. Of course, other modalities of imaging may be used, such as white light imaging for the first modality, autofluorescence imaging or I-SCAN imaging for the second modality, and so on, which is not limited by the present disclosure. For example, the multi-modal images may be obtained by operating an endoscope, may be obtained by downloading through a network, or may be obtained by other ways, which is not limited in this embodiment of the present disclosure.

It should be understood that embodiments of the present disclosure may also be equally applicable to image classification of other digestive tract lesions besides polyps, such as inflammation, ulcers, vascular malformations, and diverticula, etc., and the present disclosure is not limited thereto.

For example, to mimic the reality of real polyp data lacking labels, where a large amount of data in the first and second sets is unlabeled, since the first modality images in the first set and the second modality images in the second set are in one-to-one correspondence, the presence or absence of a label is also in one-to-one correspondence. For example, according to the embodiments of the present disclosure, polyps can be classified herein according to NICE classification criteria as hyperplastic polyps, adenomas (including mucosal carcinoma and submucous superficial invasive carcinoma), submucous deep invasive carcinoma, where we can label the training data briefly as hyperplasia, adenomas, and cancer.

For example, in one implementation of a method for training an endoscopic image classification model according to an embodiment of the present disclosure, the first and second sets of data may include 1302 sheets of white light imagery images and corresponding 1302 sheets of narrow band light imagery images, respectively. In order to adapt to the situation that a large amount of labels are absent in the real data set, 90% of labels can be randomly removed, and only 10% of labels are reserved, so that semi-supervised learning is realized.

It should be understood that the number of data sets and the label ratio used for training the endoscope image classification model according to the embodiment of the present disclosure may be adjusted according to actual situations, and the present disclosure does not limit this. For unlabeled video images, the embodiments of the present disclosure dynamically add data labels to assist training based on the way of pseudo labels, and details will be described later with reference to fig. 6.

Next, in step S505, the first image set and the second image set are input as training data sets into the endoscope image classification model, and the endoscope image classification model is trained to obtain a trained endoscope image classification model.

As is well known to those skilled in the art, machine learning algorithms typically rely on a process of maximizing or minimizing an objective function, often referred to as a loss function. For example, in the training method of the endoscope image classification model according to the embodiment of the present disclosure, training the endoscope image classification model to obtain the trained endoscope image classification model may include: and training the endoscope image classification model until the joint loss function of the endoscope image classification model converges to obtain the trained endoscope image classification model.

As described above, in the conventional contrast learning, at each iterative training, N images are randomly selected from the training set to form a batch, and for each image in the batch, a positive example is constructed by the above image enhancement method, that is, two image enhancement views are generated for each image. Thus, two batches of images, each comprising N images, are generated, with a one-to-one correspondence between the images of the two batches, where each pair of images is an enhanced view of the same original image. In the conventional contrast learning, 2N images of two batches are obtained by performing an image enhancement technique based on the original images, but the data generated in this way are false data. Accordingly, the disclosed embodiments utilize two different modality image images of the same digestive tract lesion (e.g., polyp) instead of two enhanced views in conventional contrast learning, which may provide a richer representation of the features of the polyp, so that a well trained network based on such a training set can more accurately classify the polyp.

For example, at each iterative training, a first batch of first modality image images is selected from the first image set and input into the first learning module 401-1 of fig. 4; and selecting a second batch of second modality image images corresponding to the first batch of first modality image images one by one from the second image set, and inputting the second batch of second modality image images into the second learning module 401-2 of fig. 4.

According to the endoscope classification method based on contrast learning, a new positive and negative example selection mode is adopted, information of images in different endoscope modes is better utilized, the characteristic of the abstract semantic level of the images is learned, and the classification accuracy of the endoscope images is enhanced. Under the condition that the labeled data are limited, data labels are dynamically added for assisting training in a pseudo label mode, and the cost problem of manually collecting and labeling a large number of training sets is better solved.

Referring to fig. 6, the implementation described in step S505 is specifically and exemplarily illustrated in conjunction with the endoscopic image model 400 shown in fig. 4.

As shown in fig. 6, in step S601, unsupervised contrast learning is performed by using the contrast learning submodel to generate a first feature representation of a first batch and a second feature representation of the first batch for the first batch of first-modality image images, and to generate a first feature representation of a second batch and a second feature representation of the second batch for the second batch of second-modality image images.

For example, the comparative learning process here is generally similar to the conventional SimCLR learning process described above. Specifically, referring to fig. 4, taking the first learning module 401-1 (i.e., the upper branch) as an example, after a first batch of first-modality image images is selected from the first image set and input into the first learning module 401-1, the first encoder converts each image in the first batch of first-modality image images into a first feature representation to obtain a first feature representation of the first batch, and then performs a non-linear mapping on each first feature representation in the first feature representation of the first batch based on the first non-linear mapper to obtain a second feature representation of the first batch. The first characteristic expression herein may be, for example, as described above

The second characteristic expression herein may be, for example, the one described above

。

The process of the second learning module (i.e., the lower branch) is the same as that of the first learning module, and after the second batch of second-modality image images is selected from the second image set and input into the second learning module 401-2, each image in the second batch of second-modality image images is converted into the first feature representation based on the second encoder to obtain the first feature representation of the second batch, and then each first feature representation in the first feature representation of the second batch is subjected to nonlinear mapping based on the second nonlinear mapper to obtain the second feature representation of the second batch.

For example, unsupervised contrast learning according to embodiments of the present disclosure employs the unsupervised contrast loss function InfoNCE described above as the loss function. For example, the contrast-learned loss function InfoNCE is based on a similarity between the second feature representation of the first batch and the second feature representation of the second batch and a similarity between the second feature representation of the first batch and a plurality of second feature representations stored in a memory queue generated during a previous iteration of training.

In step 603, the second feature representation of the first lot and the second feature representation of the second lot are stored in the memory queue based on a first-in-first-out rule.

As described above, the conventional SimCLR takes 2N-2 pictures except for two enhanced views of the current picture within the input two batches of 2N pictures as negative examples at each iterative training. Unlike conventional SimCLR, the disclosed embodiments also add a memory queue for storing image features of previously trained batch images (e.g., the first batch second feature representation and the second batch second feature representation) as more negative examples, which facilitates extracting good features, since more negative examples can more effectively cover the underlying distribution, thereby giving a better training signal. For example, the memory queue uses a first-in-first-out based rule, that is, the memory queue is dynamic, and after a new training feature batch is queued, the oldest training feature batch is dequeued.

In step S605, a classification training is performed by using the classifier sub-model to generate a first classification prediction probability distribution for each image in the first batch of first-modality image images, so as to obtain a first classification prediction probability distribution in the first batch, and to generate a second classification prediction probability distribution for each image in the second batch of second-modality image images, so as to obtain a second classification prediction probability distribution in the second batch.

As shown in fig. 4, the outputs of the two encoders of the contrast learning subnetwork are connected to two classifiers, respectively, e.g. a first classifier may receive the first feature representation of a first batch from the first encoder and a second classifier may receive the first feature representation of a second batch from the second encoder. In this way, the first classifier and the second classifier may be used for classification training based on the received feature representations.

Here the classifier outputs a prediction probability distribution for each input image. In particular, the first classifier outputs a predicted probability distribution for each image of the first batch of first modality image images based on the first characterization of the first batch received from the first encoder. Similarly, the second classifier outputs a predicted probability distribution for each of the second batch of second modality image images based on the second batch of first feature representations received from the second encoder. For example, assuming we need to classify polyps as hyperplastic, adenoma, cancer, when inputting an image labeled as hyperplastic, if the output probability distribution of the classifier is: [0.6,0.3,0.1], it means that the classifier predicts that the image has a probability of hyperplasia of 0.6, adenoma of 0.3, and cancer of 0.1.

For a labeled image, a loss function for classification training may be determined based on the true label and the prediction probability distribution of the image. Although classification prediction is also performed on the unlabeled image, the prediction result is only used for determining a pseudo label for the unlabeled image subsequently, and a training set is added after the pseudo label is determined to be used as labeled data for subsequent iterative training, so that a loss value does not need to be calculated for the unlabeled image. This process will be described in more detail in subsequent paragraphs.

For example, due to the imbalance of polyp distributions, embodiments of the present disclosure may use a focal loss (focal loss) function as the loss function for classification training, as shown in equation (2) below.

（2）

Wherein the content of the first and second substances,

in order to predict the probability distribution,

and is an adjustable weight.

Of course, other types of loss functions, such as cross-entropy loss functions, may be adopted according to the distribution of the training set, and the disclosure is not limited thereto.

For example, the focus loss function determined by the classification training for the white light image is determined as

Determining the focus loss function determined by the classification training of the narrow-band light image as

。

In step S607, a joint loss function is calculated based on the second feature representation of the first lot and the second feature representation of the second lot, and the first classification prediction probability distribution of the first lot and the second classification prediction probability distribution of the second lot, and parameters of the endoscopic image classification model are adjusted according to the joint loss function.

For example, the joint loss function herein may be determined as the sum of the loss function of the contrast learning submodel and the loss function of the classifier submodel, as shown in equation (3) below:

=

（3）

accordingly, the endoscopic image model shown in fig. 4 may be parametrically adjusted based on the joint loss function described above such that the joint loss function is ultimately minimized as iterative training continues.

In step S609, it is determined whether a trusted pseudo tag is generated for the unlabeled image in the first-batch first-modality video images and the unlabeled image in the second-batch second-modality video images.

As mentioned above, since there are a lot of label-missing cases in the real dataset, a semi-supervised training method is proposed herein, in which a credible pseudo label is generated for unlabeled data during the training process and added to the training set to continue the training as labeled data.

For example, an authentic pseudo-label may be generated for each pair of input images in conjunction with the two classifier outputs. As described above, the first classifier generates a first predicted probability distribution for a first batch of white light image images and the second classifier generates a second predicted probability distribution for a second batch of narrowband light image images. For unlabeled images, a label prediction value is first determined based on the prediction probability distribution. For example, for one of the unlabeled white light image images in the first batch of white light image images, the predicted probability distribution generated by the first classifier for the unlabeled white light image is 60% of hyperplasia, 20% of adenoma and 10% of cancer, and the probability value (for example, 60%) of the category (for example, hyperplasia) with the highest probability can be selected as the label predicted value corresponding to the current unlabeled image. For example, for one unlabeled narrowband photo image corresponding to one unlabeled image in the first batch of white light image images, the predicted probability distributions generated by the second classifier for the unlabeled narrowband photo image are 60% hyperplasia, 10% adenoma and 20% cancer, and the probability value (for example, 60% in this case) of the class (for example, hyperplasia) with the highest probability can be selected as the label predicted value corresponding to the current unlabeled narrowband photo image. And judging whether the label predicted values generated by the two classifiers are the same or not for the one-to-one corresponding label-free images. If not, no authentic pseudo-label is generated for the pair of images. If the label prediction values generated by the two classifiers are the same (for example, the two label prediction values are both 60%), the two label prediction values are fused. For example, the two corresponding label prediction values may be linearly added and then divided by 2, and of course, other data fusion manners may also be used, which is not limited by the present disclosure. The trusted pseudo tag is generated when the fused tag prediction value is greater than a predetermined threshold (e.g., 0.85), and is not generated if less than the threshold.

Next, in step S611, if it is determined that an authentic pseudo tag is generated for an unlabeled image in the first batch of first-modality image images and an unlabeled image in the second batch of second-modality image images, the first-modality image images and the corresponding second-modality image images, which generate the authentic pseudo tag, are added to the first image set and the second image set, respectively, to form a new first image set and a new second image set, so as to update the training data set.

Finally, in step S613, iterative training is continued on the adjusted endoscope image classification model using the new first image set and the new second image set as a new training data set.

And continuously optimizing the joint loss function in the training process to minimize and converge the joint loss function, namely determining that the training of the image classification model is finished. Of course, if no pseudo-label is generated for any unlabeled image in the first batch of first video images and any unlabeled image in the second batch of second video images, the next iterative training is still performed based on the original first and second image sets as training sets.

According to the endoscope classification method based on contrast learning, a new positive and negative example selection mode is adopted, information of images in different endoscope modes is better utilized, the characteristic of the abstract semantic level of the image is learned, and the classification accuracy rate of the white light image is enhanced. Meanwhile, a dynamic storage queue is added on the traditional SimCLR model for comparison and learning to store more negative samples, so that the bottom layer distribution is more effectively covered, and a better training effect is given. In addition, under the condition that the labeled data are limited, the data labels are dynamically added in a pseudo label mode to assist training, and the cost problem of manually collecting and labeling a large number of training sets is better solved.

Based on the endoscope image classification model trained in the above way, the embodiment of the disclosure also provides an endoscope image classification method. Taking an image to be recognized as a white light image as an example, a flowchart of an endoscopic image classification method in an embodiment of the present disclosure is described with reference to fig. 7, where the method includes:

in step S701, an endoscopic image to be recognized is acquired.

For example, if the trained image classification model is for polyp type recognition, the acquired endoscopic image to be recognized is the acquired polyp image.

Through the method for training the endoscope image classification model in the above embodiment, the embodiment of the present disclosure performs classification of endoscope images only by using the encoder and the classifier in the trained endoscope image classification model. Because the image images of different modes can complement each other in characteristics to assist in identification. For example, if the upper and lower branches are trained based on white light imagery and narrowband light imagery, respectively, embodiments of the present disclosure utilize an encoder and classifier in the upper branch or an encoder and classifier in the lower branch, respectively, based on whether the identified endoscopic image belongs to a white light imagery or a narrowband light imagery.

In step S703, an image feature representation of the endoscopic image is extracted based on an encoder in the trained endoscopic image classification model. The encoder here may be, for example, a ResNet101 network. The specific feature representation extraction process is well known to those skilled in the art and will not be described herein.

In step S705, the extracted image feature representations are input to the corresponding classifiers in the classification model of the endoscopic image, and the classification result of the endoscopic image is obtained.

The encoder and the classifier are obtained by mutually assisting and training endoscopic images of different modalities of the same lesion. Specifically, for example, the encoder and the classifier in the upper branch for classifying the white light image are obtained by performing assistant training on the encoder and the classifier in the lower branch based on the narrowband light image, so that the encoder and the classifier in the upper branch can achieve more accurate and reliable classification results when classifying the white light image. For example, when a white light image obtained by an endoscope operating in a white light mode is identified by using the trained endoscope image classification model of the present disclosure, the white light image may be input to a first encoder in an upper branch of the trained endoscope image classification model to extract a first feature representation, and the first feature representation may be input to a first classifier connected to the first encoder to perform classification identification. For example, for a white light image of an acquired adenoma, the first classifier can output predicted probability distributions of hyperplasia 10%, adenoma 80%, and cancer 10%.

Similarly, the encoder and the classifier in the lower branch can achieve a more accurate and reliable classification result when classifying the narrow-band optical image, and the description is omitted here. In addition, if the trained endoscope image classification model is learned based on other modality images, for example, the first modality image is an autofluorescence image, and the second modality image is an I-SCAN image, the encoder and the classifier connected thereto in the upper branch of the trained endoscope image classification model achieve a more accurate and reliable classification result when classifying the autofluorescence image, and the encoder and the classifier connected thereto in the lower branch achieve a more accurate and reliable classification result when classifying the I-SCAN image.

Based on the above embodiments, referring to fig. 8, a schematic structural diagram of an endoscopic image classification system 800 according to an embodiment of the present disclosure is shown. The endoscopic image classification system 800 includes at least an image acquisition component 801, a processing component 802, and an output component 803. In the embodiment of the present disclosure, the image acquisition component 801, the processing component 802, and the output component 803 are related medical devices, and may be integrated in the same medical device, or may be divided into a plurality of devices, which are connected to communicate with each other to form a medical system for use, for example, for diagnosing digestive tract diseases, the image acquisition component 801 may be an endoscope, and the processing component 802 and the output component 803 may be computer devices communicating with the endoscope.

Specifically, the image acquisition section 801 is used to acquire an image to be recognized. The processing component 802 is configured to extract image feature information of an image to be recognized, and obtain a lesion classification result of the image to be recognized based on the feature information of the image to be recognized. The output section 803 is used to output the classification result of the image to be recognized.

Fig. 9 shows a training apparatus of an endoscopic image classification model according to an embodiment of the present disclosure, which specifically includes a training data set acquisition component 901 and a training component 903.

The training data set acquisition section 901 is configured to: acquiring a first set of images, the first set of images being a set of first modality imagery images of one or more objects acquired by an endoscope operating at a first modality; and acquiring a second set of images, the second set of images being a set of second modality imagery images of the one or more objects acquired by an endoscope operating in a second modality different from the first modality, the second modality imagery images corresponding one-to-one to the first modality imagery images; and training component 903 for: and inputting the first image set and the second image set into the endoscope image classification model as training data sets, and training the endoscope image classification model to obtain a trained endoscope image classification model.

For example, the training component 903 is a semi-supervised training component, images of a first subset of the first set of images have labels labeling endoscopic image classes, and other images of the first set of images have no labels labeling endoscopic image classes; and the images of the second subset in the second image set, which correspond to the images of the first subset one by one, have the same label marking the endoscope image category, and the other images of the second image set do not have the label marking the endoscope image category.

For example, wherein the endoscope image classification model comprises: a comparative learning submodel, the comparative learning submodel comprising: a first learning module for receiving the first set of images and learning the first set of images to obtain a first feature representation and a second feature representation of the first set of images; a second learning module for receiving the second set of images and learning the second set of images to obtain a first feature representation and a second feature representation of the second set of images; a memory queue for storing second feature representations of the first set of images generated by the first learning module and second feature representations of the second set of images generated by the second learning module; a classifier submodel comprising: a first classifier submodel for performing classification learning according to the first feature representation of the first image set generated by the first learning module to generate a classification prediction probability distribution of each image in the first image set; and the second classifier submodel is used for performing classification learning according to the first feature representation of the second image set generated by the second learning module so as to generate a classification prediction probability distribution of each image in the second image set.

For example, wherein a first learning module comprises a first encoder and a first nonlinear mapper connected in sequence, a second learning module comprises a second encoder and a second nonlinear mapper connected in sequence, wherein the first encoder and the second encoder have the same structure and the first nonlinear mapper and the second nonlinear mapper have the same structure, a first classifier submodel comprises a first classifier connected to an output of the first encoder, and a first classifier submodel comprises a second classifier connected to an output of the second encoder, wherein the first classifier and the second classifier have the same structure.

For example, the training component 903 includes an input component 903_1 that, at each iteration of training: the input component 903_1 selects a first batch of first modality image images from the first image set, and inputs the first batch of first modality image images into the first learning module; and the input component 903_1 selects a second batch of second modality image images corresponding to the first batch of first modality image images one by one from the second image set, and inputs the second batch of second modality image images into the second learning module.

For example, the training component 903 training the endoscope image classification model to obtain a trained endoscope image classification model includes: the training component 903 trains the endoscope image classification model until the joint loss function of the endoscope image classification model converges to obtain a trained endoscope image classification model.

For example, the training component 903 further comprises: an unsupervised learning component 903_2, configured to perform unsupervised contrast learning by using the contrast learning submodel to generate a first feature representation of a first batch and a second feature representation of the first batch for the first-batch first-modality image images, and generate a first feature representation of a second batch and a second feature representation of the second batch for the second-batch second-modality image images; a storage unit 903_3 for storing the second characteristic representation of the first batch and the second characteristic representation of the second batch in the memory queue based on a first-in-first-out rule; a classification training component 903_4, configured to perform classification training using the classifier submodel to generate a first classification prediction probability distribution for each image in the first batch of first-modality image images, so as to obtain a first classification prediction probability distribution in the first batch, and generate a second classification prediction probability distribution for each image in the second batch of second-modality image images, so as to obtain a second classification prediction probability distribution in the second batch; a parameter adjusting unit 903_5 that calculates a joint loss function based on the second feature representation of the first lot and the second feature representation of the second lot, and the first classification prediction probability distribution of the first lot and the second classification prediction probability distribution of the second lot, and adjusts a parameter of the endoscopic image classification model according to the joint loss function; a trusted pseudo tag determination unit 903_6 that determines whether or not trusted pseudo tags are generated for the non-tag images in the first-batch first-modality video images and the non-tag images in the second-batch second-modality video images; a training data set updating component 903_7, configured to, if it is determined that a trusted pseudo label is generated for an unlabeled image in the first batch of first-modality image images and an unlabeled image in the second batch of second-modality image images, add the first-modality image and the corresponding second-modality image that generate the trusted pseudo label to the first image set and the second image set respectively to form a new first image set and a new second image set, so as to update a training data set; and the training component 903 continues to iteratively train the adjusted endoscope image classification model using the new first image set and the new second image set as a new training data set.

For example, if the trusted pseudo-label determination component 903_6 determines that no trusted pseudo-label is generated for the unlabeled image in the first batch of first-modality image images and the unlabeled image in the second batch of second-modality image images, then iterative training of the adjusted endoscopic image classification model continues based on the first set of images and the second set of images as training data sets.

For example, the joint loss function of the endoscope image classification model is the sum of the following loss functions: the loss function of the contrast learning, the loss function when performing classification training for the labeled images in the first batch of first-mode image images, and the loss function when performing classification training for the labeled images in the second batch of second-mode image images.

For example, the loss function learned for the contrast is a noise contrast estimation loss function InfoNCE, and the loss function trained for classifying the labeled images in the first-batch first-modality image images and the loss function trained for classifying the labeled images in the second-batch second-modality image images are focus loss functions.

For example, performing unsupervised contrast learning using the contrast learning submodel to generate a first batch of first feature representations and a first batch of second feature representations for the first batch of first-modality imagery images, and a second batch of first feature representations and a second batch of second feature representations for the second batch of second-modality imagery images includes: converting each image in the first batch of first modality image images into a first feature representation based on the first encoder to obtain a first feature representation of a first batch, and nonlinearly mapping each first feature representation in the first feature representation of the first batch based on the first nonlinear mapper to obtain a second feature representation of the first batch; and converting each image in the second batch of second modality image images into a first feature representation based on the second encoder to obtain a first feature representation of the second batch, and performing nonlinear mapping on each first feature representation in the first feature representation of the second batch based on the second nonlinear mapper to obtain a second feature representation of the second batch.

For example, wherein the trusted pseudo tag determining component 903_6 determines whether to generate a trusted pseudo tag for an unlabeled image in the first batch of first-modality imagery images and an unlabeled image in the second batch of second-modality imagery images comprises: for each unlabeled first modality video image, determining a first label prediction value for the unlabeled first modality video image based on a first classification prediction probability distribution generated for the unlabeled first modality video image; and determining a second label prediction value of the unlabeled second modality video image for an unlabeled second modality video image that corresponds one-to-one with the unlabeled first modality video image based on a second classification prediction probability distribution generated for the unlabeled second modality video image; determining whether the first tag prediction value and the second tag prediction value are consistent; if not, not generating the credible pseudo label; and if the predicted value of the first label is consistent with the predicted value of the second label, fusing the predicted value of the first label and the predicted value of the second label, generating the credible pseudo label when the fused predicted value of the label is greater than a preset threshold value, and otherwise, not generating the credible pseudo label.

For example, the fusing the first label prediction value and the second label prediction value by the trusted pseudolabel determination component 903_6 includes: and carrying out weighted average on the first label predicted value and the second label predicted value to obtain the fused label predicted value.

For example, the object is a polyp, and the endoscopic image is a polyp endoscopic image.

For example, wherein the signature comprises at least one of hyperplasia, adenoma, and cancer.

For example, the first modality picture image is a white light picture image and the second modality picture image is a narrow band light picture image.

Based on the above embodiments, the embodiments of the present disclosure also provide electronic devices of another exemplary implementation. In some possible embodiments, an electronic device in the embodiments of the present disclosure may include a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor may implement the steps of the endoscope image classification model training method or the endoscope image recognition method in the embodiments described above when executing the program.

For example, taking an electronic device as the server 100 in fig. 1 of the present disclosure as an example for explanation, a processor in the electronic device is the processor 110 in the server 100, and a memory in the electronic device is the memory 120 in the server 100.

Embodiments of the present disclosure also provide a computer-readable storage medium. Fig. 10 shows a schematic diagram 1000 of a storage medium according to an embodiment of the disclosure. As shown in fig. 10, the computer-readable storage medium 1000 has stored thereon computer-executable instructions 1001. When the computer-executable instructions 1001 are executed by a processor, the training method of the contrast learning-based endoscopic image classification model and the endoscopic image classification method according to the embodiments of the present disclosure described with reference to the above drawings may be performed. The computer-readable storage medium includes, but is not limited to, volatile memory and/or non-volatile memory, for example. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc.

Embodiments of the present disclosure also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the training method of the contrast learning-based endoscopic image classification model and the endoscopic image classification method according to the embodiments of the present disclosure.

Those skilled in the art will appreciate that the disclosure of the present disclosure is susceptible to numerous variations and modifications. For example, the various devices or components described above may be implemented in hardware, or may be implemented in software, firmware, or a combination of some or all of the three.

Further, while the present disclosure makes various references to certain elements of a system according to embodiments of the present disclosure, any number of different elements may be used and run on a client and/or server. The units are illustrative only, and different aspects of the systems and methods may use different units.

It will be understood by those skilled in the art that all or part of the steps of the above methods may be implemented by instructing the relevant hardware through a program, and the program may be stored in a computer readable storage medium, such as a read only memory, a magnetic or optical disk, and the like. Alternatively, all or part of the steps of the above embodiments may be implemented using one or more integrated circuits. Accordingly, each module/unit in the above embodiments may be implemented in the form of hardware, and may also be implemented in the form of a software functional module. The present disclosure is not limited to any specific form of combination of hardware and software.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The foregoing is illustrative of the present disclosure and is not to be construed as limiting thereof. Although illustrative embodiments of the present disclosure have been described, those skilled in the art will readily appreciate that many modifications are possible in the illustrative embodiments without materially departing from the novel teachings and advantages of the present disclosure. Accordingly, all such modifications are intended to be included within the scope of this disclosure as defined in the claims. It is to be understood that the foregoing is illustrative of the present disclosure and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The present disclosure is defined by the claims and their equivalents.

Claims

1. A method of training an endoscopic image classification model based on contrast learning, the method comprising:

acquiring a first set of images, the first set of images being a set of first modality imagery images of one or more objects acquired by an endoscope operating at a first modality;

acquiring a second set of images, the second set of images being a set of second modality imagery images of the one or more objects acquired by an endoscope operating at a second modality different from the first modality, the second modality imagery images corresponding one-to-one with the first modality imagery images; and

inputting the first image set and the second image set into the endoscope image classification model as training data sets, training the endoscope image classification model until a joint loss function of the endoscope image classification model converges to obtain a trained endoscope image classification model,

wherein training the endoscope image classification model until a joint loss function of the endoscope image classification model converges comprises:

carrying out unsupervised contrast learning by utilizing a contrast learning submodel to generate a first characteristic representation of a first batch and a second characteristic representation of the first batch for a first-batch first-mode image, and generate a first characteristic representation of a second batch and a second characteristic representation of the second batch for a second-batch second-mode image;

storing the second feature representation of the first batch and the second feature representation of the second batch into a memory queue based on a first-in-first-out rule;

performing classification training by using a classifier sub-model to generate a first classification prediction probability distribution for each image in the first batch of first modality image images so as to obtain a first classification prediction probability distribution of the first batch, and generate a second classification prediction probability distribution for each image in the second batch of second modality image images so as to obtain a second classification prediction probability distribution of the second batch;

calculating a joint loss function based on the second feature representation of the first batch and the second feature representation of the second batch, and the first classification prediction probability distribution of the first batch and the second classification prediction probability distribution of the second batch, and adjusting parameters of the endoscope image classification model according to the joint loss function;

determining whether a trusted pseudo-tag is generated for an unlabeled image in the first batch of first modality imagery images and an unlabeled image in the second batch of second modality imagery images;

if the credible pseudo labels are determined to be generated for the unlabeled images in the first batch of first-modality image images and the unlabeled images in the second batch of second-modality image images, adding the first-modality image images and the corresponding second-modality image images which generate the credible pseudo labels into the first image set and the second image set respectively to form a new first image set and a new second image set so as to update the training data set; and

using the new first image set and the new second image set as a new training data set to continuously carry out iterative training on the adjusted endoscope image classification model,

wherein determining whether to generate a trusted pseudo-tag for unlabeled images in the first batch of first-modality imagery images and unlabeled images in the second batch of second-modality imagery images comprises:

for each unlabeled first modality video image, determining a first label prediction value for the unlabeled first modality video image based on a first classification prediction probability distribution generated for the unlabeled first modality video image; and

determining a second label prediction value of the unlabeled second modality video image based on a second classification prediction probability distribution generated for the unlabeled second modality video image for an unlabeled second modality video image that corresponds one-to-one with the unlabeled first modality video image;

determining whether the first tag prediction value and the second tag prediction value are consistent;

if not, not generating the credible pseudo label;

and if the predicted value of the first label is consistent with the predicted value of the second label, fusing the predicted value of the first label and the predicted value of the second label, generating the credible pseudo label when the fused predicted value of the label is greater than a preset threshold value, and otherwise, not generating the credible pseudo label.

2. The method of claim 1, wherein the training method is a semi-supervised training method, images of a first subset of the first set of images having labels labeling endoscopic image classes, and other images of the first set of images having no labels labeling endoscopic image classes; and

the images of the second subset in the second image set, which correspond to the images of the first subset one by one, have the same label marking the endoscope image category, and the other images of the second image set do not have the label marking the endoscope image category.

3. The method of claim 1 or 2, wherein the endoscopic image classification model comprises:

a comparative learning submodel, the comparative learning submodel comprising:

a first learning module for receiving the first set of images and learning the first set of images to obtain a first feature representation and a second feature representation of the first set of images;

a second learning module for receiving the second set of images and learning the second set of images to obtain a first feature representation and a second feature representation of the second set of images; and

a memory queue for storing second feature representations of the first set of images generated by the first learning module and second feature representations of the second set of images generated by the second learning module;

a classifier submodel comprising:

a first classifier submodel for performing classification learning according to the first feature representation of the first image set generated by the first learning module to generate a classification prediction probability distribution of each image in the first image set; and

and the second classifier submodel is used for performing classification learning according to the first feature representation of the second image set generated by the second learning module so as to generate a classification prediction probability distribution of each image in the second image set.

4. The method of claim 3, wherein

The first learning module comprises a first coder and a first nonlinear mapper which are connected in sequence,

the second learning module comprises a second coder and a second nonlinear mapper which are connected in sequence, wherein the first coder and the second coder have the same structure, and the first nonlinear mapper and the second nonlinear mapper have the same structure,

the first classifier submodel comprises a first classifier connected to an output of the first encoder, an

The first classifier submodel includes a second classifier connected to an output of the second encoder, wherein the first classifier and the second classifier are structurally identical.

5. The method of claim 4, wherein inputting the first set of images and the second set of images as a training data set into an endoscopic image classification model comprises:

at each iterative training:

selecting a first batch of first modality image images from the first image set and inputting the first batch of first modality image images into the first learning module; and

and selecting second-batch second-mode image images which correspond to the first-batch first-mode image images one by one from the second image set, and inputting the second-batch second-mode image images into the second learning module.

6. The method of claim 1, wherein if it is determined that authentic pseudo-labels are not generated for unlabeled images in the first batch of first-modality image images and unlabeled images in the second batch of second-modality image images, continuing iterative training of the adjusted endoscopic image classification model based on the first set of images and the second set of images as a training data set.

7. The method of claim 1, wherein the joint loss function of the endoscope image classification model is a sum of:

the loss function of the contrast learning, the loss function when performing classification training for the labeled images in the first batch of first-mode image images, and the loss function when performing classification training for the labeled images in the second batch of second-mode image images.

8. The method of claim 7, wherein the loss function learned for the comparison is a noise contrast estimate loss function, InfonCE,

the loss function for classification training of the labeled images in the first batch of first modality image images and the loss function for classification training of the labeled images in the second batch of second modality image images are focus loss functions.

9. The method of claim 5, wherein performing unsupervised contrast learning with the contrast learning submodel to generate a first batch of first feature representations and a first batch of second feature representations for the first batch of first modality imagery images and a second batch of first feature representations and a second batch of second feature representations for the second batch of second modality imagery images comprises:

converting each image in the first batch of first modality image images into a first feature representation based on the first encoder to obtain a first feature representation of a first batch, and nonlinearly mapping each first feature representation in the first feature representation of the first batch based on the first nonlinear mapper to obtain a second feature representation of the first batch; and

based on the second encoder, each image in the second batch of second modality image images is converted into a first feature representation to obtain a first feature representation of the second batch, and based on the second nonlinear mapper, each first feature representation in the first feature representation of the second batch is subjected to nonlinear mapping to obtain a second feature representation of the second batch.

10. The method of claim 1, wherein fusing the first tag predictor and the second tag predictor comprises:

and carrying out weighted average on the first label predicted value and the second label predicted value to obtain the fused label predicted value.

11. The method of claim 1, wherein the object is a polyp and the endoscopic image is a polyp endoscopic image.

12. The method of claim 2, wherein the signature comprises at least one of hyperplasia, adenoma, and cancer.

13. The method of claim 2, wherein the first modality picture image is a white light picture image and the second modality picture image is a narrowband light picture image.

14. The method of claim 2, wherein the first modality imagery image is a white light imagery image and the second modality imagery image is an autofluorescence imagery image.

15. The method of claim 4, wherein the encoder is a convolutional layer portion of a residual neural network ResNet, the nonlinear mapper is comprised of a two-layer multi-layer perceptron MLP, and the classifier is comprised of a two-layer multi-layer perceptron MLP.

16. An endoscopic image classification method comprising:

acquiring an endoscope image to be identified;

extracting an image feature representation of the endoscopic image based on an encoder in a trained endoscopic image classification model;

inputting the extracted image feature representation into a corresponding classifier in a trained endoscope image classification model to obtain a classification result of the endoscope image;

wherein the trained endoscopic image classification model is obtained based on the training method of the contrast learning based endoscopic image classification model according to any one of claims 1-15.

17. An endoscopic image classification system comprising:

an image acquisition section for acquiring an endoscopic image to be recognized;

the processing component is used for extracting image characteristic representations of the endoscope images based on an encoder in the trained endoscope image classification model and inputting the extracted image characteristic representations into corresponding classifiers in the trained endoscope image classification model to obtain classification results of the endoscope images;

an output section for outputting a classification result of the image to be recognized,

18. A training apparatus for an endoscopic image classification model based on contrast learning, the apparatus comprising:

a training data set acquisition component for acquiring a first set of images, the first set of images being a set of first modality imagery images of one or more subjects acquired by an endoscope operating at a first modality; and acquiring a second set of images, the second set of images being a set of second modality imagery images of the one or more objects acquired by an endoscope operating in a second modality different from the first modality, the second modality imagery images corresponding one-to-one to the first modality imagery images; and

a training section configured to input the first image set and the second image set as a training data set into the endoscope image classification model, train the endoscope image classification model until a joint loss function of the endoscope image classification model converges to obtain a trained endoscope image classification model,

if not, not generating the credible pseudo label;

19. An electronic device comprising a memory and a processor, wherein the memory has stored thereon program code readable by the processor, which when executed by the processor, performs the method of any of claims 1-16.

20. A computer-readable storage medium having stored thereon computer-executable instructions for performing the method of any of claims 1-16.