CN114078137A

CN114078137A - Colposcope image screening method and device based on deep learning and electronic equipment

Info

Publication number: CN114078137A
Application number: CN202111396135.0A
Authority: CN
Inventors: 赵帅; 袁莎; 曹岗; 赵健
Original assignee: Beijing Zhiyuan Artificial Intelligence Research Institute
Current assignee: Beijing Zhiyuan Artificial Intelligence Research Institute
Priority date: 2021-11-23
Filing date: 2021-11-23
Publication date: 2022-02-22

Abstract

The invention discloses a colposcope image screening method and device based on deep learning and electronic equipment. The method comprises the following steps: acquiring a colposcope image of a patient, and segmenting the colposcope image to obtain a cervical orifice image; extracting bottom layer image features and high layer semantic features of the cervical orifice image and text features of patient information; fusing the bottom layer image features, the high-layer semantic features and the text features to obtain fused features; and searching in a preset colposcope image library based on the fusion characteristics, and screening out a target image associated with the cervical orifice image. According to the technical scheme, based on the identification of a single colposcope image, multi-modal characteristics are integrated, and a classification model based on triple loss is adopted, so that the complexity of interpretation of the colposcope image is reduced, and the image retrieval accuracy is improved.

Description

Colposcope image screening method and device based on deep learning and electronic equipment

Technical Field

The invention relates to the technical field of image processing, in particular to a method and a device for screening colposcopic images based on deep learning and electronic equipment.

The invention is a research result of the national emphasis research and development plan subsidization (2020AAA 0105200).

Background

Cervical cancer is a significant cause of cancer-related mortality in women specifically, and if patients are diagnosed at the precancerous stage or earlier, the cure rate can be as high as 98%, and the mortality rate can be significantly reduced. In recent years, medical microscopic image processing techniques using computer image processing, artificial intelligence, and the like are rapidly developing. Currently, cervical cancer diagnosis in China generally adopts a three-step process, and HPV detection or cervical exfoliated cytology primary screening (TCT or pap smear) is firstly carried out; performing colposcopy and biopsy on the patient with positive primary screening; finally, cervical pathology is confirmed. The three stages advance step by step. It can be seen that colposcopy plays a crucial role in diagnosis, often as one of the important tools for diagnosing Cervical Intraepithelial Neoplasia (CIN).

However, in the current colposcopic diagnosis process, a colposcopist generally needs to comprehensively judge by sequentially combining the acquired image information acquired at multiple time intervals, such as 1 minute and 2 minutes after the acquisition of the physiological saline, the acetic acid, and the iodine solution, and even needs to repeatedly compare the change information between the images. The colposcopic diagnosis result depends on the subjective experience of an operator to a great extent, and only a few of colposcopic doctors with rich experience can obtain a relatively accurate conclusion according to the analysis of the weak change of the acetic acid white epithelium color of the cervical orifice region. The accuracy and repeatability of colposcopic pathology identification is therefore limited.

In addition, when applying the artificial intelligence technique to biopsy region identification of a colposcopic image, it is generally necessary to input image information acquired at a plurality of periods, such as a saline image, an acetic acid image, and an iodine image, into a model to identify the biopsy region. However, a long time interval exists between the colposcope and the image acquisition, and the image is greatly inconsistent due to the posture adjustment of the patient or the movement of the acquisition instrument, so that the analysis of the change degree of the specific area by the artificial intelligence method is difficult to realize, and the cervical lesion area occupies a small area of the colposcope image in general, thereby further restricting the identification effect of the mode.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides the following technical scheme.

The invention provides a colposcope image screening method based on deep learning, which comprises the following steps:

s101, collecting a colposcope image of a patient, and segmenting the image to obtain a cervical orifice image;

s102, extracting bottom layer image features and high layer semantic features of the cervical orifice image, and extracting text features of the patient information;

s103, fusing the bottom layer image features, the high-layer semantic features and the text features to obtain fused features;

and S104, retrieving in a preset colposcope image library based on the fusion characteristics, and screening out a target image associated with the cervical orifice image.

Preferably, the segmenting the image to obtain the cervical os image includes:

detecting a cervical region of the colposcope image through a preset target detection model;

and a channel attention module is added between jump connections of the U-Net segmentation framework, and a cervical orifice image of the cervical orifice region is segmented from the colposcope image.

Preferably, the extracting bottom layer image features of the cervical orifice image further comprises:

and extracting a plurality of bottom layer image characteristics of the cervical orifice image, and fusing the bottom layer image characteristics into a local descriptor of the cervical orifice image.

Preferably, the underlying image features include SIFT features, SURF features, LBP features, and/or histogram information.

Preferably, the extracting the high-level semantic features of the cervical os image further includes:

and performing feature extraction on the cervical orifice image by using a triple loss-based deep learning model to obtain a high-dimensional feature vector as the high-level semantic feature.

Preferably, in the deep learning model training process, P × K training images are read at a time, where P is the number of classes of randomly selected training images, and K is the number of randomly selected training images per class.

Preferably, the expression of the triplet penalty is:

wherein margin is a boundary over-parameter,

representing the a-th image in the ith category,

representing the nth image in the pth category,

representing the jth image in the ith category; d represents the distance between the two images.

Preferably, the extracting text features of the patient information further comprises:

and encoding the basic information of the patient and the text information contained in the examination result by One-Hot to form a One-dimensional feature vector, inputting a fully-connected neural network comprising a plurality of hidden layers, and taking the output of the last hidden layer as the text feature vector of the patient.

Preferably, the fusing the bottom-layer image feature, the high-layer semantic feature and the text feature to obtain a fused feature includes:

respectively carrying out normalization processing on the bottom layer image features, the high-layer semantic features and the text features;

and connecting the normalized features in series to obtain the fusion feature.

In another aspect, the present invention provides a colposcopic image screening apparatus based on deep learning, including:

the image preprocessing module is used for acquiring a colposcope image of a patient and segmenting the image to obtain a cervical orifice image;

the characteristic extraction module is used for extracting the bottom layer image characteristic and the high layer semantic characteristic of the cervical orifice image and extracting the text characteristic of the patient information;

the feature fusion module is used for fusing the bottom layer image features, the high-layer semantic features and the text features to obtain fusion features;

and the retrieval screening module is used for retrieving in a preset colposcope image library based on the fusion characteristics and screening out a target image associated with the cervical orifice image.

A third aspect of the present invention provides an electronic device comprising a processor and a memory, the memory storing a plurality of instructions, the processor being configured to read the instructions and execute the cervical abnormal cell identification method according to the first aspect.

A fourth aspect of the present invention provides a computer-readable storage medium storing a plurality of instructions readable by a processor and performing the cervical abnormal cell identification method according to the first aspect.

The invention has the beneficial effects that: the screening based on the colposcope only needs to be identified according to a single colposcope image, so that the complexity of the colposcope image interpretation process is reduced. By integrating multi-mode characteristics and adopting a classification model based on triple loss, the essence of the colposcope image is better expressed, and the image retrieval accuracy is improved.

Drawings

Fig. 1 is a schematic flowchart of a deep learning-based colposcopic image screening method according to an embodiment of the present invention.

Fig. 2 is a functional framework diagram of the colposcopic image screening method according to the embodiment of the invention.

FIG. 3 is a block diagram of an image segmentation architecture with attention mechanism according to an embodiment of the present invention.

Fig. 4 is a block diagram of a deep learning-based colposcopic image screening apparatus according to the present invention.

Detailed Description

In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.

The method provided by the invention can be implemented in the following terminal environment, and the terminal can comprise one or more of the following components: a processor, a memory, and a display screen. Wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the methods described in the embodiments described below.

A processor may include one or more processing cores. The processor connects various parts within the overall terminal using various interfaces and lines, performs various functions of the terminal and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory, and calling data stored in the memory.

The Memory may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). The memory may be used to store instructions, programs, code sets, or instructions.

The display screen is used for displaying user interfaces of all the application programs.

In addition, those skilled in the art will appreciate that the above-described terminal configurations are not intended to be limiting, and that the terminal may include more or fewer components, or some components may be combined, or a different arrangement of components. For example, the terminal further includes a radio frequency circuit, an input unit, a sensor, an audio circuit, a power supply, and other components, which are not described herein again.

In order to overcome the defects of the prior art, the invention provides a colposcopic image screening method based on deep learning, which can accurately find out an image which is most similar to the result of a first colposcopic image in a standard colposcopic image database only by combining the first colposcopic image after normal saline and clinical diagnosis information of a patient, thereby providing diagnostic auxiliary reference information for a colposcopist.

Example one

As shown in fig. 1, an embodiment of the present invention provides a deep learning-based colposcopic image screening method, including:

Correspondingly, as shown in fig. 2, the main algorithm functional framework of the present invention includes an image preprocessing stage, a multi-modal feature extraction stage, a feature fusion stage, and a similarity retrieval stage.

In step S101, i.e. the image preprocessing stage, under the influence of the shooting environment and the human posture, the image shot through the colposcope is prone to have the problems of image distortion, cervical position shift, and the like, which may cause difficulty in the subsequent analysis and processing process of the algorithm. The present invention therefore requires two tasks to be accomplished during the pre-processing stage. Comprises the detection of the position of the cervical orifice and the segmentation of the cervical orifice area.

In step S101, segmenting the image to obtain a cervical os image, which may include: detecting a cervical region of the colposcope image through a preset target detection model; and a cervical opening image of the cervical opening region is obtained by segmentation from the colposcope image by adding a channel attention module between jump connections of the U-Net segmentation framework.

The cervical opening can be considered to be the central location of the entire cervix, and the region of the cervical opening is also where most lesions occur. Therefore, the method specially trains a target detection model to detect the region, facilitates the subsequent steps to extract the deep learning characteristic and the bottom layer image characteristic of the position, and provides characteristic information for further colposcopic image retrieval. In a preferred embodiment, the invention adopts a Yolov5 model-based architecture, and performs network training learning on a training data set manually labeled by a doctor to find a square frame with a colposcope cervical center range size of 200 x 200 as the cervical center.

In general, the original colposcopic image acquired by the colposcopic electronic device contains a large amount of colposcopic contents, which have no effect on medical diagnosis, so the invention automatically segments the cervical region in the colposcopic image by adopting an improved image segmentation algorithm. Specifically, in consideration of the characteristics of unbalanced scale and various colors and shapes of the cervical orifice region in the colposcope image, the invention introduces an attention mechanism based on a segmentation network of a U-Net segmentation framework, and adds a channel attention module SE between U-Net jumping connections, thereby obtaining an ideal cervical orifice region segmentation effect. Referring to fig. 3, Bottom left ERes is an encoder module, right DRes is a decoder module, and each ERes is jumped to the corresponding DRes through an SE module, so that the feature map (feature map) of the ERes is selectively transferred to the DRes, and an ideal cervical region segmentation effect can be obtained. The colposcope image after the segmentation processing can eliminate the interference of irrelevant factors such as the skin, the hair and medical appliances of the human body, so that the subsequent image retrieval is more targeted.

Experiments show that compared with the traditional U-Net model, the segmentation precision (IoU index) of the model is improved by 3.2%, and the accuracy is improved by 4.1%.

In step S102, i.e., the multi-modal feature extraction stage, the following three features need to be extracted:

1) a bottom layer image feature descriptor of the cervical os image;

2) high-level semantic feature information based on a deep neural network; and

3) textual feature information generated by a patient routine examination.

For the extraction of the bottom layer image features, based on the image of the 200 × 200 cervical center range determined in the image preprocessing process in step S101, a plurality of bottom layer features of the cervical os image may be extracted, and the plurality of bottom layer image features may be fused into a local descriptor, i.e., a description vector, of the cervical os image. The underlying image features may include one or more of SIFT features, SURF features, LBP features, and/or histogram information. For convenience of explanation, the underlying image features are described as SIFT features in the following embodiments of the present invention, and a plurality of SIFT features are aggregated into a local description vector of the cervical os by using VLAD.

For extracting the high-level semantic features of the cervical orifice image, the invention utilizes a deep learning model based on triple loss to extract the features of the cervical orifice image to obtain a high-dimensional feature vector as the high-level semantic features.

In order to enable the deep neural network to capture feature information of colposcope images, the invention designs a multitask deep learning model based on triple loss, so that the distance between similar images is closer through calculation of the triple loss, and feature representation can reflect the difference between the colposcope images.

The triple loss is a widely applied metric learning loss, and has the advantages of end-to-end performance, cluster property, high embedding of features and the like compared with other losses (classification loss and contrast loss). Three input images per set of training data are required at the loss of training triplets. For example, the three images are named a fixed image (Anchor) a, a Positive sample image (Positive) p, and a Negative sample image (Negative) n, respectively. Image a and image p are a pair of positive samples, and image a and image n are a pair of negative samples. The triplet penalty is then expressed as:

L_t＝(D(a，p)-D(a，n)+margin)₊

where margin is the boundary hyperparameter, D (a, p) represents the distance between images a and p, and D (a, n) represents the distance between images a and n. In colposcopic images, distances between similar images are made closer by triplet loss, and distances between unrelated class images are made further apart. Wherein the end of the formula "+" means that when the value in the preceding parentheses is greater than 0, the value is taken as a loss, and when it is less than 0, the loss is 0.

In view of the fact that the triple loss network can be combined to generate a large number of negative sample pairs in the training process, imbalance of the number of the positive sample pairs and the negative sample pairs can be caused, training blockage occurs, and the convergence result is poor. Therefore, in the training process of the invention, the trained Batch size (the number of images read in at a time) is set to be P × K, that is, P classes of images are randomly selected at a time, and K images are randomly selected for training the network in each class. P is the number of classes of randomly selected training images, and K is the number of randomly selected training images per class. The triple loss within each Batch size is calculated by the following formula:

wherein margin is a boundary over-parameter,

representing the a-th image in the ith category,

representing the nth image in the pth category,

representing the jth image in the ith category; d () represents the distance between two images in parentheses. For the a-th image in the i-th category, calculating

Obtaining the maximum distance values of the images of different classes, and calculating

Obtaining distances of images of the same categoryA minimum value. Wherein the end of the formula "+" means that when the value in the preceding parentheses is greater than 0, the value is taken as a loss, and when it is less than 0, the loss is 0.

Through the training mode, the most dissimilar positive sample pair and the most similar negative sample pair in each Batch size are selected to calculate loss, so that the feature representation capability learned by the network is stronger.

In a further embodiment of the present invention, in particular, VGG-19 may be used as a backbone network, wherein bottom layer image features Conv4_3, Conv7 to Conv11_2 are retained, Global max poling is used to form a one-dimensional vector for the above feature layers, the one-dimensional vector is spliced with the final classification vector, and then classification is performed through a softmax function, so that the final classification result fully considers the colposcopic high-dimensional and bottom layer image feature information.

It should be noted that the above-mentioned VGG-19 backbone network is only an example. Those skilled in the art will appreciate that other deep learning feature extraction approaches may alternatively be used, such as a backbone network of the FPN architecture, etc.

Finally, the loss function combines cross entropy loss and triplet loss:

L_total＝L_BH+αL_cross

wherein L is_total，L_BH，L_crossRespectively representing total loss, triplet loss and cross entropy loss, and alpha is L_crossThe weight of (c). Preferably, α is set to 0.2.

In the experiment, 3500 colposcopic images marked in hospitals are obtained by the invention, and the images are divided into five categories of suspected cervical cancer, neoplasm, hemorrhage, areas with red characteristics and normal (399, 280, 312, 1200 and 1309 respectively). An NVIDIA TITAN XP GPU and a Ubuntu 18.04 system of 64G memory are used as a training platform, the training time is 200 rounds, the batch processing size is 8, and the input image size is 224 x 224. A random gradient descent (SGD) optimizer was used, setting the momentum to 0.9 and the learning rate to 0.001. Training data-test data were divided by 8-2. The final training test accuracy was 0.92 and 0.855, respectively. And finally, taking a feature vector with 1024 dimensions of a full connection layer, which is formed by fusing the bottom dimension feature and the high dimension feature, at the previous layer of the softmax as the high dimension feature vector required to be analyzed by the invention.

For the text feature extraction of the patient information of the cervical orifice image, the basic information of the patient and the text information contained in the examination result can be specifically encoded through One-Hot to form a One-dimensional feature vector, a fully-connected neural network comprising a plurality of hidden layers is input, and the output of the last hidden layer is used as the text feature vector of the patient.

For example, the text in which the patient information is included may include, but is not limited to, age, fluid based cytology test results (TCT), HPV results, whether menopause or not, etc., wherein the TCT information is mainly classified into 11 types: lack of information, no and malignant lesions, low grade squamous epithelial lesions (LSIL), atypical squamous epithelial definitions (ASCUS), atypical squamous epithelial cells not excluded high grade squamous intraepithelial lesions (ASC-H), high grade squamous intraepithelial lesions, Squamous Cell Carcinoma (SCC), atypical cellular carcinoma (AGC), atypical adenocytopathic predisposition neoplasia, cervical canal Adenocarcinoma In Situ (AIS), adenocarcinoma (Adca).

HPV results were classified into 5 categories: deletion, normal, low-risk, general, high-risk (HPV 16 and HPV 18 types) or (HPV 12 non-16/18 high-risk positives). The patient age was divided into 6 stages: less than 20 years old, 20-30 years old, 30-40 years old, 40-50 years old, 50-60 years old, more than 60 years old. Menopause is classified into two categories, yes and no.

Finally, the text information is encoded through One-Hot to form a One-dimensional characteristic vector, and then the One-dimensional characteristic vector is input into a fully-connected neural network. For example, the vector is input into a fully-connected neural network comprising 5 hidden layers, the input node is 23, the hidden layer nodes are 64-128-256-128-64, and the final output layer is positive and negative. Finally, the output of the last hidden layer 64 nodes is taken as the text feature vector information.

In step S103, i.e. the feature fusion stage, before fusing the bottom-layer image feature, the high-layer semantic feature and the text feature, normalization processing needs to be performed respectively, and then the features after normalization processing are connected in series to obtain the fusion feature.

According to clinical statistics, a considerable part of colposcopic lesions exist near the cervical orifice, so the SIFT features and the depth features obtained by triple learning are fused, a retrieval system has stronger expression capability on local information of a colposcopic image, particularly feature information near the cervical orifice, and the text information aims to help a colposcope physician to retrieve cases with more similar age, TCT and HPV detection results during retrieval. Specifically, the method firstly normalizes the bottom-layer SIFT features, the high-layer deep learning features and the text features, and expresses the normalization as F_i＝N(F_l)，F_lExpressing SIFT features, deep learning features and text features before normalization, N expressing normalization operation, F_iAnd expressing the normalized SIFT features, the deep learning features and the text features. Then, the merging is performed in a cascading manner as follows:

F_total＝[F_sift，F_cnn，F_text]

wherein, F_sift，F_cnn，F_textRespectively representing SIFT feature, deep learning feature and text feature after normalization, F_totalRepresenting the feature space after fusion, i.e. the cervical sample feature space.

In the similarity retrieval stage of step S104, in order to retrieve the target image closest to the cervical orifice image from the preset colposcope image library, the similarity between the cervical orifice image and each target image in the image library needs to be calculated. The similarity calculation selects the similarity expression between cosine distance measurement features, and the formula is as follows:

in the above formula, f_iRepresenting feature vectors in feature space of cervical samples, f_i∈F_total，m_iRepresenting a feature vector in a feature space of an image to be retrieved, n representing a dimension of the feature space. And then sorting the results according to the similarity values, and selecting the image corresponding to the preset first G values as a final result for presentation.

In a preferred embodiment, for the above feature space, the present invention uses the feature vector based on multi-modal feature fusion of step S103, i.e. the fusion feature F of the normalized SIFT feature, the deep learning feature and the text feature described above_iAnd (6) searching.

In an alternative embodiment, unlike retrieval based on multi-modal fusion features, similarity retrieval may be performed on single-modal information, and then the retrieval results may be weighted and fused. For example, the SIFT features may be used to measure the similarity, respectively, to obtain the search result S_siftUsing deep learning characteristics to search and obtain search result S_cnnAnd using the text characteristics to search to obtain a search result S_textThen, the three search results are weighted:

Similarity＝w₁S_sift+w₂S_text+w₃S_cnn

wherein w_i(i ═ 1, 2, 3) is the weight of each similar category.

In a further embodiment, the weight of each similar category may be set and adjusted according to user requirements. For example, if it is desired that the colposcopic image retrieval result satisfy that the cervical region is more similar, w may be set_1，w_2，w₃(1, 0, 0). W may be set if it is desired to have the colposcopic image retrieval results ordered fully according to the deep learning feature similarity_1，w_2，w₃＝(0，0，1)。

It can be seen that the colposcopic image screening method based on deep learning of the present invention has the following advantages compared with the cell identification method in the prior art:

1) only a single colposcope image is needed after the physiological saline, the acquisition speed of the traditional colposcope is greatly increased, and a colposcope doctor can be effectively assisted to complete primary screening.

2) By integrating multi-mode features such as bottom-layer SIFT features and patient medical record text features, different features are mutually supplemented without only depending on single features of a deep neural network, and the colposcope image retrieval accuracy is improved.

3) In the image identification stage, a multitask classification model based on triple loss is adopted, so that the neural network can efficiently learn the commonalities among similar categories and the differences among different categories of the colposcope, and the essential information of the colposcope image is better expressed through high-dimensional features.

Example two

As shown in fig. 4, another aspect of the present invention further includes a functional module architecture completely corresponding to the foregoing method flow, that is, an embodiment of the present invention further provides a deep learning-based colposcopic image screening apparatus, including:

the image preprocessing module 201 is used for acquiring a colposcopy image of a patient and segmenting the image to obtain a cervical orifice image;

a feature extraction module 202, configured to extract a bottom-layer image feature and a high-layer semantic feature of the cervical os image, and extract a text feature of the patient information;

the feature fusion module 203 is configured to fuse the bottom-layer image features, the high-layer semantic features, and the text features to obtain fusion features;

and the retrieval screening module 204 is configured to perform retrieval in a preset colposcope image library based on the fusion characteristics, and screen out a target image associated with the cervical orifice image.

The device can be realized by the colposcopic image screening method based on deep learning provided by the first embodiment, and specific implementation methods can be referred to the description in the first embodiment and are not described herein again.

The invention also provides an electronic device comprising a processor and a memory connected to the processor, wherein the memory stores a plurality of instructions, and the instructions can be loaded and executed by the processor to enable the processor to execute the method according to the first embodiment.

The invention also provides a computer-readable storage medium storing a plurality of instructions for implementing the method according to the first embodiment.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A colposcopic image screening method based on deep learning is characterized by comprising the following steps:

2. The method of claim 1, wherein the segmenting the image to obtain an image of the cervical os comprises:

3. The method of claim 1, wherein extracting underlying image features of the cervical os image further comprises:

4. The method of claim 3, wherein the underlying image features comprise SIFT features, SURF features, LBP features, and/or histogram information.

5. The method of claim 1, wherein extracting high-level semantic features of the cervical os image further comprises:

6. The method according to claim 5, wherein P x K training images are read in at a time during the deep learning model training process, wherein P is the number of classes of randomly selected training images and K is the number of randomly selected training images per class.

7. The method of claim 6, wherein the triplet penalty is expressed as:

wherein margin is a boundary over-parameter,

representing the a-th image in the ith category,

representing the nth image in the pth category,

8. The method of claim 1, wherein extracting the textual features of the patient information further comprises:

9. The method according to claim 1, wherein the fusing the bottom-layer image features, the high-layer semantic features and the text features to obtain fused features comprises:

and connecting the normalized features in series to obtain the fusion feature.

10. A colposcopic image screening apparatus based on deep learning, comprising:

11. An electronic device comprising a processor and a memory, the memory storing a plurality of instructions, the processor being configured to read the instructions and perform the method of any of claims 1-9.

12. A computer-readable storage medium storing a plurality of instructions readable by a processor and performing the method of any one of claims 1-9.