CN115471838A

CN115471838A - Cervical squamous lesion cell detection method based on depth self-adaptive feature extraction

Info

Publication number: CN115471838A
Application number: CN202211161135.7A
Authority: CN
Inventors: 李佐勇; 彭中华; 陈春强; 胡蓉; 林滨; 樊好义
Original assignee: Minjiang University
Current assignee: Minjiang University
Priority date: 2022-09-22
Filing date: 2022-09-22
Publication date: 2022-12-13

Abstract

The invention relates to a cervical squamous lesion cell detection method based on depth self-adaptive feature extraction. On the basis of YOLOv5, a Transformer module is used in a high-level feature layer of a backbone network, so that the detection accuracy of 4 types (ASCUS, ASCH, HSIL and LISL) of cervical squamous lesion cells can be effectively distinguished; in addition, because the smear background is complex when detecting cells, the smear background is full of free cytoplasm, nuclei and other impurities; aiming at the problem of the situation, the invention adopts a bidirectional feature pyramid (BiFPN) to adaptively extract features of different layers in the picture so as to reduce the influence of a complex background on the detection of the squamous lesion cells. Furthermore, because cells are stacked and agglomerated, the sizes of detection frames for independent cells and agglomerated cells are different, and therefore, in the invention, attention volume blocks (CBAM) are used to adaptively learn the size of the final cell detection frame. The invention can improve the detection precision of the cervical squamous lesion cells.

Description

Cervical squamous lesion cell detection method based on depth adaptive feature extraction

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a cervical squamous lesion cell detection method based on depth adaptive feature extraction.

Background

Currently, cancer is a leading cause of death in people less than 70 years of age worldwide. According to the data published by GLOBOCAN in 2020, the number of cancer attacks is nearly 1930 ten thousand and the number of deaths is nearly 1000 ten thousand globally. The number of onset and death of cervical cancer is 60.4 ten thousand and 46.6 ten thousand, respectively, the number of onset accounts for 6.5% of new cases of cancer worldwide, and 7 th is the average of the number of onset and death. With a higher ranking of female cancer morbidity and mortality being 4 th. Studies according to WTO have shown that if women in developing countries are able to do a precautionary check for the pre-cervical cancer every 5 years, the overall prevalence can drop by 60% or more. At present, according to the most effective cervical cancer screening method proven by clinical medicine, doctors mainly analyze morphological structures of cell images by observing cervical smears through an optical microscope, and finally, diagnosis is performed according to the cytopathology principle. But this method undoubtedly requires a large number of experienced medical workers and the task of reviewing smears is very tedious, laborious and laborious.

The current assisted screening system for cervical cell smears is mainly assisted from three aspects: cervical cell classification, detection and segmentation. The smear classification comprises cell classification and smear classification, and is different from the cell classification, and the purpose of the smear classification is mainly to predict the category of the whole smear; the division includes the strategy of firstly dividing and then adjusting, and firstly detecting and then finely dividing; the cervical cell detection is mainly divided into two types, one type is one-stage target detection, the model is mainly based on variant models of YOLO, retinaNet and EfficientNet, the characteristics of the models are obvious, the overall reasoning speed is particularly high because the part is not like ROI recommendation and classification of two stages, but the defect is obvious, and the precision is relatively lower than that of the detection model of two stages. The other is a two-stage detection model, which is mainly based on Fast R-CNN and Fast R-CNN variant models, and the models are characterized by higher detection accuracy than the one-stage model as mentioned above. However, due to the continuous iterative updating of the model, the precision of the current partial one-stage detection model can be comparable to that of the current two-stage model, the current examination volume is extremely large, and the smear is time-efficient, so that a fast and accurate detection model is urgently needed at present. Early classification and segmentation tasks were based on a few single-cell datasets that had been trimmed. The classification task at that time is usually to analyze the distribution state of the cell nucleus and cytoplasm, and the analysis method is usually to train the segmented cell nucleus and cytoplasm information to a specific classifier to obtain a classification model, while the segmentation methods used are usually algorithms such as threshold segmentation, watershed segmentation and C-means. For example, thanatip uses a fuzzy C-means clustering method to segment cellular images, and finally trains its own classifier. Willian, by training out the Weka segmentation classifier and then combining the fuzzy C-means algorithm to classify. The above conventional algorithms are based on extracting the cell tissue features by using a segmentation algorithm in advance, in other words, a segmentation algorithm capable of well segmenting the cell nucleus and the cell cytoplasm is required as the guarantee of the overall classification accuracy. Since CNN exhibits excellent ability in object feature extraction and the use of CNN to extract cell features does not require the use of segmentation algorithms to acquire cell tissue information in advance, many CNN-based classifications have emerged in recent research. These models have achieved superior results to earlier studies in the classification dataset of Herlev and HEMLBC (H & E stabilized manual LBC), but are difficult to apply in realistic clinical medicine to assist medical staff in analytical classification because the datasets they apply are relatively simple. Most of cervical cell smears are overlapped, most of cells are still required to be cut into single cells or cell groups for classification in the classification process, the overall reasoning time is in direct proportion to the number of the cells, and the cutting of the smear is a very tedious matter, so that the auxiliary effect of the classification task on the actual clinical medicine is extremely limited.

Of course, in the former work, there are many target detection models based on one stage, which are mostly modified and supplemented based on YOLO, retinaNet and EfficientNet, and they have used their own data sets to sufficiently train and optimize the models, but in the final result, the effect on the detection of squamous cells is not particularly ideal. Since the features between squamous cells are so similar that they are indistinguishable, as shown in figure 1, there is a strong desire in both current model research as well as clinical medicine to address this problem.

The challenges of image detection of squamous cervical lesion cells are: (1) In clinical medicine, squamous lesion cells are very similar in their characteristics, and there are often cases that are difficult to distinguish, which are not so friendly for subsequent pathological analysis; (2) Because cervical smears often present poor quality samples, there are cases where multiple cells are stacked together, leading to detection difficulties; (3) The criteria for determining the lesion cells in different age groups are different, and therefore it is often necessary to consider other cells around one cell to determine whether it is a lesion squamous cell. (4) The smear background is complex when detecting cells, and is full of free cytoplasm, nucleus and other impurities, which can affect the detection result of cervical squamous lesion cells to different degrees.

Disclosure of Invention

The invention aims to improve the detection precision of cervical squamous lesion cells and provides a cervical squamous lesion cell detection method based on depth adaptive feature extraction.

In order to realize the purpose, the technical scheme of the invention is as follows: a cervical squamous lesion cell detection method based on depth adaptive feature extraction uses a Transformer module in a high-level feature layer of a trunk network on the basis of YOLOv 5; the neck network adopts a bidirectional feature pyramid Bi-FPN to adaptively extract features of different levels in the picture; the prediction head module uses the attention convolution block CBAM to adaptively learn the size of the final cell detection box.

Compared with the prior art, the invention has the following beneficial effects:

in order to enhance the acquisition capability of the model for global information and better extract the cell characteristics in a self-adaptive manner, the invention uses a Transformer module in a high-level characteristic layer of a backbone network on the basis of YOLOv5, so that the detection accuracy of the cervical squamous lesion cells of 4 types (ASCUS, ASCH, HSIL and LISL) can be effectively distinguished. In addition, because the background of the smear is complex when detecting cells, it is filled with some free cytoplasm, nuclei and other impurities. Aiming at the problem of the situation, the invention adopts a bidirectional feature pyramid (BiFPN) to adaptively extract features of different levels in the picture so as to reduce the influence of a complex background on the detection of squamous lesion cells. Furthermore, because cells are stacked and agglomerated, the sizes of detection frames for independent cells and agglomerated cells are different, and therefore, in the invention, attention volume blocks (CBAM) are used to adaptively learn the size of the final cell detection frame. The algorithm of the invention is proved to have different degrees of improvement on 4 types of lesion cells in a large number of squamous cell detection experiments.

Drawings

FIG. 1 is an example of a white blood cell image prepared by rapid and standard staining.

Fig. 2 is a block diagram of the algorithm of the present invention.

Fig. 3 is a backbone network: (left) original backbone network structure; (right) backbone network structure after optimization using Transformer Block.

Fig. 4 is a neck network: (a) a feature pyramid; (b) a pixel aggregation network; (c) a feature pyramid network of neural architecture search; (d) a bidirectional feature pyramid network.

Fig. 5 shows a prediction head module: attention convolution module.

FIG. 6 is a random affine example graph: (a) a translation transformation; (b) a scaling transform; (c) a shear transformation; and (d) rotating and transforming.

FIG. 7 is an exemplary illustration of mixing enhancement: (a) original 1; (b) Effect after performing Mosaic enhancement together with other sample pictures; (c) original figure 2; (d) Effect after MixUp enhancement with other sample pictures.

Fig. 8 is sample 1 and the corresponding label.

FIG. 9 shows the results of the detection of the present invention on sample 1.

Fig. 10 is sample 2 and the corresponding label.

FIG. 11 shows the results of the test of the present invention on sample 2.

Fig. 12 shows sample 3 and the corresponding label.

FIG. 13 shows the results of the detection of the present invention on sample 3.

Detailed Description

The technical scheme of the invention is specifically explained below with reference to the accompanying drawings.

The invention relates to a cervical squamous lesion cell detection method based on depth adaptive feature extraction, which uses a Transformer module in a high-level feature layer of a trunk network on the basis of YOLOv 5; the neck network adopts a bidirectional feature pyramid Bi-FPN to adaptively extract features of different levels in the picture; the prediction head module uses the attention convolution block CBAM to adaptively learn the size of the final cell detection box.

The following is a specific implementation process of the present invention.

As shown in FIG. 1, in order to enhance the acquisition capability of the model for global information and better extract the cell features in an adaptive manner, the invention uses a Transformer module in a high-level feature layer of a backbone network on the basis of YOLOv5, so that the detection accuracy of 4 types (ASCUS, ASCH, HSIL, LISL) of cervical squamous lesion cells can be effectively distinguished. In addition, because the background of the smear is complex when detecting cells, it is filled with some free cytoplasm, nuclei and other impurities. Aiming at the problem of the situation, the invention adopts a bidirectional feature pyramid (BiFPN) to adaptively extract features of different layers in the picture so as to reduce the influence of a complex background on the detection of the squamous lesion cells. Furthermore, because cells are stacked and clustered, the sizes of detection frames for independent cells and clustered cells are different, and therefore, in the invention, the attention volume block (CBAM) is used for adaptively learning the size of the final cell detection frame.

1. Network architecture

1.1, backbone network

The invention is a model optimized after modularization modification on the basis of YOLOv5, a main network uses CSP-Darknet53, but the 9 th layer and the 11 th layer are modified, and a Transformer Block capable of better acquiring global information is used for replacing an original C3Block, and the model is specifically shown in FIG. 3. The reason why the replacement is carried out at the end of the backbone is that ZFNET proves that the deep neural network extracts high-dimensional features which can represent overall information more than low-dimensional features, and the calculation amount after the replacement is not increased too much. Furthermore, since ViT uses Transformer in the image classification task of computer vision and embodies the strong feature extraction capability of Transformer, we also want to use Transformer to help the model distinguish the features between different types of squamous lesion cells.

1.2 neck network

According to the description of EfficientDet, although FPN and PAN are widely applied to multi-scale feature fusion of targets in various target detection models. But for the importance of features of different scales, neither FPN nor PAN do a detailed analysis of this, and almost all treat this information indiscriminately. In order to solve the problem, the invention designs Bi-FPN by a weighting method, and the structure chart is shown in figure 4 (d), wherein the Bi-FPN introduces an extra weight to each input of the network, deletes the node with only one input edge and adds a jump connection between the input node and the output node in the same scale. In general, bi-FPN is equivalent to endowing different weights to each layer for fusion, so that the network focuses more on important layers, and node connection of some unnecessary layers is reduced. The authors demonstrated the efficient and accurate performance of this model in a large number of experiments. Therefore, when a model for detecting cervical squamous lesion cells is designed, the original PAN is replaced by the Bi-FPN, so that the overall model can better process features of different scales. And the final experiment shows that the use of Bi-FPN can increase the identification accuracy of the model for different classes while increasing the calculation amount as little as possible.

1.3 prediction of head modules

An attention mechanism module representing a convolution module. Is an attention mechanism module combining space (spatial) and channel (channel). Compared with the attention mechanism that SeNet only focuses on channels, the attention mechanism can achieve better effect. The module is simple and efficient, and can carry out attention extraction on the intermediate characteristic diagram from different angles, so that the characteristics of the input characteristic diagram can be refined in a self-adaptive mode. Moreover, because the CBAM is an end-to-end model structure, the CBAM can be flexibly applied to different models, and the CBAM has high-efficiency performance in extracting target features by using the CBAM in a large number of classical models and carrying out experiments in different data sets. The details of the model are in figure 5. Since most of the images we detect have complex backgrounds, such as free nuclei, cytoplasm and mucus silk or other impurities, and there are cases of large cell stacking, we use CBAM module to assist YOLO capture on the target object to be detected.

1.4 data enhancement

1.4.1 random affine enhancement

We have considered at the beginning of the training of the model that the network misidentifies cells because the deformation of the samples may cause the cells to change from positive to negative samples. We have not initially been enhanced with random biomimetic transformation data that would cause cell deformation, such as Translation (Translation), scale (Scale), flip (Flip), rotation (Rotation), and Shear (Shear), as shown in fig. 6. But the overall effect is quite poor, in one chance we erroneously use data enhancement that may lead to deformation, and the final effect is particularly excellent. Then we begin to explore the data enhancement mode most suitable for this task, and finally add the parameters of the method in table 1, so that the overall mAP is 2.5mAP higher than the case of using default data enhancement.

Table 1: random affine data enhancing corresponding parameters

1.4.2, mixed data enhancement

In order to enable the model of the invention to have stronger robustness under some complex environments, more complex data enhancement is used for combining different samples, a) Mosaic fuses a plurality of pictures into one picture, thereby increasing the diversity of the samples; b) The MixUp is to fuse two pictures together according to a certain transparency, so that the training difficulty is increased, and the model has better semantic analysis capability. The specific effect is shown in fig. 7.

1.5 loss function

The loss function of the present invention consists of three parts

1.5.1 Object Score loss

When detecting the object, the model uses two-class cross entropy loss which can judge whether the object is in the frame of the target frame, wherein o is a label vector, t is a prediction vector:

1.5.2 Class Probasic Score loss

When detecting classes, the model uses two-class cross entropy loss that can determine whether the target frame has correct class in the frame, as with Object Score, where o is the label vector, t is the predictor vector:

1.5.3 Bounding Box loss

When detecting whether the target object can be compactly framed in the prediction frame, the model uses the CIOU Loss with more excellent performance:

wherein α is:

v is:

wherein D is ₂ To predict the distance between the frame and the center point of the target frame, D _c Is the diagonal distance of the minimum circumscribed rectangle; w and h are the height and width of the prediction box, w ^gt And h ^gt Is the height and width of the real box.

To evaluate the performance of the leukocyte segmentation algorithm, in this experiment, we used a data set of CDetector [1] containing a total of 7410 cervical microscopic image slices of 11 classes, all of which were cropped from the slide microscopic image, wherein a total of 6666 cervical microscopic image slices were obtained from the training set and 744 cervical microscopic image slices were obtained from the test set. The slide micrographs in the data set were taken using a panoramic MIDI II digital slide scanner, and the corresponding specimens were prepared using the Thinprep method and stained using a Papanicolaou stain. The expert labels all the sections of the cervical microscopy images according to the TBS category and provides the corresponding bounding box and category names, for a total of 11 category names. Respectively, being an antigenic square cells Canot exact HSIL (ASCH), an antigenic square cells of undetermined design (ASCUS), a high-grade square intraepithelial version (HSIL), a low-grade square intraepithelial version (LSIL), a flo, a Trichlonas (TRICH), an Antigenic Great Cells (AGC), an Actinomyces (ACTIN), a Candida (CAND), a hers, a squamous-cell carbon (SCC). We selected only 4 types of squamous lesion cells (ASCUS, ASCH, HSIL, LSIL) in CDetector for training and analysis. The 4 types of pictures are 3502 in the training set and 505 in the test set, and the training set is divided into a training set and a verification set according to the proportion of 9. The algorithm of the invention is compared with the existing various detection methods, the result is shown in Table 2, and three Loss values of different methods are shown in FIG. 8; then, various algorithms were quantitatively compared on the CDetector dataset using the measure mAP of paslc VOCs.

The mAP is the average value of AP detected by various types of cells of a model. Where AP is the area under the curve for Precision and Recall Recall:

wherein r is ₁ ,r ₂ …r _n Is the corresponding Recall value at the first interpolation of the Precison interpolation segment arranged in ascending order, and Precision and Recall are respectively:

wherein TP represents True Positive, represents the Positive sample predicted as Positive by the model, FP represents False Positive, represents the negative sample predicted as Positive by the model, FN represents False negative, represents the Positive sample predicted as negative by the model.

Higher values of mAP represent better detection. The experimental environment of the invention is Ubuntu 20.04.3 LTS operating system, NVIDIA GEFORCE RTX 3080Ti display card is adopted for operation, the display memory is 12GB and 128GB RAM, the CPU processor is Intel (R) Xeon (R) Silver4210 CPU @2.20GHz, and all models are based on python 3.8 operating environment, pytrch 1.11.0 deep learning frame, CUDA 11.6 and CUDNN 8.3 acceleration library.

2. Effect display

Fig. 8 shows sample 1 and the corresponding labels. FIG. 9 shows the results of the detection of the present invention on sample 1. Fig. 10 is sample 2 and the corresponding label. FIG. 11 shows the results of the test of the present invention on sample 2. Fig. 12 shows sample 3 and the corresponding label. FIG. 13 shows the results of the detection of the present invention on sample 3.

3. Qualitative comparison

In order to qualitatively compare the detection effect of our algorithm with that of other excellent baseline algorithms on 4 kinds of squamous cervical lesion cells, we performed comparative experiments, and respectively show the effect images of our model on the detection of ASCUS, ASCH, HSIL and LSIL samples, and the detection results are shown in fig. 8-13, which shows that our invention can effectively and accurately cut out the lesion cells of these kinds.

As can be seen from fig. 8, 10, 12, the characteristics of these 4 types of squamous lesion cells in the true microscopic smear are very similar, and it is almost difficult for a person without specialized training to distinguish the differences among them. However, the model can distinguish the lesion cells in the samples with excellent accuracy, and although individual misjudgment also appears, the model is also applicable as a cervical lesion cell detection system for providing pathology-assisted analysis for clinical medical detection at present.

As shown in Table 2, compared with the fast R-CNN and RetinaNet models with a large number of parameters, the models of other series of YOLO, and the models which are also YOLOv5 but have larger scale and more parameter quantity, the model parameters are smaller, the detection accuracy of the 4 types of squamous epithelial cells is higher, and the reasoning speed is Faster.

TABLE 2 quantitative comparison of the results of the present invention with other excellent models for detecting diseased cells

4. Quantitative comparison

To quantitatively compare the accuracy of detection of cervical lesions with five methods (i.e., fast R-CNN, retinaNet, YOLOv3, YOLOv5, and the algorithm of the present invention) and corresponding models of different parameter scales, we performed experiments on a data set consisting of 3150 cervical cell images stained with Papanicolaou stain and quantitatively evaluated the segmentation results with two measures, mAP.5 and mAP.5: 0.95. Table 2 shows the results of quantitative evaluation on the current data set for the squamous cell test results in 4, with the best measure in each column of data shown in bold. It can be easily seen that the algorithm of our invention has the best detection effect on cervical lesion cells with almost the same parameters and calculation amount. Compared with other models with larger-scale parameter quantity and calculated quantity, the method still has more excellent performance. Therefore, the algorithm of the present invention has the best detection effect on the four cervical lesion cells in combination.

The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. A cervical squamous lesion cell detection method based on depth adaptive feature extraction is characterized in that a Transformer module is used in a high-level feature layer of a trunk network on the basis of YOLOv 5; the neck network adopts a bidirectional feature pyramid Bi-FPN to adaptively extract features of different levels in the picture; the prediction head module uses the attention convolution block CBAM to adaptively learn the size of the final cell detection box.

2. The method for detecting the cervical squamous lesion cells based on the depth adaptive feature extraction as claimed in claim 1, wherein the main network uses CSP-Darknet53, and Transformer Block is used for replacing C3Block at the 9 th layer and the 11 th layer.

3. The cervical squamous lesion cell detection method based on depth adaptive feature extraction as claimed in claim 1, wherein the neck network replaces PAN with Bi-FPN designed by a weighted method; the Bi-FPN introduces an extra weight to each input of the network, deleting nodes with only one input edge and adding a jump connection between the input node and the output node of the same scale.

4. The cervical squamous lesion cell detection method based on depth adaptive feature extraction as claimed in claim 1, wherein the attention convolution block CBAM assists YOLOv5 to capture on the target object to be detected.

5. The cervical squamous lesion cell detection method based on depth adaptive feature extraction as claimed in claim 1, characterized in that the method adopts a mixed data enhancement mode to combine different samples, specifically: a) A plurality of pictures are fused into one picture by adopting Mosaic, so that the diversity of the sample is increased; b) And the MixUp is adopted to fuse the two pictures together according to the preset transparency, so that the training difficulty is increased, and the model has better semantic analysis capability.

6. The cervical squamous lesion cell detection method based on depth adaptive feature extraction as claimed in claim 1, wherein the loss function adopted by the method is composed of three parts, specifically:

1) Object Score loss

When detecting the object, two-class cross entropy loss which can judge whether the object in the target frame is in the frame is used, wherein o is a label vector, t is a prediction vector:

2) Class Probasic Score loss

In detecting the class, as well as the Object Score loss, two classes of cross entropy loss are used to determine whether the target frame is the correct class in the frame, where o is the label vector, t is the prediction vector:

3) Bounding Box loss

When detecting whether the target object can be compactly framed in the prediction frame, the CIOU Loss is used:

wherein α is:

v is:

wherein D is ₂ To predict the distance between the frame and the center point of the target frame,D _c is the diagonal distance of the minimum circumscribed rectangle; w and h are the height and width of the prediction box, w ^gt And h ^gt Is the height and width of the real box.