CN115965602A

CN115965602A - Abnormal cell detection method based on improved YOLOv7 and Swin-Unet

Info

Publication number: CN115965602A
Application number: CN202211726362.XA
Authority: CN
Inventors: 胡鹤轩; 方晓杰; 黄倩; 杨天金; 胡强; 巫义锐; 张晔; 狄峰; 胡震云; 周晓军; 沈勤; 吕京澴
Original assignee: Jiuyisanluling Medical Technology Nanjing Co ltd; Hohai University HHU
Current assignee: Jiuyisanluling Medical Technology Nanjing Co ltd; Hohai University HHU
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-04-14

Abstract

The invention discloses an abnormal cell detection method based on improved YOLOv7 and Swin-Unet, which comprises the following steps: collecting pathological cell smear images, and making an abnormal cell detection data set and a segmentation data set; constructing an improved YOLOv7 model and training; constructing a detection result screening module, and classifying cells in the detection network output image; building and training a Swin-Unet model for segmenting the overlapped cell mass images: based on an Unet model, a Swin-Transformer module is introduced to sample according to the local relation and the global relation of the cell image under multiple scales; abnormal cell detection was performed using the improved YOLOv7 and Swin-uet models. The invention fully utilizes the context information during cell detection, effectively processes the cell clusters difficult to detect, and can greatly improve the accuracy and the recall rate on the premise of ensuring the detection rate.

Description

Abnormal cell detection method based on improved YOLOv7 and Swin-Unet

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to an abnormal cell detection method based on improved YOLOv7 and Swin-Unet.

Background

The pathological cytology examination refers to taking cytology specimens, such as sputum cast-off cells, liquid-based cells and the like, making pathological cytology slices through smear, pathological section making technology and the like, and then observing the conditions of cell types, cell types and the like through a microscope to diagnose diseases. For example, screening of diseases such as breast cancer and cervical cancer is mostly diagnosed by applying pathological cytology examination, wherein the cervical cancer is the most common gynecological malignant tumor worldwide and seriously threatens the life of women. 2016 world health organization reports: more than 50 million new cervical cancer cases are globally sent every year, the cases account for about 28 percent of China in developing countries, the early treatment effect of the cancer is good, the cost is low, the difficulty is small, but no obvious symptoms exist and the cancer is not easy to find, cytology (including traditional pap smears) is taken as a main screening means of common female cancers such as the cervical cancer in China, but the overall screening level is not high, and mainly because domestic experienced cytopathologists and auxiliary personnel are scarce, the auxiliary detection of pathological cells by using a computer is very necessary and valuable.

The detection method in the prior art is mainly based on a deep learning method, and comprises a method based on target detection and a method based on example segmentation. Chinese patent application (CN 202111048528.2) "an abnormal cell detection method based on attention-inducing mechanism", adopts advanced target detection network RetinaNet to screen suspicious cells, and then classifies the suspicious cells by using Mean-Teacher network with attention-inducing mechanism. The method effectively realizes false positive inhibition in the detection process and improves the detection precision, but the method has poor performance when the noise of a detection sample is increased and the overlapped cell mass is increased. The main performance is as follows: (1) overlapping abnormal cells are difficult to detect; (2) When the sample contains non-cell units such as tissue fluid, the detection precision is obviously reduced; (3) The attention mechanism introduced does not sufficiently combine multi-scale information, and the detection performance needs to be improved.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an abnormal cell detection method based on improved YOLOv7 and Swin-Unet, which is used for carrying out staged detection on overlapped abnormal cells difficult to detect, effectively detecting difficult-to-detect cell samples by utilizing the characteristics of good performance of large targets and high precision of example segmentation of target detection, and preventing the phenomena of missed detection and false detection; the most advanced dynamic head attention mechanism is adopted in the detection head part, the attention of three dimensions of scale, space and task is fully fused, and the detection precision can be greatly improved.

In order to solve the technical problems, the invention adopts the following technical scheme.

An abnormal cell detection method based on improved YOLOv7 and Swin-Unet, comprising the following steps:

step 1, collecting a cell smear image in pathological cytology examination, and making an abnormal cell detection data set and an abnormal cell segmentation data set;

step 2, constructing an improved YOLOv7 model and training the improved YOLOv7 model for detecting abnormal cells and overlapped cell clusters, improving the extraction problem of multi-dimensional characteristic information on the basis of the latest YOLOv7 model to obtain an improved YOLOv7 model structure, wherein the improved YOLOv7 model structure comprises an abnormal cell detection data preprocessing module Process, a Backbone network backhaul, a Neck network neutral and a detection network Head;

step 3, building a detection result screening module, classifying cells in the detection network output image, outputting the abnormal cell image, and inputting the overlapped cell cluster image as a segmentation model;

step 4, building a Swin-Unet model and training, wherein the Swin-Unet model is used for segmenting the overlapped cell mass images: based on a most commonly used Unet model in the medical field, a Swin-Transformer module is introduced to sample the local relation and the global relation of the cell image under multiple scales; the Swin-Unet model structure comprises an Encoder Encoder, a Neck network Neck and a Decoder Decoder;

and 5, carrying out abnormal cell detection by using a modified Yolov7 model and a Swin-Unet model.

The step 1 process is as follows:

1-1, collecting cell smear images in pathological cytology examination, including cervical cell images and mammary gland cell images, performing sliding window cutting on original cell smear images, wherein the cutting size is 640 multiplied by 640, the overlapping range of the sliding window is 50%, obtaining cell images of small areas, performing rectangular frame labeling on independent abnormal cells and overlapping cell groups by using a LabelImg tool, storing labels as XML files, and making an abnormal cell detection data set for training an improved YOLOv7 model;

and 1-2, screening cell images with labels of overlapped cell groups in the abnormal cell detection data set, subdividing a segmentation area by utilizing a polygonal labeling function of a LabelImg tool, labeling the area, labeling the abnormal cells, storing the labels, and preparing the abnormal cell segmentation data set for training a Swin-Unet model.

Specifically, in step 2, the building of the improved YOLOv7 model includes:

2-1, constructing an abnormal cell detection data preprocessing module, comprising: performing data enhancement of turning and translation on the cell image; and (3) carrying out noise reduction processing on the cell image by using Gaussian filtering, wherein a Gaussian kernel function is as follows:

g (x, y) is a pixel value of the denoised cell image, x and y represent coordinates of pixel points, and sigma is a standard variance of Gaussian, so that the smoothness degree of the cell image is determined;

2-2, constructing a backbone network of the improved YOLOv7 model: firstly, convolving a feature map of an input cell image by a 4-layer CBS module, wherein the CBS module comprises a Conv layer, a BN layer and a SiLU layer, and then stacking an ELAN module and an MP module to output three feature maps; the ELAN comprises a plurality of CBS modules, the input and output characteristic size of the ELAN is kept unchanged, the number of channels is changed in the first two CBS modules, the latter input channels are all kept consistent with the output channels, and the output channels are required channels through the last CBS module; the MP module is spliced by output vectors of the Maxpool and CBS modules;

2-3, building a neck network of the improved YOLOv7 model: fusing three characteristic graphs output by a backbone network by using a PAFPN structure;

2-4. Build the head network of the improved YOLOv7 model: introducing a dynamic head Dyhead module to perform characteristic diagram attention fusion; its dynamic head module structure includes: scale-aware attention, spatially-aware attention, and task-aware attention with a stacked fit of attention functions; the formula for applying self-attention is:

W(F)＝π _C (π _S (π _L (F)·F)·F)·F (2)

wherein F ∈ R ^L×S×C Corresponding to the input feature vector, R is the input feature vector set, L represents the scale degree of the feature, S = H multiplied by W is the remodeling of the height H and width W dimensions of the feature map, C represents the channel number of the feature map, pi _L (x)、π _S (x)、π _C (x) The attention function which is independent on three dimensions of task, space and scale respectively corresponds to the formulas (3), (4) and (5):

/>

π _C (F)·F＝max(α ¹ (F)·F _c +β ¹ (F),α ² (F)·F _c +β ² (F)) (5)

wherein, in the formula (3), f (x) is a linear function approximated by 1 × 1 convolution,

is a Hard-Sigmoid activation function;

in equation (4), K is the number of sparse sampling locations, ω is the weighting factor for l and K, p _k +Δp _k Is the position of the spatial offset, Δ m, by self-learning _k Is position p _k A self-learning scalar;

in the formula (5), [ alpha ] ¹ ,α ² ,β ¹ ,β ² ] ^T (·) is a hyper-function of the learning control activation threshold, F _C The feature slice at the C-th channel is represented.

Specifically, in step 2, the process of training the improved YOLOv7 model is as follows:

inputting cell images in abnormal cell detection data sets, wherein the size of the cell images is 640 multiplied by 640, the batch-size is set to be 16, training 180 epochs to obtain an improved YOLOv7 model with the best effect, extracting overlapped cell mass images in detection results, and setting different IoU cross-over ratios; where @ x represents the performance of setting the IoU threshold to x; the mAP represents the average value of the AP calculated for each corresponding category, the higher the value, the better the detection effect, the higher the Recall Recall rate, namely the Recall ratio, the higher the value, the fewer the leaked marked cells.

Specifically, the process of building the Swin-uet model in step 4 includes:

4-1, constructing an encoder part of a Swin-Unet model: firstly, the minimum unit of a cell image is converted into a 4 x 4 Patch from a pixel through the Patch Panel, a structure of a high-dimensional space is maintained through a linear embedding module, then a down-sampling process is carried out, the initial part still uses a convolution layer in Unet, and the two following down-sampling processes are replaced by a pair of Swin-Transformer, and the basic formula is as follows:

wherein Q, K, V represent query, key, value matrix in the self Attention separately, d represents the dimensionality, B is the relative position bias that can be learnt, attention (Q, K, V) represents the Attention function in each patch;

4-2, building a neck layer of a Swin-Unet model: filtering the high-dimensional characteristic information after down-sampling by using a group of paired Swin-transformers;

4-3, constructing a decoder part of the Swin-Unet model: the network structure corresponding to the encoder network comprises the steps of firstly, performing twin Swin-transducer + batch expansion twice, performing original up-sampling +2 times of convolution on the last layer of characteristics, splicing characteristic graphs corresponding to encoders in the same phase for each up-sampling module to form a residual block, performing linear projection on output characteristics once, and sending the output characteristics into a convolution network for classification to obtain a final output result.

Specifically, the training process of the Swin-Unet model in the step 4 is as follows:

taking the overlapped cell mass image as input, setting the batch-size to 64, training 80 epochs, and setting different IoU (cross-over ratio, which is expressed as a cross-over ratio threshold value of a prediction Mask and a group Truth Mask in a segmentation task.

In the step 5, the process of abnormal cell detection using the improved Yolov7 and Swin-Unet model comprises the following steps:

5-1, obtaining a cell smear image in the pathological cytology examination, performing sliding window cutting on the cell smear image, and inputting the cut cell image into a detection network in sequence;

5-2, extracting the features of the cell image by using the improved backbone network of the YOLOv7 network, sending feature maps of different scales into a neck network, fusing the features, sending the feature maps into a detection head network, and outputting a detection result;

5-3, screening the detection result by using a detection result screening module, judging that abnormal cells are directly used as output, and sending the abnormal cells into a segmentation network if the abnormal cells are judged to be overlapped cell clusters;

and 5-4. The encoder of the Swin-Unet network performs down-sampling on the overlapped cell images, the overlapped cell images are filtered by the neck network, the overlapped cell images enter the decoder network to perform up-sampling by using a residual error structure, the output characteristic images enter a convolution layer to classify image segmentation areas, the areas judged to be abnormal cells are output, and the output characteristic images and the abnormal cells in the step 3 are used as final output results.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention adopts a staged detection method of target detection and example segmentation, fully utilizes the high precision of a segmentation model to process overlapping cell clusters which are difficult to detect, simultaneously keeps the high performance of the original detection model and the easy labeling property of data, is used for processing single abnormal cells which are easy to detect, effectively solves the problems of cell clusters which are difficult to detect and easy to detect by mistake under the condition of not losing the real-time detection performance, improves the recall ratio and the accuracy of detection, realizes good balance on the hardware level and precision, meets the practical requirements better and has feasibility.

2. The invention introduces the most advanced dynamic head module (Dyhead), and simultaneously fits scale perception attention, space perception attention and task perception attention by stacking of attention functions, so that the cell context correlation under multiple dimensions is fully considered in the detection process of a target detection network and is matched with the real detection process of a pathologist, the model is more robust, and the detection accuracy is improved.

3. According to the method, a Swin-transducer attention module is introduced, and the local attention and the global attention of the image are combined through a sliding window mechanism, so that the segmentation performance of the segmentation network is enhanced effectively, the identification accuracy of the model on the complex cell mass is improved, and the overall detection precision is improved.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment of the present invention.

Fig. 2 is a structural diagram of an improved YOLOv7 model according to an embodiment of the present invention.

FIG. 3 is a diagram of a dynamic head module according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of the Swin-uet model according to an embodiment of the present invention.

Figure 5 is a Swin-Transformer block diagram according to one embodiment of the present invention.

FIG. 6 is a diagram of an overall algorithm implementation process according to an embodiment of the present invention.

Detailed Description

The invention relates to an abnormal cell detection method based on improved YOLOv7 and Swin-Unet, which comprises the following steps: collecting cell slice images in cytopathology examination, and making an abnormal cell detection data set and an abnormal cell segmentation data set; an improved YOLOv7 model is built, and the robustness of the model is enhanced by adding a dynamic attention head to detect independent abnormal cells and overlapped cell clusters; carrying out abnormal cell detection by using the model with the best training effect, and screening the detection result; inputting the overlapped cell groups in the detection result into a Swin-Unet model for segmentation detection. The invention fully considers the context information during cell detection, effectively processes the cell clusters difficult to detect, and effectively improves the accuracy and the recall rate on the premise of ensuring the detection rate.

The present invention will be described in further detail with reference to the accompanying drawings.

FIG. 1 is a flow chart of a method according to an embodiment of the present invention. As shown in fig. 1, the method of this embodiment includes the following steps:

1-2, screening cell images with labels of overlapped cell groups in the abnormal cell detection data set, subdividing a segmentation area by utilizing a polygon labeling function of a LabelImg tool, labeling the area, labeling abnormal cells, storing the labels, and preparing an abnormal cell segmentation data set for training a Swin-Unet model;

and 2, constructing an improved YOLOv7 model and training the model for detecting abnormal cells and overlapped cell clusters. The invention is based on the latest YOLOv7 model, and improves the extraction problem of multi-dimensional characteristic information, the improved YOLOv7 model structure is shown in FIG. 2, the overall structure comprises a data preprocessing module (Process), a Backbone network (Backbone), a Neck network (Neck) and a detection network (Head), and the construction of each network is as follows:

2-3, building a neck network of the improved YOLOv7 model: fusing three characteristic graphs output by a main Network by utilizing a PAFPN (Path Aggregation Network with Feature Pyramid Networks) structure;

W(F)＝π _C (π _S (π _L (F)·F)·F)·F (2)

wherein F ∈ R ^L×S×C Corresponding to the input feature vector, R is an input feature vector set, L represents the scale number of the feature, S = H multiplied by W is the reshaping of the height (H) and width (W) dimensions of the feature map, C represents the channel number of the feature map, and pi _L (x)、π _S (x)、π _C (x) The attention functions which are independent in three dimensions of task, space and scale respectively correspond to formulas (3), (4) and (5):

/>

π _C (F)·F＝max(α ¹ (F)·F _c +β ¹ (F),α ² (F)·F _c +β ² (F)) (5)

wherein in the formula (3), f (x) is a linear function approximated by a 1 × 1 convolution,

is a Hard-Sigmoid activation function;

in equation (4), K is the number of sparsely sampled locations, ω is the weighting factor corresponding to l and K, p _k +Δp _k Is the position of the spatial offset, Δ m, by self-learning _k Is position p _k A self-learning scalar;

in the formula (5), [ alpha ] ¹ ,α ² ,β ¹ ,β ² ] ^T (·) is a hyper-function of the learning control activation threshold, F _C Representing the feature slice at the C-th channel;

2-5 training the improved Yolov7 model: inputting an abnormal cell detection data set, cutting an image into a size of 640 multiplied by 640, setting a batch-size of 16, training 180 epochs to obtain an improved Yolov7 model with the best effect, extracting overlapped cell mass samples in a detection result, setting different IoU (intersection ratio, which is expressed as the intersection ratio of a prediction frame and a group trial box in a target detection task) thresholds, and setting the training result as shown in Table 1.

TABLE 1

Where @ x represents an expression in which the IoU threshold is set to x, and mapp represents an Average value of AP (Average Precision) calculated for each category, and a higher value indicates a better detection effect, and Recall is also called Recall rate, and a higher value indicates a smaller number of cells with labels that have been missed.

and 4, building a Swin-Unet model and training the Swin-Unet model for segmenting the overlapped cell mass images. The invention is based on the most commonly used Unet model in the medical field, aiming at the local relation and the global relation of the cell image under the multi-scale, a Swin-Transformer module is introduced for sampling, the structure of the Swin-Unet model is shown as figure 4, the overall structure of the Swin-Unet model comprises an Encoder (Encoder), a Neck network (Neck) and a Decoder (Decoder), and the construction of each network is as follows:

4-1, constructing an encoder part of a Swin-Unet model: firstly, the minimum unit of a cell image is converted into a 4 × 4 Patch from a pixel through the Patch Panel, a downsampling process is performed after a structure of a high-dimensional space is maintained through a linear embedding module, the initial part still uses a convolution layer in Unet, and downsampling of the two subsequent times is replaced by a pair of Swin-Transformers, the structure of which is shown in FIG. 5, and the basic formula of which is shown in formula (6):

4-3, constructing a decoder part of a Swin-Unet model: the network structure corresponding to the encoder network comprises the steps of firstly, performing twin Swin-transducer + batch expansion twice, performing original up-sampling +2 times of convolution on the last layer of characteristics, splicing characteristic graphs corresponding to encoders in the same phase for each up-sampling module to form a residual block, performing linear projection on output characteristics once, and sending the output characteristics into a convolution network for classification to obtain a final output result.

4-4, training Swin-Unet model: overlapping cell clusters were entered as segmentation datasets, with the batch-size set to 64, 80 epochs trained, and different IoU (cross-over ratio, expressed as the cross-over ratio of the predictive Mask to the group try Mask in the example segmentation task) thresholds set, with the training results shown in table 2.

TABLE 2

Wherein each index is the same as table 1.

Step 5, using improved Yolov7 and Swin-Unet model to detect abnormal cells, the algorithm flow is shown in FIG. 6, and the detailed process is as follows:

and 5-4. Down-sampling the overlapped cell images by an encoder of the Swin-Unet network, filtering the overlapped cell images by the neck network, entering a decoder network to perform up-sampling by using a residual error structure, entering an output characteristic image into a convolution layer to classify image segmentation regions, outputting the regions which are judged to be abnormal cells, and taking the abnormal cells together with the abnormal cells in the step 3 as a final output result.

Claims

1. An abnormal cell detection method based on improved YOLOv7 and Swin-Unet, which is characterized by comprising the following steps:

step 4, building a Swin-Unet model and training, wherein the Swin-Unet model is used for segmenting the overlapped cell mass images: based on the most common Unet model in the medical field, aiming at the local relation and the global relation of the cell image under the multi-scale, a Swin-Transformer module is introduced for sampling; the Swin-Unet model structure comprises an Encoder Encoder, a Neck network Neck and a Decoder Decoder;

step 5, carrying out abnormal cell detection by using improved YOLOv7 and Swin-Unet model;

the step 1 process is as follows:

2. The method for detecting abnormal cells based on improved YOLOv7 and Swin-Unet as claimed in claim 1, wherein in step 2, the construction of the improved YOLOv7 model comprises:

2-1, constructing an abnormal cell detection data preprocessing module, which comprises: performing data enhancement of turning and translation on the cell image; and (3) carrying out noise reduction processing on the cell image by using Gaussian filtering, wherein a Gaussian kernel function is as follows:

2-3. Build the neck network of the improved YOLOv7 model: fusing three feature maps output by a backbone network by using a PAFPN structure;

2-4. Build the head network of the improved YOLOv7 model: a dynamic head Dyhead module is introduced to carry out feature map attention fusion; its dynamic head module structure includes: scale-aware attention, spatially-aware attention, and task-aware attention with a stacked fit of attention functions; the formula for applying self-attention is:

W(F)＝π _C (π _S (π _L (F)·F)·F)·F (2)

wherein F ∈ R ^L×S×C Corresponding to the input feature vector, R is the input feature vector set, L represents the scale number of the feature, S = H multiplied by W is the remodeling of the height H and width W dimensions of the feature map, C represents the channel number of the feature map, and pi _L (x)、π _S (x)、π _C (x) The attention function which is independent on three dimensions of task, space and scale respectively corresponds to the formulas (3), (4) and (5):

π _C (F)·F＝max(α ¹ (F)·F _c +β ¹ (F),α ² (F)·F _c +β ² (F)) (5)

is a Hard-Sigmoid activation function;

in the formula (5), [ alpha ] ¹ ,α ² ,β ¹ ,β ² ] ^T (= θ) ·) is a hyper-function of learning control activation threshold, F _C The feature slice at the C-th channel is represented.

3. The method for detecting abnormal cells based on improved YOLOv7 and Swin-uet as claimed in claim 1, wherein in step 2, the process of training the improved YOLOv7 model is:

4. The improved YOLOv7 and Swin-uet based abnormal cell detection method as claimed in claim 1, wherein the process of constructing Swin-uet model in step 4 comprises:

4-2, building a neck layer of the Swin-Unet model: filtering the high-dimensional characteristic information after down-sampling by using a group of paired Swin-transformers;

5. The improved Yolov7 and Swin-Unet based abnormal cell detection method as claimed in claim 1, wherein the Swin-Unet model training process in step 4 is:

with the overlapping cell mass images as input, set the batch-size to 64, train 80 epochs, and set different ious (union ratio, expressed as a union ratio threshold for predicting Mask and group Truth Mask in the segmentation task.

6. The method for detecting abnormal cells based on improved Yolov7 and Swin-Unet as claimed in claim 1, wherein in step 5, the process of abnormal cell detection using improved Yolov7 and Swin-Unet model comprises:

5-2, extracting the features of the cell image by using a backbone network of the improved YOLOv7 network, sending feature maps with different scales into a neck network, sending the feature maps into a detection head network after feature fusion, and outputting a detection result;

and 5-4. Down-sampling the overlapped cell images by an encoder of the Swin-Unet network, filtering the overlapped cell images by the neck network, entering a decoder network to perform up-sampling by using a residual error structure, entering an output characteristic image into a convolution layer to classify image segmentation regions, outputting the regions which are judged to be abnormal cells, and taking the regions and the abnormal cells in the step 3 as final output results.