CN113469942B

CN113469942B - CT image lesion detection method

Info

Publication number: CN113469942B
Application number: CN202110608053.1A
Authority: CN
Inventors: 侯永宏; 刘传玉; 李岳阳; 王拓; 苏晓雨; 郭子慧
Original assignee: Tianjin University
Current assignee: Sichuan Xuanguang Liying Medical Technology Co Ltd
Priority date: 2021-06-01
Filing date: 2021-06-01
Publication date: 2022-02-22
Anticipated expiration: 2041-06-01
Also published as: CN113469942A

Abstract

The invention provides a CT image lesion detection method, which is used for preprocessing and enhancing data of a medical CT image; the method comprises the steps of constructing a network model for detecting lesions of multiple continuous slices, wherein the model is based on a Mask-RCNN network and comprises a Convolutional Neural Network (CNN), an RPN network and an RCNN network, detecting the lesions in two stages, constructing a multi-scale local axial self-attention MSLASA module in the Convolutional Neural Network (CNN), and realizing information combination of pixel points in different local areas by using the MSLASA module, so that each pixel point in a characteristic diagram can obtain a self weight value according to a proper local space. Through an attention mechanism, information related to pathological changes is enhanced and highlighted to a great extent, and non-pathological features are subjected to desalination and filtration, so that the task of network discrimination at the next stage is relieved, and the number of false positive samples is reduced to a great extent.

Description

CT image lesion detection method

Technical Field

The invention belongs to the fields of computer vision, deep learning and the like, relates to a target detection technology, and particularly relates to a CT image lesion detection method.

Background

Lesion detection in Computed Tomography (CT) images is an important basis and guarantee for doctors in clinical practice, in hospitals, doctors analyze lesions in given CT images based on their own medical knowledge and daily empirical interpretation, and the accuracy of film reading is limited by the experience and knowledge levels of different doctors, thus causing differences. Moreover, for reading a large number of CT pictures, the doctor can influence the mental state to cause misjudgment and the like after working for a long time. Therefore, for the work of taking a lot of time and energy of the radiologist, the present invention hopes to reduce the burden of the radiologist by using a computer-aided system (CAD) to provide objective and real-time reference information to assist the physician in judging the lesion, and the present invention believes that the present invention is a trend of modern technology development, and the CAD system will be used as the best tool of the physician to provide quasi-safe, convenient and fast diagnosis and treatment for the patient in the future.

The advent of Convolutional Neural Networks (CNN) has made it possible to obtain better picture feature representations for natural pictures under their processing, and thus, there is a great breakthrough in the results of detection and segmentation tasks. The medical picture and the natural picture have certain field difference, the main embodiment is that the natural picture is mostly a scene picture in daily life of people, the picture characteristic is obvious, the prospect to be detected generally has higher discrimination, and at least the naked eye can judge rapidly. However, the lesion to be detected and the surrounding area in the medical CT image have high similarity, and even most non-professional persons cannot locate the lesion position from the similarity. The convolutional neural network is used for extracting picture characteristics of the medical image and then carrying out further lesion location and classification, and the powerful CNN is also greatly improved on the medical image and can detect most lesion positions. Most studies use CNNs to detect lesions in specific areas, such as the abdomen, chest or pelvis. However, the present invention expects that the CNN-based CAD system can process CT images from any part, that is, for CT medical images from all parts of the human body, the CNN should well extract image features for further classification and detection processing, which brings great challenges to the CNN designed for natural images.

Although the method of deep learning is able to detect the vast majority of lesions. However, due to the high similarity between the lesion in the CT image and its surrounding background, many non-lesion features passing through the CNN can be misjudged as lesions, resulting in many false positive samples.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a method for detecting lesions in a CT image, which can obtain the pixel weight value of the image characteristic through local attention and enhance the discrimination of the CT image characteristic, thereby reducing the number of false positive samples obtained by a network.

The innovation points of the invention are as follows:

the structure innovation in the convolutional neural network model is that the shallow feature map is used for obtaining the weight values of feature spaces under different levels, and the weight values are applied to the feature map after the shallow feature map and the deep feature map are combined.

The MSLASA module is utilized to realize information combination of the pixel points in different local areas, so that each pixel point in the characteristic diagram can acquire a weight value of the pixel point according to a proper local space.

In the self-attention module, the square area of the local space is converted into high-axis and wide-axis pixel points, so that the calculation complexity is O (N)²) Reduced to o (2N).

And adding position information to the query value and the key value in the self-attention module, so that the spatial information is better reflected.

The technical scheme for realizing the purpose of the invention is as follows:

a CT image lesion detection method comprises the following steps:

(1) the medical CT image is preprocessed and data enhanced, and the window width of each CT slice is converted to be (-1024) -3071HU) so as to cover the characteristic representation of a plurality of positions. The size of each CT slice is basically (512 ), and the data enhancement is realized by cutting, scaling and rotating the picture.

(2) And constructing a network model for detecting the lesion of the plurality of continuous slices, wherein the model is based on a Mask-RCNN network and comprises a Convolutional Neural Network (CNN), an RPN network and an RCNN network. The model has accurate effect by detecting the lesion in two stages.

(3) And constructing a multi-scale local axial self-attention MSLASA module in the convolutional neural network CNN, and enhancing the identification degree between pathological changes and non-pathological changes in the CT image. The CNN adopts a characteristic pyramid form, obtains shallow features and deep features of the CT picture through Bottom-up and Top-down Top-down network modes, and establishes fusion of the deep features and the shallow features through jump connection. In addition, with the shallow features as input, feature weights of the CT image are acquired in the local axial self-attention module mslas at multiple scales and mapped into the output features.

(4) The RPN is responsible for regression and classification of the detection frames in the first stage in the model, a fixed number of anchors are generated at each point in the feature map according to the size of the feature map and the size and proportion of a preset rectangular frame, and the RPN predicts the bias and category regression of each anchor to obtain the regression frame in the first stage.

(5) And selecting a part of the score according to the score of the score in the anchors obtained in the first stage, and sending the part of the score to the RCNN for regression and classification in the second stage. And obtaining the optimal model parameters.

(6) The network model respectively calculates the loss once in two stages, the loss of each stage is regression loss and classification loss, the final loss function is the sum of the loss of the two stages, and the loss function is as follows:

class loss:

wherein L is_clsRepresents the class loss, n represents the total number of samples, x represents the samples, y_iRepresenting the actual label and a the predicted output.

Regression loss:

wherein L is_regRepresents regression loss, n represents total number of samples, x represents samples, y represents actual coordinates, and f (x) represents predicted coordinate values.

Total loss:

L＝L_RPN,cls+L_RPN,reg+L_RCNN,cls+L_RCNN,reg

where L is the final loss of the network model. L is_RPN,clsAnd L_RPN,regRespectively refer to classification and regression in the first stage RPN networkloss，L_RCNN,clsAnd L_RCNN,regRespectively, classification and regression loss in the second stage RCNN network.

(7) Under the two-stage processing, the model outputs the category and the coordinates of the regression frame which is finally predicted, outputs the accuracy of the experiment under different FPs, and outputs the CT medical picture of the detection regression frame which is additionally predicted.

The invention has the advantages and beneficial effects that:

(1) the lesion detection in the CT image based on the attention mechanism is applied to processing the CNN in research, so that the features of the obtained CT image can have higher discrimination, the information related to the lesion is enhanced and highlighted to a great extent through the attention mechanism, and the non-lesion features are subjected to desalination and filtration, so that the task of network discrimination in the next stage is reduced, and the number of false positive samples is reduced to a great extent.

(2) The present invention proposes to adaptively combine multiple local regions with different proportions for each pixel using local self-attention modules at multiple scales to find the best fusion information. The module utilizes the characteristic information under different scales, brings obvious progress in the lesion detection of the CT image, and can be applied to more fields.

(3) The present invention centers on each pixel and assigns a certain number of squares to them to provide local self-attention. In view of computational complexity and memory limitations, the present invention selects 1D axial self-attention instead of 2D axial self-attention.

(4) The present invention develops a position sensitive module to enrich the spatial information of each pixel. Then, similar to the radiologist's diagnostic procedure, the mslas a module of the present invention can help the CNN to exclude many redundant pixels and capture a better characterization. The MSLASA module introduces few parameters and has low calculation overhead, thereby greatly improving the detection precision. Numerous experiments performed on the deep version basis demonstrated the effectiveness of the mslas module. More importantly, the invention provides a new method for implementing local self-attention to a plurality of local spaces.

Drawings

FIG. 1 is a network overall structure diagram of a CT image lesion detection method according to an embodiment of the present invention;

FIG. 2 is a model diagram of a multi-scale local axial self-attention module;

FIG. 3a is a graph of the results of a MULAN method on a picture of a human abdomen;

FIG. 3b is a graph of the results of the MULAN method on a picture of the liver of a human;

FIG. 3c is a graph of the result of the MULAN method on a human pelvis image;

FIG. 4a is a graph showing the results of a test conducted on a picture of the abdomen of a human body according to the method of the present invention;

FIG. 4b is a graph showing the results of the test of the present invention on a picture of the liver;

FIG. 4c is a graph showing the results of the test of the method of the present invention on a human pelvis image.

Detailed Description

The present invention is further illustrated by the following specific examples, which are intended to be illustrative, not limiting and are not intended to limit the scope of the invention.

The invention provides a method for detecting lesions in a CT image, wherein a network used is trained end to end.

As shown in FIG. 1, embodiments of the overall structure of the present invention are described in the following 1 to 5.

1) Since the CT image is a single-channel slice unlike the natural image, the present invention combines 3 single-channel slices as a 3-channel picture, so that the pretrained model of ImageNet can be used in the CNN network, and the data set has 32735 labeled lesion instances, 70% of which is used as the training set, 15% of which is used as the validation set, and the remaining 15% of which is used as the test set.

2) And constructing a feature extraction network model, wherein the network takes DenseNet-121 as a dense residual block as a basic component element and uses a DenseNet-121 pre-training model. Considering that the shallow feature has a larger size to display more picture details and the deep feature has more high-level semantic information to display more picture details as the convolutional neural network deepens, both of which have important roles in judging the pathological changes in the CT picture, the invention adopts a network structure of a Feature Pyramid (FPN) and uses jump connection to fuse the shallow feature and the deep semantic information to obtain a comprehensive feature map. In addition, the invention designs a self-attention Module (MSLASA), obtains weight information on a feature map space through local attention of shallow features, and maps the weight information on the feature map after the fusion of the shallow features and the deep features. The self-attention module will be described with emphasis in fig. 2.

3) And processing the picture characteristics acquired by the network as the input of the RPN network. First, for each point on the feature map, a fixed number of anchors are generated according to the proportion and area of the preset rectangular box. Meanwhile, the feature map is sent to an RPN network to obtain the corresponding category and regression bias of each rectangular frame, so that the invention can obtain the corresponding category and correction coordinate according to the basic coordinate and bias of each rectangular frame, and then 2000 proposals with the maximum probability can be selected by using non-maximum suppression (NMS) to be used as the detection of the first stage.

4) Rpn, finding a region with corresponding size and position on the feature map as a feature of the propsall, sending the region to ROI Align to obtain a feature with a fixed size, similarly obtaining a category score and a regression bias of the propsall through activation functions such as convolution and relu of the RCNN network, and correcting coordinates of the propsals again to obtain a second-stage final coordinate output.

5) It should be noted that not all anchors and propofol participate in the calculation of loss during the two-stage loss calculation. Under the assistance of the labeling information, the anchors obtained by the RPN can obtain the label of each anchor by calculating the cross-over ratio of the anchor to the label (the cross-over ratio is more than 0.5 and is set as 1, otherwise, the cross-over ratio is set as 0), and 32 samples are selected from the samples with the labels of 0 and 1 according to the ratio of 1:1 to calculate the loss. The second stage of RCNN also performs the above processing on propofol in calculating loss, except that the total number of positive and negative samples selected is 64. In addition, both positive and negative samples will participate in the calculation of the classification loss, but only the positive sample will participate in the calculation of the regression loss.

As shown in FIG. 2, the present invention introduces a multi-scale local axial self attention (MSLASA) module implementation. It should be noted that the feature map is used as an input of the mslas a module, and the mslas a module obtains a pixel weight value of each point in the feature map according to local information of a plurality of scales around each pixel point by taking the pixel point as a center.

1) And (3) carrying out 1x1 convolution on the input feature diagram F belonging to BxCxHxW (B: batch size, C: channel, H: height, W: width) to obtain the query value of each point with the feature diagram size unchanged and the channel number C', and dividing the feature diagram into 4 parts according to the channel number, wherein the obtained query value corresponds to (3,5,7,9)4 different scales. And simultaneously, dividing the feature map into 4 equal parts in the channel dimension, corresponding to 4 different scales, performing edge filling on each feature map according to the different scales, wherein the filling is 0, and then convolving each feature map to obtain a key value corresponding to a pixel point under each scale.

2) Because the local information key value corresponding to the query value of each pixel point is a plurality of squares taking the local information key value as the center, the calculation complexity and the subsequent calculation amount are higher, when the key value range is selected, for each pixel point, the invention only uses the high axis and the wide axis which pass through the pixel point and are vertical to the side length of the square, and the complexity is O (N)²) Reduced to O (2N).

3) And respectively adding position information to the query and the key in consideration of the influence of the spatial position information on the query and the key. For query, the coordinates of the upper left corner and the lower right corner of the whole feature map are (-1, -1) and (1,1), position coordinates of each point of the whole feature map are obtained, the position embedding of the feature map is set to be 1x2xHxW (2 represents horizontal and vertical coordinates), the position embedding is carried out with 1x1 convolution, the dimension size and the channel number are not changed, and the query value of each pixel point of the feature map is added with the value obtained after the convolution of the horizontal and vertical coordinates of the corresponding position to serve as a final query value. For the position information of the key value, only the high axis and the wide axis of 3,5,7 and 9 are adopted by the invention. Therefore, for each scale (N is 3,5,7,9), the present invention sets one end to be-1, the other end to be 1, and the position embedding to be 1x1xN, and performs 1-dimensional convolution on the position information to obtain position information, which is added to the key values on the horizontal axis and the vertical axis as the final key value.

4) In each scale, the query value of each point in the feature map and the corresponding key value are subjected to point multiplication, then the result is divided by the number of pixels of the high axis and the wide axis to be taken as a mean value, and the feature maps in all scales are cascaded on a channel to obtain a new feature map G.

5) Through convolution of 1x1, the size of the feature graph G is not changed, the number of channels is reduced to 1, and then relu processing is carried out, so that the final weight information feature graph is obtained.

6) The concrete implementation formula is as follows:

wherein x_oFinal output weight, f, representing o point_ConvRepresents a convolution of 1x1 + relu function, phi represents a channel cascade, p represents a local region at different scales,

representing the calculated quantity at different scales of o points, q_oRepresents the query value of the o point,

and

representing the location information of the horizontal and vertical coordinates, k, corresponding to the query_pRepresenting the key values of the pixel points at different scales,

representing the position information corresponding to the key.

The following is a description of the experimental results of the present invention on the deepversion dataset and the real images:

the invention trains and predicts under a Pythrch deep learning framework. Training and testing were performed on the DeepLesion dataset with sensitivity sensitivities with different FPs as the standard. The results of the experiments on the data set are shown in the table below

It can be seen that the method provided by the invention can obtain results superior to other methods in evaluation indexes. The present invention was tested in real images and visually compared to other mainstream algorithms, as shown in figures 3 a-3 c, and figures 4 a-4 c below.

FIGS. 3 a-3 c show the results of the MULAN method, and FIGS. 4 a-4 c show the results of the experiments according to the present invention. The solid line frame represents the real lesion label, and the dotted line frame represents the predicted lesion position and the predicted score, so that the method reduces a plurality of false positive samples while accurately predicting.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various changes and modifications can be made without departing from the inventive concept, and these changes and modifications are all within the scope of the present invention.

Claims

1. A CT image lesion detection method comprises the following steps:

preprocessing and data enhancing the medical CT image;

constructing a network model for detecting the lesion of a plurality of continuous slices, wherein the model is based on a Mask-RCNN network and comprises a Convolutional Neural Network (CNN), an RPN network and an RCNN network, and the model is divided into two stages for detecting the lesion;

constructing a multi-scale local axial self-attention MSLASA module in a Convolutional Neural Network (CNN), wherein the CNN adopts a characteristic pyramid form, obtains shallow features and deep features of a CT picture in a Bottom-up and Top-down (Top-down) network mode, establishes fusion of the deep features and the shallow features through jump connection, takes the shallow features as input, and obtains the characteristic weight of the CT image in the multi-scale local axial self-attention module MSLASA and maps the characteristic weight to output features;

the RPN is responsible for regression and classification of the detection frames in the first stage in the model, a fixed number of anchors are generated at each point in the feature map according to the size of the feature map and the size and proportion of a preset rectangular frame, and the RPN predicts the bias and category regression of each anchor to obtain the regression frame in the first stage; selecting a part of the score according to the score in the anchors obtained in the first stage, and sending the part of the score into an RCNN network for regression and classification in the second stage to obtain the optimal model parameters;

under the two-stage processing, the model outputs the category and the coordinates of the regression frame which is finally predicted, outputs the accuracy of the experiment under different FPs, and outputs the CT medical picture of the detection regression frame which is additionally predicted.

2. The detection method according to claim 1, characterized in that: the RPN and the RCNN respectively calculate the loss once in two detection stages, the loss in each stage is regression loss and classification loss, the final loss function is the sum of the losses in the two stages, and the loss function is as follows:

class loss:

wherein L is_clsRepresents the class loss, n represents the total number of samples, x represents the samples, y_iRepresents the actual label, a represents the predicted output;

regression loss:

wherein L is_regRepresents regression loss, n represents total number of samples, x represents samples, y represents actual coordinates, and f (x) represents predicted coordinate values;

total loss:

L＝L_RPN,cls+L_RPN,reg+L_RCNN,cls+L_RCNN,reg

where L is the final loss of the network model, L_RPN,clsAnd L_RPN,regRespectively refer to classification and regression loss, L in the first stage RPN network_RCNN,clsAnd L_RCNN,regRespectively, classification and regression loss in the second stage RCNN network.

3. The detection method according to claim 1, characterized in that: the characteristic graph is used as the input of the MSLASA module, the MSLASA module takes each pixel point as the center, and the pixel weight value of each point in the characteristic graph is obtained according to local information of the MSLASA module under a plurality of scales around the MSLASA module.

4. The detection method according to claim 3, characterized in that: the specific implementation steps of the multi-scale local axial self-attention module MSLASA are as follows:

equally dividing a feature graph input into the MSLASA module into a plurality of parts in a channel dimension, correspondingly calculating feature information under a plurality of different scales, and performing fusion calculation on each pixel point and local pixels under the peripheral fixed scale of the pixel point to obtain fusion information of the pixel under the scale;

firstly, carrying out 1x1 convolution on the characteristic graph to obtain a characteristic display query value of each pixel point, simultaneously carrying out 1x1 convolution on the characteristic graph to obtain characteristic key values of different positions, distributing a local area under a fixed scale with the query value as a center to each pixel point on the characteristic graph, carrying out dot multiplication on the query value of each pixel point and the key of the corresponding local area, averaging according to the number of the local pixel points, taking the output of each point as characteristic output under the local area, selecting the pixel points on a horizontal axis and a vertical axis which pass through the center of a square and are perpendicular to the side length of the square as the local area, and adding position embedding information to the pixel points;

each feature map is processed, feature outputs under various scales are obtained for each pixel point, 4 feature maps are cascaded on a channel to serve as one feature map, and features under various scales are fused through convolution of 1x1 and a Relu activation function to obtain weight outputs of each point.

5. The detection method according to claim 4, characterized in that: the implementation formula of the local axial self-attention module mslas at multiple scales is as follows:

wherein x_oFinal output weight, f, representing o point_ConvRepresents a convolution of 1x1 + Relu activation function, phi represents channel concatenation,_prepresenting local regions at different scales of a scale,

and

representing the position information corresponding to the key.