Ear sclerosis focus detection and diagnosis system based on small target detection neural network
Technical Field
The invention belongs to the technical field of medical image processing, particularly relates to an ear sclerosis focus detection and diagnosis system, and more particularly relates to an ear sclerosis focus detection and diagnosis system based on a small target detection neural network.
Background
Otosclerosis is a disease in which dense lamellar bone of labyrinthine bone is locally replaced by new spongy bone rich in cells and blood vessels. Otosclerosis can be classified into stapedial type otosclerosis, cochlear type otosclerosis and mixed type otosclerosis according to the difference of lesion sites and ranges. Cochlear otosclerosis is an advanced form of otosclerosis that is readily diagnosed according to typical clinical and CT manifestations, and treatment is limited to hearing aid wear. The earliest and most common location of otosclerosis, however, is in the anterior part of the vestibular window, which leads to stapedial otosclerosis, a conductive hearing loss due to stapedial floor fixation. In addition to diagnosis of otosclerosis based on clinical symptoms and audiological examinations of patients, the diagnostic value of high resolution ct (hrct) is also widely recognized. It can reach 74% to 95.1% positive rate, and is considered as the first choice for diagnosing otosclerosis.
Temporal bone CT in stapes-type ear sclerosis patients is manifested by a reduction in bone density in the preforacle cleft area and a thickening of the stapes footplate. By observing the CT, the doctor can give a corresponding preliminary diagnosis result. However, human factors such as inexperience, fatigue, negligence, etc. of the doctor may directly affect the accuracy of the diagnosis. In addition, the stapes occupies about 8 × 8 in CT images with 512 × 512 pixel resolution, so that the physician is complicated to locate and diagnose the stapes.
The Deep Convolutional Neural Network (DCNN) is a machine learning technology, which can effectively avoid human factors and automatically learn how to extract rich representative visual features from a large amount of marked data. The technology uses a back propagation optimization algorithm, so that a machine updates internal parameters thereof and learns the mapping relation from an input image to a label. In recent years, DCNN has greatly improved the performance of tasks in computer vision.
2012, Krizhevsky et al[1]For the first time atImageNet[2]The image classification competition applies a deep convolutional neural network, and obtains a champion with a Top-5 error rate of 15.3%, which causes a hot tide of deep learning. 2015 Simnyan et al[3]The neural networks VGG-16 and VGG-19 of 16 and 19 layers are provided, the parameter number of the networks is increased, and the result of the ImageNet image classification task is further improved. 2016 He et al[4]The use of the 152-layer residual network ResNet achieves a classification effect exceeding that of human eyes.
DCNN not only performs excellently in image classification tasks, but also in some structured output tasks, such as object detection[5-7]Semantic segmentation[8,9]The same excellent effects are obtained. If the DCNN is applied to computer-aided diagnosis (CAD), doctors can be assisted to make better medical diagnosis, early discovery and early treatment can be achieved, and the treatment effect can be improved.
However, the existing target detection network can not detect a stapes which is a very small target generally, and aiming at the problem, the invention provides a new noise robust ear sclerosis focus detection and diagnosis system based on a small target detection neural network, which can fully combine the characteristics of a training image, extract abundant characteristics and simultaneously realize the detection and diagnosis of an ear sclerosis area.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide an ear sclerosis focus detection and diagnosis system based on a small target detection neural network, which eliminates the influence of human factors and realizes the automatic diagnosis of a temporal bone CT image.
The invention provides an ear sclerosis focus detection and diagnosis system based on a small target detection neural network, which specifically comprises the following steps:
(1) extracting features of a backbone network;
(2) a target detection and classification network;
(3) a noise robust classification loss function;
(4) and (3) a post-processing diagnosis system for multi-layer detection results.
Wherein:
(1) the feature extraction backbone network is a multi-level deep convolutional neural network and comprises 9 convolutional modules which are conv _ x1, conv _ x2, conv _ x3, conv _ y1, conv _ y2, conv _ y3, conv _ y4, conv _ z1 and conv _ z2 respectively, wherein each convolutional module is composed of two continuous convolutional layers, and all convolutional modules form a W-shaped network structure; wherein the number of output channels of x1, x2 and x3 is 64, 128 and 256 respectively, and the output characteristic resolution is consistent with that of the input image; the number of output channels of y1, y2, y3 and y4 is 128, and the output characteristic resolution is 4 times of the input image down-sampling; z1, z2 are all 256 output channels, and the output characteristic resolution is 8 times that of input image down-sampling. Jumping connection structures are arranged between x1 and x2, between x2 and x3, between y1 and y2, and between y3 and y4, and the former feature is directly spliced with the high-level feature and then convolved to obtain the latter feature. The input of the feature extraction backbone network is a 3D temporal bone CT image (hierarchical input), and the output is a feature map of the temporal bone CT image;
(2) the target detection and classification network is Ren et al[7]The proposed target detection network, and the main network thereof is replaced by the feature extraction main network in the step (1); besides the backbone network, the target detection network also comprises a region extraction network, an interested region pooling layer, a classification network and the like; the region extraction network comprises two parallel modules, wherein one part divides each feature point into a foreground type and a background type through a softmax function, the other part calculates the offset of a marking frame through 1 multiplied by 1 convolution, and finally the outputs of the two parts are integrated to obtain an extracted feature region; the classification network generates two output branches through a full connection layer: the first branch circuit outputs the position offset of each characteristic region for further correcting the position of the detection frame; calculating the classification probability of the features by the second branch through a softmax function to obtain the category of the region; the work flow of the target detection network is as follows:
firstly, an output feature map of a feature extraction backbone network is sent to an area extraction network to obtain an extracted feature area;
then, enabling the extracted characteristic region to enter a region-of-interest pooling layer for self-adaptive pooling, and uniformly adjusting the size of the characteristic region to 7 multiplied by 7;
finally, sending the data to a classification network to obtain the classification of the region;
(3) the noise robust classification loss function is used for reducing cross entropy loss lceAnd the mean absolute error loss l1In combination, the probability p of the network output is used as the dynamic adaptive weighting coefficient, and the loss function is defined as:
wherein, p is the probability of label corresponding class in the network output probability of each class;
according to Wang et al[10]In the paper, cross entropy loss l is describedceIs not robust to noise, but helps network convergence; mean absolute error loss l1Robust to noise, but difficult to converge. Therefore, they use two fixed parameters α and β as coefficients of two loss functions, respectively. However, the two coefficients need to be adjusted manually according to different data sets to achieve better effect. Therefore, the invention carries out self-adaptive weighting on the network output label and the label according to the probability p of the corresponding class of the network output label, thereby avoiding the trouble of manually adjusting the coefficient and obtaining better result;
(4) the input of the post-processing diagnosis system for the multi-layer detection result is N layered detection results, wherein a certain result is expressed as
The two terms respectively represent the classification category (normal and ear hardening) of the detection region and the detection confidence (the probability that the result output by the target detection model belongs to the classification, and the numerical value range is between 0 and 1), and the result is output as the diagnosis result of the sample.
Firstly, the confidence degrees of all the results are sorted, and all the results of the k items before the confidence degree ranking are reserved. From statistics on the data, the stapes region will usually appear in a continuous 3-layer CT, so the k value at this step takes 3 to be most appropriate;
and finally, calculating the ratio r of the detected lesion layer number to the total residual layer number of the residual result, setting a threshold value T, and considering that the sample has ear sclerosis when r is greater than T, otherwise, considering that the sample is a normal person. According to different choices of the threshold value T, the sensitivity and the specificity of the obtained model are correspondingly improved or reduced, a receiver operating characteristic curve (ROC) is drawn, the performance of the model can be measured through the size of the area under the curve (AUC), the AUC value is between 0 and 1, and the larger the value is, the better the performance of the model is.
Further, the training method of the network model of the invention is as follows:
firstly, training a feature extraction backbone network and a region extraction network, wherein the step adopts binary cross entropy loss and smoothing L1Loss; then, training a classification network by using the feature region extracted by the region extraction network, wherein the step adopts the noise robust classification loss function and the smoothing L in the step (3)1Loss; the whole training process is alternately carried out twice; during training, the sample at least comprises 1500 pathological images and 1500 normal images.
In the invention, after the test image I is input, the detection and diagnosis result can be obtained only by one-time forward propagation.
The invention has the beneficial effects that: the invention designs a small target detection network taking a temporal bone CT image as input, can detect a stapes with the size of about 8 multiplied by 8 in a CT image with the resolution of 512 multiplied by 512 pixels, and further realizes the detection and diagnosis of the ear sclerosis focus area through a designed post-processing algorithm. The images to be tested can obtain detection and diagnosis results only through one-time forward propagation, detection and classification tasks share backbone network parameters, the calculated amount is effectively reduced, and the diagnosis efficiency is improved. On the other hand, training of the model requires a large amount of manually labeled data, and during the data labeling process, the subjective factors such as insufficient experience and fatigue of a doctor may cause wrong labeling. Aiming at the problem, the invention designs a classification loss function robust to noise, and improves the performance of the model. The experimental result shows that the invention can accurately detect the ear sclerosis focus area, and obtain an accurate diagnosis result through post-processing based on the detection result, thereby reducing the influence of human factors and improving the efficiency and the accuracy of clinical diagnosis.
Drawings
FIG. 1 is a flow chart of a diagnostic system of the present invention.
Fig. 2 is a diagram of the feature extraction backbone network of the present invention, and the number is the number of output channels of the convolution module.
FIG. 3 is a receiver operating characteristic curve (ROC) for the ear sclerosis diagnostic classification.
FIGS. 4 and 5 are graphs showing the effect of the detection and diagnosis of the present invention.
Detailed Description
The embodiments of the present invention are described in detail below, but the scope of the present invention is not limited to the examples.
The network structure in fig. 2 is used as a feature extraction backbone network, 1500 abnormal images and 1500 normal images are used for training a target detection neural network, and an automatic detection and diagnosis model is obtained.
The method comprises the following specific steps:
(1) before training, randomly initializing network parameters in FIG. 2, and adjusting images in a training set to be 512 × 512 in a uniform size;
(2) during training, the image values are normalized and the mean value is subtracted. The initial learning rate is set to 0.0001, and a small batch random gradient descent method is used to minimize the loss function. The batch size was set to 2 and the network parameters were updated every 4 batches. Firstly, training a feature extraction backbone network and a region extraction network, wherein the step adopts binary cross entropy loss and smoothing L1Loss; then, a classification network is trained by using the characteristic region extracted by the region extraction network, and the step adopts a noise robust classification loss function and a smooth L1Loss; the whole training process is alternately carried out twice;
(3) during testing, the size of each layer I of the 3D temporal bone CT image is adjusted to 512 x 512, the 3D temporal bone CT image is input into a trained model, and the model outputs a target detection frame and confidence p of each layer of image. And sequencing the confidence degrees of all the results, and reserving all the results of the top 3 items of the confidence degree ranking. And then, calculating the ratio r of the number of layers detected as lesions to the total number of remaining layers of the remaining results, setting a threshold value T, and considering that the sample has ear sclerosis when r is greater than T, or considering that the sample is a normal person. According to different choices of the threshold value T, the sensitivity and specificity of the obtained model are correspondingly improved or reduced, a receiver operating characteristic curve (ROC) is drawn according to the sensitivity and specificity, and the performance of the model can be measured through the size of the area under the curve (AUC).
FIG. 3 is a ROC curve for evaluating the classification effect of the present invention on ear sclerosis, and it can be seen that the area under the ROC curve (AUC, maximum value of 1) reaches 0.954, which shows that the classification performance of the present invention is superior.
FIGS. 4 and 5 show examples of detecting ear sclerosis according to the present invention. Wherein the detection effects of normal persons and patients are shown respectively. The left image is the detection result of the input single-layer CT, the right image is an image after the result is partially enlarged, a small box in the image represents the detected stapes position, and the classification result and the confidence coefficient of the model are arranged below the image. It can be seen that the model can be positioned directly from the original image with extremely high accuracy to an extremely small stapes region, and accurate diagnostic information can be given, demonstrating the effectiveness of the model.
Reference to the literature
[1]Krizhevsky,A.,Sutskever,I.&Hinton,G.E.ImageNet classification with deep convolutional neural networks.Advances in Neural Information Processing Systems,1097-1105(2012).
[2]Russakovsky,O.,Deng,J.,Su,H.et al.ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision 115,211-252(2015).
[3]Simonyan,K.&Zisserman A.Very deep convolutional networks for large-scale image recognition.International Conference on Representation Learning,(2014).
[4]He,K.,Zhang,X.,Ren,S.&Sun,J.Deep residual learning for image recognition.IEEE Conference on Computer Vision and Pattern Recognition,770-778(2016).
[5]Girshick,R.,Donahue,J.,Darrell,T.&Malik,J.Rich feature hierarchies for accurate object detection and semantic segmentation.IEEE Conference on Computer Vision and Pattern Recognition,580-587(2014).
[6]Girshick,R.Fast R-CNN.IEEE International Conference on Computer Vision,1440-1448(2015).
[7]Ren,S.,He,K.,Girshick,R.&Sun,J.Faster R-CNN:Towards real-time object detection with region proposal networks.Neural Information Processing Systems,(2015).
[8]Long,J.,Shelhamer,E.&Darrell,T.Fully convolutional networks for semantic segmentation.IEEE International Conference on Computer Vision,3431-3440(2015).
[9]Chen,L.,Papandreou,G.,Kokkinos,I.,Murphy,K.&Yuille,A.L.DeepLab:Semantic image segmentation with deep convolutional nets,atrous convolution,and fully connected CRFs.IEEE Transactions on Pattern Analysis and Machine Intelligence 40,834-848(2018).
[10]Wang Y,Ma X,Chen Z,et al.Symmetric cross entropy for robust learning with noisy labels.IEEE International Conference on Computer Vision,322-330(2019).。