CN112419248B

CN112419248B - Ear sclerosis focus detection and diagnosis system based on small target detection neural network

Info

Publication number: CN112419248B
Application number: CN202011263682.7A
Authority: CN
Inventors: 王云峰; 颜波; 李健; 谭伟敏; 管鹏飞; 陈鹤丹; 吴灵捷; 李吉春
Original assignee: Fudan University
Current assignee: Anhui Yixinhui Technology Co.,Ltd.
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2022-04-12
Anticipated expiration: 2040-11-13
Also published as: CN112419248A

Abstract

The invention belongs to the technical field of medical image processing, and particularly relates to an ear sclerosis focus detection and diagnosis system based on a small target detection neural network. The system comprises a feature extraction backbone network, a target detection and classification network, a classification loss function for noise robustness and a post-processing diagnosis system for multi-layer detection results; the feature extraction backbone network is a multi-level deep convolution neural network and is used for extracting a feature map of an image; the target detection and classification network comprises the feature extraction backbone network, the region extraction network, the region of interest pooling layer and the classification network to obtain the region category; the noise robust classification loss function is combined with cross entropy loss and average absolute error loss, and is less influenced by error marking in training data; the invention inputs the 3D temporal bone CT image into the network model in a layering way, and the focus detection and diagnosis result can be obtained through one-time forward propagation and post-processing. The invention can reduce the influence of human factors and improve the efficiency and accuracy of clinical diagnosis.

Description

Ear sclerosis focus detection and diagnosis system based on small target detection neural network

Technical Field

The invention belongs to the technical field of medical image processing, particularly relates to an ear sclerosis focus detection and diagnosis system, and more particularly relates to an ear sclerosis focus detection and diagnosis system based on a small target detection neural network.

Background

Otosclerosis is a disease in which dense lamellar bone of labyrinthine bone is locally replaced by new spongy bone rich in cells and blood vessels. Otosclerosis can be classified into stapedial type otosclerosis, cochlear type otosclerosis and mixed type otosclerosis according to the difference of lesion sites and ranges. Cochlear otosclerosis is an advanced form of otosclerosis that is readily diagnosed according to typical clinical and CT manifestations, and treatment is limited to hearing aid wear. The earliest and most common location of otosclerosis, however, is in the anterior part of the vestibular window, which leads to stapedial otosclerosis, a conductive hearing loss due to stapedial floor fixation. In addition to diagnosis of otosclerosis based on clinical symptoms and audiological examinations of patients, the diagnostic value of high resolution ct (hrct) is also widely recognized. It can reach 74% to 95.1% positive rate, and is considered as the first choice for diagnosing otosclerosis.

Temporal bone CT in stapes-type ear sclerosis patients is manifested by a reduction in bone density in the preforacle cleft area and a thickening of the stapes footplate. By observing the CT, the doctor can give a corresponding preliminary diagnosis result. However, human factors such as inexperience, fatigue, negligence, etc. of the doctor may directly affect the accuracy of the diagnosis. In addition, the stapes occupies about 8 × 8 in CT images with 512 × 512 pixel resolution, so that the physician is complicated to locate and diagnose the stapes.

The Deep Convolutional Neural Network (DCNN) is a machine learning technology, which can effectively avoid human factors and automatically learn how to extract rich representative visual features from a large amount of marked data. The technology uses a back propagation optimization algorithm, so that a machine updates internal parameters thereof and learns the mapping relation from an input image to a label. In recent years, DCNN has greatly improved the performance of tasks in computer vision.

2012, Krizhevsky et al^[1]For the first time atImageNet^[2]The image classification competition applies a deep convolutional neural network, and obtains a champion with a Top-5 error rate of 15.3%, which causes a hot tide of deep learning. 2015 Simnyan et al^[3]The neural networks VGG-16 and VGG-19 of 16 and 19 layers are provided, the parameter number of the networks is increased, and the result of the ImageNet image classification task is further improved. 2016 He et al^[4]The use of the 152-layer residual network ResNet achieves a classification effect exceeding that of human eyes.

DCNN not only performs excellently in image classification tasks, but also in some structured output tasks, such as object detection^[5-7]Semantic segmentation^[8,9]The same excellent effects are obtained. If the DCNN is applied to computer-aided diagnosis (CAD), doctors can be assisted to make better medical diagnosis, early discovery and early treatment can be achieved, and the treatment effect can be improved.

However, the existing target detection network can not detect a stapes which is a very small target generally, and aiming at the problem, the invention provides a new noise robust ear sclerosis focus detection and diagnosis system based on a small target detection neural network, which can fully combine the characteristics of a training image, extract abundant characteristics and simultaneously realize the detection and diagnosis of an ear sclerosis area.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide an ear sclerosis focus detection and diagnosis system based on a small target detection neural network, which eliminates the influence of human factors and realizes the automatic diagnosis of a temporal bone CT image.

The invention provides an ear sclerosis focus detection and diagnosis system based on a small target detection neural network, which specifically comprises the following steps:

(1) extracting features of a backbone network;

(2) a target detection and classification network;

(3) a noise robust classification loss function;

(4) and (3) a post-processing diagnosis system for multi-layer detection results.

Wherein:

(1) the feature extraction backbone network is a multi-level deep convolutional neural network and comprises 9 convolutional modules which are conv _ x1, conv _ x2, conv _ x3, conv _ y1, conv _ y2, conv _ y3, conv _ y4, conv _ z1 and conv _ z2 respectively, wherein each convolutional module is composed of two continuous convolutional layers, and all convolutional modules form a W-shaped network structure; wherein the number of output channels of x1, x2 and x3 is 64, 128 and 256 respectively, and the output characteristic resolution is consistent with that of the input image; the number of output channels of y1, y2, y3 and y4 is 128, and the output characteristic resolution is 4 times of the input image down-sampling; z1, z2 are all 256 output channels, and the output characteristic resolution is 8 times that of input image down-sampling. Jumping connection structures are arranged between x1 and x2, between x2 and x3, between y1 and y2, and between y3 and y4, and the former feature is directly spliced with the high-level feature and then convolved to obtain the latter feature. The input of the feature extraction backbone network is a 3D temporal bone CT image (hierarchical input), and the output is a feature map of the temporal bone CT image;

(2) the target detection and classification network is Ren et al^[7]The proposed target detection network, and the main network thereof is replaced by the feature extraction main network in the step (1); besides the backbone network, the target detection network also comprises a region extraction network, an interested region pooling layer, a classification network and the like; the region extraction network comprises two parallel modules, wherein one part divides each feature point into a foreground type and a background type through a softmax function, the other part calculates the offset of a marking frame through 1 multiplied by 1 convolution, and finally the outputs of the two parts are integrated to obtain an extracted feature region; the classification network generates two output branches through a full connection layer: the first branch circuit outputs the position offset of each characteristic region for further correcting the position of the detection frame; calculating the classification probability of the features by the second branch through a softmax function to obtain the category of the region; the work flow of the target detection network is as follows:

firstly, an output feature map of a feature extraction backbone network is sent to an area extraction network to obtain an extracted feature area;

then, enabling the extracted characteristic region to enter a region-of-interest pooling layer for self-adaptive pooling, and uniformly adjusting the size of the characteristic region to 7 multiplied by 7;

finally, sending the data to a classification network to obtain the classification of the region;

(3) the noise robust classification loss function is used for reducing cross entropy loss l_ceAnd the mean absolute error loss l₁In combination, the probability p of the network output is used as the dynamic adaptive weighting coefficient, and the loss function is defined as:

wherein, p is the probability of label corresponding class in the network output probability of each class;

according to Wang et al^[10]In the paper, cross entropy loss l is described_ceIs not robust to noise, but helps network convergence; mean absolute error loss l₁Robust to noise, but difficult to converge. Therefore, they use two fixed parameters α and β as coefficients of two loss functions, respectively. However, the two coefficients need to be adjusted manually according to different data sets to achieve better effect. Therefore, the invention carries out self-adaptive weighting on the network output label and the label according to the probability p of the corresponding class of the network output label, thereby avoiding the trouble of manually adjusting the coefficient and obtaining better result;

(4) the input of the post-processing diagnosis system for the multi-layer detection result is N layered detection results, wherein a certain result is expressed as

The two terms respectively represent the classification category (normal and ear hardening) of the detection region and the detection confidence (the probability that the result output by the target detection model belongs to the classification, and the numerical value range is between 0 and 1), and the result is output as the diagnosis result of the sample.

Firstly, the confidence degrees of all the results are sorted, and all the results of the k items before the confidence degree ranking are reserved. From statistics on the data, the stapes region will usually appear in a continuous 3-layer CT, so the k value at this step takes 3 to be most appropriate;

and finally, calculating the ratio r of the detected lesion layer number to the total residual layer number of the residual result, setting a threshold value T, and considering that the sample has ear sclerosis when r is greater than T, otherwise, considering that the sample is a normal person. According to different choices of the threshold value T, the sensitivity and the specificity of the obtained model are correspondingly improved or reduced, a receiver operating characteristic curve (ROC) is drawn, the performance of the model can be measured through the size of the area under the curve (AUC), the AUC value is between 0 and 1, and the larger the value is, the better the performance of the model is.

Further, the training method of the network model of the invention is as follows:

firstly, training a feature extraction backbone network and a region extraction network, wherein the step adopts binary cross entropy loss and smoothing L₁Loss; then, training a classification network by using the feature region extracted by the region extraction network, wherein the step adopts the noise robust classification loss function and the smoothing L in the step (3)₁Loss; the whole training process is alternately carried out twice; during training, the sample at least comprises 1500 pathological images and 1500 normal images.

In the invention, after the test image I is input, the detection and diagnosis result can be obtained only by one-time forward propagation.

The invention has the beneficial effects that: the invention designs a small target detection network taking a temporal bone CT image as input, can detect a stapes with the size of about 8 multiplied by 8 in a CT image with the resolution of 512 multiplied by 512 pixels, and further realizes the detection and diagnosis of the ear sclerosis focus area through a designed post-processing algorithm. The images to be tested can obtain detection and diagnosis results only through one-time forward propagation, detection and classification tasks share backbone network parameters, the calculated amount is effectively reduced, and the diagnosis efficiency is improved. On the other hand, training of the model requires a large amount of manually labeled data, and during the data labeling process, the subjective factors such as insufficient experience and fatigue of a doctor may cause wrong labeling. Aiming at the problem, the invention designs a classification loss function robust to noise, and improves the performance of the model. The experimental result shows that the invention can accurately detect the ear sclerosis focus area, and obtain an accurate diagnosis result through post-processing based on the detection result, thereby reducing the influence of human factors and improving the efficiency and the accuracy of clinical diagnosis.

Drawings

FIG. 1 is a flow chart of a diagnostic system of the present invention.

Fig. 2 is a diagram of the feature extraction backbone network of the present invention, and the number is the number of output channels of the convolution module.

FIG. 3 is a receiver operating characteristic curve (ROC) for the ear sclerosis diagnostic classification.

FIGS. 4 and 5 are graphs showing the effect of the detection and diagnosis of the present invention.

Detailed Description

The embodiments of the present invention are described in detail below, but the scope of the present invention is not limited to the examples.

The network structure in fig. 2 is used as a feature extraction backbone network, 1500 abnormal images and 1500 normal images are used for training a target detection neural network, and an automatic detection and diagnosis model is obtained.

The method comprises the following specific steps:

(1) before training, randomly initializing network parameters in FIG. 2, and adjusting images in a training set to be 512 × 512 in a uniform size;

(2) during training, the image values are normalized and the mean value is subtracted. The initial learning rate is set to 0.0001, and a small batch random gradient descent method is used to minimize the loss function. The batch size was set to 2 and the network parameters were updated every 4 batches. Firstly, training a feature extraction backbone network and a region extraction network, wherein the step adopts binary cross entropy loss and smoothing L₁Loss; then, a classification network is trained by using the characteristic region extracted by the region extraction network, and the step adopts a noise robust classification loss function and a smooth L₁Loss; the whole training process is alternately carried out twice;

(3) during testing, the size of each layer I of the 3D temporal bone CT image is adjusted to 512 x 512, the 3D temporal bone CT image is input into a trained model, and the model outputs a target detection frame and confidence p of each layer of image. And sequencing the confidence degrees of all the results, and reserving all the results of the top 3 items of the confidence degree ranking. And then, calculating the ratio r of the number of layers detected as lesions to the total number of remaining layers of the remaining results, setting a threshold value T, and considering that the sample has ear sclerosis when r is greater than T, or considering that the sample is a normal person. According to different choices of the threshold value T, the sensitivity and specificity of the obtained model are correspondingly improved or reduced, a receiver operating characteristic curve (ROC) is drawn according to the sensitivity and specificity, and the performance of the model can be measured through the size of the area under the curve (AUC).

FIG. 3 is a ROC curve for evaluating the classification effect of the present invention on ear sclerosis, and it can be seen that the area under the ROC curve (AUC, maximum value of 1) reaches 0.954, which shows that the classification performance of the present invention is superior.

FIGS. 4 and 5 show examples of detecting ear sclerosis according to the present invention. Wherein the detection effects of normal persons and patients are shown respectively. The left image is the detection result of the input single-layer CT, the right image is an image after the result is partially enlarged, a small box in the image represents the detected stapes position, and the classification result and the confidence coefficient of the model are arranged below the image. It can be seen that the model can be positioned directly from the original image with extremely high accuracy to an extremely small stapes region, and accurate diagnostic information can be given, demonstrating the effectiveness of the model.

Reference to the literature

[1]Krizhevsky,A.,Sutskever,I.&Hinton,G.E.ImageNet classification with deep convolutional neural networks.Advances in Neural Information Processing Systems,1097-1105(2012).

[2]Russakovsky,O.,Deng,J.,Su,H.et al.ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision 115,211-252(2015).

[3]Simonyan,K.&Zisserman A.Very deep convolutional networks for large-scale image recognition.International Conference on Representation Learning,(2014).

[4]He,K.,Zhang,X.,Ren,S.&Sun,J.Deep residual learning for image recognition.IEEE Conference on Computer Vision and Pattern Recognition,770-778(2016).

[5]Girshick,R.,Donahue,J.,Darrell,T.&Malik,J.Rich feature hierarchies for accurate object detection and semantic segmentation.IEEE Conference on Computer Vision and Pattern Recognition,580-587(2014).

[6]Girshick,R.Fast R-CNN.IEEE International Conference on Computer Vision,1440-1448(2015).

[7]Ren,S.,He,K.,Girshick,R.&Sun,J.Faster R-CNN:Towards real-time object detection with region proposal networks.Neural Information Processing Systems,(2015).

[8]Long,J.,Shelhamer,E.&Darrell,T.Fully convolutional networks for semantic segmentation.IEEE International Conference on Computer Vision,3431-3440(2015).

[9]Chen,L.,Papandreou,G.,Kokkinos,I.,Murphy,K.&Yuille,A.L.DeepLab:Semantic image segmentation with deep convolutional nets,atrous convolution,and fully connected CRFs.IEEE Transactions on Pattern Analysis and Machine Intelligence 40,834-848(2018).

[10]Wang Y,Ma X,Chen Z,et al.Symmetric cross entropy for robust learning with noisy labels.IEEE International Conference on Computer Vision,322-330(2019).。

Claims

1. An ear sclerosis focus detection and diagnosis system based on a small target detection neural network is characterized by comprising: a feature extraction backbone network, a target detection and classification network, a classification loss function for noise robustness, and a post-processing diagnosis system for multi-layer detection results; wherein:

(1) the feature extraction backbone network comprises 9 convolution modules, namely conv _ x1, conv _ x2, conv _ x3, conv _ y1, conv _ y2, conv _ y3, conv _ y4, conv _ z1 and conv _ z2, wherein each convolution module is composed of two continuous convolution layers, all the convolution modules form a W-shaped network structure, namely the number of output channels of conv _ x1, conv _ x2 and conv _ x3 is 64, 128 and 256 respectively, and the output feature resolution is consistent with that of an input image; the number of output channels of conv _ y1, conv _ y2, conv _ y3 and conv _ y4 is 128, and the output characteristic resolution is 4 times that of input image down-sampling; the number of output channels of conv _ z1 and conv _ z2 is 256, and the output characteristic resolution is 8 times of the input image down-sampling; jump connection structures are arranged between conv _ x1 and conv _ x2, between conv _ x2 and conv _ x3, between conv _ y1 and conv _ y2, and between conv _ y3 and conv _ y4, and the former feature is directly spliced with the higher feature and then convolved to obtain the latter feature; the input of the feature extraction backbone network is 3D temporal bone CT images which are input in a layered mode; outputting a feature map of the temporal bone CT image;

(2) the target detection and classification network adopts a target detection network, which refers to Ren, S, He, K, Girshick, R. & Sun, J.Faster R-CNN, Towards real-time object detection with region Processing Systems,2015, and replaces the main network with the feature extraction main network in (1); besides the backbone network, the target detection network also comprises an area extraction network, an interested area pooling layer and a classification network; the region extraction network comprises two parallel modules, wherein one part divides each feature point into a foreground type and a background type through a softmax function, the other part calculates the offset of a marking frame through 1 multiplied by 1 convolution, and finally the outputs of the two parts are integrated to obtain an extracted feature region; the classification network generates two output branches through a full connection layer: the first branch circuit outputs the position offset of each characteristic region for further correcting the position of the detection frame; calculating the classification probability of the features by the second branch through a softmax function to obtain the category of the region; the work flow of the target detection network is as follows:

The two terms respectively represent the classification category and the detection confidence of the detection region and are output as the diagnosis result of the sample; the specific process is as follows:

firstly, sequencing the confidence degrees of all results, and reserving all results of k items before the confidence degree ranking;

and then, calculating the ratio r of the number of layers detected as lesions to the total number of remaining layers of the remaining results, setting a threshold value T, and considering that the sample has ear sclerosis when r is greater than T, otherwise, considering that the sample is a normal person.

2. The system for detecting and diagnosing ear sclerosis focus based on small target detection neural network as claimed in claim 1, wherein the training process of the network model is as follows:

firstly, training a feature extraction backbone network and a region extraction network, wherein the step adopts binary cross entropy loss and smoothing L₁Loss; then, training the classification network by using the feature region extracted by the region extraction network, and this step adopts the noise robust classification loss function and the smoothing L in (3)₁Loss; the whole training process is alternately carried out twice; during training, the sample at least comprises 1500 pathological images and 1500 normal images.

3. The system for detecting and diagnosing ear sclerosis focus based on small target detection neural network as claimed in claim 1, wherein the detection and diagnosis result is obtained through one forward propagation after the test image is inputted.