Disclosure of Invention
In order to overcome the defects of low efficiency, large error and low precision of the existing X-ray hand bone interest region extraction mode, the invention provides the automatic X-ray hand bone interest region extraction method based on the deep neural network, which has high efficiency, small error and high precision, and not only can automatically acquire the hand bone region in an X-ray, but also can automatically remove noise and adjust brightness.
In order to solve the technical problems, the invention adopts the technical scheme that:
an automatic X-ray picture hand bone interest area extraction method based on a deep neural network comprises the following steps:
removing the parts of black backgrounds at two sides of an original hand bone X-ray film gray image, which are embedded into characters, so that most of the characters are removed from the original image;
step two, carrying out brightening operation on the original hand bone X-ray film gray image, wherein the step is to carry out overall brightness evaluation on the image, carry out brightening operation on the image with insufficient brightness, carry out denoising operation after brightening, and obtain an image set called Output 1;
step three, sampling and training a model M1, wherein the model is used for removing characters near hand bones and on the hand bones in an X-ray film in Output1 to obtain an X-ray film grayscale image Output2 of the hand bones without the characters;
step four, normalizing the sizes of all the images in Output2, in order to keep the height and width consistent, firstly performing black bottom filling operation on two sides, when the width ratio of the image is high, adopting inward cutting operation on two sides, then reducing the images to (512 ), and calling a new image set as Output 3;
step five, sampling and training a model M2, wherein the model is used for judging three parts of hand bones, backgrounds and intersecting hand bone backgrounds in Output3, and the size of the hand bone background is 16 × 16;
step six, judging an image sliding window in Output3 by using a model M2, adding a judged value to each pixel point, setting the pixel value of the hand bone part to be 255 and the pixel value of the background part to be 0 according to the values of different judgment types obtained by each pixel point, and thus obtaining a hand bone binary marker image which is called Output 4;
step seven, obtaining an image only with a hand bone region on the basis of Output3 by contrasting the hand bone binary marker map of Output4, wherein background impurities still exist in the image, so that calculation of a maximum connected region needs to be performed once here to remove the impurities, and Output5 is obtained;
step eight, because the phenomenon that the aperture around the image is judged as hand bone tissue exists in the Output5, and the aperture is connected with the maximum communication area; since the length of the aperture is much larger than the width of the hand bone, the difference comparison is made for the bottom part of each image in Output5, so as to remove the aperture in the image and obtain the final hand bone region of interest.
The operation flow is described to the end.
Further, in the first step, the method for removing the text embedded in the black background comprises: firstly, converting the image into a numerical array, detecting the left column and the right column towards the middle, and judging that the non-black background embedded character part is started if the number of the existing non-pure black characters exceeds 10% of the whole column due to the particularity of the pure white characters and the pure black background, and then completely cutting the previous part.
Still further, in the second step, the overall brightness evaluation process is as follows: let the resolution of an original image O be MxN, and the value of each pixel be t
ijBy the formula
The overall brightness of the image is calculated. The pixel points with the pixel value larger than 120 are considered, and different parameters are used for brightening different Aug values.
Further, in the third step, for training the model M1, the sample collection process is as follows: intercepting 100 x 100 samples containing letters in the original gray image as positive samples; intercepting samples with the same size and without letters as negative samples; the process of constructing the two-dimensional convolutional neural network comprises the following steps:
step 3.1 the input image was subjected to Conv2D convolution to extract local features with an input size of 100 x 1, followed by a relu activation function layer, maxpoloring pooling layer.
Step 3.2 extraction of three Conv2D convolutional layers of different sizes, in which the activation function layer and pooling layer structure was in accordance with 3.1.
Step 3.3 the 4 convolutional layers described above are connected to the next fully-connected layer via a Flatten layer.
Step 3.4 the above features are prevented from overfitting by the first fully connected layer, the internal sequence comprising a sense layer, a relu active layer, and a Dropout layer. The second fully-connected layer follows, the internal sequence comprises a Dense layer and a sigmoid activation layer, and an output result is obtained.
Among them, the letter L can be found by the selective search method and accurately positioned to the size of 100 × 100 because of the particularity of the letter. Because other letters are mainly concentrated in the upper right corner of the image and the upper right corner characters do not affect the judgment of the hand bone region, after the L is found and removed, the upper right corner region is filled according to the average value of the background peripheral values.
In the fifth step, the training process of the model M2 is as follows: after the text interference is removed through step four, the model is used to distinguish the background, hand bones, and the intersection area of the hand bones and the background. Sampling the three types by adopting a sliding window, wherein the three types are defined as 0,1 and 2 and respectively correspond to a background, a hand bone background intersection area and a hand bone; the process of constructing the two-dimensional convolutional neural network model comprises the following steps:
step 5.1, extracting local features of the input image through a two-dimensional convolution layer, wherein the input size is 16 × 1, and then a relu activation function layer and a Maxpooling pooling layer are carried out;
step 5.2, extracting a Conv2D two-dimensional convolution layer with different sizes, wherein the structures of an activation function layer and a Maxpooling layer are consistent with 5.1;
step 5.3, connecting the convolution layer with the following full-connection layer through a Flatten layer;
step 5.4 the above feature obtains the output result through the first fully-connected layer, the internal sequence including the sense layer, the relu active layer, and the Dropout layer to prevent overfitting, followed by the second fully-connected layer, the internal sequence including the sense layer, and the softmax active layer (output is of three types).
In the sixth step, the sliding window judging process is as follows:
step 6.1, performing sliding window on the input image, 16 × 16, wherein the step size is 1;
step 6.2, for the 16 × 16 patch, judging by using the model M2, adding 1 to the x values of the 16 × 16 pixels in the patch to obtain a value x, and defining val [512] [512] [3] to count the number of different x values of each pixel, wherein each pixel has a process of counting the number of different x values;
step 6.3, defining an output array result, for each point in the result, only comparing the number of 0 types and 2 types in the corresponding val array, if the number of the 2 types is more, filling the point corresponding to the result to be 255, namely white, otherwise, 0, and black;
since the sliding window adds statistics corresponding to the pixels, the complexity is O (nxnxnxnxmxmxmxm ×), since we only need a single point query, here the complexity reduction is done by the tree array, the statistical complexity is reduced to O (nxnxnxlog)2(n)), the single point query complexity is log2(n)。
In the step eight, the process of difference comparison is as follows:
step 8.1: the image is firstly converted into a numerical array, and because the light shadow exists at the bottom, only the cross section (patch) with a certain height is taken from the bottom, and the height is determined according to the average height of the light shadow;
step 8.2: performing white statistics on each line in the patch, and reserving the most white lines and the least white lines;
step 8.3: comparing the two reserved rows column by column, wherein different numbers t are recorded when one row is white and one row is black;
step 8.4: when t is larger than a set difference threshold value, judging that light shadow occurs, discarding the part between the two lines, and if the t is not larger than the threshold value, keeping the part unchanged;
step 8.5: and performing maximum connected region calculation again, wherein the part between the light shadow and the hand bone is abandoned, so that the hand bone region is reserved at this time, and the light shadow part is also removed.
The technical conception of the invention is as follows: the character information in the X-ray image is detected by adopting the deep neural network, the contents of different areas in the X-ray image are classified by adopting the deep neural network, and the obtained model can automatically and quickly extract the interesting area of the hand bones.
In the process provided by the invention, the first model M1 is used for distinguishing the most prominent letters L (representing the left hand), R (representing the right hand) and some residual character information in the initial X-ray film image from other backgrounds through sampling training, so that the task of removing the letter information from the original image is realized, and an important foundation is laid for extracting the interested region of the hand bones later. The purpose of the second model M2 is to obtain a hand bone marker map that distinguishes between interesting and non-interesting parts of the input picture by different labels, and the invention uses markers placed in white and black. Since text has been removed in model M1, at this stage the model is used primarily to distinguish between the three classes, background, hand bone background intersection. And sampling the input image by sliding a window, and then counting the detection result on each pixel point to obtain the mark mapping.
Compared with the traditional method for artificially extracting the hand bone region of interest, the method has the advantages that: 1. the efficiency of obtaining the hand bone region of interest is greatly improved. 2. And the unified processing can reduce errors caused by manual operation. 3. More accurate results than artificial extraction can be obtained. 4. Different tasks at different stages, the layered operation flow reduces the paralysis risk of the whole flow caused by extreme pictures, and is easier to maintain and improve.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 to 4, for the sampling operation of Output1, pictures with text information, especially the large letter L, are sampled mainly through a sliding window. The detection of whether the picture has the text information can be almost 100% completed by training the model M1. When the Output1 is input into the model, a selective search (selective search) is used to detect large letters as objects, and a uniform coverage strategy is used for other characters in corners. By doing so, the time for sliding the original large image is saved, and the text information can be effectively removed.
According to the method, after the text information is removed, the image is normalized to be of a uniform size, so that the time complexity in the subsequent acquisition of the mark mapping of the hand bone interesting region is reduced. For the acquisition of the tag map, detailed description is made here:
1. the model M2 can classify the 16 × 16 samples into three categories, background, hand bone, and hand bone background intersection.
2. With a step size of 1, a total of 500 × 500 samples are detected by sliding window to Output 3.
3. The detection result of each sample is counted on 16 × 16 pixels of the sample to obtain the detection result of each category. So a 512 x 3 array is needed for statistics. 0,1 and 2 respectively correspond to the background, the hand bone background intersection area and the hand bone. Statistics are necessary because assuming that the previous sample detection is background and the next sample detection is the intersection area, all pixels differed by this step size will eventually be considered as background, and the boundary in the intersection area is divided. According to the statistical data of three categories in the pixel, if the number of 0 categories is large, the pixel is set to be 0 (black), otherwise, the pixel is set to be 255 (white). Finally we can get that the hand bone Region (ROI) is white and the background noise is all black.
4. Because the time complexity of the algorithm is high, the tree array is used for optimization, and the conventional tree array is expanded into a two-dimensional tree array.
5. After the marker map is obtained, it is inevitable that there will be some white noise, which is removed by calculating and retaining the maximum connected region. In addition, the light and shadow of the hand bone region and the bottom of the image are sometimes overlapped, and the light and shadow are very similar to the hand bone tissue, so that the model cannot accurately distinguish the hand bone region and the bottom of the image, and the difference detection is performed on the upward cross section of the bottom of the image, and the method comprises the following specific steps: the target area is converted into a data array, the rows with the most white dots and the rows with the least white dots are calculated, and then the rows with the most white dots and the rows with the least white dots are compared column by column, wherein the rows are either white or black, otherwise, the rows are regarded as differences. If the difference is greater than the threshold, it is determined to be caused by light shadow, and the area between the two rows is discarded, and then the maximum connected area calculation is performed again, so that the noise caused by light shadow can be removed.
Example (c): the hand bones used were X-ray images containing an age range of 0 to 18 years, for a total of 944 specimens. 632 samples of the samples were used as training sets, model M1 and model M2 were trained, and the remaining 312 samples were used as test sets. The validation set and the training set coincide. The following describes the construction and testing processes of the two models respectively.
Model M1:
step 1.1, constructing a deep convolutional neural network, wherein the specific structure is shown in fig. 2.
Step 1.1.1: the convolutional neural network comprises 4 convolutional modules, one Flatten layer and two fully-connected layers
Step 1.1.2: in the convolutional layer, the convolution kernel size is 1 × 3, the input size is (1,100,100), and the number of convolution kernels increases with the depth of the network, and is 6,12,24 and 48 in sequence. Each convolution module also includes a relu activation layer, and a Maxpooling pooling layer, with pool _ size being (1, 2).
Step 1.1.3: before the full bonding of the layers, there is a flatten layer to flatten the output of the convolutional layers.
Step 1.1.4: and in the fully-connected layers, the first fully-connected layer is provided with a Dropout layer to prevent overfitting, the parameter is 0.5, and the subsequent fully-connected layers are output through a sigmoid active layer.
Step 1.2, data sampling and model training
Step 1.2.1: the hand bones X-ray film images are all gray-scale images, and the number of channels is 1. 500 images were sampled, each with 99 samples plus one letter L at random. Connecting all samples into a numpy array, wherein the first dimension is the number of the samples, and a four-dimensional array (50000,1,100 and 100) is obtained to serve as a training set; the test set was treated in the same manner as (10000,1, 100). The validation set and the training set coincide.
Step 1.2.2: the model adopts a batch training mode, the number of samples of each batch of the training set generator and the verification set generator is 120, 200 times of training are performed totally, a logarithmic loss function is adopted, and an rmsprop algorithm is adopted for optimization. The models only remain the models with the highest accuracy.
Step 1.3, model testing
For the function of the model M2, see the operation flow in detail, and are not described herein again.
Model M2:
and 2.1, constructing a deep convolutional neural network, wherein the specific structure is shown in figure 2.
Step 2.1.1: the convolutional neural network comprises 2 convolutional modules, one Flatten layer and two fully-connected layers
Step 2.1.2: in the convolutional layer, the size of the convolutional kernel is 1 × 3, the input size is (1,16,16), and the number of convolutional kernels increases with the depth of the network, and is sequentially 6 and 24. Each convolution module also includes an activation layer of relu, and a Maxpooling layer, with pool _ size of (1, 2).
Step 2.1.3: before fully joining the layers, there will be a flatten layer to flatten the output of the roll substrate.
Step 2.1.4: of the fully-connected layers, the first fully-connected layer inputs the previous 24 outputs at 96 input points and has a Dropout layer to prevent overfitting, with a parameter of 0.05, and the following fully-connected layer outputs in three categories through the active layer of softmax.
Step 2.2, data sampling and model training
Step 2.2.1: the hand bones X-ray film images are all gray-scale images, and the number of channels is 1. The 500 images were sampled in a sliding window of 16 x 16, step size 32, too unique to the background, step size 8, and this portion of the sample was doubled to increase its weight in the model training. Thereby obtaining a four-dimensional array (100000,1,16,16) as a training set, wherein the first dimension is the number of samples; the test set is treated in the same way as (20000,1,16, 16). The validation set and the training set coincide.
Step 2.2.2: the model adopts a batch training mode, the number of samples of each batch of the training set generator and the verification set generator is 120, the training is carried out for 2000 times, the loss uses a multi-classification logarithmic loss function, and the optimizer selects adam. The models only remain the models with the highest accuracy.
Step 2.3, model testing
Through the operation of the steps, the extraction of the hand bone interest area in the hand bone X-ray film image can be realized.
The above detailed description is intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above detailed description is only exemplary of the present invention, and is not intended to limit the scope of the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.