CN107871316B

CN107871316B - Automatic extraction method of hand bone interest region based on deep neural network

Info

Publication number: CN107871316B
Application number: CN201710975940.6A
Authority: CN
Inventors: 郝鹏翼; 陈易京; 尤堃; 吴福理; 黄玉娇; 白琮
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang Feitu Imaging Technology Co ltd
Priority date: 2017-10-19
Filing date: 2017-10-19
Publication date: 2020-10-27
Anticipated expiration: 2037-10-19
Also published as: CN107871316A

Abstract

A deep neural network-based automatic extraction method for the region of interest of hand bone in X-ray film, removes the part of the black background embedded text on both sides of the original hand bone X-ray film image, and uniformly brightens and removes the original hand bone X-ray film image. Noise operation; sample and train the model M1 to obtain the hand bone X-ray image without text Output2; normalize the size of Output2 to obtain Output3; sample and train the model M2, for the intersection of the hand bone, background, and hand bone background in Output3 Partial judgment; based on the model M2 to judge the image sliding window in Output3, according to the judgment value, the hand bone marker map Output4 is obtained; based on Output3 and Output4, the image Output5 with only the hand bone is obtained; Output5 is optimized to obtain the final hand bone interest area. The invention can automatically acquire the region of interest of the hand bone in the X-ray film.

Description

Automatic extraction method of hand bone interest region based on deep neural network

技术领域technical field

本发明涉及医学图像分析领域及机器学习领域，特别涉及一种应用于人体手骨X光片影像兴趣区域提取的方法，属于基于深度学习的医学影像分析领域。The invention relates to the field of medical image analysis and machine learning, in particular to a method for extracting regions of interest in human hand bone X-ray images, and belongs to the field of deep learning-based medical image analysis.

背景技术Background technique

骨骼年龄，简称骨龄，是由儿童的骨骼钙化程度所决定的。骨龄能较精确地反映人从出生到完全成熟的过程中各年龄阶段的发育水平，不仅如此，在内分泌疾病、发育障碍、营养障碍、遗传性疾病及代谢性疾病的分析与诊断方面，骨龄具有重要的作用。放射科医生通过对比儿童手部的X光片和他们对应年龄的标准状态，来测量儿童的骨龄。这项技术很稳定，已经持续了大约好几十年。随着家长们健康意识的提高，做骨龄检测的儿童数量与日俱增，但医疗机构的现状却并没有发生改变：儿科影像科的执业医生不多，并且读片效率不高。Skeletal age, referred to as bone age, is determined by the degree of calcification of a child's bones. Bone age can more accurately reflect the developmental level of people at all ages from birth to full maturity. Not only that, but also in the analysis and diagnosis of endocrine diseases, developmental disorders, nutritional disorders, hereditary diseases and metabolic diseases. important role. Radiologists measure a child's bone age by comparing X-rays of the child's hand with their age-appropriate standard. The technology is stable and has been around for decades. With the improvement of parents' health awareness, the number of children undergoing bone age testing is increasing day by day, but the status quo of medical institutions has not changed: there are not many licensed doctors in the pediatric imaging department, and the reading efficiency is not high.

随着GPU加速的深度学习技术，通过人工智能来完成对于骨龄的自动检测已经成为了可能。深度学习需要大量手骨X光图片来进行训练，以达到良好的检测效果。但原始的手骨X光图片中具有不少无用的信息，例如，文字，噪声等；而且原始的手骨X光图片中由于每一个人手的摆放不统一，机器的光照不统一，导致X光图片中手的位置以及亮度并不是固定的，这些因素会导致检测模型的学习效果不佳，从而引起较大的判断误差。所以对于基于深度学习的骨龄预测来说，准确地获取X光片中手骨区域而且将这些手骨区域处理为检测模型可以进行学习的良好样本是至关重要的。但是人为进行手部位的标注，不仅效率底下，而且不同的标注者的标注结果会有差异。With GPU-accelerated deep learning technology, automatic detection of bone age through artificial intelligence has become possible. Deep learning requires a large number of hand bone X-ray images for training to achieve good detection results. However, the original hand bone X-ray image has a lot of useless information, such as text, noise, etc. In the original hand bone X-ray image, due to the inconsistent placement of each person's hands and the uneven illumination of the machine, X-ray images are not uniform. The position and brightness of the hand in the light image are not fixed, these factors will lead to poor learning effect of the detection model, resulting in large judgment errors. Therefore, for bone age prediction based on deep learning, it is crucial to accurately obtain hand bone regions in X-ray films and treat these hand bone regions as good samples that the detection model can learn from. However, manual labeling of hand parts is not only inefficient, but also the labeling results of different labelers will be different.

发明内容SUMMARY OF THE INVENTION

为了克服现有X光片手骨兴趣区域提取方式的效率低下、误差较大、精度较低的不足，本发明提出了一种效率较高、误差较小、精度较高的基于深度神经网络的X光片手骨兴趣区域自动提取方法，不仅可以自动获取X光片中的手骨区域，而且还可以自动去除噪声，调整亮度。In order to overcome the shortcomings of low efficiency, large error and low precision of the existing X-ray film hand bone interest region extraction methods, the present invention proposes a deep neural network-based X-ray system with high efficiency, small error and high precision. The automatic extraction method of the region of interest of the hand bone in the light film can not only automatically obtain the hand bone region in the X-ray film, but also automatically remove noise and adjust the brightness.

为了解决上述技术问题本发明所采用的技术方案是：In order to solve the above-mentioned technical problems, the technical scheme adopted in the present invention is:

一种基于深度神经网络的X光片手骨兴趣区域自动提取方法，包括以下步骤：A method for automatically extracting a region of interest of a hand bone in an X-ray film based on a deep neural network, comprising the following steps:

步骤一，对原始手骨X光片灰度影像，去除影像两边的黑色背景嵌入文字的部分，从而使得原图像去除掉大部分的文字；Step 1: For the grayscale image of the original hand bone X-ray film, remove the black background embedded text on both sides of the image, so that most of the text is removed from the original image;

步骤二，对原始手骨X光片灰度影像进行提亮操作，该步骤，先对影像进行整体亮度评估，对于亮度不足的影像进行提亮操作，提亮之后进行去噪操作，得到的影像集称为Output1；The second step is to perform a brightening operation on the grayscale image of the original hand bone X-ray film. In this step, the overall brightness of the image is evaluated first, and the image with insufficient brightness is subjected to a brightening operation. After the brightening, a denoising operation is performed to obtain an image. The set is called Output1;

步骤三，采样并训练模型M1，该模型用于去除Output1中的X光片中手骨附近以及手骨上的文字，得到没有文字的手骨X光片灰度影像Output2；Step 3, sample and train the model M1, which is used to remove the text near the hand bone and on the hand bone in the X-ray film in Output1, and obtain a grayscale image of the hand bone X-ray film Output2 without text;

步骤四，将Output2中所有影像归一化尺寸，为保持高宽大小一致，先进行两侧黑底填充操作，当图片的宽比高的数值大时，采取两侧向内切割的操作，然后再对影像缩小至(512,512)，称新的影像集为Output3；Step 4: Normalize the size of all images in Output2. In order to keep the height and width consistent, first fill in the black background on both sides. When the width ratio of the image is larger than the height, the operation is to cut inward on both sides, and then Then reduce the image to (512,512), and call the new image set Output3;

步骤五，采样并训练模型M2，该模型用于对Output3中的手骨，背景，手骨背景相交的三部分进行判断，手骨背景的大小为16*16；Step 5: Sampling and training the model M2, which is used to judge the intersection of the hand bone, the background, and the hand bone background in Output3, and the size of the hand bone background is 16*16;

步骤六，有了模型M2，对Output3中的影像滑窗判断，将判断的值加到每个像素点中，接着每个像素点根据自己得到的不同判断类型的值的大小，将手骨部分的像素值设为255，背景部分的像素值设为0，从而得到手骨二值标记图，称之为Output4；Step 6, with the model M2, judge the sliding window of the image in Output3, add the judged value to each pixel, and then each pixel according to the size of the value of the different judgement types obtained by itself, put the hand bone part. The pixel value of the hand bone is set to 255, and the pixel value of the background part is set to 0, so as to obtain the hand bone binary map, which is called Output4;

步骤七，对照Output4的手骨二值标记图，在Output3的基础上得到仅有手骨区域的影像，但是同时该影像中依然会存在背景杂质，于是在这里需要进行一次最大联通区域的计算，去除杂质，得到Output5；Step 7, compare the hand bone binary labeling map of Output4, and obtain the image with only the hand bone area on the basis of Output3, but at the same time, there will still be background impurities in the image, so it is necessary to perform a calculation of the maximum connected area here. Remove impurities to get Output5;

步骤八，由于Output5中存在影像周围光圈被判断为手骨组织的现象，并且它与最大联通区域是相连接的；由于光圈的长远远大于手骨的宽，于是，对Output5中每张影像的底部部分做差异比较，从而去除影像中的光圈，得到最终的手骨兴趣区域。Step 8: Because the aperture around the image in Output5 is judged to be hand bone tissue, and it is connected to the largest connected area; since the length of the aperture is much larger than the width of the hand bone, therefore, for each image in Output5 The bottom part is used for difference comparison to remove the halo in the image and get the final hand bone region of interest.

至此，操作流程叙述完毕。So far, the description of the operation flow is completed.

进一步，所述步骤一中，去除黑色背景中嵌入文字部分的方法为：先将影像转化为数值数组，最左右两列开始向中间检测，由于纯白文字与纯黑背景的特殊性，如果存在的非纯黑的个数超过整列的10％，则可以判断这里开始为非黑色背景嵌文字部分，那么之前的部分将被全部切割掉。Further, in the first step, the method of removing the embedded text in the black background is as follows: first convert the image into a numerical array, and then start to detect the two left and right columns toward the middle. Due to the particularity of pure white text and pure black background, if there is If the number of non-pure black is more than 10% of the entire column, it can be judged that this is the non-black background inlaid text part, then the previous part will be cut off.

再进一步，所述步骤二中，整体亮度评估过程为：设一张原影像O的分辨率为M×N，每一个像素点的值为t_ij，通过公式

计算该影像的整体亮度。这里考虑像素值大于120的像素点，对于不同的Aug值，用不同的参数进行提亮。Further, in the second step, the overall brightness evaluation process is as follows: set the resolution of an original image O to be M×N, and the value of each pixel point is t _ij , through the formula

Calculates the overall brightness of the image. Here, pixels with pixel values greater than 120 are considered. For different Aug values, different parameters are used to brighten.

更进一步，所述步骤三中，对于模型M1的训练，样本采集过程为：截取原灰度影像中含字母的100*100的样本，作为正样本；再截取完全不包含字母的同样尺寸样本，作为负样本；构建二维卷积神经网络的过程为：Further, in the third step, for the training of the model M1, the sample collection process is as follows: intercepting 100*100 samples containing letters in the original grayscale image as positive samples; then intercepting samples of the same size that do not contain letters at all, As a negative sample; the process of building a two-dimensional convolutional neural network is:

步骤3.1输入图像经过Conv2D卷积层提取局部特征，输入size为100*100*1，接着是relu激活函数层，Maxpooling池化层。Step 3.1 The input image goes through the Conv2D convolution layer to extract local features, and the input size is 100*100*1, followed by the relu activation function layer and the Maxpooling pooling layer.

步骤3.2提取size不同的三层Conv2D卷积层，其中的激活函数层和池化层结构与3.1一致。Step 3.2 extracts three Conv2D convolution layers with different sizes, and the activation function layer and pooling layer structure are consistent with 3.1.

步骤3.3经过一个Flatten层来连接上述4层卷积层与接下来的全连接层。Step 3.3 connects the above 4-layer convolutional layer with the next fully connected layer through a Flatten layer.

步骤3.4上述特征通过第一个全连接层，内部顺序包括Dense层，relu激活层，Dropout层防止过拟合。接下来是第二个全连接层，内部顺序包括Dense层，以及sigmoid激活层，得到输出结果。Step 3.4 The above features pass through the first fully connected layer, and the internal sequence includes the Dense layer, the relu activation layer, and the Dropout layer to prevent overfitting. Next is the second fully connected layer, the internal sequence includes the Dense layer, and the sigmoid activation layer to get the output.

其中，通过SelectiveSearch(选择性搜索)方法在影像中找寻字母并判断，因为字母的特殊性，字母L均能通过该方法找到，并且精确地定位成100*100的尺寸。由于其他字母主要集中于影像右上角，并且右上角文字不会影响手骨区域的判断，于是在找到L并去除之后，对右上角区域根据背景周边值的平均值进行填充。Among them, the SelectiveSearch (selective search) method is used to find letters in the image and judge, because of the particularity of letters, the letter L can be found by this method, and it is precisely positioned to a size of 100*100. Since other letters are mainly concentrated in the upper right corner of the image, and the text in the upper right corner will not affect the judgment of the hand bone area, after finding L and removing it, the upper right corner area is filled according to the average value of the surrounding background values.

所述步骤五中，模型M2的训练过程为：在通过步骤四去除了文字干扰之后，这一模型用来分辨背景，手骨，手骨背景相交区域。采用滑窗对这三类分别采样，定义为0，1，2三类，分别对应背景，手骨背景相交区域，手骨；构建二维卷积神经网络模型的过程为：In the fifth step, the training process of the model M2 is as follows: after removing the text interference in the fourth step, the model is used to distinguish the background, the hand bone, and the hand-bone-background intersection area. Sliding windows are used to sample these three categories respectively, which are defined as 0, 1, and 2, which correspond to the background, the hand-bone-background intersection area, and the hand bone respectively. The process of constructing a two-dimensional convolutional neural network model is as follows:

步骤5.1输入图像经过二维卷积层提取局部特征，输入尺寸为16*16*1，接着是relu激活函数层，Maxpooling池化层；Step 5.1 The input image is extracted through a two-dimensional convolution layer to extract local features, and the input size is 16*16*1, followed by the relu activation function layer and the Maxpooling pooling layer;

步骤5.2提取尺寸不同的一层Conv2D二维卷积层，其中的激活函数层和Maxpooling层结构与5.1一致；Step 5.2 Extract a layer of Conv2D two-dimensional convolution layers with different sizes, and the activation function layer and Maxpooling layer structure are consistent with 5.1;

步骤5.3经过一个Flatten层来连接上述卷积层与接下来的全连接层；Step 5.3 Connect the above convolutional layer and the next fully connected layer through a Flatten layer;

步骤5.4上述特征通过第一个全连接层，内部顺序包括Dense层，relu激活层，Dropout层防止过拟合，接下来是第二个全连接层，内部顺序包括Dense层，以及softmax激活层(输出为三类)，得到输出结果。Step 5.4 The above features pass through the first fully connected layer, the internal order includes the Dense layer, the relu activation layer, the Dropout layer to prevent overfitting, followed by the second fully connected layer, the internal order includes the Dense layer, and the softmax activation layer ( The output is three categories), and the output result is obtained.

所述步骤六中，滑窗判断的过程如下：In the step 6, the process of sliding window judgment is as follows:

步骤6.1对输入图像进行滑窗，16*16，步长为1；Step 6.1 Perform a sliding window on the input image, 16*16, and the step size is 1;

步骤6.2对于16*16的patch，用模型M2进行判断，得到的值x对于patch中16*16个像素的x值都加1，可以看到这里每个像素都有统计不同x值数量的过程，定义val[512][512][3]来统计每个像素不同x值的数量；Step 6.2 For the 16*16 patch, use the model M2 to judge, and the obtained value x adds 1 to the x value of the 16*16 pixels in the patch. It can be seen that each pixel has a process of counting the number of different x values. , define val[512][512][3] to count the number of different x values for each pixel;

步骤6.3定义一个输出数组result，对于result中的每一个点，对应val数组中，只比较0类和2类的数量，如果2类数量较多，result对应点就填充为255，即为白色，反之就是0，黑色；Step 6.3 Define an output array result. For each point in the result, in the corresponding val array, only compare the number of types 0 and 2. If the number of types 2 is large, the corresponding point of the result is filled with 255, which is white. Otherwise, it is 0, black;

由于滑窗加对应到像素的统计，复杂度为O(n×n×m×m)，由于我们只需要单点查询，这里通过树状数组来进行复杂度的降低，统计复杂度降为O(n×n×log²(n)),单点查询复杂度为log²(n)。Since the sliding window adds the statistics corresponding to the pixels, the complexity is O(n×n×m×m). Since we only need a single-point query, the tree array is used to reduce the complexity, and the statistical complexity is reduced to O (n×n×log ² (n)), the single-point query complexity is log ² (n).

所述步骤八中，差异比较的过程为：In the step 8, the process of difference comparison is:

步骤8.1：图像先转换为数值数组，由于光影存在于底部，因此只对底部向上取一定高度的横截面(patch)进行操作，高度根据光影的平均高度决定；Step 8.1: The image is first converted into a numerical array. Since the light and shadow exist at the bottom, only the cross section (patch) with a certain height upward from the bottom is operated, and the height is determined according to the average height of the light and shadow;

步骤8.2：对patch中每一行进行白色统计，保留最多白色行和最少的白色行；Step 8.2: Perform white statistics on each line in the patch, and keep the most white lines and the least white lines;

步骤8.3：对于保留的这两行，逐列进行比较，一白一黑即为不同，记下不同的数量t；Step 8.3: For the two reserved rows, compare them column by column, one white and one black are different, and note the different quantity t;

步骤8.4：t大于设定的差异阀值时，即判断出现了光影，于是舍弃掉这两行之间的部分，如果不大于阀值，就保持不变；Step 8.4: When t is greater than the set difference threshold, it is judged that there is light and shadow, so the part between the two lines is discarded, and if it is not greater than the threshold, it will remain unchanged;

步骤8.5：再做一次最大连通区域计算，由于舍弃了光影与手骨之间的部分，这一次手骨区域将被保留下来，同时也去除了光影部分。Step 8.5: Do the calculation of the maximum connected area again. Since the part between the light and shadow and the hand bone is discarded, the hand bone area will be retained this time, and the light and shadow part will also be removed.

本发明的技术构思为：采用深度神经网络检测X光片影像中的文字信息，并采用深度神经网络对X光片影像中的不同区域内容进行分类，得到的模型可以自动快速地提取出手骨感兴趣区域。The technical idea of the present invention is as follows: using a deep neural network to detect the text information in the X-ray image, and using the deep neural network to classify the contents of different regions in the X-ray image, the obtained model can automatically and quickly extract the skinnyness of the hand area of interest.

本发明给出的流程中，第一个模型M1，针对初始X光片影像中最显眼的字母L(代表左手)，R(代表右手)和一些残留文字信息，通过采样训练，使其和其他背景区分出来，从而实现从原图中去除字母信息的任务，为之后的手骨兴趣区域提取奠定了重要的基础。第二个模型M2的目的是得到手骨标记映射，该模型对输入图片中感兴趣的部分和不感兴趣部分通过不同的标签区分开来，本发明用的是通过置为白和黑来标记。由于已经在模型M1中去除了文字，所以在这个阶段，模型主要用来区别三类，背景，手骨，手骨背景相交区域。通过对输入图像滑窗采样，之后将检测结果在每一个像素点上统计，得到标记映射。In the process given in the present invention, the first model M1, for the most conspicuous letters L (representing the left hand), R (representing the right hand) and some residual text information in the initial X-ray image, is trained by sampling to make it match other The background is distinguished, so as to realize the task of removing letter information from the original image, which lays an important foundation for the subsequent extraction of the region of interest of the hand bone. The purpose of the second model M2 is to obtain the hand bone labeling map. The model distinguishes the interesting part and the uninteresting part in the input image by different labels. The present invention uses white and black to label. Since the text has been removed in the model M1, at this stage, the model is mainly used to distinguish three categories, the background, the hand bone, and the hand-bone-background intersection area. By sampling the sliding window of the input image, and then counting the detection results on each pixel, the marker map is obtained.

跟传统的人为提取手骨感兴趣区域的方法相比，本发明的优势在于：1.大大地提高了获取手骨兴趣区域的效率。2.统一的处理，可以减少人为操作所带来的误差。3.可以得到比人为提取更加精准的结果。4.不同阶段不同任务，分层的操作流程减少了极端图片所带来的整个流程的瘫痪风险，更易于维护和改进。Compared with the traditional method of manually extracting the region of interest of the hand bone, the present invention has the following advantages: 1. The efficiency of obtaining the region of interest of the hand bone is greatly improved. 2. Unified processing can reduce errors caused by human operation. 3. More accurate results can be obtained than manual extraction. 4. Different tasks in different stages, the layered operation process reduces the risk of paralysis of the entire process caused by extreme pictures, and is easier to maintain and improve.

附图说明Description of drawings

图1基于深度神经网络的X光片手骨兴趣区域提取的流程图。Figure 1. Flow chart of extracting the region of interest of hand bone in X-ray film based on deep neural network.

图2用于去除文字的模型M1结构示意图。FIG. 2 is a schematic structural diagram of a model M1 for removing characters.

图3用来构造手骨标记映射的模型M2结构示意图。Figure 3 is a schematic diagram of the structure of the model M2 used to construct hand bone marker mapping.

图4模型M1和模型M2中的卷积模块结构示意图。Figure 4 is a schematic diagram of the structure of the convolution module in model M1 and model M2.

具体实施方式Detailed ways

下面结合附图对本发明作进一步描述。The present invention will be further described below in conjunction with the accompanying drawings.

参照图1～图4，对于Output1的采样工作，主要是通过滑窗采样带有文字信息的图片，尤其是大字母L。通过训练模型M1能够几乎100％地完成对于图片中是否具有文字信息的检测。在Output1输入到模型中时，采用选择性搜索(SelectiveSearch)将大字母作为对象检测出来，同时对于存在于角落的其他文字，采用统一覆盖的策略。这样做省去了对原始大图像进行滑窗的时间，又能够有效地去除文字信息。Referring to Figures 1 to 4, for the sampling work of Output1, the sliding window is mainly used to sample pictures with text information, especially the large letter L. By training the model M1, the detection of whether there is text information in the picture can be completed almost 100%. When Output1 is input into the model, a selective search (SelectiveSearch) is used to detect large letters as objects, and for other characters existing in the corners, a unified coverage strategy is adopted. This saves the time of sliding the original large image, and can effectively remove text information.

本发明在去除文字信息之后将影像归一化到统一尺寸，这将降低后续获取手骨感兴趣区域的标记映射时的时间复杂度。对于标记映射的获取，这里进行详细的叙述：In the present invention, the image is normalized to a uniform size after removing the text information, which will reduce the time complexity of the subsequent acquisition of the marker map of the region of interest of the hand bone. For the acquisition of tag mapping, here is a detailed description:

1.模型M2是能够对16*16的样本进行分类的，具体分为三类，背景，手骨，手骨背景相交区域。1. Model M2 is able to classify 16*16 samples, which are divided into three categories: background, hand bone, and hand-bone-background intersection area.

2.通过对Output3的滑窗，步长为1，共有500*500的样本待检测。2. Through the sliding window of Output3, the step size is 1, and a total of 500*500 samples are to be detected.

3.每一个样本的检测结果，都会在该样本16*16个像素点上进行统计各个类别的检测结果。所以需要一个512*512*3的数组来进行统计。0,1,2分别对应背景，手骨背景相交区域，手骨。统计是必要的，因为假设前一个样本检测到是背景，下一个样本检测到是相交区域，这一步长所差异的全部像素最终就会被认为是背景，从而相交区域中的边界得以划分。根据像素内三个类别的统计数据，0类别数量多则该像素就置为0(黑色)，反之则为255(白色)。最终我们能得到手骨区域(ROI)是白色的，背景噪音全是黑色。3. The detection results of each sample will be counted on the 16*16 pixel points of the sample for each category of detection results. So an array of 512*512*3 is needed for statistics. 0, 1, and 2 correspond to the background, the intersection area of the hand bone and the background, and the hand bone, respectively. Statistics are necessary because assuming that the previous sample is detected as the background and the next sample is detected as the intersection area, all pixels that differ by this step will eventually be considered as the background, so that the boundaries in the intersection area are divided. According to the statistical data of the three categories in the pixel, if the number of 0 categories is large, the pixel is set to 0 (black), otherwise it is 255 (white). In the end we can get the hand bone region (ROI) to be white and the background noise to be all black.

4.由于该算法的时间复杂度高，因此我们使用了树状数组来进行优化，并将常规的树状数组扩展为二维树状数组。4. Due to the high time complexity of the algorithm, we use a tree-like array for optimization, and expand the regular tree-like array into a two-dimensional tree-like array.

5.得到标记映射之后，难免会有一些白色噪声，这里通过计算并保留最大联通区域去除白色噪声。此外，手骨区域和影像底部的光影，有时会重合，而且光影和手骨组织又极其相似，使得模型并不能将其准确的区分开来，这里通过对影像底部向上部分横截面进行差异检测，具体做法为：先将目标区域转换为数据数组，计算得出最多白点的行与最少白点的行，接着对它们逐列进行比较，要么均为白色，要么均为黑色，否则则视为差异。差异大于阈值的，则视为是光影造成的，将这两行之间的区域舍弃掉，之后再做一次最大联通区域计算，则可以去除掉光影所带来的噪声。5. After the marker map is obtained, it is inevitable that there will be some white noise. Here, the white noise is removed by calculating and retaining the maximum connected area. In addition, the light and shadow of the hand bone area and the bottom of the image sometimes overlap, and the light and shadow are very similar to the hand bone tissue, so that the model cannot accurately distinguish them. The specific method is: first convert the target area into a data array, calculate the row with the most white points and the row with the least white point, and then compare them column by column, either all of them are white or all black, otherwise they are regarded as difference. If the difference is greater than the threshold, it is considered to be caused by light and shadow. Discard the area between the two lines, and then do the calculation of the maximum connected area again to remove the noise caused by light and shadow.

实例：使用的手骨X光影像包含的年龄范围为0岁到18岁，共944个样本。将其中的632例样本作为训练集，训练模型M1和模型M2，剩余312例样本作为测试集。验证集和训练集重合。下面分别介绍两个模型的构建和测试过程。Example: The hand bone X-ray image used contains an age range of 0 to 18 years, with a total of 944 samples. Take the 632 samples as the training set, train the model M1 and the model M2, and use the remaining 312 samples as the test set. The validation set and training set coincide. The construction and testing process of the two models are described below.

模型M1：Model M1:

步骤1.1，构建深度卷积神经网络，具体结构如图2所示。Step 1.1, build a deep convolutional neural network, the specific structure is shown in Figure 2.

步骤1.1.1：此卷积神经网络包含4个卷积模块，一个Flatten层以及两个全连接层组成Step 1.1.1: This convolutional neural network consists of 4 convolutional modules, a Flatten layer and two fully connected layers

步骤1.1.2：卷积层中，卷积核大小为1*3，输入大小为(1,100,100)，卷积核数量随网络深入而增大，依次为6,12,24,48个。每个卷积模块还包括relu激活层，以及Maxpooling池化层，pool_size为(1,2)。Step 1.1.2: In the convolution layer, the size of the convolution kernel is 1*3, the input size is (1, 100, 100), and the number of convolution kernels increases with the depth of the network, which are 6, 12, 24, and 48 in turn. Each convolution module also includes a relu activation layer, and a Maxpooling pooling layer with a pool_size of (1, 2).

步骤1.1.3：在全连接层之前，会有一层flatten层，来压平卷积层的输出。Step 1.1.3: Before the fully connected layer, there will be a flatten layer to flatten the output of the convolutional layer.

步骤1.1.4：全连接层中，第一个全连接层有Dropout层来防止过拟合，参数为0.5，之后的全连接层通过sigmoid激活层输出。Step 1.1.4: In the fully connected layer, the first fully connected layer has a Dropout layer to prevent overfitting, the parameter is 0.5, and the subsequent fully connected layers are output through the sigmoid activation layer.

步骤1.2，数据采样及模型训练Step 1.2, data sampling and model training

步骤1.2.1：手骨X光片影像均为灰度图，通道数为1。对500个图像进行取样，每张图随机取样99个样本加上一个字母L的样本。再将所有样本连接为一个numpy数组，第一个维度是样本数量，由此得到一个(50000,1,100,100)的四维数组作为训练集；测试集采取同样的方式处理为(10000,1,100,100)。验证集和训练集重合。Step 1.2.1: The X-ray images of the hand bone are all grayscale images, and the number of channels is 1. 500 images were sampled, each image was randomly sampled 99 samples plus a sample of the letter L. Then connect all the samples into a numpy array, the first dimension is the number of samples, thus a four-dimensional array of (50000, 1, 100, 100) is obtained as the training set; the test set is processed in the same way as (10000, 1, 100, 100). The validation set and training set coincide.

步骤1.2.2：模型采用batch训练的方式，训练集生成器和验证集生成器每个batch的样本数均为120，共训练200轮次，采用对数损失函数，并采用rmsprop算法进行优化。模型仅保留正确率最高的模型。Step 1.2.2: The model adopts the batch training method. The number of samples in each batch of the training set generator and the verification set generator is 120, and a total of 200 rounds of training are used. The logarithmic loss function is used, and the rmsprop algorithm is used for optimization. The model only keeps the model with the highest accuracy.

步骤1.3，模型测试Step 1.3, Model Testing

对于模型M2的作用，详见操作流程，这里不再赘述。For the role of the model M2, please refer to the operation process for details, and will not be repeated here.

模型M2：Model M2:

步骤2.1，构建深度卷积神经网络，具体结构如图2所示。Step 2.1, build a deep convolutional neural network, the specific structure is shown in Figure 2.

步骤2.1.1：此卷积神经网络包含2个卷积模块，一个Flatten层以及两个全连接层组成Step 2.1.1: This convolutional neural network consists of 2 convolutional modules, a Flatten layer and two fully connected layers

步骤2.1.2：卷积层中，卷积核大小为1*3，输入大小为(1,16,16)，卷积核数量随网络深入而增大，依次为6,24个。每个卷积模块还包括relu的激活层，以及Maxpooling层，pool_size为(1,2)。Step 2.1.2: In the convolution layer, the size of the convolution kernel is 1*3, the input size is (1, 16, 16), and the number of convolution kernels increases with the depth of the network, 6 and 24 in turn. Each convolution module also includes the activation layer of relu, and the Maxpooling layer, and the pool_size is (1, 2).

步骤2.1.3：在全连接层之前，会有一层flatten层，来压平卷基层的输出。Step 2.1.3: Before the fully connected layer, there will be a flatten layer to flatten the output of the volume base layer.

步骤2.1.4：全连接层中，第一个全连接层将之前的24个输出在96个输入点输入，并且有Dropout层来防止过拟合，参数为0.05，之后的全连接层通过softmax的激活层输出，输出为三类。Step 2.1.4: In the fully connected layer, the first fully connected layer inputs the previous 24 outputs at 96 input points, and has a Dropout layer to prevent overfitting, the parameter is 0.05, and the subsequent fully connected layers pass softmax The activation layer output of , the output is three categories.

步骤2.2，数据采样及模型训练Step 2.2, data sampling and model training

步骤2.2.1：手骨X光片影像均为灰度图，通道数为1。对500个图像进行滑窗取样，16*16的滑窗，步长为32，对于背景过于独特的，步长为8，并且会对这部分样本做加倍处理，以增加它在模型训练中的权重。由此得到一个(100000,1,16,16)的四维数组作为训练集，其中第一个维度是样本数量；测试集采取同样的方式处理为(20000,1,16,16)。验证集和训练集重合。Step 2.2.1: The X-ray images of the hand bone are all grayscale images, and the number of channels is 1. Sliding window sampling is performed on 500 images, 16*16 sliding window, step size is 32, for backgrounds that are too unique, step size is 8, and this part of the sample will be doubled to increase its performance in model training. Weights. As a result, a four-dimensional array of (100000, 1, 16, 16) is obtained as the training set, where the first dimension is the number of samples; the test set is processed in the same way as (20000, 1, 16, 16). The validation set and training set coincide.

步骤2.2.2：模型采用batch训练的方式，训练集生成器和验证集生成器每个batch的样本数均为120，共训练2000轮次，loss使用多分类的对数损失函数，optimizer选用adam。模型仅保留正确率最高的模型。Step 2.2.2: The model adopts the batch training method. The number of samples in each batch of the training set generator and the validation set generator is 120, and a total of 2000 rounds of training are used. The loss uses the multi-class logarithmic loss function, and the optimizer selects adam . The model only keeps the model with the highest accuracy.

步骤2.3，模型测试Step 2.3, Model Testing

经过上述步骤的操作，即可实现对手骨X光片影像中手骨兴趣区域的提取。After the operations of the above steps, the extraction of the region of interest of the hand bone in the X-ray image of the hand bone can be realized.

以上所述的具体描述，对发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施例，用于解释本发明，并不用于限定本发明的保护范围，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above-mentioned specific description further describes the purpose, technical solutions and beneficial effects of the invention in detail. It should be understood that the above-mentioned descriptions are only specific embodiments of the present invention, which are used to explain the present invention and are not intended to be used for The protection scope of the present invention is limited, and any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

1. an automatic extraction method of X-ray hand bone region of interest based on deep neural network, is characterized in that: described method comprises the following steps:

Step 1: For the grayscale image of the original hand bone X-ray film, remove the black background embedded text on both sides of the image, so that most of the text is removed from the original image;

The second step is to perform a brightening operation on the grayscale image of the original hand bone X-ray film. In this step, the overall brightness of the image is evaluated first, and the image with insufficient brightness is subjected to a brightening operation. After the brightening, a denoising operation is performed to obtain an image. The set is called Output1;

Step 3, sample and train the model M1, which is used to remove the text near the hand bone and on the hand bone in the X-ray film in Output1, and obtain a grayscale image of the hand bone X-ray film Output2 without text;

Step 4: Normalize the size of all images in Output2. In order to keep the height and width consistent, first fill in the black background on both sides. When the width ratio of the image is larger than the height, the operation is to cut inward on both sides, and then Then reduce the image to (512,512), and call the new image set Output3;

Step 5: Sampling and training the model M2, which is used to judge the intersection of the hand bone, the background, and the hand bone background in Output3, and the size of the hand bone background is 16*16;

Step 6, with the model M2, judge the sliding window of the image in Output3, add the judged value to each pixel, and then each pixel according to the size of the value of the different judgement types obtained by itself, put the hand bone part. The pixel value of the hand bone is set to 255, and the pixel value of the background part is set to 0, so as to obtain the hand bone binary map, which is called Output4;

Step 7, compare the hand bone binary labeling map of Output4, and obtain the image with only the hand bone area on the basis of Output3, but at the same time, there will still be background impurities in the image, so it is necessary to perform a calculation of the maximum connected area here. Remove impurities to get Output5;

Step 8. Since there is a phenomenon that the aperture around the image in Output5 is judged as hand bone tissue, and it is connected to the largest connected area, since the length of the aperture is much larger than the width of the hand bone, the bottom of each image in Output5 is determined. Part of the difference is compared, so as to remove the aperture in the image, and get the final hand bone interest area.

2. The deep neural network-based automatic extraction method for the region of interest of the hand bone in the X-ray film as claimed in claim 1, wherein in the step 1, the method for removing the embedded text part in the black background is: first convert the image into Numerical array, the two most left and right columns start to be detected in the middle. Due to the particularity of pure white text and pure black background, if the number of non-black backgrounds that exist exceeds 10% of the entire column, it can be judged that the non-black background is embedded here. part, then the previous part will be completely cut off.

3. The deep neural network-based automatic extraction method for the hand bone region of interest in X-ray films as claimed in claim 1 or 2, wherein: in the step 2, the overall brightness evaluation process is: set the resolution of an original image O The rate is M×N, and the value of each pixel is t _ij , through the formula

Calculate the overall brightness of the image, consider pixels with pixel values greater than 120, and use different parameters to brighten for different Aug values.

4. The deep neural network-based automatic extraction method for the region of interest of hand bones in X-ray films as claimed in claim 1 or 2, characterized in that: in the step 3, for the training of the model M1, the sample collection process is: intercepting the original ash The 100*100 samples containing letters in the degree image are used as positive samples; then the samples of the same size that do not contain letters at all are taken as negative samples; the process of constructing a two-dimensional convolutional neural network is as follows:

Step 3.1 The input image is extracted through the Conv2D convolution layer to extract local features, and the input size is 100*100*1, followed by the relu activation function layer and the Maxpooling pooling layer;

Step 3.2 extracts three layers of Conv2D convolution layers with different sizes, and the activation function layer and pooling layer structure are consistent with step 3.1;

Step 3.3 Connect the above 4-layer convolutional layer and the next fully connected layer through a Flatten layer;

Step 3.4 The above features pass through the first fully connected layer, the internal sequence includes the Dense layer, the relu activation layer, the Dropout layer to prevent overfitting, followed by the second fully connected layer, the internal sequence includes the Dense layer, and the sigmoid activation layer, get the output result;

The selective search method is used to find letters in the image and judge. Because of the particularity of the letters, the letter L can be found by this method, and it can be accurately located in the size of 100*100; since other letters are mainly concentrated in the upper right corner of the image, and The text in the upper right corner will not affect the judgment of the hand bone area, so after finding L and removing it, the upper right corner area is filled according to the average value of the surrounding background values.

5. The deep neural network-based method for automatically extracting regions of interest in X-rays of hand bones as claimed in claim 1 or 2, characterized in that: in the step 5, the training of the model M2, after removing the word interference through the step 3 , this model is used to distinguish background, hand bone and hand bone background intersection area; use sliding window to sample these three categories respectively, defined as 0, 1, 2 three categories, corresponding to background, hand bone background intersection area, hand bone ; The process of building a two-dimensional convolutional neural network model is:

Step 5.1 The input image is extracted through a two-dimensional convolution layer to extract local features, and the input size is 16*16*1, followed by the relu activation function layer and the Maxpooling pooling layer;

Step 5.2 Extract a layer of Conv2D two-dimensional convolution layers with different sizes, and the activation function layer and Maxpooling layer structure are consistent with 5.1;

Step 5.3 Connect the above convolutional layer and the next fully connected layer through a Flatten layer;

Step 5.4 The above features pass through the first fully connected layer, and the internal sequence includes the Dense layer, the relu activation layer, and the Dropout layer to prevent overfitting; followed by the second fully connected layer, the internal sequence includes the Dense layer, and the sotfmax activation layer, get the output.

6. The method for automatically extracting the region of interest of X-ray hand bone based on deep neural network as claimed in claim 5, it is characterized in that: in described step 6, the process of sliding window judgment is:

Step 6.1 Perform a sliding window on the input image, 16*16, and the step size is 1;

Step 6.2 For the 16*16 patch, use the model M2 to judge, and the obtained value x adds 1 to the x value of the 16*16 pixels in the patch. It can be seen that each pixel has a process of counting the number of different x values. , define val[512][512][3] to count the number of different x values for each pixel;

Step 6.3 Define an output array result. For each point in the result, in the corresponding val array, only compare the number of types 0 and 2. If the number of types 2 is large, the corresponding point of the result is filled with 255, which is white. Otherwise, it is 0, black.

7. The deep neural network-based automatic extraction method for the hand bone region of interest in X-ray films as claimed in claim 1 or 2, wherein in the step 8, the process of difference comparison is:

Step 8.1: The image is first converted into a numerical array. Since the light and shadow exist at the bottom, only the cross-sectional patch with a certain height upward from the bottom is operated, and the height is determined according to the average height of the light and shadow;

Step 8.2: Perform white statistics on each line in the patch, and keep the most white lines and the least white lines;

Step 8.3: For the two reserved rows, compare them column by column, one white and one black are different, and note the different quantity t;

Step 8.4: When t is greater than the set difference threshold, it is judged that there is light and shadow, so the part between the two lines is discarded, and if it is not greater than the threshold, it will remain unchanged;

Step 8.5: Do the calculation of the maximum connected area again. Since the part between the light and shadow and the hand bone is discarded, the hand bone area will be retained this time, and the light and shadow part will also be removed.