CN112102285B

CN112102285B - Bone age detection method based on multi-modal countermeasure training

Info

Publication number: CN112102285B
Application number: CN202010962917.5A
Authority: CN
Inventors: 陈吉; 王星; 林清水; 杜伟; 陈海涛; 沈芷佳
Original assignee: Liaoning Technical University
Current assignee: Liaoning Technical University
Priority date: 2020-09-14
Filing date: 2020-09-14
Publication date: 2024-03-12
Anticipated expiration: 2040-09-14
Also published as: CN112102285A

Abstract

The invention provides a bone age detection method based on multi-modal countermeasure training, which comprises the steps of constructing a bone age prediction data set; constructing a bone age detection model based on multi-modal countermeasure training for training; in the prediction stage, only a discriminator of the bone age detection model is reserved, softmax is added in the last layer, and the weight of the discrimination model with optimal training is loaded to perform result prediction. The bone age detection method based on the multi-mode countermeasure training adopts the mode of the countermeasure training to improve the accuracy of model prediction; performing multi-mode bone age detection by using the medical images and the text medical records; training by using the X-ray data set of the Chinese teenagers to obtain a bone age identification result which accords with the Chinese teenagers; the bone age identification of the multi-mode data is realized, and the identification result is more accurate by combining text information.

Description

Bone age detection method based on multi-modal countermeasure training

Technical Field

The invention relates to the technical field of bone age detection, in particular to a bone age detection method based on multi-mode countermeasure training.

Background

Bone age analysis is an important index of growth and development degree, and plays an important role in the fields of medicine, sports, judicial identification and the like. The degree of skeletal calcification in children is a determinant of bone age. The bone age can accurately reflect the development level of people at all ages in the growth process. Measuring the bone age of children is typically determined by radiologists comparing X-ray films of children's hands with standard states of their corresponding ages.

The bone age evaluation of teenagers plays an important role in diagnosis of pediatric endocrine problems and childhood growth disorders, is commonly used for screening symptoms such as pediatric endocrine disorder, growth and development delay, congenital adrenal cortex hyperplasia and the like, and can evaluate the intervention effect of technical use. In addition, bone age can also be used to identify the true age of minors, which is used in both young criminal cases and in sports matches to confirm player age.

X-ray irradiation treatment is carried out on radius, palm, cephalic bone, uncinate bone, phalangeal bone and the like of a tester, so that X-ray irradiation images of three different directions of far, middle and near are obtained. The obtained X-ray irradiation image and necessary information related to the inspector (such as the height of parents of the inspector, the disease history of the inspector, the birth date of the inspector, the height of the inspector, etc. can be dictated by voice recognition or manually filled in) are subjected to data processing to obtain multi-source mode data.

Standard methods for assessing skeletal maturity have existed for nearly 100 years. The Tanner-Whitehouse (TW) process and the Greulich-Pyle (GP) process are two common processes. G-P atlas was developed in the last 30-50 years, based on the current time of the children in the upper family of the United states, from birth to adulthood. The publication was successfully made by Greulich and Pyle in 1950, revised in 1959, and then widely spread and used until now. There are two ways to evaluate bone age by G-P mapping. One is to evaluate the bone age of each bone separately, and finally obtain the average bone age value of the whole bone, and the other is to compare the bone age sheet to be tested with the atlas one by one, and take the nearest bone age as the bone age, if between two adjacent age atlas, take the average value to estimate, called the whole bone matching method (i.e. the insertion method). The TW method was originally established in the 1930 s for white european children. The Tanner-Whitehouse method second edition (TW 2) was published in 1983 based on data of 1950 s and 1960 s, and updated in 2001 as the Tanner-Whitehouse method third edition (TW 3). The estimated bone age by the TW3 method is slightly lower than that estimated by the TW2 method. The TW method calculates the scores of radius, ulna and short bones, and each major bone of the hand will account for the total score. One meta analysis suggests that the TW3 method predicts the age of caucasians more accurately than the TW2 or GP method, whereas for caucasian children, both TW3 and TW2 methods are more accurate than the GP method. The TW method takes 7.9 minutes to assess bone age and is the recommended method for European endocrinologists.

The main method for detecting bone age at present is to obtain X-ray images through X-ray irradiation of finger bones, metacarpal bones and carpal bones of left and right hands of a tester, and evaluate the images through a Greulich and Pyle (G-P) method and a Tanner-Whitehouse (TW) method, wherein the G-P method can lead to different analysis conclusion due to subjective factors of an analyzer, while the TW method eliminates the subjective factors of the analyzer, but the detection process is relatively time-consuming, and the bone age analysis conclusion is difficult to obtain in a short time. In addition, the existing bone age detection equipment cannot accurately and rapidly obtain a bone age detection conclusion, most detection standards are based on the defects that white children are not suitable for other Asian children such as China, detection information can only depend on a single information source, and the like.

The traditional bone age detection method can only process image information, and semantic information such as parent height of a detector, age of the detector and the like is not fully combined. The conventional bone age detection method only uses Convolutional Neural Network (CNN), and has poor network expandability when processing images obtained by different testers and different machines.

Disclosure of Invention

Aiming at the technical problems, the invention aims to provide a multi-mode countermeasure training-based bone age detection method, which utilizes medical images and text medical records to carry out multi-mode bone age detection and adopts a countermeasure training mode to improve the accuracy of model prediction.

In order to achieve the above purpose, the invention provides a bone age detection method based on multi-modal countermeasure training, comprising the following steps:

s1: constructing a bone age prediction data set;

s2: constructing a bone age detection model based on multi-modal countermeasure training for training;

s3: in the prediction stage, only a discriminator of the bone age detection model is reserved, softmax is added in the last layer, and the weight of the discrimination model with optimal training is loaded to perform result prediction.

Optionally, the dataset in step S1 includes one-to-one correspondence between X-ray pictures and case text summaries.

Preferably, the X-ray illumination picture comprises X-ray irradiation treatment of radius, palm, cephalic bone, uncinate bone and phalange of a tester in sequence to obtain X-ray irradiation images of three different directions of far, middle and near;

different gesture images of the tester obtained by the X-rays and text information obtained by inquiring the electronic medical record database are sent into a bone age detection model based on multi-modal countermeasure training for prediction.

Further, the bone age detection model in step S2 includes a generator for generating new sample data according to a distribution rule of the real data and a discriminator for identifying whether the data is from the real data or from the newly generated sample data.

Further, the steps of constructing the generator are as follows:

firstly, transpose convolution operation is performed, wherein three super parameters, namely the number of batch processing samples, noise dimension and pixels of initial noise samples, are given, and the three parameters form four-dimensional tensor required by a generation model; secondly, converting the dimension into a four-dimensional tensor of (1, 512,4,4) after a two-dimensional transposition convolution operation with a primary convolution kernel of 4*4 and a step length of 1; the dimension is converted into a four-dimensional tensor of (1, 64, 32, 32) after three-dimensional convolution kernel 4*4 and two-dimensional transposition convolution operation with the step length of 2; finally, performing a two-dimensional transposition convolution operation with a convolution kernel 5*5 and a step length of 3, and outputting tensors (1, 64, 96, 96);

the second step is downsampling operation, and the dimension is converted into four-dimensional tensors (1, 256, 24, 24) after two-dimensional convolution operation with the convolution kernel 4*4 and the step length of 2 is carried out twice;

thirdly, residual network operation is performed, and the dimension output after passing through a residual network consisting of 6 residual blocks is kept unchanged;

the fourth step is up sampling operation, two-dimensional transpose convolution operation with step length of 2 and three-dimensional transpose convolution operation with step length of 1 is performed respectively, the dimensions are converted into four-dimensional tensors of (1, 64, 102, 102), and finally three-channel pictures of 102 x 102 pixels are output through two-dimensional convolution operation with step length of 1 with one convolution core of 7*7.

Further, the step of constructing the arbiter is as follows:

firstly, inputting a piece of sample data, and converting the sample data into four-dimensional tensors of (1, 3, 102, 102);

then, converting the dimension into a four-dimensional tensor of (1, 64, 102, 102) after 1*1 convolution operation;

the dimension is converted into a four-dimensional tensor of (1, 1024,1,1) after convolution operation with four convolution kernels of 4*4 and a step length of 2;

finally, a convolution operation with a convolution kernel of 6*6 and a step length of 1 is converted into a scalar and output.

From above, the bone age detection method based on multi-modal countermeasure training of the invention has at least the following beneficial effects:

(1) The model prediction accuracy is improved by adopting a countermeasure training mode;

(2) Performing multi-mode bone age detection by using the medical images and the text medical records;

(3) Training by using the X-ray data set of the Chinese teenagers to obtain a bone age identification result which accords with the Chinese teenagers;

(4) The bone age identification of the multi-mode data is realized, and the identification result is more accurate by combining text information.

Drawings

FIG. 1 is a flow chart of a multi-modal challenge training based bone age detection method of the present invention;

FIG. 2 is a network structure diagram of a bone age detection model of a bone age detection method based on multimodal challenge training of the present invention;

FIG. 3 is a network architecture diagram of an image processing portion of the multi-modal challenge training based bone age detection method of the present invention;

fig. 4 is a diagram of a data set generation sample of the multi-modal challenge training based bone age detection method of the present invention, (a) being an image and (b) being text.

Detailed Description

The bone age detection method based on the multi-modal countermeasure training according to the present invention will be described in detail with reference to fig. 1 to 4.

As shown in fig. 1, the invention combines technical ideas such as multi-mode countermeasure training to construct a bone age detection method based on multi-mode countermeasure training, and the method firstly constructs a bone age prediction data set which comprises one-to-one correspondence of X-ray illumination pictures and case text summaries; secondly, constructing a bone age detection model based on multi-modal countermeasure training for training; and finally, only a model discriminator is reserved in the prediction stage, softmax is added in the last layer, and the weight of the optimal discrimination model is loaded for result prediction. Specifically, the invention mainly utilizes the attention-based CGAN network structure to realize multi-mode prediction of medical images and text medical records, and the network structure has stronger expandability, and the following is detailed:

(1) First, the inspector needs to fill in necessary information such as the height of parents of the inspector, the disease history of the inspector, the birth date of the inspector, the height of the inspector, etc.

The said speech is converted into text and the technology used is speech recognition technology. The following is an overview of speech recognition techniques:

speech recognition technology (Automatic Speech Recognition, ASR): the problem to be solved by speech recognition is to let a computer "understand" human speech, converting it into text. The voice recognition is a front-edge array for realizing intelligent man-machine interaction, and is a precondition for realizing machine translation, natural language understanding and the like.

(2) Then, the radius, palm, cephalic bone, hamate bone, phalange bone and the like of the detected person are sequentially subjected to X-ray irradiation treatment, so that X-ray irradiation images of three different directions of far, middle and near are obtained. X-rays shown in the following graph obtain images of the distal, middle and near three different orientations of the phalanges of the patient, so that the ossification center can be clearly obtained, and the smooth and continuous edges are provided;

(3) Different gesture images of the tester obtained by the X-ray and text information obtained by inquiring the electronic medical record database are sent into a bone age detection model based on multi-mode countermeasure training for prediction, and the prediction result is displayed in a short time;

(4) Samples in the prediction process, especially samples with prediction errors, can be used as a data set for the next training, so that the accuracy of the prediction of the model is gradually improved.

The model prediction accuracy can be improved by adopting the countermeasure training mode, so the model is built based on the CGAN network to realize the countermeasure training, the network framework of a specific model is shown in a figure 2, wherein the upper half part of the figure 2 is a text information processing part which supports the search of the age, sex, height, father height, mother height of a detector from an electronic medical record database, and whether keyword information such as a certain disease is affected or not is taken as Con in the model. The lower half part is a processing part of Image information, and the part mainly combines with a CGAN model to form countermeasure training, wherein Image represents medical images detected by a real detector and corresponds to medical text medical records of the upper half part; z represents modeling a random noise distribution; g represents a generator; d represents a discriminator; in the training process, G is continuously learned through feedback given by D until Z can fit the distribution curve of Image data.

As shown in fig. 3, the upper half of the image is constructed by a generator, the first step of constructing the generator is a transposed convolution operation, three super parameters, namely, the number of batch samples, the noise dimension and the pixels of the initial noise samples are firstly given, and the three parameters form four-dimensional tensors required for generating a model, and the four-dimensional tensors are defaulted (1, 200,1,1), wherein 1 represents the number of batch samples, 200 represents the dimension of input noise (including Condition100 dimension), and 1*1 represents the pixels of the initial noise samples; secondly, converting the dimension into a four-dimensional tensor of (1, 512,4,4) after a two-dimensional transposition convolution operation with a primary convolution kernel of 4*4 and a step length of 1; the dimension is converted into a four-dimensional tensor of (1, 64, 32, 32) after three-dimensional convolution kernel 4*4 and two-dimensional transposition convolution operation with the step length of 2; and finally, performing a two-dimensional transposition convolution operation with a convolution kernel of 5*5 and a step length of 3, and outputting tensors of (1, 64, 96 and 96). In the second step of downsampling operation, the dimensions are converted into four-dimensional tensors (1, 256, 24, 24) after two-dimensional convolution operation with the convolution kernel of 4*4 and the step length of 2, the primary purpose of downsampling operation is to reduce parameters, speed up operation, and the feeling after sampling is more suitable for network extraction characteristics. And thirdly, carrying out residual network operation, wherein the dimension outputted after passing through a residual network consisting of 6 residual blocks is kept unchanged. And a fourth step of up-sampling operation, namely, performing two-dimensional transposition convolution operation with a step length of 2 and a three-dimensional transposition convolution operation with a step length of 1 by using a convolution kernel of 4*4 and a step length of 4*4 respectively, converting the dimensions into four-dimensional tensors of (1, 64, 102 and 102), and finally performing two-dimensional convolution operation with a convolution kernel of 7*7 and a step length of 1 to output a three-channel picture with 102 x 102 pixels.

The lower part is constructed by a discriminator, and the construction process of the depth residual generation type countermeasure network is contrary to the construction process of the generation type model network structure, and the discrimination model network structure of the depth residual generation type countermeasure network is used for converting a four-dimensional tensor into a scalar. Firstly, inputting a piece of sample data, and converting the sample data into four-dimensional tensors of (1, 3, 102, 102); then, converting the dimension into a four-dimensional tensor of (1, 64, 102, 102) after 1*1 convolution operation; the dimension is converted into a four-dimensional tensor of (1, 1024,1,1) after convolution operation with four convolution kernels of 4*4 and a step length of 2; finally, a convolution operation with a convolution kernel of 6*6 and a step length of 1 is converted into a scalar and output.

The middle of the image is used for guiding the generator to learn the sample distribution by searching text information in the electronic medical record database and guiding the discriminator to discriminate the difference in the sample.

In order to obtain higher accuracy of capturing tiny features of bone scales by the model, an image processing part respectively uses a depth residual error generation network and a depth residual error dense network to construct an countermeasure model, as shown in fig. 3, an upper half part of the model is a generator, random noise (Z) is connected with 10 tag dimensions into 200-dimension data as input, and a plurality of dense residual error blocks formed by a convolution network are used for generating samples; the lower half part of the model is a discriminator, and the discriminator judges whether an input photo is a real picture or a generated picture through a plurality of convolution residual blocks formed by a convolution network.

The residual network is characterized by easy optimization and can improve accuracy by increasing the appropriate depth. The residual blocks inside the deep neural network are connected in a jumping mode, and the gradient disappearance problem caused by depth increase in the deep neural network is relieved. However, the residual network does not fully exploit the information of each convolution layer, only with connections after partial convolution operations. In order to make full use of the acquired receptive field information by all layers in the generator and reduce unnecessary calculation amount, the generator is constructed by adopting a dense residual error module, and the discriminator is constructed by using a residual error network.

Model training the predicted overall process:

(1) Constructing a data set of one-to-one correspondence of images and text

As shown in FIG. 4, the left side is a detector X-ray illumination picture, the right side is a text abstract of the detector case, and the two are in correspondence with each other in many pairs to construct the bone age prediction data set of the invention. In order to improve the accuracy of the training model, the radius, palm, cephalic bone, uncinate bone and phalangeal bone of the tester are irradiated by using X-ray shooting in a multi-angle mode so as to obtain a plurality of pictures of far, middle and near and the like to obtain bone structure information (similar to Gaussian difference in SIFT idea in a traditional image algorithm) under different scales.

(2) The invention builds a model BAGAN based on the ideas of CGAN and InfoGAN, a generator of the BAGAN generates new sample data according to the distribution rule of real data, a discriminator is used for identifying whether the data is from the real data or the training of the newly generated sample data BAGAN is a process of a very small and very large game, the final aim of training is to enable the generator to completely capture the sample distribution rule in the real data, so that the sample is generated, the discriminator discriminates whether the generated sample is the real sample or not, the learning rate of the generator is slightly lower than the learning rate of the discriminator in the training process, and the training of the two models is performed in a crossing way, namely the discriminator is trained first and then the generator is trained in a crossing way. In order to make the model easier to train, the invention carries out the deconvolution operation on the generator and then carries out Instance Normalization processing on the generator, and in order to ensure the nonlinearity of the generator, carries out the LeakyRELU processing on the generator; performing convolution operation on the discriminator, and then performing Batch Normalization processing on the discriminator, and performing RELU processing on the discriminator in order to ensure nonlinearity; the method comprises the steps of searching text information in an electronic medical record database as a Condition in a model, and limiting probability distribution of a generator for learning a real sample under the limiting Condition.

(3) Prediction process

In the model prediction stage, only a discriminator part is needed to be reserved, the discriminator part loads an optimal model in the training process, and a softmax is added at the end of the network to output the model which is originally predicted to be true or false as a specific classification.

Model predictive overall results:

the BAGAN network of the invention uses the idea of resistance training and combines a model constructed by InfoGAN on the basis of a condition GAN. The use of a double time scale update rule (TTUR) allows the arbiter to capture the true sample distribution rules faster and more fully. And finally, connecting the pre-trained discriminator with a softmax layer as a classifier for classification in a prediction stage.

Inputting the multi-source mode data into a trained classifier, comparing the output of the classification which is matched with the bone characteristic information, determining the growth stage of the bone, determining the specific bone age of the detected person and analyzing the result.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A bone age detection method based on multi-modal countermeasure training is characterized by comprising the following steps:

s1: constructing a bone age prediction data set;

s3: in the prediction stage, only a discriminator of a bone age detection model is reserved, softmax is added in the last layer, and the weight of a discrimination model with optimal training is loaded for result prediction;

the bone age detection model in the step S2 comprises a generator for generating new sample data according to the distribution rule of the real data and a discriminator for identifying whether the data is from the real data or the newly generated sample data;

the steps of building the generator are as follows:

the fourth step is up sampling operation, two-dimensional transpose convolution operation with step length of 1 and two-dimensional transpose convolution operation with step length of 2 and three-dimensional transpose convolution operation with step length of 4*4 are respectively carried out, dimensions are converted into four-dimensional tensors of (1, 64, 102, 102), and finally three-channel pictures with 102 x 102 pixels are output through two-dimensional convolution operation with step length of 1 with one convolution core of 7*7;

the steps of constructing the arbiter are as follows:

2. The bone age detection method based on multimodal challenge training according to claim 1, wherein the dataset in step S1 includes one-to-one correspondence of X-ray illumination pictures and case text summaries.

3. The bone age detection method based on multi-modal challenge training according to claim 2, wherein the X-ray illumination picture comprises sequentially performing X-ray irradiation treatment on radius, palm, cephalic bone, hamate bone, phalanx of the examiner, obtaining X-ray irradiation images of three different orientations of far, middle and near;