CN114220516A - Brain CT medical report generation method based on hierarchical recurrent neural network decoding - Google Patents
Brain CT medical report generation method based on hierarchical recurrent neural network decoding Download PDFInfo
- Publication number
- CN114220516A CN114220516A CN202111548154.0A CN202111548154A CN114220516A CN 114220516 A CN114220516 A CN 114220516A CN 202111548154 A CN202111548154 A CN 202111548154A CN 114220516 A CN114220516 A CN 114220516A
- Authority
- CN
- China
- Prior art keywords
- brain
- semantic
- features
- visual
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H15/00—ICT specially adapted for medical reports, e.g. generation or transmission thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
- G06T7/0012—Biomedical image inspection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10072—Tomographic images
- G06T2207/10081—Computed x-ray tomography [CT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30004—Biomedical image processing
- G06T2207/30016—Brain
Abstract
The invention discloses a brain CT medical report generation method based on hierarchical recurrent neural network decoding, which comprises the steps of firstly obtaining a brain CT image and corresponding medical report data and preprocessing the brain CT image and the corresponding medical report data; constructing a feature extractor to complete the encoding of the brain CT image data to obtain encoding featuresAnd fault block visual featuresConstructing an orientation keyword predictor for extracting orientation keyword semantic features F of brain CT image data Is(ii) a Constructing a hierarchical recurrent neural network language model, the model utilizingAnd FsPerforming layered decoding, and generating medical reports sentence by sentence; training and optimizing the model; preprocessing the brain CT to be predicted; extracting coding features and fault block visual features of the brain CT to be predicted by using the orientation key words; extracting semantic features by using the orientation keywords; the language model generates predicted medical reports sentence-by-sentence using the coding features, the fault block visual features, and the semantic features.
Description
Technical Field
The invention designs a medical report generation method of hierarchical recurrent neural network decoding by using related technical means in two fields of computer vision and natural language processing and aiming at the report automatic generation task in the field of medical image analysis.
Background
The task of automatic generation of medical report is to input a group of brain CT images containing spatial sequence relationship, and if a computer can automatically generate several sentences to describe the content of the images as a corresponding medical report, it requires that the computer has both comprehension ability of the brain CT images and language organization ability of the expression of the image content, which is a research focus in the analysis of medical images at present.
With the rise of various technologies in the field of artificial intelligence, computer-aided doctors performing medical image analysis become the key point of more and more attention of people. Among these, machine learning techniques play an important role in classification, segmentation, etc., however, conventional machine learning techniques exhibit limited capabilities in processing large-scale annotated or unlabeled data. Deep learning is one of the most important breakthroughs in the field of artificial intelligence in the last decade, and has enjoyed great success in different tasks of computer vision and image processing. The medical image analysis method and model also benefit from the powerful representation learning capability of the deep learning technology, and not only related research documents are continuously increased, but also substantial progress is made in the practical application fields of chest X-ray image diagnosis report generation and the like.
Early work on the automatic generation of medical reports was mainly based on models in the field of image description. At present, three research methods are mainly used in the field of image description: the method based on the template, the method based on the retrieval and the method based on the neural network coding and decoding are the most common methods in the field of image description at present and are widely applied to the fields of intelligent medical treatment and the like. The basic idea of the image description method based on the encoder and the decoder is to firstly adopt the encoder to extract the visual characteristics of the image, and then use the decoder to establish the mapping relation from the vision to the language to generate the description of the image. Compared with the first two methods, the method does not need detailed image labeling or huge data sets, and the generated image description is natural and becomes a mainstream model framework in the image description field. Compared with image description, the brain CT medical report generation task has a certain uniqueness in the decoding phase: the corresponding medical report is a long text sequence and has the characteristics of fixed structure and more description subjects.
Disclosure of Invention
In order to combine the data characteristics of a medical report and fully mine the detailed characteristic information of the brain CT code, the invention provides a brain CT medical report generation method based on hierarchical recurrent neural network decoding (HLSTMD) based on a related deep learning model in the field of image description, effectively mines the visual characteristics matched with different keywords in the brain CT code, improves the language performance index of finally generating the medical report, and generates a medical report with higher quality. The medical report generation method based on hierarchical recurrent neural network decoding mainly comprises an encoder part and a decoder part, wherein the encoder part comprises a feature extractor, and the decoder part comprises a position keyword Predictor (KWP) and a hierarchical recurrent neural network language decoding Model (TTSM).
The technical scheme adopted by the invention is a brain CT medical report generation method based on hierarchical recurrent neural network decoding, which comprises the following steps:
step (1) a training data set for generating a brain CT medical report is manufactured and preprocessed to obtain a standardized three-dimensional brain CT image and corresponding text information, wherein the text information comprises a keyword textReport text
Step (1.1) of acquiring brain CT images to construct a data set, wherein each piece of patient data comprises RGB images I ═ { I ═ I) generated by the brain CT images1…,In},Keyword textBrain CT report textWhere n represents the number of brain CT sequences in each case, IiRepresenting the RGB image of the ith sequence, W and H represent the width and height of the sequence respectively,representing the ith word in the keyword text,represents the number of keywords in a report,the ith word representing the report text,representing the number of words in a report.
Step (1.2) divides all patient data into a training set, a validation set and a test set. Wherein the training set is used for learning parameters of the neural network; the validation set is used for determining the hyper-parameters; the test set is used for verifying the neural network classification effect.
Step (1.3) data preprocessing: and (3) complementing the brain CT images with the number of sequences not more than 24 by using an interpolation algorithm, and selecting the brain CT images with the number of sequences more than 24 by using a uniform sampling method.
And (2) completing the encoding of the brain CT image data I by using a brain CT image feature extractor. The feature extractor adopts a coder in a patent of brain CT medical report generation method based on hierarchical self-attention sequence coding to extract three-dimensional coding features of a brain CT imageAnd fault block visual features
And (3) constructing an orientation keyword predictor for left and right side two-classification of the brain CT image, and then generating a final medical report by using two-classification information in an auxiliary manner. The orientation keyword predictor is realized by multi-layer perception only comprising a hidden layer, the number of neurons of an output layer is 2, namely two classifications of left and right semantic labels are made, and left and right semantic information FsThe value of (b) is the value of a multi-layered perceptron hidden layer neuron. Three-dimensional brain CT coding featuresObtaining a semantic feature containing left and right side keyword labels through an orientation keyword predictorWhere n represents the number of brain CT images of one case after preprocessing.
And (4) constructing a hierarchical recurrent neural network language decoding model, which is called a language model for short and is used for generating a brain CT medical report. The language model is integrally divided into two partsDividing into: a Fusion Attention Model (FAM) and a hierarchical decoding module. The fusion attention module is used for performing feature fusion on semantic features and visual features, the fused semantic features are used for activating LSTM initial hidden layer vectors in the hierarchical decoding module, and the fused visual features guide the LSTM to generate words with emphasis. In the training phase, the input of the language model is the visual characteristicsSemantic feature FsKeyword textAnd report textAnd (5) the output value is the loss function value in the step (5) and is used for optimizing the performance of the model.
And (4.1) guiding the construction of an attention model, and guiding the attention model to be used in the step (4.2). The attention model is guided to mine the association information of one feature tensor in another related feature tensor space. The attentional model can be further divided into a visual Guided Attention model (VGA) in which the input of query tensor is VGA and a semantic Guided Attention model (SGA) in which the input of query tensor isThe input of the keyword tensor and the value tensor is FsThe new calculated features are guided by visual information, so that the semantic features are reconstructed in a visual feature space; the input of the query tensor in the semantic guidance attention model SGA is FsThe input of the keyword tensor and the value tensor isAnd guiding by the semantic information through the calculated new features, so that the visual features are reconstructed in a semantic feature space.
And (4.2) constructing a Fusion Attention Module (FAM). The fusion attention module is realized by two-time stacked attention guiding models and linear transformation, residual error connection and normalization operations. The fusion attention module is used in the step (4.3), when the input is the semantic feature FsAnd visual featuresThe output is a visual fusion feature Fv′And semantic fusion feature Fs′(ii) a When the input is a semantic feature(T-LSTM acquisition from step (4.3)) and visual characteristics Fv′The output is a visual fusion feature Fv″And semantic fusion feature Fs″。
And (4.3) constructing a level decoding module. The hierarchical decoding module is composed of a keyword recurrent neural network T-LSTM and a descriptive short sentence recurrent neural network S-LSTM.
T-LSTM is characterized by:
1) first, semantic feature F is combined by fusion attention model FAMsAnd visual featuresPerforming fusion to obtain a reconstruction characteristic Fv′、Fs′。
2) Semantic fusion feature F is used by T-LSTM hidden layer and first cycle units′Initializing, visually fusing feature Fv′Visual attention calculations are performed with the entered word values such that the T-LSTM focuses on the corresponding visual features when generating the corresponding word values.
3) Hidden layer state of T-LSTM at time TComposed of two parts, one is the hidden layer state output by T-LSTM at T-1 momentThe other is the hidden layer state returned by the descriptive statement generation cyclic neural network S-LSTM after generating the descriptive statement of the t-th sentence
4) The input of the T-LSTM entity is the semantic feature FsVisual characteristicsAnd keyword text wordsPredicted value X ═ X output as keyword text word1,…,xt},xtRespectively representing the input word value and the output word predicted value at time t.
S-LSTM is characterized in that:
1) firstly, the key word xtObtaining keyword semantic features through linear change and dimension expansionThen fusing the attention model to the semantic featuresAnd visual featuresPerforming fusion to obtain a reconstruction characteristic Fv″、Fs″。
2) Semantic fusion feature F is used by the S-LSTM hidden layer and the first cyclic units″Initializing, visually fusing feature Fv″And performing visual attention calculation with the input word value to realize visual semantic association matching of the characteristics of the brain CT fault block and the detailed lesion description short sentence.
3) Hidden layer state of S-LSTM at time tHidden layer states returned by last descriptive clauseAnd semantic reconstruction feature F at time ts″And (4) forming.
4) Input of S-LSTM ensemble as semantic featuresVisual featuresAnd describing short sentence wordsThe output is the predicted value Y ═ Y for describing the short sentence word1,…,yt},ytRespectively representing the input word value and the output word predicted value at time t.
And (5) constructing and training an overall loss function, wherein the overall loss function comprises two parts:
Loss=Losssentence+Losstopic
wherein the content of the first and second substances, finger xtThe real tags of (i.e., words in the keyword text, finger ytOf genuine labels, i.e. reportsWord in this text, ptFinger corresponding word xtOr ytThe prediction probability of (2).
The prediction phase comprises the following steps:
and (6) preprocessing the brain CT to be predicted to obtain a standardized three-dimensional brain CT image, corresponding keyword text information and a medical report.
And (7) extracting the three-dimensional coding features and the fault block visual features of the brain CT image to be predicted by using the trained feature extractor.
And (8) generating left and right semantic features by using the trained direction keyword predictor.
And (9) fusing the semantic features and the visual features by using the trained language model to generate a medical report of the brain CT image to be predicted.
Advantageous effects
Compared with image description, the brain CT medical report generation task has a certain uniqueness in the decoding phase: the corresponding medical report is a long text sequence and has the characteristics of fixed structure and more description subjects. The model provided by the method can generate corresponding medical report description phrases from different visual spaces while combining the text data characteristics of the medical report, and generate a brain CT medical report with higher quality.
Drawings
FIG. 1: feature extractor
Fig. 2 (a): self-attention network architecture
Fig. 2 (b): attention directing network architecture
FIG. 3: model for fusing attention
FIG. 4: statement decoding model based on hierarchical recurrent neural network
FIG. 5: generated report comparative example
Detailed Description
The following takes 492 cases of data provided by the third medicine of Beijing university as an example to explain the specific implementation steps of the invention:
acquiring and preprocessing a brain CT image, corresponding keywords and report data:
step (1.1) acquires brain CT images to construct a data set, wherein the data set comprises 492 cases of brain CT images with patient sensitive information deleted and corresponding reports, and the image data of each patient comprises a plurality of CT sequences and a corresponding report text. The original brain CT medical image is in a dicom format, is converted into observation views with three scales commonly used by doctors and is used as three channel values of an RGB three-channel color image, the boundary noise CT value-2000 of the image is removed, and finally, the brain CT image data I in a PNG format is obtained1…,IN},Where N represents the number of CT slices per case and W and H represent the width and height, respectively, of each sequence. Then the preprocessing of the corresponding text: removing redundant punctuation marks in the report text data; the terms of the professional nouns are unified; dividing the Chinese words into a plurality of description short sentences by using a 'separation character', dividing the Chinese words into a plurality of description short sentences by using a jieba word segmentation tool to obtain a description dictionary, wherein the size of the dictionary is 244, and finally obtaining the medical report textFinally, the highest frequency words in each short sentence in the report text are combined into a keyword text Representing the ith word in the keyword text,represents the number of keywords in a report,the ith word representing the report text,represent in a reportThe number of words.
And (1.2) randomly dividing a data set into a training set, a verification set and a test set according to the ratio of 10:1: 1. The data ratio was 410:41: 41.
Step (1.3) data preprocessing: 24 brain CT tomography images are used for each case of data, interpolation algorithms are used for complementing less than 24 images, and data of more than 24 images are selected by using a uniform sampling method. Each image is denoised and normalized, and then the size is uniformly adjusted to 512 × 512.
And (2) completing the encoding of 492 cases of brain CT image data I by using a brain CT image feature extractor. The feature extractor adopts a coder in a patent of brain CT medical report generation method based on hierarchical self-attention sequence coding to extract three-dimensional coding features of a brain CT imageAnd fault block visual featuresA specific network architecture is shown in figure 1,corresponding to SA Feature and NSA Feature in fig. 1, respectively, the overall formula of the Feature extractor is expressed as:
and (3) constructing an orientation keyword predictor for left and right side two-classification of the brain CT image, and then generating a final medical report by using two-classification information in an auxiliary manner. The orientation keyword predictor is realized by multi-layer perception only comprising a hidden layer, the number of neurons of an output layer is 2, namely two classifications of left and right semantic labels are made, and left and right semantic information FsThe value of (b) is the value of a multi-layered perceptron hidden layer neuron. Three-dimensional brain CT coding featuresObtaining a semantic feature containing left and right side keyword labels through an orientation keyword predictorWhere n represents the number of brain CT images of one case after preprocessing, we use 24 brain CT images here, so n is 24.
And (4) constructing a hierarchical recurrent neural network language decoding model, which is called a language model for short and is used for generating a brain CT medical report. The language model is integrally divided into two parts: a Fusion Attention Model (FAM) and a hierarchical decoding module. The fusion attention module is used for performing feature fusion on semantic features and visual features, the fused semantic features are used for activating LSTM initial hidden layer vectors in the hierarchical decoding module, and the fused visual features guide the LSTM to generate words with emphasis. In the training phase, the input of the language model is the visual characteristicsSemantic feature FsKeyword textAnd report textAnd (5) the output value is the loss function value in the step (5) and is used for optimizing the performance of the model.
And (4.1) guiding the construction of an attention model. The method comprises the steps that the guiding attention model can be used for mining the association information of one feature tensor in another related feature tensor space, the fusion attention model is built through the guiding attention model, and therefore the language model can pay attention to the associated visual detail features in the brain CT when each description short sentence is generated. The specific structure of the self-attention model is shown in fig. 2(a), firstly, the same feature vector is subjected to linear change to obtain an inquiry tensor Q, a keyword tensor K and a value tensor V, then, matrix multiplication, scaling, Softmax activation function and matrix multiplication with V of Q and K are sequentially carried out, and finally, the attention feature considering the dependence relationship of the vector is calculated. The guiding attention is obtained by self-attention improvement, specifically, as shown in fig. 2(b), the guiding attention replaces the query tensor with another related tensor, when the replaced tensor is the visual feature and the value tensor and the keyword tensor are the semantic features, the guiding attention at this time is the visual guiding attention model VGA, the opposite query tensor is the semantic feature, and the attention model when the value tensor and the keyword tensor are the visual feature is the semantic directing attention model SGA.
And (4.2) constructing a Fusion Attention Model (FAM). The fusion attention model FAM is mainly realized by two-time stacked attention-directing models, for example, fusion semantic feature F is calculated by using FAMs′The specific process of (2) is shown in fig. 3. The input to the model comprises visual featuresSemantic feature FsThe characteristic dimensions are respectivelyAndwherein N isvIs the number of fault blocks, NsThe number of labels on the left and right sides. Firstly, mining visual characteristics through an SGA modelNeutralizing semantic feature FsThe related information realizes the effect of visual feature enhancement, and then the enhanced feature is used as the visual information of the VGA to guide the semantic feature F in returnsFinally, the semantic features reconstructed by the VGA and SGA models are calculated through the operation sequence of linear transformation, residual connection and normalization, and then the final semantic fusion features F are obtaineds′The forward propagation formula of the FAM model of the whole process is as follows:
as in the above fusion process, the visual features are used as the main body of the value tensor to generate the visual fusion features, and the forward propagation formula of the FAM model is as follows:
wherein W is a parameter matrix of the neural network linear transformation, Norm is a Layer Normalization function (Layer Normalization),candidate values for semantically fused features, Fs′For the final semantic fusion feature, this feature will then be used to initialize our keyword generation recurrent neural network,candidate value for visual fusion feature, Fv′This feature will then be used to compute a visual attention mechanism in the keyword generation recurrent neural network for the final visual fusion feature.
And (4.3) constructing a level decoding module. The hierarchical decoding module is composed of a keyword recurrent neural network T-LSTM and a descriptive short sentence recurrent neural network S-LSTM, and the specific structure is shown in FIG. 4, wherein the T-LSTM is used for generating keywords in the descriptive short sentence, and the L-LSTM generates the corresponding descriptive short sentence according to the corresponding keywords.
T-LSTM is characterized by:
1) first, semantics are specified by Fusing Attention Model (FAM)Sign FsAnd visual featuresPerforming fusion to obtain a reconstruction characteristic Fv′、Fs′。
2) Semantic fusion feature F is used by T-LSTM hidden layer and first cycle units′Initializing, visually fusing feature Fv′Visual attention calculations are performed with the entered word values such that the T-LSTM focuses on the corresponding visual features when generating the corresponding word values.
3) Hidden layer state of T-LSTM at time TComposed of two parts, one is the hidden layer state output by T-LSTM at T-1 momentThe other is the hidden layer state returned by the descriptive statement generation cyclic neural network S-LSTM after generating the descriptive statement of the t-th sentence(as shown by the downward dashed line in fig. 4), the specific formula is as follows:
wherein alpha is a hyper-parameter and takes the value of [0, 1%]. By adjusting the value of the hyper-parameter alpha, the T-LSTM can retain and abandon the descriptive statement information and the keyword information at the previous moment to different degrees when generating each keyword. In the actual experimental process, a bias is added at the forgetting gate of the recurrent neural network and is initialized to be 1 so as to reduce the forgetting capability of the model at the initial training stage and avoid the problem that the output of the model explodes or disappears when the model is trained at the beginning, wherein a specific formula is shown as follows, wherein W is a parameter matrix f of linear transformation of the neural networktTo forget the gate vector, ht-1Upper time hidden layerState vector, xtFor the input vector at time t, b is bias:
ft=(Wfxxt+Wfhht-1+b)
4) the input of the T-LSTM entity is the semantic feature FsVisual characteristicsAnd keyword text wordsPredicted value X ═ X output as keyword text word1,…,xt},xtRespectively representing the input word value and the output word predicted value at time t.
S-LSTM is characterized in that:
1) firstly, the key word xtObtaining keyword semantic features through linear change and dimension expansionThen fusing the attention model to the semantic featuresAnd visual featuresPerforming fusion to obtain a reconstruction characteristic Fv″、Fs″;
2) Semantic fusion feature F is used by the S-LSTM hidden layer and the first cyclic units″Initializing, visually fusing feature Fv″Performing visual attention calculation with the input word value to realize visual semantic association matching of the characteristics of the brain CT fault block and the detailed lesion description short sentence;
3) hidden layer state of S-LSTM at time tHidden layer states returned by last descriptive clauseAnd semantic reconstruction features at time tIn the process of initialization, we also introduce the memory information of the S-LSTM in generating the previous sentence, and the specific formula is as follows:
wherein beta is a hyper-parameter and takes the value of [0,1 ]. By adjusting the value of the hyper-parameter beta, the S-LSTM can reserve and abandon the generated description short sentence and the keyword information at the previous moment in different degrees when generating each short sentence.
4) Input of S-LSTM ensemble as semantic featuresVisual featuresAnd describing short sentence wordsThe output is the predicted value Y ═ Y for describing the short sentence word1,…,yt},ytRespectively representing the input word value and the output word predicted value at time t.
Step (5), defining an overall loss function and training a brain CT image report to automatically generate a network, wherein the overall loss function comprises two parts:
loss=Losssentence+Losstopic
wherein the content of the first and second substances, finger xtThe real tags of (i.e., words in the keyword text, finger ytThe real label of (1), i.e. the word in the report text, ptFinger corresponding word xtOr ytThe prediction probability of (2). Finally, the network can adaptively optimize loss values of a real report and a prediction report under an Adam optimizer algorithm, a brain CT image is input after training is completed, the model can automatically generate a continuous and accurate brain CT medical report, meanwhile, the model can pay attention to different visual points when each descriptive short sentence is generated by using the characteristics of the reconstructed fusion attention mechanism, and the advantage of the method for generating the long text sequence is embodied.
In order to verify the feasibility of the method, quantitative experimental analysis and qualitative result analysis are carried out on the method provided by the invention. The generated report result is subjected to language evaluation through four evaluation indexes, namely BLEU, METEOR, ROUGE-L and CIDER, which are widely applied to quantitative evaluation of generated sentences in the fields of image description, natural language processing and the like. And the fifth figure shows examples of the invention, including brain CT image, report example generated by baseline method of contrast and report example generated by the method.
Table 1 model ablation experiment
First, an inter-module ablation experiment was performed, and the first row of table 1 used a single layer of LSTM for decoding of the medical report, i.e., the baseline model. The LSTM _ S method of the second line is to use semantic information F of the second classs′The memory unit of the decoder part is initialized and used as the starting word information of the medical report in the testing stage. In the next third row of the LSTM _ FAM method, the semantic information of the second category is replaced by its own fusion feature with the fault block visual feature by the FAM module. The HLSTM method below the split line is to use hierarchical T-LSTM and S-LSTM as decoders for medical reports. The fifth and sixth lines are semantic information F using orientation keywords respectively for HLSTMs′And Fs″Performing initialization of a loop unit
From table 1, it can be seen that the mere addition of the number of layers of the recurrent neural network does not necessarily bring performance improvement to the final medical report generation, since the training sample is too small, the addition of one layer of neural network for training and describing the subject keyword brings more training parameters, and the language evaluation index value during the final test is rather lower, which indicates that the addition of parameters brings further over-fitting problem in the data set, but the newly added recurrent neural network can perform a task decomposition on the final medical report generation, so as to obtain the generation process of each sentence description sentence and capture of corresponding visual and semantic information, and the language evaluation index value of the model is made to be equal to or even greater than the performance of a single-layer recurrent neural network through the fusion of multi-modal information. Specifically, no matter the LSTM baseline method or the HLSTM method is a single-layer LSTM baseline method or a single-layer HLSTM method, the semantic information of two classifications is used for initializing the language model, the left and right labels generated by the orientation keyword predictor are used as verb initials of the first medical report short sentence in the testing stage, the performance of the final model is slightly improved, and BLEU-1 and CIDER index values are obviously improved. In addition, the FAM module is used to convert the left and right keywords F in HLSTMs′And location keyword Fs″Fused with visually encoded features and used to enhance T-LSTM and S-LSTM, respectively, with a large improvement in performance of final medical report generationAnd (5) rising. Finally, in the HLSTM _ GT experiment, the semantic information on the left side and the right side and the semantic information output by the T-LSTM are not used any more, but the real label of the keyword is directly used for generating FAM fusion information, and the result shows that when the quality of generating the keyword label is very high, the method provided by the invention can effectively improve the performance index of the report description statement.
TABLE 2 FAM model ablation experimental results
Then, experiments of different stacking modes of the FAM model are carried out, and the effectiveness of the FAM model is researched in a certain assumed space. Visual fusion (reconstruction) feature is Fv′(Fv″) Semantic fusion (reconstruction) feature is Fs′(Fs″) SA _.
As can be seen from the top of the segmentation line in table 2, when the same semantic guidance attention model, that is, the same SGA and SA _ SGA model, is used, the fitting ability of the final model to the data set is relatively similar, and the fluctuation of the evaluation index value is relatively small, which indicates that the discriminativity of the model to the visual fusion feature is significantly stronger than that of the semantic fusion feature. Meanwhile, through cross fusion with the fault block information, the label detail characteristics contained in the semantic information can be further learned, so that the relevance degree of the model and the relevant visual information is enhanced when the model generates a description short sentence taking the corresponding label as a description main body, and the quality of the model for finally generating the medical report is further improved.
Finally, the qualitative analysis of the model description ability is carried out, fig. five is a comparison example of the medical report generated by the method and the baseline method, and from the generated medical report text, the medical report generated by the single-layer LSTM method describes less subjects, and the described subjects are generally adjacent brain tissue structures (the positions of the tomograms are adjacent); the HLSTMD method has the advantages that the number of description subjects of the medical report generated by the HLSTMD method is increased obviously, more pathological conditions are described, the generated report has more diversity, and the detailed information describing the subject can be analyzed to a certain extent, for example, the detailed description information of the 'submucosal cyst of right maxillary sinus' is generated in the case of paranasal sinusitis.
In general, no matter quantitative evaluation of language performance or medical report examples shown in qualitative analysis, the automatic medical report generation framework provided by the invention achieves better effect and has good application prospect in future practical application.
Claims (5)
1. A brain CT medical report generation method based on hierarchical recurrent neural network decoding is characterized by comprising the following steps:
the method comprises two stages of training and predicting,
the training phase comprises the following steps:
(1) making a training data set for generating a brain CT medical report and preprocessing the training data set to obtain a standardized three-dimensional brain CT image and corresponding text information, wherein the text information comprises a keyword textReport text
(2) The brain CT image feature extractor is used for completing the encoding of the brain CT image data I to obtain the three-dimensional brain CT encoding featuresAnd brain CT fault block visual characteristics
(3) Constructing an orientation keyword predictor for left and right secondary classification of the brain CT image, and then generating a final medical report by using secondary classification information in an auxiliary manner; three-dimensional brain CT coding featuresObtaining a semantic feature containing left and right side keyword labels through an orientation keyword predictorWherein n represents the number of brain CT images of one case after preprocessing;
(4) constructing a hierarchical recurrent neural network language decoding model, hereinafter referred to as a language model for generating a brain CT medical report; the language model is integrally divided into two parts: the system comprises a fusion attention module and a hierarchy decoding module, wherein the fusion attention module is used for carrying out feature fusion on semantic features and visual features, the fused semantic features are used for activating LSTM initial hidden layer vectors in the hierarchy decoding module, and the fused visual features guide the LSTM to generate words with emphasis; in the training phase, the input of the language model is the visual characteristicsSemantic feature FsKeyword textAnd report textThe output value is the loss function value in the step (5) and is used for optimizing the performance of the model;
(5) constructing and training an overall loss function, wherein the overall loss function comprises two parts:
Loss=Losssentence+Losstopic
wherein the content of the first and second substances, finger xtThe real tags of (i.e., words in the keyword text, finger ytThe real label of (1), i.e. the word in the report text, ptFinger corresponding word xtOr ytA predicted probability of (d);
the prediction phase comprises the following steps:
(6) preprocessing a brain CT to be predicted to obtain a standardized three-dimensional brain CT image, corresponding keyword text information and a medical report;
(7) extracting three-dimensional coding features and fault block visual features of the brain CT image to be predicted by using a trained feature extractor;
(8) generating left and right semantic features by using the trained direction keyword predictor;
(9) and fusing the semantic features and the visual features by using the trained language model to generate a medical report of the brain CT image to be predicted.
2. The brain CT medical report generation method based on hierarchical recurrent neural network decoding as claimed in claim 1, wherein:
the specific steps in the step (1) are as follows:
step (1.1) of acquiring brain CT images to construct a data set, wherein each piece of patient data comprises RGB images I ═ { I ═ I) generated by the brain CT images1…,In},Keyword textBrain CT reportWhere n represents the number of brain CT sequences in each case, IiRepresenting the RGB image of the ith sequence, W and H represent the width and height of the sequence respectively,representing the ith word in the keyword text,represents the number of keywords in a report,the ith word representing the report text,represents the number of words in a report;
step (1.2) dividing all patient data into a training set, a validation set and a test set; wherein the training set is used for learning parameters of the neural network; the validation set is used for determining the hyper-parameters; the test set is used for verifying the neural network classification effect;
step (1.3) data preprocessing: and (3) complementing the brain CT images with the number of sequences not more than 24 by using an interpolation algorithm, and selecting the brain CT images with the number of sequences more than 24 by using a uniform sampling method.
3. The brain CT medical report generation method based on hierarchical recurrent neural network decoding as claimed in claim 1, wherein:
4. The brain CT medical report generation method based on hierarchical recurrent neural network decoding as claimed in claim 1, wherein:
the orientation keyword predictor in the step (3) is realized by multi-layer perception only comprising a hidden layer, the number of neurons of an output layer is 2, namely, two classifications of left and right semantic labels are made, and left and right semantic information FsThe value of (b) is the value of a multi-layered perceptron hidden layer neuron.
5. The brain CT medical report generation method based on hierarchical recurrent neural network decoding as claimed in claim 1, wherein:
the specific steps in the step (4) are as follows:
step (4.1) guiding the construction of an attention model, and guiding the attention model to be used in step (4.2); the guiding attention model is used for mining the association information of one feature tensor in another related feature tensor space; the attention-guiding model can be further divided into a visual attention-guiding model VGA and a semantic attention-guiding model SGA, wherein the input of the query tensor in the visual attention-guiding model VGA isThe input of the keyword tensor and the value tensor is FsThe new calculated features are guided by visual information, so that the semantic features are reconstructed in a visual feature space; the input of the query tensor in the semantic guidance attention model SGA is FsKey point ofThe input of the word tensor and the value tensor isAt the moment, the calculated new features are guided by semantic information, so that the visual features are reconstructed in a semantic feature space;
step (4.2) construction of a Fusion Attention Module (FAM): the fusion attention module is realized by two-time stacked attention guiding models and linear transformation, residual connection and normalization operations; the fusion attention module is used in the step (4.3), when the input is the semantic feature FsAnd visual featuresThe output is a visual fusion feature Fv′And semantic fusion feature Fs′The specific formula is as follows:
wherein, W is a parameter matrix of the linear transformation of the neural network, Norm is Layer Normalization function,andrespectively representing a visual fusion feature candidate value and a semantic fusion feature candidate value;
and (4.3) constructing a level decoding module: the hierarchical decoding module consists of a keyword recurrent neural network T-LSTM and a descriptive short sentence recurrent neural network S-LSTM;
T-LSTM is characterized by:
1) first, semantic feature F is combined by fusion attention model FAMsAnd visual featuresPerforming fusion to obtain a reconstruction characteristic Fv′、Fs′;
2) Semantic fusion feature F is used by T-LSTM hidden layer and first cycle units′Initializing, visually fusing feature Fv′Performing visual attention calculation with the input word value so that the T-LSTM focuses on the corresponding visual feature when generating the corresponding word value;
3) hidden layer state of T-LSTM at time TComposed of two parts, one is the hidden layer state output by T-LSTM at T-1 momentThe other is the hidden layer state returned by the descriptive statement generation cyclic neural network S-LSTM after generating the descriptive statement of the t-th sentenceThe specific formula is as follows:
wherein alpha is a hyper-parameter and takes a value of [0,1 ];
4) T-LSTM entityIs input as semantic feature FsVisual characteristicsAnd keyword text wordsPredicted value X ═ X output as keyword text word1,…,xt},xtRespectively representing the input word value and the output word predicted value at the time t;
S-LSTM is characterized in that:
1) firstly, the key word xtObtaining keyword semantic features through linear change and dimension expansionThen fusing the attention model to the semantic featuresAnd visual featuresPerforming fusion to obtain a reconstruction characteristic Fv″、Fs″;
2) Semantic fusion feature F is used by the S-LSTM hidden layer and the first cyclic units″Initializing, visually fusing feature Fv″Performing visual attention calculation with the input word value to realize visual semantic association matching of the characteristics of the brain CT fault block and the detailed lesion description short sentence;
3) hidden layer state of S-LSTM at time tHidden layer states returned by last descriptive clauseAnd semantic reconstruction features at time tThe specific formula is as follows:
wherein beta is a hyper-parameter, and the value is [0,1 ];
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111548154.0A CN114220516A (en) | 2021-12-17 | 2021-12-17 | Brain CT medical report generation method based on hierarchical recurrent neural network decoding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111548154.0A CN114220516A (en) | 2021-12-17 | 2021-12-17 | Brain CT medical report generation method based on hierarchical recurrent neural network decoding |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114220516A true CN114220516A (en) | 2022-03-22 |
Family
ID=80703490
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111548154.0A Pending CN114220516A (en) | 2021-12-17 | 2021-12-17 | Brain CT medical report generation method based on hierarchical recurrent neural network decoding |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114220516A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116631566A (en) * | 2023-05-23 | 2023-08-22 | 重庆邮电大学 | Medical image report intelligent generation method based on big data |
CN117056519A (en) * | 2023-08-17 | 2023-11-14 | 天津大学 | Cross-domain-oriented automatic generation method for comprehensive report of legal opinions |
CN117095187A (en) * | 2023-10-16 | 2023-11-21 | 四川大学 | Meta-learning visual language understanding and positioning method |
-
2021
- 2021-12-17 CN CN202111548154.0A patent/CN114220516A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116631566A (en) * | 2023-05-23 | 2023-08-22 | 重庆邮电大学 | Medical image report intelligent generation method based on big data |
CN117056519A (en) * | 2023-08-17 | 2023-11-14 | 天津大学 | Cross-domain-oriented automatic generation method for comprehensive report of legal opinions |
CN117095187A (en) * | 2023-10-16 | 2023-11-21 | 四川大学 | Meta-learning visual language understanding and positioning method |
CN117095187B (en) * | 2023-10-16 | 2023-12-19 | 四川大学 | Meta-learning visual language understanding and positioning method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ren et al. | Cgmvqa: A new classification and generative model for medical visual question answering | |
Karpathy et al. | Deep visual-semantic alignments for generating image descriptions | |
Ayesha et al. | Automatic medical image interpretation: State of the art and future directions | |
CN114220516A (en) | Brain CT medical report generation method based on hierarchical recurrent neural network decoding | |
Huang et al. | Multimodal continuous emotion recognition with data augmentation using recurrent neural networks | |
CN112614561A (en) | Brain CT medical report generation method based on hierarchical self-attention sequence coding | |
Alami et al. | Using unsupervised deep learning for automatic summarization of Arabic documents | |
Liu et al. | Unsupervised temporal video grounding with deep semantic clustering | |
Islam et al. | Exploring video captioning techniques: A comprehensive survey on deep learning methods | |
CN111666762B (en) | Intestinal cancer diagnosis electronic medical record attribute value extraction method based on multitask learning | |
Bae et al. | Flower classification with modified multimodal convolutional neural networks | |
Hu et al. | Advancing medical imaging with language models: A journey from n-grams to chatgpt | |
Yang et al. | Writing by memorizing: Hierarchical retrieval-based medical report generation | |
CN116779091B (en) | Automatic generation method of multi-mode network interconnection and fusion chest image diagnosis report | |
Pan et al. | AMAM: an attention-based multimodal alignment model for medical visual question answering | |
Guo et al. | Implicit discourse relation recognition via a BiLSTM-CNN architecture with dynamic chunk-based max pooling | |
Polignano et al. | A study of Machine Learning models for Clinical Coding of Medical Reports at CodiEsp 2020. | |
Xu et al. | Deep image captioning: A review of methods, trends and future challenges | |
Lu et al. | Sentiment analysis: Comprehensive reviews, recent advances, and open challenges | |
Dey et al. | Deep learning for multimedia content analysis | |
Hafeth et al. | Semantic representations with attention networks for boosting image captioning | |
CN116843995A (en) | Method and device for constructing cytographic pre-training model | |
CN116881336A (en) | Efficient multi-mode contrast depth hash retrieval method for medical big data | |
Gasimova | Automated enriched medical concept generation for chest X-ray images | |
Dilawari et al. | Neural attention model for abstractive text summarization using linguistic feature space |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |