CN114220516A - Brain CT medical report generation method based on hierarchical recurrent neural network decoding - Google Patents

Brain CT medical report generation method based on hierarchical recurrent neural network decoding Download PDF

Info

Publication number
CN114220516A
CN114220516A CN202111548154.0A CN202111548154A CN114220516A CN 114220516 A CN114220516 A CN 114220516A CN 202111548154 A CN202111548154 A CN 202111548154A CN 114220516 A CN114220516 A CN 114220516A
Authority
CN
China
Prior art keywords
brain
semantic
features
visual
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111548154.0A
Other languages
Chinese (zh)
Inventor
张晓丹
胡启鹏
刘颖
王筝
冀俊忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202111548154.0A priority Critical patent/CN114220516A/en
Publication of CN114220516A publication Critical patent/CN114220516A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H15/00ICT specially adapted for medical reports, e.g. generation or transmission thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10072Tomographic images
    • G06T2207/10081Computed x-ray tomography [CT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30004Biomedical image processing
    • G06T2207/30016Brain

Abstract

The invention discloses a brain CT medical report generation method based on hierarchical recurrent neural network decoding, which comprises the steps of firstly obtaining a brain CT image and corresponding medical report data and preprocessing the brain CT image and the corresponding medical report data; constructing a feature extractor to complete the encoding of the brain CT image data to obtain encoding features
Figure DDA0003416307920000011
And fault block visual features
Figure DDA0003416307920000012
Constructing an orientation keyword predictor for extracting orientation keyword semantic features F of brain CT image data Is(ii) a Constructing a hierarchical recurrent neural network language model, the model utilizing
Figure DDA0003416307920000013
And FsPerforming layered decoding, and generating medical reports sentence by sentence; training and optimizing the model; preprocessing the brain CT to be predicted; extracting coding features and fault block visual features of the brain CT to be predicted by using the orientation key words; extracting semantic features by using the orientation keywords; the language model generates predicted medical reports sentence-by-sentence using the coding features, the fault block visual features, and the semantic features.

Description

Brain CT medical report generation method based on hierarchical recurrent neural network decoding
Technical Field
The invention designs a medical report generation method of hierarchical recurrent neural network decoding by using related technical means in two fields of computer vision and natural language processing and aiming at the report automatic generation task in the field of medical image analysis.
Background
The task of automatic generation of medical report is to input a group of brain CT images containing spatial sequence relationship, and if a computer can automatically generate several sentences to describe the content of the images as a corresponding medical report, it requires that the computer has both comprehension ability of the brain CT images and language organization ability of the expression of the image content, which is a research focus in the analysis of medical images at present.
With the rise of various technologies in the field of artificial intelligence, computer-aided doctors performing medical image analysis become the key point of more and more attention of people. Among these, machine learning techniques play an important role in classification, segmentation, etc., however, conventional machine learning techniques exhibit limited capabilities in processing large-scale annotated or unlabeled data. Deep learning is one of the most important breakthroughs in the field of artificial intelligence in the last decade, and has enjoyed great success in different tasks of computer vision and image processing. The medical image analysis method and model also benefit from the powerful representation learning capability of the deep learning technology, and not only related research documents are continuously increased, but also substantial progress is made in the practical application fields of chest X-ray image diagnosis report generation and the like.
Early work on the automatic generation of medical reports was mainly based on models in the field of image description. At present, three research methods are mainly used in the field of image description: the method based on the template, the method based on the retrieval and the method based on the neural network coding and decoding are the most common methods in the field of image description at present and are widely applied to the fields of intelligent medical treatment and the like. The basic idea of the image description method based on the encoder and the decoder is to firstly adopt the encoder to extract the visual characteristics of the image, and then use the decoder to establish the mapping relation from the vision to the language to generate the description of the image. Compared with the first two methods, the method does not need detailed image labeling or huge data sets, and the generated image description is natural and becomes a mainstream model framework in the image description field. Compared with image description, the brain CT medical report generation task has a certain uniqueness in the decoding phase: the corresponding medical report is a long text sequence and has the characteristics of fixed structure and more description subjects.
Disclosure of Invention
In order to combine the data characteristics of a medical report and fully mine the detailed characteristic information of the brain CT code, the invention provides a brain CT medical report generation method based on hierarchical recurrent neural network decoding (HLSTMD) based on a related deep learning model in the field of image description, effectively mines the visual characteristics matched with different keywords in the brain CT code, improves the language performance index of finally generating the medical report, and generates a medical report with higher quality. The medical report generation method based on hierarchical recurrent neural network decoding mainly comprises an encoder part and a decoder part, wherein the encoder part comprises a feature extractor, and the decoder part comprises a position keyword Predictor (KWP) and a hierarchical recurrent neural network language decoding Model (TTSM).
The technical scheme adopted by the invention is a brain CT medical report generation method based on hierarchical recurrent neural network decoding, which comprises the following steps:
step (1) a training data set for generating a brain CT medical report is manufactured and preprocessed to obtain a standardized three-dimensional brain CT image and corresponding text information, wherein the text information comprises a keyword text
Figure BDA0003416307900000021
Report text
Figure BDA0003416307900000022
Step (1.1) of acquiring brain CT images to construct a data set, wherein each piece of patient data comprises RGB images I ═ { I ═ I) generated by the brain CT images1…,In},
Figure BDA0003416307900000031
Keyword text
Figure BDA0003416307900000032
Brain CT report text
Figure BDA0003416307900000033
Where n represents the number of brain CT sequences in each case, IiRepresenting the RGB image of the ith sequence, W and H represent the width and height of the sequence respectively,
Figure BDA0003416307900000034
representing the ith word in the keyword text,
Figure BDA0003416307900000035
represents the number of keywords in a report,
Figure BDA0003416307900000036
the ith word representing the report text,
Figure BDA0003416307900000037
representing the number of words in a report.
Step (1.2) divides all patient data into a training set, a validation set and a test set. Wherein the training set is used for learning parameters of the neural network; the validation set is used for determining the hyper-parameters; the test set is used for verifying the neural network classification effect.
Step (1.3) data preprocessing: and (3) complementing the brain CT images with the number of sequences not more than 24 by using an interpolation algorithm, and selecting the brain CT images with the number of sequences more than 24 by using a uniform sampling method.
And (2) completing the encoding of the brain CT image data I by using a brain CT image feature extractor. The feature extractor adopts a coder in a patent of brain CT medical report generation method based on hierarchical self-attention sequence coding to extract three-dimensional coding features of a brain CT image
Figure BDA0003416307900000038
And fault block visual features
Figure BDA0003416307900000039
And (3) constructing an orientation keyword predictor for left and right side two-classification of the brain CT image, and then generating a final medical report by using two-classification information in an auxiliary manner. The orientation keyword predictor is realized by multi-layer perception only comprising a hidden layer, the number of neurons of an output layer is 2, namely two classifications of left and right semantic labels are made, and left and right semantic information FsThe value of (b) is the value of a multi-layered perceptron hidden layer neuron. Three-dimensional brain CT coding features
Figure BDA00034163079000000310
Obtaining a semantic feature containing left and right side keyword labels through an orientation keyword predictor
Figure BDA00034163079000000311
Where n represents the number of brain CT images of one case after preprocessing.
And (4) constructing a hierarchical recurrent neural network language decoding model, which is called a language model for short and is used for generating a brain CT medical report. The language model is integrally divided into two partsDividing into: a Fusion Attention Model (FAM) and a hierarchical decoding module. The fusion attention module is used for performing feature fusion on semantic features and visual features, the fused semantic features are used for activating LSTM initial hidden layer vectors in the hierarchical decoding module, and the fused visual features guide the LSTM to generate words with emphasis. In the training phase, the input of the language model is the visual characteristics
Figure BDA0003416307900000041
Semantic feature FsKeyword text
Figure BDA0003416307900000042
And report text
Figure BDA0003416307900000043
And (5) the output value is the loss function value in the step (5) and is used for optimizing the performance of the model.
And (4.1) guiding the construction of an attention model, and guiding the attention model to be used in the step (4.2). The attention model is guided to mine the association information of one feature tensor in another related feature tensor space. The attentional model can be further divided into a visual Guided Attention model (VGA) in which the input of query tensor is VGA and a semantic Guided Attention model (SGA) in which the input of query tensor is
Figure BDA0003416307900000044
The input of the keyword tensor and the value tensor is FsThe new calculated features are guided by visual information, so that the semantic features are reconstructed in a visual feature space; the input of the query tensor in the semantic guidance attention model SGA is FsThe input of the keyword tensor and the value tensor is
Figure BDA0003416307900000045
And guiding by the semantic information through the calculated new features, so that the visual features are reconstructed in a semantic feature space.
And (4.2) constructing a Fusion Attention Module (FAM). The fusion attention module is realized by two-time stacked attention guiding models and linear transformation, residual error connection and normalization operations. The fusion attention module is used in the step (4.3), when the input is the semantic feature FsAnd visual features
Figure BDA0003416307900000046
The output is a visual fusion feature Fv′And semantic fusion feature Fs′(ii) a When the input is a semantic feature
Figure BDA0003416307900000047
(T-LSTM acquisition from step (4.3)) and visual characteristics Fv′The output is a visual fusion feature Fv″And semantic fusion feature Fs″
And (4.3) constructing a level decoding module. The hierarchical decoding module is composed of a keyword recurrent neural network T-LSTM and a descriptive short sentence recurrent neural network S-LSTM.
T-LSTM is characterized by:
1) first, semantic feature F is combined by fusion attention model FAMsAnd visual features
Figure BDA0003416307900000048
Performing fusion to obtain a reconstruction characteristic Fv′、Fs′
2) Semantic fusion feature F is used by T-LSTM hidden layer and first cycle units′Initializing, visually fusing feature Fv′Visual attention calculations are performed with the entered word values such that the T-LSTM focuses on the corresponding visual features when generating the corresponding word values.
3) Hidden layer state of T-LSTM at time T
Figure BDA0003416307900000051
Composed of two parts, one is the hidden layer state output by T-LSTM at T-1 moment
Figure BDA0003416307900000052
The other is the hidden layer state returned by the descriptive statement generation cyclic neural network S-LSTM after generating the descriptive statement of the t-th sentence
Figure BDA0003416307900000053
4) The input of the T-LSTM entity is the semantic feature FsVisual characteristics
Figure BDA0003416307900000054
And keyword text words
Figure BDA0003416307900000055
Predicted value X ═ X output as keyword text word1,…,xt},
Figure BDA0003416307900000056
xtRespectively representing the input word value and the output word predicted value at time t.
S-LSTM is characterized in that:
1) firstly, the key word xtObtaining keyword semantic features through linear change and dimension expansion
Figure BDA0003416307900000057
Then fusing the attention model to the semantic features
Figure BDA0003416307900000058
And visual features
Figure BDA0003416307900000059
Performing fusion to obtain a reconstruction characteristic Fv″、Fs″
2) Semantic fusion feature F is used by the S-LSTM hidden layer and the first cyclic units″Initializing, visually fusing feature Fv″And performing visual attention calculation with the input word value to realize visual semantic association matching of the characteristics of the brain CT fault block and the detailed lesion description short sentence.
3) Hidden layer state of S-LSTM at time t
Figure BDA00034163079000000510
Hidden layer states returned by last descriptive clause
Figure BDA00034163079000000511
And semantic reconstruction feature F at time ts″And (4) forming.
4) Input of S-LSTM ensemble as semantic features
Figure BDA00034163079000000512
Visual features
Figure BDA00034163079000000513
And describing short sentence words
Figure BDA00034163079000000514
The output is the predicted value Y ═ Y for describing the short sentence word1,…,yt},
Figure BDA00034163079000000515
ytRespectively representing the input word value and the output word predicted value at time t.
And (5) constructing and training an overall loss function, wherein the overall loss function comprises two parts:
Loss=Losssentence+Losstopic
wherein the content of the first and second substances,
Figure BDA0003416307900000061
Figure BDA0003416307900000062
finger xtThe real tags of (i.e., words in the keyword text,
Figure BDA0003416307900000063
Figure BDA0003416307900000064
finger ytOf genuine labels, i.e. reportsWord in this text, ptFinger corresponding word xtOr ytThe prediction probability of (2).
The prediction phase comprises the following steps:
and (6) preprocessing the brain CT to be predicted to obtain a standardized three-dimensional brain CT image, corresponding keyword text information and a medical report.
And (7) extracting the three-dimensional coding features and the fault block visual features of the brain CT image to be predicted by using the trained feature extractor.
And (8) generating left and right semantic features by using the trained direction keyword predictor.
And (9) fusing the semantic features and the visual features by using the trained language model to generate a medical report of the brain CT image to be predicted.
Advantageous effects
Compared with image description, the brain CT medical report generation task has a certain uniqueness in the decoding phase: the corresponding medical report is a long text sequence and has the characteristics of fixed structure and more description subjects. The model provided by the method can generate corresponding medical report description phrases from different visual spaces while combining the text data characteristics of the medical report, and generate a brain CT medical report with higher quality.
Drawings
FIG. 1: feature extractor
Fig. 2 (a): self-attention network architecture
Fig. 2 (b): attention directing network architecture
FIG. 3: model for fusing attention
FIG. 4: statement decoding model based on hierarchical recurrent neural network
FIG. 5: generated report comparative example
Detailed Description
The following takes 492 cases of data provided by the third medicine of Beijing university as an example to explain the specific implementation steps of the invention:
acquiring and preprocessing a brain CT image, corresponding keywords and report data:
step (1.1) acquires brain CT images to construct a data set, wherein the data set comprises 492 cases of brain CT images with patient sensitive information deleted and corresponding reports, and the image data of each patient comprises a plurality of CT sequences and a corresponding report text. The original brain CT medical image is in a dicom format, is converted into observation views with three scales commonly used by doctors and is used as three channel values of an RGB three-channel color image, the boundary noise CT value-2000 of the image is removed, and finally, the brain CT image data I in a PNG format is obtained1…,IN},
Figure BDA0003416307900000071
Where N represents the number of CT slices per case and W and H represent the width and height, respectively, of each sequence. Then the preprocessing of the corresponding text: removing redundant punctuation marks in the report text data; the terms of the professional nouns are unified; dividing the Chinese words into a plurality of description short sentences by using a 'separation character', dividing the Chinese words into a plurality of description short sentences by using a jieba word segmentation tool to obtain a description dictionary, wherein the size of the dictionary is 244, and finally obtaining the medical report text
Figure BDA0003416307900000072
Finally, the highest frequency words in each short sentence in the report text are combined into a keyword text
Figure BDA0003416307900000073
Figure BDA0003416307900000074
Representing the ith word in the keyword text,
Figure BDA0003416307900000075
represents the number of keywords in a report,
Figure BDA0003416307900000076
the ith word representing the report text,
Figure BDA0003416307900000077
represent in a reportThe number of words.
And (1.2) randomly dividing a data set into a training set, a verification set and a test set according to the ratio of 10:1: 1. The data ratio was 410:41: 41.
Step (1.3) data preprocessing: 24 brain CT tomography images are used for each case of data, interpolation algorithms are used for complementing less than 24 images, and data of more than 24 images are selected by using a uniform sampling method. Each image is denoised and normalized, and then the size is uniformly adjusted to 512 × 512.
And (2) completing the encoding of 492 cases of brain CT image data I by using a brain CT image feature extractor. The feature extractor adopts a coder in a patent of brain CT medical report generation method based on hierarchical self-attention sequence coding to extract three-dimensional coding features of a brain CT image
Figure BDA0003416307900000081
And fault block visual features
Figure BDA0003416307900000082
A specific network architecture is shown in figure 1,
Figure BDA0003416307900000083
corresponding to SA Feature and NSA Feature in fig. 1, respectively, the overall formula of the Feature extractor is expressed as:
Figure BDA0003416307900000084
and (3) constructing an orientation keyword predictor for left and right side two-classification of the brain CT image, and then generating a final medical report by using two-classification information in an auxiliary manner. The orientation keyword predictor is realized by multi-layer perception only comprising a hidden layer, the number of neurons of an output layer is 2, namely two classifications of left and right semantic labels are made, and left and right semantic information FsThe value of (b) is the value of a multi-layered perceptron hidden layer neuron. Three-dimensional brain CT coding features
Figure BDA0003416307900000085
Obtaining a semantic feature containing left and right side keyword labels through an orientation keyword predictor
Figure BDA0003416307900000086
Where n represents the number of brain CT images of one case after preprocessing, we use 24 brain CT images here, so n is 24.
And (4) constructing a hierarchical recurrent neural network language decoding model, which is called a language model for short and is used for generating a brain CT medical report. The language model is integrally divided into two parts: a Fusion Attention Model (FAM) and a hierarchical decoding module. The fusion attention module is used for performing feature fusion on semantic features and visual features, the fused semantic features are used for activating LSTM initial hidden layer vectors in the hierarchical decoding module, and the fused visual features guide the LSTM to generate words with emphasis. In the training phase, the input of the language model is the visual characteristics
Figure BDA0003416307900000087
Semantic feature FsKeyword text
Figure BDA0003416307900000088
And report text
Figure BDA0003416307900000091
And (5) the output value is the loss function value in the step (5) and is used for optimizing the performance of the model.
And (4.1) guiding the construction of an attention model. The method comprises the steps that the guiding attention model can be used for mining the association information of one feature tensor in another related feature tensor space, the fusion attention model is built through the guiding attention model, and therefore the language model can pay attention to the associated visual detail features in the brain CT when each description short sentence is generated. The specific structure of the self-attention model is shown in fig. 2(a), firstly, the same feature vector is subjected to linear change to obtain an inquiry tensor Q, a keyword tensor K and a value tensor V, then, matrix multiplication, scaling, Softmax activation function and matrix multiplication with V of Q and K are sequentially carried out, and finally, the attention feature considering the dependence relationship of the vector is calculated. The guiding attention is obtained by self-attention improvement, specifically, as shown in fig. 2(b), the guiding attention replaces the query tensor with another related tensor, when the replaced tensor is the visual feature and the value tensor and the keyword tensor are the semantic features, the guiding attention at this time is the visual guiding attention model VGA, the opposite query tensor is the semantic feature, and the attention model when the value tensor and the keyword tensor are the visual feature is the semantic directing attention model SGA.
And (4.2) constructing a Fusion Attention Model (FAM). The fusion attention model FAM is mainly realized by two-time stacked attention-directing models, for example, fusion semantic feature F is calculated by using FAMs′The specific process of (2) is shown in fig. 3. The input to the model comprises visual features
Figure BDA0003416307900000092
Semantic feature FsThe characteristic dimensions are respectively
Figure BDA0003416307900000093
And
Figure BDA0003416307900000094
wherein N isvIs the number of fault blocks, NsThe number of labels on the left and right sides. Firstly, mining visual characteristics through an SGA model
Figure BDA0003416307900000095
Neutralizing semantic feature FsThe related information realizes the effect of visual feature enhancement, and then the enhanced feature is used as the visual information of the VGA to guide the semantic feature F in returnsFinally, the semantic features reconstructed by the VGA and SGA models are calculated through the operation sequence of linear transformation, residual connection and normalization, and then the final semantic fusion features F are obtaineds′The forward propagation formula of the FAM model of the whole process is as follows:
Figure BDA0003416307900000096
Figure BDA0003416307900000101
as in the above fusion process, the visual features are used as the main body of the value tensor to generate the visual fusion features, and the forward propagation formula of the FAM model is as follows:
Figure BDA0003416307900000102
Figure BDA0003416307900000103
wherein W is a parameter matrix of the neural network linear transformation, Norm is a Layer Normalization function (Layer Normalization),
Figure BDA0003416307900000104
candidate values for semantically fused features, Fs′For the final semantic fusion feature, this feature will then be used to initialize our keyword generation recurrent neural network,
Figure BDA0003416307900000105
candidate value for visual fusion feature, Fv′This feature will then be used to compute a visual attention mechanism in the keyword generation recurrent neural network for the final visual fusion feature.
And (4.3) constructing a level decoding module. The hierarchical decoding module is composed of a keyword recurrent neural network T-LSTM and a descriptive short sentence recurrent neural network S-LSTM, and the specific structure is shown in FIG. 4, wherein the T-LSTM is used for generating keywords in the descriptive short sentence, and the L-LSTM generates the corresponding descriptive short sentence according to the corresponding keywords.
T-LSTM is characterized by:
1) first, semantics are specified by Fusing Attention Model (FAM)Sign FsAnd visual features
Figure BDA0003416307900000106
Performing fusion to obtain a reconstruction characteristic Fv′、Fs′
2) Semantic fusion feature F is used by T-LSTM hidden layer and first cycle units′Initializing, visually fusing feature Fv′Visual attention calculations are performed with the entered word values such that the T-LSTM focuses on the corresponding visual features when generating the corresponding word values.
3) Hidden layer state of T-LSTM at time T
Figure BDA0003416307900000107
Composed of two parts, one is the hidden layer state output by T-LSTM at T-1 moment
Figure BDA0003416307900000108
The other is the hidden layer state returned by the descriptive statement generation cyclic neural network S-LSTM after generating the descriptive statement of the t-th sentence
Figure BDA0003416307900000111
(as shown by the downward dashed line in fig. 4), the specific formula is as follows:
Figure BDA0003416307900000112
wherein alpha is a hyper-parameter and takes the value of [0, 1%]. By adjusting the value of the hyper-parameter alpha, the T-LSTM can retain and abandon the descriptive statement information and the keyword information at the previous moment to different degrees when generating each keyword. In the actual experimental process, a bias is added at the forgetting gate of the recurrent neural network and is initialized to be 1 so as to reduce the forgetting capability of the model at the initial training stage and avoid the problem that the output of the model explodes or disappears when the model is trained at the beginning, wherein a specific formula is shown as follows, wherein W is a parameter matrix f of linear transformation of the neural networktTo forget the gate vector, ht-1Upper time hidden layerState vector, xtFor the input vector at time t, b is bias:
ft=(Wfxxt+Wfhht-1+b)
4) the input of the T-LSTM entity is the semantic feature FsVisual characteristics
Figure BDA0003416307900000113
And keyword text words
Figure BDA0003416307900000114
Predicted value X ═ X output as keyword text word1,…,xt},
Figure BDA0003416307900000115
xtRespectively representing the input word value and the output word predicted value at time t.
S-LSTM is characterized in that:
1) firstly, the key word xtObtaining keyword semantic features through linear change and dimension expansion
Figure BDA0003416307900000116
Then fusing the attention model to the semantic features
Figure BDA0003416307900000117
And visual features
Figure BDA0003416307900000118
Performing fusion to obtain a reconstruction characteristic Fv″、Fs″
2) Semantic fusion feature F is used by the S-LSTM hidden layer and the first cyclic units″Initializing, visually fusing feature Fv″Performing visual attention calculation with the input word value to realize visual semantic association matching of the characteristics of the brain CT fault block and the detailed lesion description short sentence;
3) hidden layer state of S-LSTM at time t
Figure BDA0003416307900000121
Hidden layer states returned by last descriptive clause
Figure BDA0003416307900000122
And semantic reconstruction features at time t
Figure BDA0003416307900000123
In the process of initialization, we also introduce the memory information of the S-LSTM in generating the previous sentence, and the specific formula is as follows:
Figure BDA0003416307900000124
wherein beta is a hyper-parameter and takes the value of [0,1 ]. By adjusting the value of the hyper-parameter beta, the S-LSTM can reserve and abandon the generated description short sentence and the keyword information at the previous moment in different degrees when generating each short sentence.
4) Input of S-LSTM ensemble as semantic features
Figure BDA0003416307900000125
Visual features
Figure BDA0003416307900000126
And describing short sentence words
Figure BDA0003416307900000127
The output is the predicted value Y ═ Y for describing the short sentence word1,…,yt},
Figure BDA0003416307900000128
ytRespectively representing the input word value and the output word predicted value at time t.
Step (5), defining an overall loss function and training a brain CT image report to automatically generate a network, wherein the overall loss function comprises two parts:
loss=Losssentence+Losstopic
wherein the content of the first and second substances,
Figure BDA0003416307900000129
Figure BDA00034163079000001210
finger xtThe real tags of (i.e., words in the keyword text,
Figure BDA00034163079000001211
Figure BDA00034163079000001212
finger ytThe real label of (1), i.e. the word in the report text, ptFinger corresponding word xtOr ytThe prediction probability of (2). Finally, the network can adaptively optimize loss values of a real report and a prediction report under an Adam optimizer algorithm, a brain CT image is input after training is completed, the model can automatically generate a continuous and accurate brain CT medical report, meanwhile, the model can pay attention to different visual points when each descriptive short sentence is generated by using the characteristics of the reconstructed fusion attention mechanism, and the advantage of the method for generating the long text sequence is embodied.
In order to verify the feasibility of the method, quantitative experimental analysis and qualitative result analysis are carried out on the method provided by the invention. The generated report result is subjected to language evaluation through four evaluation indexes, namely BLEU, METEOR, ROUGE-L and CIDER, which are widely applied to quantitative evaluation of generated sentences in the fields of image description, natural language processing and the like. And the fifth figure shows examples of the invention, including brain CT image, report example generated by baseline method of contrast and report example generated by the method.
Table 1 model ablation experiment
Figure BDA0003416307900000131
First, an inter-module ablation experiment was performed, and the first row of table 1 used a single layer of LSTM for decoding of the medical report, i.e., the baseline model. The LSTM _ S method of the second line is to use semantic information F of the second classs′The memory unit of the decoder part is initialized and used as the starting word information of the medical report in the testing stage. In the next third row of the LSTM _ FAM method, the semantic information of the second category is replaced by its own fusion feature with the fault block visual feature by the FAM module. The HLSTM method below the split line is to use hierarchical T-LSTM and S-LSTM as decoders for medical reports. The fifth and sixth lines are semantic information F using orientation keywords respectively for HLSTMs′And Fs″Performing initialization of a loop unit
From table 1, it can be seen that the mere addition of the number of layers of the recurrent neural network does not necessarily bring performance improvement to the final medical report generation, since the training sample is too small, the addition of one layer of neural network for training and describing the subject keyword brings more training parameters, and the language evaluation index value during the final test is rather lower, which indicates that the addition of parameters brings further over-fitting problem in the data set, but the newly added recurrent neural network can perform a task decomposition on the final medical report generation, so as to obtain the generation process of each sentence description sentence and capture of corresponding visual and semantic information, and the language evaluation index value of the model is made to be equal to or even greater than the performance of a single-layer recurrent neural network through the fusion of multi-modal information. Specifically, no matter the LSTM baseline method or the HLSTM method is a single-layer LSTM baseline method or a single-layer HLSTM method, the semantic information of two classifications is used for initializing the language model, the left and right labels generated by the orientation keyword predictor are used as verb initials of the first medical report short sentence in the testing stage, the performance of the final model is slightly improved, and BLEU-1 and CIDER index values are obviously improved. In addition, the FAM module is used to convert the left and right keywords F in HLSTMs′And location keyword Fs″Fused with visually encoded features and used to enhance T-LSTM and S-LSTM, respectively, with a large improvement in performance of final medical report generationAnd (5) rising. Finally, in the HLSTM _ GT experiment, the semantic information on the left side and the right side and the semantic information output by the T-LSTM are not used any more, but the real label of the keyword is directly used for generating FAM fusion information, and the result shows that when the quality of generating the keyword label is very high, the method provided by the invention can effectively improve the performance index of the report description statement.
TABLE 2 FAM model ablation experimental results
Figure BDA0003416307900000141
Then, experiments of different stacking modes of the FAM model are carried out, and the effectiveness of the FAM model is researched in a certain assumed space. Visual fusion (reconstruction) feature is Fv′(Fv″) Semantic fusion (reconstruction) feature is Fs′(Fs″) SA _.
As can be seen from the top of the segmentation line in table 2, when the same semantic guidance attention model, that is, the same SGA and SA _ SGA model, is used, the fitting ability of the final model to the data set is relatively similar, and the fluctuation of the evaluation index value is relatively small, which indicates that the discriminativity of the model to the visual fusion feature is significantly stronger than that of the semantic fusion feature. Meanwhile, through cross fusion with the fault block information, the label detail characteristics contained in the semantic information can be further learned, so that the relevance degree of the model and the relevant visual information is enhanced when the model generates a description short sentence taking the corresponding label as a description main body, and the quality of the model for finally generating the medical report is further improved.
Finally, the qualitative analysis of the model description ability is carried out, fig. five is a comparison example of the medical report generated by the method and the baseline method, and from the generated medical report text, the medical report generated by the single-layer LSTM method describes less subjects, and the described subjects are generally adjacent brain tissue structures (the positions of the tomograms are adjacent); the HLSTMD method has the advantages that the number of description subjects of the medical report generated by the HLSTMD method is increased obviously, more pathological conditions are described, the generated report has more diversity, and the detailed information describing the subject can be analyzed to a certain extent, for example, the detailed description information of the 'submucosal cyst of right maxillary sinus' is generated in the case of paranasal sinusitis.
In general, no matter quantitative evaluation of language performance or medical report examples shown in qualitative analysis, the automatic medical report generation framework provided by the invention achieves better effect and has good application prospect in future practical application.

Claims (5)

1. A brain CT medical report generation method based on hierarchical recurrent neural network decoding is characterized by comprising the following steps:
the method comprises two stages of training and predicting,
the training phase comprises the following steps:
(1) making a training data set for generating a brain CT medical report and preprocessing the training data set to obtain a standardized three-dimensional brain CT image and corresponding text information, wherein the text information comprises a keyword text
Figure FDA0003416307890000011
Report text
Figure FDA0003416307890000012
(2) The brain CT image feature extractor is used for completing the encoding of the brain CT image data I to obtain the three-dimensional brain CT encoding features
Figure FDA0003416307890000013
And brain CT fault block visual characteristics
Figure FDA0003416307890000014
(3) Constructing an orientation keyword predictor for left and right secondary classification of the brain CT image, and then generating a final medical report by using secondary classification information in an auxiliary manner; three-dimensional brain CT coding features
Figure FDA0003416307890000015
Obtaining a semantic feature containing left and right side keyword labels through an orientation keyword predictor
Figure FDA0003416307890000016
Wherein n represents the number of brain CT images of one case after preprocessing;
(4) constructing a hierarchical recurrent neural network language decoding model, hereinafter referred to as a language model for generating a brain CT medical report; the language model is integrally divided into two parts: the system comprises a fusion attention module and a hierarchy decoding module, wherein the fusion attention module is used for carrying out feature fusion on semantic features and visual features, the fused semantic features are used for activating LSTM initial hidden layer vectors in the hierarchy decoding module, and the fused visual features guide the LSTM to generate words with emphasis; in the training phase, the input of the language model is the visual characteristics
Figure FDA0003416307890000017
Semantic feature FsKeyword text
Figure FDA0003416307890000018
And report text
Figure FDA0003416307890000019
The output value is the loss function value in the step (5) and is used for optimizing the performance of the model;
(5) constructing and training an overall loss function, wherein the overall loss function comprises two parts:
Loss=Losssentence+Losstopic
wherein the content of the first and second substances,
Figure FDA00034163078900000110
Figure FDA00034163078900000111
finger xtThe real tags of (i.e., words in the keyword text,
Figure FDA0003416307890000021
Figure FDA0003416307890000022
finger ytThe real label of (1), i.e. the word in the report text, ptFinger corresponding word xtOr ytA predicted probability of (d);
the prediction phase comprises the following steps:
(6) preprocessing a brain CT to be predicted to obtain a standardized three-dimensional brain CT image, corresponding keyword text information and a medical report;
(7) extracting three-dimensional coding features and fault block visual features of the brain CT image to be predicted by using a trained feature extractor;
(8) generating left and right semantic features by using the trained direction keyword predictor;
(9) and fusing the semantic features and the visual features by using the trained language model to generate a medical report of the brain CT image to be predicted.
2. The brain CT medical report generation method based on hierarchical recurrent neural network decoding as claimed in claim 1, wherein:
the specific steps in the step (1) are as follows:
step (1.1) of acquiring brain CT images to construct a data set, wherein each piece of patient data comprises RGB images I ═ { I ═ I) generated by the brain CT images1…,In},
Figure FDA0003416307890000023
Keyword text
Figure FDA0003416307890000024
Brain CT report
Figure FDA0003416307890000025
Where n represents the number of brain CT sequences in each case, IiRepresenting the RGB image of the ith sequence, W and H represent the width and height of the sequence respectively,
Figure FDA0003416307890000026
representing the ith word in the keyword text,
Figure FDA0003416307890000027
represents the number of keywords in a report,
Figure FDA0003416307890000028
the ith word representing the report text,
Figure FDA0003416307890000029
represents the number of words in a report;
step (1.2) dividing all patient data into a training set, a validation set and a test set; wherein the training set is used for learning parameters of the neural network; the validation set is used for determining the hyper-parameters; the test set is used for verifying the neural network classification effect;
step (1.3) data preprocessing: and (3) complementing the brain CT images with the number of sequences not more than 24 by using an interpolation algorithm, and selecting the brain CT images with the number of sequences more than 24 by using a uniform sampling method.
3. The brain CT medical report generation method based on hierarchical recurrent neural network decoding as claimed in claim 1, wherein:
the characteristic extraction in the step (2)The device adopts a coder in a patent of brain CT medical report generation method based on hierarchical self-attention sequence coding to extract three-dimensional coding characteristics of a brain CT image
Figure FDA0003416307890000031
And fault block visual features
Figure FDA0003416307890000032
4. The brain CT medical report generation method based on hierarchical recurrent neural network decoding as claimed in claim 1, wherein:
the orientation keyword predictor in the step (3) is realized by multi-layer perception only comprising a hidden layer, the number of neurons of an output layer is 2, namely, two classifications of left and right semantic labels are made, and left and right semantic information FsThe value of (b) is the value of a multi-layered perceptron hidden layer neuron.
5. The brain CT medical report generation method based on hierarchical recurrent neural network decoding as claimed in claim 1, wherein:
the specific steps in the step (4) are as follows:
step (4.1) guiding the construction of an attention model, and guiding the attention model to be used in step (4.2); the guiding attention model is used for mining the association information of one feature tensor in another related feature tensor space; the attention-guiding model can be further divided into a visual attention-guiding model VGA and a semantic attention-guiding model SGA, wherein the input of the query tensor in the visual attention-guiding model VGA is
Figure FDA0003416307890000033
The input of the keyword tensor and the value tensor is FsThe new calculated features are guided by visual information, so that the semantic features are reconstructed in a visual feature space; the input of the query tensor in the semantic guidance attention model SGA is FsKey point ofThe input of the word tensor and the value tensor is
Figure FDA0003416307890000034
At the moment, the calculated new features are guided by semantic information, so that the visual features are reconstructed in a semantic feature space;
step (4.2) construction of a Fusion Attention Module (FAM): the fusion attention module is realized by two-time stacked attention guiding models and linear transformation, residual connection and normalization operations; the fusion attention module is used in the step (4.3), when the input is the semantic feature FsAnd visual features
Figure FDA0003416307890000041
The output is a visual fusion feature Fv′And semantic fusion feature Fs′The specific formula is as follows:
Figure FDA0003416307890000042
Figure FDA0003416307890000043
Figure FDA0003416307890000044
Figure FDA0003416307890000045
wherein, W is a parameter matrix of the linear transformation of the neural network, Norm is Layer Normalization function,
Figure FDA0003416307890000046
and
Figure FDA0003416307890000047
respectively representing a visual fusion feature candidate value and a semantic fusion feature candidate value;
and (4.3) constructing a level decoding module: the hierarchical decoding module consists of a keyword recurrent neural network T-LSTM and a descriptive short sentence recurrent neural network S-LSTM;
T-LSTM is characterized by:
1) first, semantic feature F is combined by fusion attention model FAMsAnd visual features
Figure FDA0003416307890000048
Performing fusion to obtain a reconstruction characteristic Fv′、Fs′
2) Semantic fusion feature F is used by T-LSTM hidden layer and first cycle units′Initializing, visually fusing feature Fv′Performing visual attention calculation with the input word value so that the T-LSTM focuses on the corresponding visual feature when generating the corresponding word value;
3) hidden layer state of T-LSTM at time T
Figure FDA0003416307890000049
Composed of two parts, one is the hidden layer state output by T-LSTM at T-1 moment
Figure FDA00034163078900000410
The other is the hidden layer state returned by the descriptive statement generation cyclic neural network S-LSTM after generating the descriptive statement of the t-th sentence
Figure FDA0003416307890000051
The specific formula is as follows:
Figure FDA0003416307890000052
wherein alpha is a hyper-parameter and takes a value of [0,1 ];
4) T-LSTM entityIs input as semantic feature FsVisual characteristics
Figure FDA0003416307890000053
And keyword text words
Figure FDA0003416307890000054
Predicted value X ═ X output as keyword text word1,…,xt},
Figure FDA0003416307890000055
xtRespectively representing the input word value and the output word predicted value at the time t;
S-LSTM is characterized in that:
1) firstly, the key word xtObtaining keyword semantic features through linear change and dimension expansion
Figure FDA0003416307890000056
Then fusing the attention model to the semantic features
Figure FDA0003416307890000057
And visual features
Figure FDA0003416307890000058
Performing fusion to obtain a reconstruction characteristic Fv″、Fs″
2) Semantic fusion feature F is used by the S-LSTM hidden layer and the first cyclic units″Initializing, visually fusing feature Fv″Performing visual attention calculation with the input word value to realize visual semantic association matching of the characteristics of the brain CT fault block and the detailed lesion description short sentence;
3) hidden layer state of S-LSTM at time t
Figure FDA0003416307890000059
Hidden layer states returned by last descriptive clause
Figure FDA00034163078900000510
And semantic reconstruction features at time t
Figure FDA00034163078900000511
The specific formula is as follows:
Figure FDA00034163078900000512
wherein beta is a hyper-parameter, and the value is [0,1 ];
4) input of S-LSTM ensemble as semantic features
Figure FDA00034163078900000513
Visual features
Figure FDA00034163078900000514
And describing short sentence words
Figure FDA00034163078900000515
The output is the predicted value Y ═ Y for describing the short sentence word1,…,yt},
Figure FDA00034163078900000516
ytRespectively representing the input word value and the output word predicted value at time t.
CN202111548154.0A 2021-12-17 2021-12-17 Brain CT medical report generation method based on hierarchical recurrent neural network decoding Pending CN114220516A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111548154.0A CN114220516A (en) 2021-12-17 2021-12-17 Brain CT medical report generation method based on hierarchical recurrent neural network decoding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111548154.0A CN114220516A (en) 2021-12-17 2021-12-17 Brain CT medical report generation method based on hierarchical recurrent neural network decoding

Publications (1)

Publication Number Publication Date
CN114220516A true CN114220516A (en) 2022-03-22

Family

ID=80703490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111548154.0A Pending CN114220516A (en) 2021-12-17 2021-12-17 Brain CT medical report generation method based on hierarchical recurrent neural network decoding

Country Status (1)

Country Link
CN (1) CN114220516A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116631566A (en) * 2023-05-23 2023-08-22 重庆邮电大学 Medical image report intelligent generation method based on big data
CN117056519A (en) * 2023-08-17 2023-11-14 天津大学 Cross-domain-oriented automatic generation method for comprehensive report of legal opinions
CN117095187A (en) * 2023-10-16 2023-11-21 四川大学 Meta-learning visual language understanding and positioning method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116631566A (en) * 2023-05-23 2023-08-22 重庆邮电大学 Medical image report intelligent generation method based on big data
CN117056519A (en) * 2023-08-17 2023-11-14 天津大学 Cross-domain-oriented automatic generation method for comprehensive report of legal opinions
CN117095187A (en) * 2023-10-16 2023-11-21 四川大学 Meta-learning visual language understanding and positioning method
CN117095187B (en) * 2023-10-16 2023-12-19 四川大学 Meta-learning visual language understanding and positioning method

Similar Documents

Publication Publication Date Title
Ren et al. Cgmvqa: A new classification and generative model for medical visual question answering
Karpathy et al. Deep visual-semantic alignments for generating image descriptions
Ayesha et al. Automatic medical image interpretation: State of the art and future directions
CN114220516A (en) Brain CT medical report generation method based on hierarchical recurrent neural network decoding
Huang et al. Multimodal continuous emotion recognition with data augmentation using recurrent neural networks
CN112614561A (en) Brain CT medical report generation method based on hierarchical self-attention sequence coding
Alami et al. Using unsupervised deep learning for automatic summarization of Arabic documents
Liu et al. Unsupervised temporal video grounding with deep semantic clustering
Islam et al. Exploring video captioning techniques: A comprehensive survey on deep learning methods
CN111666762B (en) Intestinal cancer diagnosis electronic medical record attribute value extraction method based on multitask learning
Bae et al. Flower classification with modified multimodal convolutional neural networks
Hu et al. Advancing medical imaging with language models: A journey from n-grams to chatgpt
Yang et al. Writing by memorizing: Hierarchical retrieval-based medical report generation
CN116779091B (en) Automatic generation method of multi-mode network interconnection and fusion chest image diagnosis report
Pan et al. AMAM: an attention-based multimodal alignment model for medical visual question answering
Guo et al. Implicit discourse relation recognition via a BiLSTM-CNN architecture with dynamic chunk-based max pooling
Polignano et al. A study of Machine Learning models for Clinical Coding of Medical Reports at CodiEsp 2020.
Xu et al. Deep image captioning: A review of methods, trends and future challenges
Lu et al. Sentiment analysis: Comprehensive reviews, recent advances, and open challenges
Dey et al. Deep learning for multimedia content analysis
Hafeth et al. Semantic representations with attention networks for boosting image captioning
CN116843995A (en) Method and device for constructing cytographic pre-training model
CN116881336A (en) Efficient multi-mode contrast depth hash retrieval method for medical big data
Gasimova Automated enriched medical concept generation for chest X-ray images
Dilawari et al. Neural attention model for abstractive text summarization using linguistic feature space

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination