CN114220516A

CN114220516A - Brain CT medical report generation method based on hierarchical recurrent neural network decoding

Info

Publication number: CN114220516A
Application number: CN202111548154.0A
Authority: CN
Inventors: 张晓丹; 胡启鹏; 刘颖; 王筝; 冀俊忠
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-03-22

Abstract

The invention discloses a brain CT medical report generation method based on hierarchical recurrent neural network decoding, which comprises the steps of firstly obtaining a brain CT image and corresponding medical report data and preprocessing the brain CT image and the corresponding medical report data; constructing a feature extractor to complete the encoding of the brain CT image data to obtain encoding features

And fault block visual features

Constructing an orientation keyword predictor for extracting orientation keyword semantic features F of brain CT image data I^s(ii) a Constructing a hierarchical recurrent neural network language model, the model utilizing

And F^sPerforming layered decoding, and generating medical reports sentence by sentence; training and optimizing the model; preprocessing the brain CT to be predicted; extracting coding features and fault block visual features of the brain CT to be predicted by using the orientation key words; extracting semantic features by using the orientation keywords; the language model generates predicted medical reports sentence-by-sentence using the coding features, the fault block visual features, and the semantic features.

Description

Brain CT medical report generation method based on hierarchical recurrent neural network decoding

Technical Field

The invention designs a medical report generation method of hierarchical recurrent neural network decoding by using related technical means in two fields of computer vision and natural language processing and aiming at the report automatic generation task in the field of medical image analysis.

Background

The task of automatic generation of medical report is to input a group of brain CT images containing spatial sequence relationship, and if a computer can automatically generate several sentences to describe the content of the images as a corresponding medical report, it requires that the computer has both comprehension ability of the brain CT images and language organization ability of the expression of the image content, which is a research focus in the analysis of medical images at present.

With the rise of various technologies in the field of artificial intelligence, computer-aided doctors performing medical image analysis become the key point of more and more attention of people. Among these, machine learning techniques play an important role in classification, segmentation, etc., however, conventional machine learning techniques exhibit limited capabilities in processing large-scale annotated or unlabeled data. Deep learning is one of the most important breakthroughs in the field of artificial intelligence in the last decade, and has enjoyed great success in different tasks of computer vision and image processing. The medical image analysis method and model also benefit from the powerful representation learning capability of the deep learning technology, and not only related research documents are continuously increased, but also substantial progress is made in the practical application fields of chest X-ray image diagnosis report generation and the like.

Early work on the automatic generation of medical reports was mainly based on models in the field of image description. At present, three research methods are mainly used in the field of image description: the method based on the template, the method based on the retrieval and the method based on the neural network coding and decoding are the most common methods in the field of image description at present and are widely applied to the fields of intelligent medical treatment and the like. The basic idea of the image description method based on the encoder and the decoder is to firstly adopt the encoder to extract the visual characteristics of the image, and then use the decoder to establish the mapping relation from the vision to the language to generate the description of the image. Compared with the first two methods, the method does not need detailed image labeling or huge data sets, and the generated image description is natural and becomes a mainstream model framework in the image description field. Compared with image description, the brain CT medical report generation task has a certain uniqueness in the decoding phase: the corresponding medical report is a long text sequence and has the characteristics of fixed structure and more description subjects.

Disclosure of Invention

In order to combine the data characteristics of a medical report and fully mine the detailed characteristic information of the brain CT code, the invention provides a brain CT medical report generation method based on hierarchical recurrent neural network decoding (HLSTMD) based on a related deep learning model in the field of image description, effectively mines the visual characteristics matched with different keywords in the brain CT code, improves the language performance index of finally generating the medical report, and generates a medical report with higher quality. The medical report generation method based on hierarchical recurrent neural network decoding mainly comprises an encoder part and a decoder part, wherein the encoder part comprises a feature extractor, and the decoder part comprises a position keyword Predictor (KWP) and a hierarchical recurrent neural network language decoding Model (TTSM).

The technical scheme adopted by the invention is a brain CT medical report generation method based on hierarchical recurrent neural network decoding, which comprises the following steps:

step (1) a training data set for generating a brain CT medical report is manufactured and preprocessed to obtain a standardized three-dimensional brain CT image and corresponding text information, wherein the text information comprises a keyword text

Report text

Step (1.1) of acquiring brain CT images to construct a data set, wherein each piece of patient data comprises RGB images I ═ { I ═ I) generated by the brain CT images₁…,I_n}，

Keyword text

Brain CT report text

Where n represents the number of brain CT sequences in each case, I_iRepresenting the RGB image of the ith sequence, W and H represent the width and height of the sequence respectively,

representing the ith word in the keyword text,

represents the number of keywords in a report,

the ith word representing the report text,

representing the number of words in a report.

Step (1.2) divides all patient data into a training set, a validation set and a test set. Wherein the training set is used for learning parameters of the neural network; the validation set is used for determining the hyper-parameters; the test set is used for verifying the neural network classification effect.

Step (1.3) data preprocessing: and (3) complementing the brain CT images with the number of sequences not more than 24 by using an interpolation algorithm, and selecting the brain CT images with the number of sequences more than 24 by using a uniform sampling method.

And (2) completing the encoding of the brain CT image data I by using a brain CT image feature extractor. The feature extractor adopts a coder in a patent of brain CT medical report generation method based on hierarchical self-attention sequence coding to extract three-dimensional coding features of a brain CT image

And fault block visual features

And (3) constructing an orientation keyword predictor for left and right side two-classification of the brain CT image, and then generating a final medical report by using two-classification information in an auxiliary manner. The orientation keyword predictor is realized by multi-layer perception only comprising a hidden layer, the number of neurons of an output layer is 2, namely two classifications of left and right semantic labels are made, and left and right semantic information F^sThe value of (b) is the value of a multi-layered perceptron hidden layer neuron. Three-dimensional brain CT coding features

Obtaining a semantic feature containing left and right side keyword labels through an orientation keyword predictor

Where n represents the number of brain CT images of one case after preprocessing.

And (4) constructing a hierarchical recurrent neural network language decoding model, which is called a language model for short and is used for generating a brain CT medical report. The language model is integrally divided into two partsDividing into: a Fusion Attention Model (FAM) and a hierarchical decoding module. The fusion attention module is used for performing feature fusion on semantic features and visual features, the fused semantic features are used for activating LSTM initial hidden layer vectors in the hierarchical decoding module, and the fused visual features guide the LSTM to generate words with emphasis. In the training phase, the input of the language model is the visual characteristics

Semantic feature F^sKeyword text

And report text

And (5) the output value is the loss function value in the step (5) and is used for optimizing the performance of the model.

And (4.1) guiding the construction of an attention model, and guiding the attention model to be used in the step (4.2). The attention model is guided to mine the association information of one feature tensor in another related feature tensor space. The attentional model can be further divided into a visual Guided Attention model (VGA) in which the input of query tensor is VGA and a semantic Guided Attention model (SGA) in which the input of query tensor is

The input of the keyword tensor and the value tensor is F^sThe new calculated features are guided by visual information, so that the semantic features are reconstructed in a visual feature space; the input of the query tensor in the semantic guidance attention model SGA is F^sThe input of the keyword tensor and the value tensor is

And guiding by the semantic information through the calculated new features, so that the visual features are reconstructed in a semantic feature space.

And (4.2) constructing a Fusion Attention Module (FAM). The fusion attention module is realized by two-time stacked attention guiding models and linear transformation, residual error connection and normalization operations. The fusion attention module is used in the step (4.3), when the input is the semantic feature F^sAnd visual features

The output is a visual fusion feature F^v′And semantic fusion feature F^s′(ii) a When the input is a semantic feature

(T-LSTM acquisition from step (4.3)) and visual characteristics F^v′The output is a visual fusion feature F^v″And semantic fusion feature F^s″。

And (4.3) constructing a level decoding module. The hierarchical decoding module is composed of a keyword recurrent neural network T-LSTM and a descriptive short sentence recurrent neural network S-LSTM.

T-LSTM is characterized by:

1) first, semantic feature F is combined by fusion attention model FAM^sAnd visual features

Performing fusion to obtain a reconstruction characteristic F^v′、F^s′。

2) Semantic fusion feature F is used by T-LSTM hidden layer and first cycle unit^s′Initializing, visually fusing feature F^v′Visual attention calculations are performed with the entered word values such that the T-LSTM focuses on the corresponding visual features when generating the corresponding word values.

3) Hidden layer state of T-LSTM at time T

Composed of two parts, one is the hidden layer state output by T-LSTM at T-1 moment

The other is the hidden layer state returned by the descriptive statement generation cyclic neural network S-LSTM after generating the descriptive statement of the t-th sentence

4) The input of the T-LSTM entity is the semantic feature F^sVisual characteristics

And keyword text words

Predicted value X ═ X output as keyword text word₁,…,x_t}，

x_tRespectively representing the input word value and the output word predicted value at time t.

S-LSTM is characterized in that:

1) firstly, the key word x_tObtaining keyword semantic features through linear change and dimension expansion

Then fusing the attention model to the semantic features

And visual features

Performing fusion to obtain a reconstruction characteristic F^v″、F^s″。

2) Semantic fusion feature F is used by the S-LSTM hidden layer and the first cyclic unit^s″Initializing, visually fusing feature F^v″And performing visual attention calculation with the input word value to realize visual semantic association matching of the characteristics of the brain CT fault block and the detailed lesion description short sentence.

3) Hidden layer state of S-LSTM at time t

Hidden layer states returned by last descriptive clause

And semantic reconstruction feature F at time t^s″And (4) forming.

4) Input of S-LSTM ensemble as semantic features

Visual features

And describing short sentence words

The output is the predicted value Y ═ Y for describing the short sentence word₁,…,y_t}，

y_tRespectively representing the input word value and the output word predicted value at time t.

And (5) constructing and training an overall loss function, wherein the overall loss function comprises two parts:

Loss＝Loss_sentence+Loss_topic

wherein the content of the first and second substances,

finger x_tThe real tags of (i.e., words in the keyword text,

finger y_tOf genuine labels, i.e. reportsWord in this text, p_tFinger corresponding word x_tOr y_tThe prediction probability of (2).

The prediction phase comprises the following steps:

and (6) preprocessing the brain CT to be predicted to obtain a standardized three-dimensional brain CT image, corresponding keyword text information and a medical report.

And (7) extracting the three-dimensional coding features and the fault block visual features of the brain CT image to be predicted by using the trained feature extractor.

And (8) generating left and right semantic features by using the trained direction keyword predictor.

And (9) fusing the semantic features and the visual features by using the trained language model to generate a medical report of the brain CT image to be predicted.

Advantageous effects

Compared with image description, the brain CT medical report generation task has a certain uniqueness in the decoding phase: the corresponding medical report is a long text sequence and has the characteristics of fixed structure and more description subjects. The model provided by the method can generate corresponding medical report description phrases from different visual spaces while combining the text data characteristics of the medical report, and generate a brain CT medical report with higher quality.

Drawings

FIG. 1: feature extractor

Fig. 2 (a): self-attention network architecture

Fig. 2 (b): attention directing network architecture

FIG. 3: model for fusing attention

FIG. 4: statement decoding model based on hierarchical recurrent neural network

FIG. 5: generated report comparative example

Detailed Description

The following takes 492 cases of data provided by the third medicine of Beijing university as an example to explain the specific implementation steps of the invention:

acquiring and preprocessing a brain CT image, corresponding keywords and report data:

step (1.1) acquires brain CT images to construct a data set, wherein the data set comprises 492 cases of brain CT images with patient sensitive information deleted and corresponding reports, and the image data of each patient comprises a plurality of CT sequences and a corresponding report text. The original brain CT medical image is in a dicom format, is converted into observation views with three scales commonly used by doctors and is used as three channel values of an RGB three-channel color image, the boundary noise CT value-2000 of the image is removed, and finally, the brain CT image data I in a PNG format is obtained₁…,I_N}，

Where N represents the number of CT slices per case and W and H represent the width and height, respectively, of each sequence. Then the preprocessing of the corresponding text: removing redundant punctuation marks in the report text data; the terms of the professional nouns are unified; dividing the Chinese words into a plurality of description short sentences by using a 'separation character', dividing the Chinese words into a plurality of description short sentences by using a jieba word segmentation tool to obtain a description dictionary, wherein the size of the dictionary is 244, and finally obtaining the medical report text

Finally, the highest frequency words in each short sentence in the report text are combined into a keyword text

Representing the ith word in the keyword text,

represents the number of keywords in a report,

the ith word representing the report text,

represent in a reportThe number of words.

And (1.2) randomly dividing a data set into a training set, a verification set and a test set according to the ratio of 10:1: 1. The data ratio was 410:41: 41.

Step (1.3) data preprocessing: 24 brain CT tomography images are used for each case of data, interpolation algorithms are used for complementing less than 24 images, and data of more than 24 images are selected by using a uniform sampling method. Each image is denoised and normalized, and then the size is uniformly adjusted to 512 × 512.

And (2) completing the encoding of 492 cases of brain CT image data I by using a brain CT image feature extractor. The feature extractor adopts a coder in a patent of brain CT medical report generation method based on hierarchical self-attention sequence coding to extract three-dimensional coding features of a brain CT image

And fault block visual features

A specific network architecture is shown in figure 1,

corresponding to SA Feature and NSA Feature in fig. 1, respectively, the overall formula of the Feature extractor is expressed as:

Where n represents the number of brain CT images of one case after preprocessing, we use 24 brain CT images here, so n is 24.

And (4) constructing a hierarchical recurrent neural network language decoding model, which is called a language model for short and is used for generating a brain CT medical report. The language model is integrally divided into two parts: a Fusion Attention Model (FAM) and a hierarchical decoding module. The fusion attention module is used for performing feature fusion on semantic features and visual features, the fused semantic features are used for activating LSTM initial hidden layer vectors in the hierarchical decoding module, and the fused visual features guide the LSTM to generate words with emphasis. In the training phase, the input of the language model is the visual characteristics

Semantic feature F^sKeyword text

And report text

And (4.1) guiding the construction of an attention model. The method comprises the steps that the guiding attention model can be used for mining the association information of one feature tensor in another related feature tensor space, the fusion attention model is built through the guiding attention model, and therefore the language model can pay attention to the associated visual detail features in the brain CT when each description short sentence is generated. The specific structure of the self-attention model is shown in fig. 2(a), firstly, the same feature vector is subjected to linear change to obtain an inquiry tensor Q, a keyword tensor K and a value tensor V, then, matrix multiplication, scaling, Softmax activation function and matrix multiplication with V of Q and K are sequentially carried out, and finally, the attention feature considering the dependence relationship of the vector is calculated. The guiding attention is obtained by self-attention improvement, specifically, as shown in fig. 2(b), the guiding attention replaces the query tensor with another related tensor, when the replaced tensor is the visual feature and the value tensor and the keyword tensor are the semantic features, the guiding attention at this time is the visual guiding attention model VGA, the opposite query tensor is the semantic feature, and the attention model when the value tensor and the keyword tensor are the visual feature is the semantic directing attention model SGA.

And (4.2) constructing a Fusion Attention Model (FAM). The fusion attention model FAM is mainly realized by two-time stacked attention-directing models, for example, fusion semantic feature F is calculated by using FAM^s′The specific process of (2) is shown in fig. 3. The input to the model comprises visual features

Semantic feature F^sThe characteristic dimensions are respectively

And

wherein N is_vIs the number of fault blocks, N_sThe number of labels on the left and right sides. Firstly, mining visual characteristics through an SGA model

Neutralizing semantic feature F^sThe related information realizes the effect of visual feature enhancement, and then the enhanced feature is used as the visual information of the VGA to guide the semantic feature F in return^sFinally, the semantic features reconstructed by the VGA and SGA models are calculated through the operation sequence of linear transformation, residual connection and normalization, and then the final semantic fusion features F are obtained^s′The forward propagation formula of the FAM model of the whole process is as follows:

as in the above fusion process, the visual features are used as the main body of the value tensor to generate the visual fusion features, and the forward propagation formula of the FAM model is as follows:

wherein W is a parameter matrix of the neural network linear transformation, Norm is a Layer Normalization function (Layer Normalization),

candidate values for semantically fused features, F^s′For the final semantic fusion feature, this feature will then be used to initialize our keyword generation recurrent neural network,

candidate value for visual fusion feature, F^v′This feature will then be used to compute a visual attention mechanism in the keyword generation recurrent neural network for the final visual fusion feature.

And (4.3) constructing a level decoding module. The hierarchical decoding module is composed of a keyword recurrent neural network T-LSTM and a descriptive short sentence recurrent neural network S-LSTM, and the specific structure is shown in FIG. 4, wherein the T-LSTM is used for generating keywords in the descriptive short sentence, and the L-LSTM generates the corresponding descriptive short sentence according to the corresponding keywords.

T-LSTM is characterized by:

1) first, semantics are specified by Fusing Attention Model (FAM)Sign F^sAnd visual features

Performing fusion to obtain a reconstruction characteristic F^v′、F^s′。

3) Hidden layer state of T-LSTM at time T

(as shown by the downward dashed line in fig. 4), the specific formula is as follows:

wherein alpha is a hyper-parameter and takes the value of [0, 1%]. By adjusting the value of the hyper-parameter alpha, the T-LSTM can retain and abandon the descriptive statement information and the keyword information at the previous moment to different degrees when generating each keyword. In the actual experimental process, a bias is added at the forgetting gate of the recurrent neural network and is initialized to be 1 so as to reduce the forgetting capability of the model at the initial training stage and avoid the problem that the output of the model explodes or disappears when the model is trained at the beginning, wherein a specific formula is shown as follows, wherein W is a parameter matrix f of linear transformation of the neural network_tTo forget the gate vector, h_t-1Upper time hidden layerState vector, x_tFor the input vector at time t, b is bias:

f_t＝(W_fxx_t+W_fhh_t-1+b)

And keyword text words

Predicted value X ═ X output as keyword text word₁,…,x_t}，

S-LSTM is characterized in that:

Then fusing the attention model to the semantic features

And visual features

Performing fusion to obtain a reconstruction characteristic F^v″、F^s″；

2) Semantic fusion feature F is used by the S-LSTM hidden layer and the first cyclic unit^s″Initializing, visually fusing feature F^v″Performing visual attention calculation with the input word value to realize visual semantic association matching of the characteristics of the brain CT fault block and the detailed lesion description short sentence;

3) hidden layer state of S-LSTM at time t

Hidden layer states returned by last descriptive clause

And semantic reconstruction features at time t

In the process of initialization, we also introduce the memory information of the S-LSTM in generating the previous sentence, and the specific formula is as follows:

wherein beta is a hyper-parameter and takes the value of [0,1 ]. By adjusting the value of the hyper-parameter beta, the S-LSTM can reserve and abandon the generated description short sentence and the keyword information at the previous moment in different degrees when generating each short sentence.

4) Input of S-LSTM ensemble as semantic features

Visual features

And describing short sentence words

Step (5), defining an overall loss function and training a brain CT image report to automatically generate a network, wherein the overall loss function comprises two parts:

loss＝Loss_sentence+Loss_topic

wherein the content of the first and second substances,

finger x_tThe real tags of (i.e., words in the keyword text,

finger y_tThe real label of (1), i.e. the word in the report text, p_tFinger corresponding word x_tOr y_tThe prediction probability of (2). Finally, the network can adaptively optimize loss values of a real report and a prediction report under an Adam optimizer algorithm, a brain CT image is input after training is completed, the model can automatically generate a continuous and accurate brain CT medical report, meanwhile, the model can pay attention to different visual points when each descriptive short sentence is generated by using the characteristics of the reconstructed fusion attention mechanism, and the advantage of the method for generating the long text sequence is embodied.

In order to verify the feasibility of the method, quantitative experimental analysis and qualitative result analysis are carried out on the method provided by the invention. The generated report result is subjected to language evaluation through four evaluation indexes, namely BLEU, METEOR, ROUGE-L and CIDER, which are widely applied to quantitative evaluation of generated sentences in the fields of image description, natural language processing and the like. And the fifth figure shows examples of the invention, including brain CT image, report example generated by baseline method of contrast and report example generated by the method.

Table 1 model ablation experiment

First, an inter-module ablation experiment was performed, and the first row of table 1 used a single layer of LSTM for decoding of the medical report, i.e., the baseline model. The LSTM _ S method of the second line is to use semantic information F of the second class^s′The memory unit of the decoder part is initialized and used as the starting word information of the medical report in the testing stage. In the next third row of the LSTM _ FAM method, the semantic information of the second category is replaced by its own fusion feature with the fault block visual feature by the FAM module. The HLSTM method below the split line is to use hierarchical T-LSTM and S-LSTM as decoders for medical reports. The fifth and sixth lines are semantic information F using orientation keywords respectively for HLSTM^s′And F^s″Performing initialization of a loop unit

From table 1, it can be seen that the mere addition of the number of layers of the recurrent neural network does not necessarily bring performance improvement to the final medical report generation, since the training sample is too small, the addition of one layer of neural network for training and describing the subject keyword brings more training parameters, and the language evaluation index value during the final test is rather lower, which indicates that the addition of parameters brings further over-fitting problem in the data set, but the newly added recurrent neural network can perform a task decomposition on the final medical report generation, so as to obtain the generation process of each sentence description sentence and capture of corresponding visual and semantic information, and the language evaluation index value of the model is made to be equal to or even greater than the performance of a single-layer recurrent neural network through the fusion of multi-modal information. Specifically, no matter the LSTM baseline method or the HLSTM method is a single-layer LSTM baseline method or a single-layer HLSTM method, the semantic information of two classifications is used for initializing the language model, the left and right labels generated by the orientation keyword predictor are used as verb initials of the first medical report short sentence in the testing stage, the performance of the final model is slightly improved, and BLEU-1 and CIDER index values are obviously improved. In addition, the FAM module is used to convert the left and right keywords F in HLSTM^s′And location keyword F^s″Fused with visually encoded features and used to enhance T-LSTM and S-LSTM, respectively, with a large improvement in performance of final medical report generationAnd (5) rising. Finally, in the HLSTM _ GT experiment, the semantic information on the left side and the right side and the semantic information output by the T-LSTM are not used any more, but the real label of the keyword is directly used for generating FAM fusion information, and the result shows that when the quality of generating the keyword label is very high, the method provided by the invention can effectively improve the performance index of the report description statement.

TABLE 2 FAM model ablation experimental results

Then, experiments of different stacking modes of the FAM model are carried out, and the effectiveness of the FAM model is researched in a certain assumed space. Visual fusion (reconstruction) feature is F^v′(F^v″) Semantic fusion (reconstruction) feature is F^s′(F^s″) SA _.

As can be seen from the top of the segmentation line in table 2, when the same semantic guidance attention model, that is, the same SGA and SA _ SGA model, is used, the fitting ability of the final model to the data set is relatively similar, and the fluctuation of the evaluation index value is relatively small, which indicates that the discriminativity of the model to the visual fusion feature is significantly stronger than that of the semantic fusion feature. Meanwhile, through cross fusion with the fault block information, the label detail characteristics contained in the semantic information can be further learned, so that the relevance degree of the model and the relevant visual information is enhanced when the model generates a description short sentence taking the corresponding label as a description main body, and the quality of the model for finally generating the medical report is further improved.

Finally, the qualitative analysis of the model description ability is carried out, fig. five is a comparison example of the medical report generated by the method and the baseline method, and from the generated medical report text, the medical report generated by the single-layer LSTM method describes less subjects, and the described subjects are generally adjacent brain tissue structures (the positions of the tomograms are adjacent); the HLSTMD method has the advantages that the number of description subjects of the medical report generated by the HLSTMD method is increased obviously, more pathological conditions are described, the generated report has more diversity, and the detailed information describing the subject can be analyzed to a certain extent, for example, the detailed description information of the 'submucosal cyst of right maxillary sinus' is generated in the case of paranasal sinusitis.

In general, no matter quantitative evaluation of language performance or medical report examples shown in qualitative analysis, the automatic medical report generation framework provided by the invention achieves better effect and has good application prospect in future practical application.

Claims

1. A brain CT medical report generation method based on hierarchical recurrent neural network decoding is characterized by comprising the following steps:

the method comprises two stages of training and predicting,

the training phase comprises the following steps:

(1) making a training data set for generating a brain CT medical report and preprocessing the training data set to obtain a standardized three-dimensional brain CT image and corresponding text information, wherein the text information comprises a keyword text

Report text

(2) The brain CT image feature extractor is used for completing the encoding of the brain CT image data I to obtain the three-dimensional brain CT encoding features

And brain CT fault block visual characteristics

(3) Constructing an orientation keyword predictor for left and right secondary classification of the brain CT image, and then generating a final medical report by using secondary classification information in an auxiliary manner; three-dimensional brain CT coding features

Wherein n represents the number of brain CT images of one case after preprocessing;

(4) constructing a hierarchical recurrent neural network language decoding model, hereinafter referred to as a language model for generating a brain CT medical report; the language model is integrally divided into two parts: the system comprises a fusion attention module and a hierarchy decoding module, wherein the fusion attention module is used for carrying out feature fusion on semantic features and visual features, the fused semantic features are used for activating LSTM initial hidden layer vectors in the hierarchy decoding module, and the fused visual features guide the LSTM to generate words with emphasis; in the training phase, the input of the language model is the visual characteristics

Semantic feature F^sKeyword text

And report text

The output value is the loss function value in the step (5) and is used for optimizing the performance of the model;

(5) constructing and training an overall loss function, wherein the overall loss function comprises two parts:

Loss＝Loss_sentence+Loss_topic

wherein the content of the first and second substances,

finger x_tThe real tags of (i.e., words in the keyword text,

finger y_tThe real label of (1), i.e. the word in the report text, p_tFinger corresponding word x_tOr y_tA predicted probability of (d);

the prediction phase comprises the following steps:

(6) preprocessing a brain CT to be predicted to obtain a standardized three-dimensional brain CT image, corresponding keyword text information and a medical report;

(7) extracting three-dimensional coding features and fault block visual features of the brain CT image to be predicted by using a trained feature extractor;

(8) generating left and right semantic features by using the trained direction keyword predictor;

(9) and fusing the semantic features and the visual features by using the trained language model to generate a medical report of the brain CT image to be predicted.

2. The brain CT medical report generation method based on hierarchical recurrent neural network decoding as claimed in claim 1, wherein:

the specific steps in the step (1) are as follows:

Keyword text

Brain CT report

representing the ith word in the keyword text,

represents the number of keywords in a report,

the ith word representing the report text,

represents the number of words in a report;

step (1.2) dividing all patient data into a training set, a validation set and a test set; wherein the training set is used for learning parameters of the neural network; the validation set is used for determining the hyper-parameters; the test set is used for verifying the neural network classification effect;

3. The brain CT medical report generation method based on hierarchical recurrent neural network decoding as claimed in claim 1, wherein:

the characteristic extraction in the step (2)The device adopts a coder in a patent of brain CT medical report generation method based on hierarchical self-attention sequence coding to extract three-dimensional coding characteristics of a brain CT image

And fault block visual features

4. The brain CT medical report generation method based on hierarchical recurrent neural network decoding as claimed in claim 1, wherein:

the orientation keyword predictor in the step (3) is realized by multi-layer perception only comprising a hidden layer, the number of neurons of an output layer is 2, namely, two classifications of left and right semantic labels are made, and left and right semantic information F^sThe value of (b) is the value of a multi-layered perceptron hidden layer neuron.

5. The brain CT medical report generation method based on hierarchical recurrent neural network decoding as claimed in claim 1, wherein:

the specific steps in the step (4) are as follows:

step (4.1) guiding the construction of an attention model, and guiding the attention model to be used in step (4.2); the guiding attention model is used for mining the association information of one feature tensor in another related feature tensor space; the attention-guiding model can be further divided into a visual attention-guiding model VGA and a semantic attention-guiding model SGA, wherein the input of the query tensor in the visual attention-guiding model VGA is

The input of the keyword tensor and the value tensor is F^sThe new calculated features are guided by visual information, so that the semantic features are reconstructed in a visual feature space; the input of the query tensor in the semantic guidance attention model SGA is F^sKey point ofThe input of the word tensor and the value tensor is

At the moment, the calculated new features are guided by semantic information, so that the visual features are reconstructed in a semantic feature space;

step (4.2) construction of a Fusion Attention Module (FAM): the fusion attention module is realized by two-time stacked attention guiding models and linear transformation, residual connection and normalization operations; the fusion attention module is used in the step (4.3), when the input is the semantic feature F^sAnd visual features

The output is a visual fusion feature F^v′And semantic fusion feature F^s′The specific formula is as follows:

wherein, W is a parameter matrix of the linear transformation of the neural network, Norm is Layer Normalization function,

and

respectively representing a visual fusion feature candidate value and a semantic fusion feature candidate value;

and (4.3) constructing a level decoding module: the hierarchical decoding module consists of a keyword recurrent neural network T-LSTM and a descriptive short sentence recurrent neural network S-LSTM;

T-LSTM is characterized by:

Performing fusion to obtain a reconstruction characteristic F^v′、F^s′；

2) Semantic fusion feature F is used by T-LSTM hidden layer and first cycle unit^s′Initializing, visually fusing feature F^v′Performing visual attention calculation with the input word value so that the T-LSTM focuses on the corresponding visual feature when generating the corresponding word value;

3) hidden layer state of T-LSTM at time T