CN115293128A - Model training method and system based on multi-modal contrast learning radiology report generation - Google Patents
Model training method and system based on multi-modal contrast learning radiology report generation Download PDFInfo
- Publication number
- CN115293128A CN115293128A CN202210931458.3A CN202210931458A CN115293128A CN 115293128 A CN115293128 A CN 115293128A CN 202210931458 A CN202210931458 A CN 202210931458A CN 115293128 A CN115293128 A CN 115293128A
- Authority
- CN
- China
- Prior art keywords
- sentence
- image
- encoder
- learning
- visual features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H15/00—ICT specially adapted for medical reports, e.g. generation or transmission thereof
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Epidemiology (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Image Analysis (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention discloses a radiology report generation model training method and system based on multi-modal contrast learning, which comprises the following steps: learning sentence representation based on a sentence-level training strategy of contrast learning by adopting a self-supervision representation learning method, and obtaining image representation through bidirectional contrast learning between paired images and texts; then embedding the learned image encoder and sentence encoder into a radiology report generation model, generating an Impulse part of a report through an encoding-decoding process, and recursively generating a Finding part of the report by fusing visual features and semantic features. The method and the system for training the radiology report generation model based on the multi-modal contrast learning optimize the representation of images and texts, realize a radiology report generation task, assist a radiologist in making diagnosis, and improve the accuracy in the diagnosis process.
Description
Technical Field
The invention relates to a radiology report generation model training method and system based on multi-modal contrast learning, and belongs to the field of computers and medicine.
Background
Medical images, such as radiological images, are widely used for diagnosis of diseases. Reading and understanding of medical images is typically performed by professional medical personnel who, by analyzing the images being examined, identify normal and abnormal areas therein, and use learned medical knowledge and accumulated work experience to compose radiology reports. However, the generated report will be in error due to lack of knowledge of the radiologist, misreasoning, personnel shortage, excessive workload, and the like. Therefore, automated generation of radiology reports has become an attractive research direction for artificial intelligence and clinical medicine to reduce the workload of radiologists and minimize the occurrence of errors.
Imaging studies are often accompanied by radiology reports that document the radiologist's observations in routine clinical care, with the passage consisting of the Impression and Findings sections representing the most direct transcription for imaging studies. The expression part is a conclusive diagnosis and can be regarded as a report conclusion or a subject sentence, and the Findings part is a paragraph consisting of a plurality of structural sentences, each of which focuses on a specific medical observation of a certain region in a radiology image. These sentences are typically longer and more complex than the titles in the standard image title data set. Therefore, many existing image header models are not directly suitable for this task, which requires special solutions.
In recent years, much work has explored the automatic generation of radiology reports and many new ideas based on traditional generation methods have been proposed to improve. However, the existing generation method generally has several problems: (1) Previous studies mostly utilized CNN encoders based on ImageNet pre-training, which is not suitable for medical images; (2) Sentences obtained by LSTM-based or BERT-based sentence encoders used in previous studies are of low quality representation; (3) The generated reports are not semantically consistent in the medical field, and the optimization of the clinical accuracy of the generated reports is at the cost of reducing the performance of other indexes to a certain extent.
Disclosure of Invention
The purpose of the invention is: the method realizes the task of generating the radiological report, improves the visual characteristics of the extracted medical images and the semantic characteristics of the texts, and enhances the consistency between the images and the texts.
In order to achieve the above object, one technical solution of the present invention is to provide a method for training a radiology report generation model based on multimodal contrast learning, which is characterized by comprising the following steps:
s100, sample data acquisition:
acquiring radiological images and text data, and transmitting the image data and the corresponding text data to a sample database, wherein each text data comprises a conclusive diagnostic statement Impression and a detailed description paragraph Finding;
s200, multi-modal contrast learning:
an image encoder for obtaining visual features in image data and a sentence encoder for extracting semantic features in text data are obtained based on sample database learning by adopting a self-supervision characterization learning method, wherein:
the sentence encoder learns sentence representations through a training strategy based on sentence levels of contrast learning;
an image encoder learns image representations through a bi-directional contrast learning objective between pairs of image data and text data;
s300, generating a radiology report:
recursively generating diagnostic sentences Impression and description paragraphs Findings of a radiology report by fusing visual features extracted by an image encoder and semantic features extracted by a sentence encoder, comprising in particular the steps of:
s301, generating a single diagnosis statement Impression by an Impression generating module based on an encoder-decoder framework, and specifically comprising the following steps:
the image encoder extracts visual features from an input image, then sends the visual features into an Impression part sentence decoder, and generates a whole sentence word by word as a diagnosis sentence Impression;
s302, fusing the visual characteristics of the image and the semantic characteristics of the sentence by a Finding generation module, generating the sentence circularly, and finally generating a long section containing a plurality of structural sentences, wherein the method specifically comprises the following steps:
in the Findings generation module, in order to make the generated sentence focus on describing different image areas, based on the attention frame, the visual features output by the image encoder and the semantic features of the previous sentence output by the sentence encoder are input into a full-link layer and then into the SoftMax layer to obtain weighted visual features, the weighted visual features are input into a Findings partial sentence decoder, the coding of the sentence is obtained by the Findings partial sentence decoder, the coding of the sentence is input into the sentence encoder as the coding of the previous sentence, and then the sentence encoder is input, so as to obtain new weighted visual features, and the process is repeated until the Findings partial sentence decoder generates an empty sentence, which indicates that the generation of the description paragraph finds is completed;
wherein the coding of the previous sentence and the visual features of the image are combined by the weighted visual features to guide the generation of the next sentence, the attention network for calculating the weighted visual representation is defined as:
V w =Attention(v,s)
wherein, V w Is the weighted visual representation to be obtained; attention () is an attention function; v = { v = 1 ,v 2 ,…,v k },Is an image feature learned by an image encoder, each feature v i Are all one d v A representation of a dimension corresponding to a portion of an image;code representing the previous sentence,d s Is the dimension of the semantic feature;
in view of visual featuresAnd the coding of the previous sentence, the attention distribution is generated in K areas of the image through a single attention distributor, and is expressed through a single-layer neural network and a SoftMax function:
α=softmax(z)
in the formula:is a vector, all elements are set to 1; is a parameter of the attention network;note the weight of the features in v in (v, s).
Weighted visual representation V based on attention distribution w Obtained by the following steps:
in the formula, alpha i Is the ith dimension element in alpha;
s400, report analysis and evaluation:
the generated radiology report is evaluated using the evaluation index:
s500, outputting a result:
and combining the diagnosis sentence Impression and the description paragraphs Finding which are respectively generated into a complete radiology report for outputting, and simultaneously realizing the evaluation of output results by using a plurality of evaluation indexes.
Preferably, in step S100, the image data and the text data in the sample database are in a one-to-one correspondence relationship, and include a training set, a test set, and a verification set.
Preferably, in step S200, learning sentence characterization through a training strategy based on the sentence levels of the comparative learning comprises the following steps:
for the same sentence in a sentence set, a series of sentence representations of different enhanced versions are obtained by implementing different data enhancement methods to be used as positive examples in a training set, a test set or a verification set, and other sentences are used as negative examples;
when the sentence encoder is trained, the sentence encoder is established and the semantic sentence embedding is established by furthest improving the consistency between the sentence representations of different enhanced versions of the same sample and simultaneously keeping the sentence vectors of different samples as far as possible.
Preferably, in step S200, a set of sentences is takenx i Represents the ith sentence, m represents the setThe total number of inner sentences; for sentence x i Applying two different data enhancement methods f () and f' () to generate two different versions of sentence embedding e i 、e′ i :
e i =f(x i )
e′ i =f′(x i )
In the formula, e i ,L is the length of sentence embedding, D is the hidden dimension of sentence embedding;
then, the sentence is embedded into e i 、e′ i Is encoded to obtain a sentence representation h i 、h′ i 。
preferably, in step S200, learning the image representation by a bidirectional contrast learning target between the paired image and text, comprises the following steps:
by paired input (X) v ,X s ) Learning image encoder, wherein X v Representing an image or a group of images, X s Representative description of X v A sentence sequence of mid-imaging information; for each input image X v And each input sentence X d They are encoded by an image encoder f v () And a sentence encoder f s () Conversion into a fixed dimension vector h v And h s (ii) a Then, a representation h of the two modes v 、h s By projecting a function g v () And g s () Project from their encoder space to the same D-dimensional space for comparative learning;
for N input pairs (X) v ,X s ),Resulting in corresponding N representation pairs (v, s), wherein:
v=g v (f v (X v ))
s=g s (f s (X s ))
with (v) i ,s i ) Represent the ith pair of representations whose training objectives include two loss functions: loss of image-to-text contrastAnd loss of text-to-image contrast
wherein λ is a scalar weight;
by maximizing the correspondence between image-text representation pairs, an image encoder is learned that maps images to a fixed-dimension vector.
Preferably, in step S301, the im compression part sentence decoder generates a word to generate a title in each time step according to the context vector, the previous hidden state and the previously generated word by using the LSTM-based method;
the initial hidden state and cell state of the LSTM are set to zero, the hidden state h at time t t Is modeled as:
h t =LSTM(x t ,h t-1 ,m t-1 )
wherein x is t Is an input vector, m t-1 Is the memory cell vector at time t-1.
The visual feature vector is used as the initial input of the LSTM to predict the first word of a sentence; before inputting LSTM, the visual feature vector output by the image encoder is converted into the same dimension as word embedding by a full connection layer, and the LSTM extracts visual features from the last convolution layer and then generates the whole sentence word by word.
Preferably, the step S400 includes:
s401, evaluating the generated report by using the BLEU and the variant thereof;
s402, evaluating the generated report by using METEOR;
s403, evaluating the generated report by using the ROUGE;
and S404, evaluating the generated report by using CIDER.
Another technical solution of the present invention is to provide a model training system for generating a radiology report based on multimodal contrast learning, which is characterized by comprising:
the sample database is used for storing the acquired radiological images and text data, wherein each text data comprises a conclusive diagnosis sentence Impression and a detailed description paragraph Findings;
the multimode contrast learning module adopts a self-supervision characterization learning method, and learns based on a sample database to obtain an image encoder for visual features in image data and a sentence encoder for extracting semantic features in text data, wherein:
the sentence encoder learns sentence representations through a training strategy based on sentence levels of contrast learning;
an image encoder learns image representations through a bi-directional contrast learning objective between pairs of image data and text data;
a radiology report generation module for recursively generating a diagnosis sentence Impression and a description paragraphs Finding of a radiology report by fusing the visual features extracted by the image encoder and the semantic features extracted by the sentence encoder;
the radiology report generation module further comprises an expression generation module and a fixings generation module, wherein:
the Impactor generation module generates a single diagnostic statement Impactor based on an encoder-decoder framework, the implementation of the Impactor generation module comprising the steps of:
the image encoder extracts visual features from an input image, then sends the visual features into an Impression part sentence decoder, and generates a whole sentence word by word as a diagnosis sentence Impression;
the method comprises the following steps of fusing visual features of an image and semantic features of a sentence by a Findings generation module, generating the sentence in a circulating manner, and finally generating a long section containing a plurality of structural sentences, wherein the Findings generation module comprises the following steps:
in the Findings generation module, in order to make the generated sentence focus on describing different image areas, based on the attention frame, the visual features output by the image encoder and the semantic features of the previous sentence output by the sentence encoder are input into a full-link layer and then into the SoftMax layer to obtain weighted visual features, the weighted visual features are input into a Findings partial sentence decoder, the coding of the sentence is obtained by the Findings partial sentence decoder, the coding of the sentence is input into the sentence encoder as the coding of the previous sentence, and then the sentence encoder is input, so as to obtain new weighted visual features, and the process is repeated until the Findings partial sentence decoder generates an empty sentence, which indicates that the generation of the Findings part is completed;
wherein the coding of the previous sentence and the visual features of the image are combined by the weighted visual features to guide the generation of the next sentence, the attention network for calculating the weighted visual representation is defined as:
V w =Attention(v,s)
wherein, V w Is the weighted visual representation to be obtained; attention () is an Attention function; v = { v = 1 ,v 2 ,…,v k },Is an image feature learned by an image encoder, each feature v i Are all one d v A representation of a dimension corresponding to a portion of an image;code representing the previous sentence, d s Is the dimension of the semantic feature;
in view of visual featuresAnd the coding of the previous sentence, the attention distribution is generated in K areas of the image through a single attention distributor, and is expressed through a single-layer neural network and a SoftMax function:
α=softmax(z)
in the formula:is a vector, all elements are set to 1; is a parameter of the attention network;note the weight of the features in v in (v, s).
Weighted visual representation V based on attention distribution w Obtained by the following steps:
in the formula, alpha i Is the ith dimension element in α;
a report analysis evaluation module that evaluates the generated radiology report using the evaluation index;
and the result output module is used for combining the respectively generated diagnosis sentences Impression and description paragraphs into a complete radiology report to be output, and simultaneously realizing the evaluation of output results by using a plurality of evaluation indexes.
The radiology report generation model training system based on multi-modal contrast learning of claim 8, wherein in the radiology report generation module, the image encoder is modeled as a CNN; the sentence encoder is modeled as a model obtained after fine-tuning in a BERT language model, and semantic representations of sentences are generated by averaging the pooling layers.
The invention has reasonable structural design, utilizes the sample database to carry out self-supervision learning, sends the radiology images into the system, finally generates corresponding radiology reports, assists the radiologist to make diagnosis, and greatly improves the accuracy in the diagnosis process.
In summary, compared with the prior art, the invention has at least the following advantages:
(1) The invention provides a multi-modal contrast learning-based recursive model for generating a radiology report. The model combines the visual characteristics of medical images and the semantic characteristics of sentences, and respectively generates a radiology report expressing part and a Finding part through a recursive network;
(2) The invention provides a model pre-training method based on multi-modal contrast learning, which is used for improving the expressive force of visual characteristics and text characteristics;
(3) The invention uses paired medical images and reports to carry out bidirectional contrast learning, and carries out pre-training on the image encoder, so that the image encoder can effectively extract visual representation and improve the consistency between image data and text data.
(4) The sentence encoder is established on a sentence-level training target based on comparative learning, so that the sentence encoder can establish semantically coherent sentence embedding for text representation;
(5) The radiology report generation model training method and system based on multi-modal contrast learning can effectively provide interpretable reasons, self-supervision learning is carried out by utilizing the sample database, radiology images are sent into the system, a radiology report corresponding to the radiology image is generated finally, a radiologist is assisted to make a decision, and the accuracy in the diagnosis process is greatly improved.
Drawings
FIG. 1 is an overall frame diagram of the present invention;
FIG. 2 is sample data according to an embodiment of the present invention;
FIG. 3 is a block diagram of multi-modal contrast learning in accordance with the present invention;
FIG. 4 is a schematic diagram of multi-modal feature fusion in the present invention.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
As shown in fig. 1, an embodiment of the present invention provides a method for training a radiology report generation model based on multi-modal contrast learning, in which an auto-supervised characterization learning method, i.e., multi-modal contrast learning, is embedded in a radiology report generation model to generate a corresponding radiology report for each radiology image, and the method includes the following steps:
s100, sample data acquisition:
the method comprises the steps of collecting a radiological image of the chest of a human body through a nuclear magnetic resonance apparatus, and transmitting image data and corresponding text data to a sample database. Specifically, referring to fig. 2, the image data and the text data in the sample database are in a one-to-one correspondence relationship, and include a training set, a test set, and a verification set. The text data is a radiology report containing a conclusive diagnostic statement Impression and a detailed description paragraph Findings.
S200, multi-modal contrast learning:
an image encoder and a sentence encoder are obtained based on sample database learning by adopting a self-supervision characterization learning method and are respectively used for extracting visual features in an image and semantic features in a sentence, and as shown in fig. 3, the method specifically comprises the following steps:
s201, sentence representation is learned through a sentence level training strategy based on comparative learning.
Taking a set of sentencesx i Representing the ith sentence, m represents the setTotal number of inner sentences. For sentence x i Applying two different data enhancement methods f () and f' () to generate two different versions of sentence embedding e i 、e′ i :
e i =f(x i )
e′ i =f′(x i )
In the formula, e i ,L is the length of sentence embedding and D is the hiding dimension of sentence embedding.
Then, the sentence is embedded into e i 、e′ i Is encoded to obtain a sentence representation h i 、h′ i 。
Therefore, for the sameSentences, by implementing different data enhancement methods such as cutting, removing and the like, the invention takes a series of different obtained embeddings as 'positive examples' and takes other sentences in the same batch as 'negative examples'. Then, for a small batch of N sentences, sentence x i Training target ofThe following were used:
The molecule in the above formula represents h i 、h′ i Cosine similarity between, wherein h' i Is a positive example; and the denominator represents h j 、h′ j Sum of cosine similarity therebetween, wherein h' j Including all positive and negative examples.
by furthest improving the consistency between different enhanced versions of the same sample and simultaneously keeping sentence vectors of different samples as far as possible, the invention establishes a sentence encoder and constructs semantic sentence embedding.
S202, learning image representation through a bidirectional contrast learning target between the paired images and texts.
By paired input (X) v ,X s ) Learning image weavingCode device, wherein X v Representing an image or a group of images, X s Representative description of X v A sentence sequence of imaging information. For each input image X v And each input sentence X s They are encoded by an image encoder f v () And a sentence encoder f s () Conversion into a fixed dimension vector h v And h s . Then, a representation h of the two modes v 、h s By projecting a function g v () And g s () Projected from their encoder space to the same D-dimensional space for comparative learning.
Thus, for N input pairs (x) v ,X s ) Corresponding N representation pairs (v, s) can be obtained, where:
v=g v (f v (X v ))
s=g s (f s (x s ))
with (v) i ,s i ) Representation pair i is represented and trained using the same medical data set as the downstream task. The present invention uses Info-NCE as a loss function, which is a comparative loss function for self-supervised learning. The ith pair represents the pair (v) i ,s i ) Includes two loss functions: loss of image-to-text contrastLoss of contrast with text-image
where λ is a scalar weight.
By maximizing the correspondence between image-text representation pairs, the present invention learns an image encoder that maps images to a fixed-dimension vector.
And S203, embedding the image encoder and the sentence encoder which are learned through the steps into a radiology report generation model.
S300, generating a radiology report:
recursively generating an Impression part and a Findings part of the radiology report by fusing the visual features extracted by the image encoder and the semantic features extracted by the sentence encoder, comprising the steps of:
s301, generating a single conclusive diagnostic statement imcompression by an imcompression generation module based on a simple encoder-decoder framework.
Specifically, the image encoder first extracts visual features from the input image, and then feeds them to the sentence decoder, generating the whole sentence word by word.
The purpose of the image encoder is to automatically extract visual features from the image, map the image into a context vector as a visual input for all subsequent modules, this vector being obtained by multimodal contrast pre-training. In particular, the image encoder is parameterized as a fully-connected layer, with visual features extracted from the last convolutional layer. The visual features are then fed into the sentence decoder to generate the Impression part.
The invention adopts an LSTM-based method, according to the context vector,A previous hidden state and a previously generated word, one word being generated at each time step to produce a title. The initial hidden state and cell state of the LSTM are set to zero, the hidden state h at time t t Is modeled as:
h t =LSTM(x t ,h t-1 ,m t-1 )
wherein x is t Is an input vector, m t-1 Is the memory cell vector at time t-1.
The visual feature vector is used as the initial input of the LSTM to predict the first word of a sentence. But before entering LSTM, a full connectivity layer is needed to convert visual feature vectors to the same dimensions as word embedding. The entire sentence can then be generated word by word.
S302, fusing the visual characteristics of the image and the semantic characteristics of the sentence by a Finding generation module, circularly generating the sentence, and finally generating a long section containing a plurality of structural sentences.
In particular, multi-modal feature fusion referring to fig. 4, the upper half of fig. 4 produces branches of visual features, and the resulting image encoder, via contrast pre-training, is modeled as a CNN for extracting visual representations from the input image. The lower part of fig. 4 shows the branches that generate semantic features, the sentence coder obtained by the comparative pre-training is modeled as a model obtained by fine-tuning in a language model similar to BERT, and the semantic representation of the sentences is generated by averaging the pooling layers.
In order to focus the generated sentence on describing different image areas, based on the attention framework, the visual features of the image and the semantic features of the text are input into a fully connected layer and then fed into the SoftMax layer to obtain weighted visual features.
The attention network used to compute the weighted visual representation is defined as:
V w =Attention(v,s)
wherein, V w Is the weighted visual representation to be obtained; attention () is the Attention function; v = { v = 1 ,v 2 ,…,v k },Is an image feature learned by an image encoder, each feature v i Are all one d v A representation of a dimension corresponding to a portion of an image;code representing the previous sentence, d s Is the dimension of the semantic features.
In view of visual featuresAnd coding of the previous sentence, the invention generates attention distribution in K areas of the image through a single attention distributor, and the attention distribution is expressed through a single-layer neural network and a SoftMax function:
α=softmax(z)
in the formula:is a vector, all elements are set to 1; is a parameter of the attention network;note the weight of features in v in (v, s).
Weighted visual representation V based on attention distribution w Can be obtained by the following method:
in the formula, alpha i Is the ith dimension element in alpha.
The input to the sentence decoder is now a weighted visual representation, so that the decoder will take note of specific regions of the image in order to generate sentences describing different image regions. The learned coding of the previous sentence and the visual characteristics of the image are combined to guide the generation of the next sentence. This process is repeated until an empty sentence is produced, indicating that the creation of the Findings section has been completed. In this way, as different sentences are generated, the model can focus on different areas of the image according to the context of the previous sentence and ensure consistency and consistency of the medical semantics of the generated report.
S400, report analysis and evaluation:
the generated radiology report is evaluated by using four common evaluation indexes, wherein the larger the value of the evaluation index is, the better the performance of the radiology report generation model is represented, and the method specifically comprises the following steps:
s401, evaluating the generated report using the BLEU and its variants.
BLEU is a method for automatically evaluating machine translation, the general idea of which is accuracy, and can be further divided into many variants according to "n-gram", four common indicators being BLEU-1, BLEU-2, BLEU-3 and BLEU-4, where n-gram refers to the number of consecutive words being n.
And S402, evaluating the generated report by using METEOR.
METEOR is an automatic index for machine translation evaluation, has better correlation with human judgment, and takes the accuracy and recall rate based on the whole corpus into consideration and obtains a final index, which is different from BLEU.
And S403, evaluating the generated report by using the ROUGE.
The ROUGE is designed to measure the quality of the summary, measures the "similarity" between the automatically generated summary and the reference summary, and calculates a corresponding score.
And S404, evaluating the generated report by using the CIDER.
CIDER is a consensus-based image description evaluation that calculates the cosine similarity of a reference title and a title generated by a model as a score.
S500, outputting a result:
and combining the respectively generated Impression part and Findings part into a complete radiology report for outputting, and simultaneously realizing the evaluation of output results by using a plurality of evaluation indexes.
And the result output module is in signal connection with a display screen and a printer. The display screen and the printer are connected with the setting result output module through signals, screen display and document printing of the diagnosis report are achieved, and analysis of the report by medical personnel is facilitated.
In this embodiment, the radiological image is acquired by a nuclear magnetic resonance apparatus. The principle of the nuclear magnetic resonance instrument is that a human body is placed in a special magnetic field, a radio frequency pulse is used for exciting hydrogen atomic nuclei in the human body to cause the hydrogen atomic nuclei to resonate, energy is absorbed, after the radio frequency pulse is stopped, the hydrogen atomic nuclei send out radio signals according to specific frequency, the absorbed energy is released and recorded by a receiver outside the human body, and an image is obtained through processing of an electronic computer.
In this embodiment, the result output module is in signal connection with a display screen and a printer. The display screen and the printer are connected through the signal of the setting result output module, screen display and document printing of the diagnosis report are achieved, and analysis of the report by medical personnel is facilitated.
Claims (9)
1. A model training method based on multi-modal contrast learning radiology report generation is characterized by comprising the following steps:
s100, sample data acquisition:
acquiring radiological images and text data, and transmitting the image data and the corresponding text data to a sample database, wherein each text data comprises a conclusive diagnosis sentence Impression and a detailed description paragraph Findings;
s200, multimodal contrast learning:
an image encoder for obtaining visual features in image data and a sentence encoder for extracting semantic features in text data are obtained based on sample database learning by adopting a self-supervision characterization learning method, wherein:
the sentence encoder learns sentence representations through a training strategy based on sentence levels of contrast learning;
an image encoder learns image representations by a bi-directional contrast learning objective between pairs of imagery data and text data;
s300, generating a radiology report:
recursively generating diagnostic sentences Impression and description paragraphs Findings of a radiology report by fusing visual features extracted by an image encoder and semantic features extracted by a sentence encoder, comprising in particular the steps of:
s301, generating a single diagnostic statement imcompression by the imcompression generation module based on the encoder-decoder framework, specifically including the following steps:
the image encoder extracts visual features from an input image, then sends the visual features into an Impression partial sentence decoder, and generates a whole sentence word by word as a diagnosis sentence Impression;
s302, fusing the visual characteristics of the image and the semantic characteristics of the sentence by a Findings generation module, generating the sentence in a circulating way, and finally generating a long paragraph containing a plurality of structural sentences, wherein the method specifically comprises the following steps:
in the Findings generation module, in order to make the generated sentence focus on describing different image areas, based on the attention frame, the visual features output by the image encoder and the semantic features of the previous sentence output by the sentence encoder are input into a full-link layer and then into the SoftMax layer to obtain weighted visual features, the weighted visual features are input into a Findings partial sentence decoder, the coding of the sentence is obtained by the Findings partial sentence decoder, the coding of the sentence is input into the sentence encoder as the coding of the previous sentence, and then the sentence encoder is input, so as to obtain new weighted visual features, and the process is repeated until the Findings partial sentence decoder generates an empty sentence, which indicates that the generation of the description paragraph finds is completed;
wherein the coding of the previous sentence and the visual features of the image are combined by the weighted visual features to guide the generation of the next sentence, the attention network for calculating the weighted visual representation is defined as:
V w =Attention(v,s)
wherein, V w Is the weighted visual representation to be obtained; attention () is the Attention function; v = { v = 1 ,v 2 ,…,v k },Is an image feature learned by an image encoder, each feature v i Are all one d v A representation of a dimension corresponding to a portion of an image;code representing the previous sentence, d s Is the dimension of the semantic feature;
in view of visual featuresAnd the coding of the previous sentence, the attention distribution is generated in K areas of the image through a single attention distributor, and is expressed through a single-layer neural network and a SoftMax function:
α=softmax(z)
in the formula:is a vector, all elements are set to 1; is a parameter of the attention network;note the weight of the features in v in (v, s).
Weighted visual representation V based on attention distribution w Obtained by the following steps:
in the formula, alpha i Is the ith dimension element in α;
s400, report analysis and evaluation:
the generated radiology report was evaluated using the evaluation index:
s500, outputting a result:
and combining the diagnosis sentence Impression and the description paragraphs Finding which are respectively generated into a complete radiology report for outputting, and simultaneously realizing the evaluation of output results by using a plurality of evaluation indexes.
2. The method for training the radiology report generation model based on multi-modal contrast learning of claim 1, wherein in step S100, the image data and the text data in the sample database have a one-to-one correspondence relationship, and the sample database includes a training set, a testing set and a verification set.
3. The method as claimed in claim 2, wherein the step S200 of learning sentence characterization by sentence-level training strategy based on comparative learning comprises the following steps:
for the same sentence in a sentence set, a series of sentence representations of different enhanced versions are obtained by implementing different data enhancement methods to serve as positive examples in a training set, a test set or a verification set, and other sentences serve as negative examples;
when the sentence encoder is trained, the sentence encoder is established and the semantic sentence embedding is established by furthest improving the consistency between the sentence representations of different enhanced versions of the same sample and simultaneously keeping the sentence vectors of different samples as far as possible.
4. The method of claim 3, wherein in step S200, a set of sentences is selectedx i Representing the ith sentence, m represents the setThe total number of inner sentences; for sentence x i Two different data enhancement methods f () and f' () are applied to generate two different versions of sentence-embedding e i 、e′ i :
e i =f(x i )
e′ i =f′(x i )
In the formula (I), the compound is shown in the specification,l is the length of sentence embedding, D is the hiding dimension of sentence embedding;
then, sentence embedding e i 、e′ i Is encoded to obtain a sentence representation h i 、h′ i 。
Then for a small batch of N sentences, sentence x i Training target l i The following were used:
5. the multi-modal contrast learning radiology report generation model training method of claim 1, wherein learning image representations through bi-directional contrast learning objects between pairs of images and text in step S200 comprises the steps of:
by paired input (X) v ,X s ) Learning image encoder, wherein X v Representing an image or a group of images, X s Representative description X v A sentence sequence of imaging information; for each input image X v And each input sentence X s They are encoded by an image encoder f v () And a sentence encoder f s () Conversion into a fixed-dimension vector h v And h s (ii) a Then, a representation h of the two modes v 、h s By projecting a function g v () And g s () Project from their encoder space to the same D-dimensional space for contrast learning;
for N input pairs (X) v ,X s ) Resulting in corresponding N representation pairs (v, s), wherein:
v=g v (f v (X v ))
s=g s (f s (X s ))
with (v) i ,s i ) Represent the ith pair of representations whose training objectives include two loss functions: loss of image-to-text contrastLoss of contrast with text-image
wherein λ is a scalar weight;
by maximizing the correspondence between image-text representation pairs, an image encoder is learned that maps images to a fixed-dimension vector.
6. The method as claimed in claim 1, wherein in step S301, the im compression partial sentence decoder generates a word to generate a title in each time step according to the context vector, the previous hidden state and the previously generated word by using LSTM-based method;
the initial hidden state and cell state of the LSTM are set to zero, the hidden state h at time t t Is modeled as:
h t =LSTM(x t ,h t-1 ,m t-1 )
wherein x is t Is an input vector, m t-1 Is the memory cell vector at time t-1.
The visual feature vector is used as the initial input of the LSTM to predict the first word of a sentence; before inputting LSTM, the visual feature vector output by the image encoder is converted into the same dimension as word embedding by a full connection layer, and the LSTM extracts visual features from the last convolution layer and then generates the whole sentence word by word.
7. The method as claimed in claim 1, wherein the step S400 includes:
s401, evaluating the generated report by using the BLEU and the variant thereof;
s402, evaluating the generated report by using METEOR;
s403, evaluating the generated report by using the ROUGE;
and S404, evaluating the generated report by using CIDER.
8. A radiology report generation model training system based on multimodal contrast learning, comprising:
the sample database is used for storing the acquired radiological images and text data, wherein each text data comprises a conclusive diagnosis sentence Impression and a detailed description paragraph Findings;
the multimode contrast learning module adopts a self-supervision characterization learning method, and learns based on a sample database to obtain an image encoder for visual features in image data and a sentence encoder for extracting semantic features in text data, wherein:
the sentence encoder learns sentence representations through a training strategy based on sentence levels of contrast learning;
an image encoder learns image representations by a bi-directional contrast learning objective between pairs of imagery data and text data;
a radiology report generation module for recursively generating a diagnosis sentence Impression and a description paragraphs Finding of a radiology report by fusing the visual features extracted by the image encoder and the semantic features extracted by the sentence encoder;
the radiology report generation module further comprises an expression generation module and a fixings generation module, wherein:
the Impactor generation module generates a single diagnostic statement Impactor based on an encoder-decoder framework, the implementation of the Impactor generation module comprising the steps of:
the image encoder extracts visual features from an input image, then sends the visual features into an Impression partial sentence decoder, and generates a whole sentence word by word as a diagnosis sentence Impression;
the method comprises the following steps of fusing visual features of an image and semantic features of a sentence by a Findings generation module, generating the sentence in a circulating manner, and finally generating a long section containing a plurality of structural sentences, wherein the Findings generation module comprises the following steps:
in a Findings generation module, in order to make the generated sentence focus on describing different image regions, based on the attention framework, the visual features output by the image encoder and the semantic features of the previous sentence output by the sentence encoder are input into a full link layer and then into a SoftMax layer to obtain weighted visual features, the weighted visual features are input into a Findings partial sentence decoder, the coding of the sentence is obtained by the Findings partial sentence decoder, the coding of the sentence is used as the coding of the previous sentence input to the sentence encoder, which is input to the sentence encoder, so as to obtain new weighted visual features, and the process is repeated until the Findings partial sentence decoder generates an empty sentence, which indicates that the Findings partial sentence generation is completed;
wherein the coding of the previous sentence and the visual features of the image are combined by the weighted visual features to guide the generation of the next sentence, the attention network for calculating the weighted visual representation is defined as:
V w =Attention(v,s)
wherein, V w Is the weighted visual representation to be obtained; attention () is an Attention function; v = { v) 1 ,v 2 ,…,v k },Is an image feature learned by an image encoder, each feature v i Are all one d v A representation of a dimension corresponding to a portion of an image;code representing the previous sentence, d s Is the dimension of the semantic feature;
in view of visual featuresAnd the coding of the previous sentence, the attention distribution is generated in K areas of the image through a single attention distributor, and is expressed through a single-layer neural network and a SoftMax function:
α=softmax(z)
in the formula:is a vector, all elements are set to 1; is a parameter of the attention network;note the weight of features in v in (v, s).
Weighted visual representation V based on attention distribution w Obtained by the following steps:
in the formula, alpha i Is the ith dimension element in alpha;
a report analysis evaluation module that evaluates the generated radiology report using the evaluation index;
and the result output module is used for combining the respectively generated diagnosis sentences Impression and description paragraphs into a complete radiology report to be output, and simultaneously realizing the evaluation of output results by using a plurality of evaluation indexes.
9. The multi-modal contrast learning-based radiology report generation model training system of claim 8, wherein in the radiology report generation module, the image encoder is modeled as a CNN; the sentence encoder is modeled as a model after fine-tuning in a BERT language model, and semantic representations of sentences are generated by averaging the pooling layers.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210931458.3A CN115293128A (en) | 2022-08-04 | 2022-08-04 | Model training method and system based on multi-modal contrast learning radiology report generation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210931458.3A CN115293128A (en) | 2022-08-04 | 2022-08-04 | Model training method and system based on multi-modal contrast learning radiology report generation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115293128A true CN115293128A (en) | 2022-11-04 |
Family
ID=83825591
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210931458.3A Pending CN115293128A (en) | 2022-08-04 | 2022-08-04 | Model training method and system based on multi-modal contrast learning radiology report generation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115293128A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116151232A (en) * | 2023-04-24 | 2023-05-23 | 北京龙智数科科技服务有限公司 | Method and device for generating model by multi-stage training text title |
CN116797889A (en) * | 2023-08-24 | 2023-09-22 | 青岛美迪康数字工程有限公司 | Updating method and device of medical image recognition model and computer equipment |
CN116843778A (en) * | 2023-05-23 | 2023-10-03 | 北京邮电大学 | Method and system for generating X-ray chest radiography image based on radiology report |
CN117174240A (en) * | 2023-10-26 | 2023-12-05 | 中国科学技术大学 | Medical image report generation method based on large model field migration |
-
2022
- 2022-08-04 CN CN202210931458.3A patent/CN115293128A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116151232A (en) * | 2023-04-24 | 2023-05-23 | 北京龙智数科科技服务有限公司 | Method and device for generating model by multi-stage training text title |
CN116151232B (en) * | 2023-04-24 | 2023-08-29 | 北京龙智数科科技服务有限公司 | Method and device for generating model by multi-stage training text title |
CN116843778A (en) * | 2023-05-23 | 2023-10-03 | 北京邮电大学 | Method and system for generating X-ray chest radiography image based on radiology report |
CN116843778B (en) * | 2023-05-23 | 2024-03-26 | 北京邮电大学 | Method and system for generating X-ray chest radiography image based on radiology report |
CN116797889A (en) * | 2023-08-24 | 2023-09-22 | 青岛美迪康数字工程有限公司 | Updating method and device of medical image recognition model and computer equipment |
CN116797889B (en) * | 2023-08-24 | 2023-12-08 | 青岛美迪康数字工程有限公司 | Updating method and device of medical image recognition model and computer equipment |
CN117174240A (en) * | 2023-10-26 | 2023-12-05 | 中国科学技术大学 | Medical image report generation method based on large model field migration |
CN117174240B (en) * | 2023-10-26 | 2024-02-09 | 中国科学技术大学 | Medical image report generation method based on large model field migration |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115293128A (en) | Model training method and system based on multi-modal contrast learning radiology report generation | |
CN113241135A (en) | Disease risk prediction method and system based on multi-mode fusion | |
CN110503635B (en) | Hand bone X-ray film bone age assessment method based on heterogeneous data fusion network | |
CN112541066B (en) | Text-structured-based medical and technical report detection method and related equipment | |
CN113555077B (en) | Suspected infectious disease prediction method and device | |
CN112489740A (en) | Medical record detection method, training method of related model, related equipment and device | |
CN117077786A (en) | Knowledge graph-based data knowledge dual-drive intelligent medical dialogue system and method | |
CN117253576B (en) | Outpatient electronic medical record generation method based on Chinese medical large model | |
Sirshar et al. | Attention based automated radiology report generation using CNN and LSTM | |
CN111524570B (en) | Ultrasonic follow-up patient screening method based on machine learning | |
CN115205880A (en) | Medical image report generation method and device | |
CN113159134A (en) | Intelligent diagnosis evaluation method based on mammary gland structural report | |
CN113555078A (en) | Intelligent generation method and system for mode-driven gastroscopy report | |
CN114708976A (en) | Method, device, equipment and storage medium for assisting diagnosis technology | |
CN112216379A (en) | Disease diagnosis system based on intelligent joint learning | |
CN112749277A (en) | Medical data processing method and device and storage medium | |
CN116797572A (en) | Rheumatoid arthritis activity grading device based on multi-mode data | |
Hartsock et al. | Vision-language models for medical report generation and visual question answering: A review | |
CN115295133A (en) | Code checking method for surgical operation | |
Ihor et al. | Exploring Multimodal Data Approach in Natural Language Processing Based on Speech Recognition Algorithms | |
CN110289065A (en) | A kind of auxiliary generates the control method and device of medical electronic report | |
CN118098482A (en) | Intelligent medical management system and method based on 5G technology | |
CN118072899A (en) | Bone mineral density report generation platform based on diffusion model text generation technology | |
CN117954041A (en) | Medical image report generation method, system and computer storage medium | |
CN114548081A (en) | Method and system for automatically generating medical ultrasonic text diagnosis result |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |