CN115293128A - Model training method and system based on multi-modal contrast learning radiology report generation - Google Patents

Model training method and system based on multi-modal contrast learning radiology report generation Download PDF

Info

Publication number
CN115293128A
CN115293128A CN202210931458.3A CN202210931458A CN115293128A CN 115293128 A CN115293128 A CN 115293128A CN 202210931458 A CN202210931458 A CN 202210931458A CN 115293128 A CN115293128 A CN 115293128A
Authority
CN
China
Prior art keywords
sentence
image
encoder
learning
visual features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210931458.3A
Other languages
Chinese (zh)
Inventor
武星
李婧雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202210931458.3A priority Critical patent/CN115293128A/en
Publication of CN115293128A publication Critical patent/CN115293128A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H15/00ICT specially adapted for medical reports, e.g. generation or transmission thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Epidemiology (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Image Analysis (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a radiology report generation model training method and system based on multi-modal contrast learning, which comprises the following steps: learning sentence representation based on a sentence-level training strategy of contrast learning by adopting a self-supervision representation learning method, and obtaining image representation through bidirectional contrast learning between paired images and texts; then embedding the learned image encoder and sentence encoder into a radiology report generation model, generating an Impulse part of a report through an encoding-decoding process, and recursively generating a Finding part of the report by fusing visual features and semantic features. The method and the system for training the radiology report generation model based on the multi-modal contrast learning optimize the representation of images and texts, realize a radiology report generation task, assist a radiologist in making diagnosis, and improve the accuracy in the diagnosis process.

Description

Model training method and system based on multi-modal contrast learning radiology report generation
Technical Field
The invention relates to a radiology report generation model training method and system based on multi-modal contrast learning, and belongs to the field of computers and medicine.
Background
Medical images, such as radiological images, are widely used for diagnosis of diseases. Reading and understanding of medical images is typically performed by professional medical personnel who, by analyzing the images being examined, identify normal and abnormal areas therein, and use learned medical knowledge and accumulated work experience to compose radiology reports. However, the generated report will be in error due to lack of knowledge of the radiologist, misreasoning, personnel shortage, excessive workload, and the like. Therefore, automated generation of radiology reports has become an attractive research direction for artificial intelligence and clinical medicine to reduce the workload of radiologists and minimize the occurrence of errors.
Imaging studies are often accompanied by radiology reports that document the radiologist's observations in routine clinical care, with the passage consisting of the Impression and Findings sections representing the most direct transcription for imaging studies. The expression part is a conclusive diagnosis and can be regarded as a report conclusion or a subject sentence, and the Findings part is a paragraph consisting of a plurality of structural sentences, each of which focuses on a specific medical observation of a certain region in a radiology image. These sentences are typically longer and more complex than the titles in the standard image title data set. Therefore, many existing image header models are not directly suitable for this task, which requires special solutions.
In recent years, much work has explored the automatic generation of radiology reports and many new ideas based on traditional generation methods have been proposed to improve. However, the existing generation method generally has several problems: (1) Previous studies mostly utilized CNN encoders based on ImageNet pre-training, which is not suitable for medical images; (2) Sentences obtained by LSTM-based or BERT-based sentence encoders used in previous studies are of low quality representation; (3) The generated reports are not semantically consistent in the medical field, and the optimization of the clinical accuracy of the generated reports is at the cost of reducing the performance of other indexes to a certain extent.
Disclosure of Invention
The purpose of the invention is: the method realizes the task of generating the radiological report, improves the visual characteristics of the extracted medical images and the semantic characteristics of the texts, and enhances the consistency between the images and the texts.
In order to achieve the above object, one technical solution of the present invention is to provide a method for training a radiology report generation model based on multimodal contrast learning, which is characterized by comprising the following steps:
s100, sample data acquisition:
acquiring radiological images and text data, and transmitting the image data and the corresponding text data to a sample database, wherein each text data comprises a conclusive diagnostic statement Impression and a detailed description paragraph Finding;
s200, multi-modal contrast learning:
an image encoder for obtaining visual features in image data and a sentence encoder for extracting semantic features in text data are obtained based on sample database learning by adopting a self-supervision characterization learning method, wherein:
the sentence encoder learns sentence representations through a training strategy based on sentence levels of contrast learning;
an image encoder learns image representations through a bi-directional contrast learning objective between pairs of image data and text data;
s300, generating a radiology report:
recursively generating diagnostic sentences Impression and description paragraphs Findings of a radiology report by fusing visual features extracted by an image encoder and semantic features extracted by a sentence encoder, comprising in particular the steps of:
s301, generating a single diagnosis statement Impression by an Impression generating module based on an encoder-decoder framework, and specifically comprising the following steps:
the image encoder extracts visual features from an input image, then sends the visual features into an Impression part sentence decoder, and generates a whole sentence word by word as a diagnosis sentence Impression;
s302, fusing the visual characteristics of the image and the semantic characteristics of the sentence by a Finding generation module, generating the sentence circularly, and finally generating a long section containing a plurality of structural sentences, wherein the method specifically comprises the following steps:
in the Findings generation module, in order to make the generated sentence focus on describing different image areas, based on the attention frame, the visual features output by the image encoder and the semantic features of the previous sentence output by the sentence encoder are input into a full-link layer and then into the SoftMax layer to obtain weighted visual features, the weighted visual features are input into a Findings partial sentence decoder, the coding of the sentence is obtained by the Findings partial sentence decoder, the coding of the sentence is input into the sentence encoder as the coding of the previous sentence, and then the sentence encoder is input, so as to obtain new weighted visual features, and the process is repeated until the Findings partial sentence decoder generates an empty sentence, which indicates that the generation of the description paragraph finds is completed;
wherein the coding of the previous sentence and the visual features of the image are combined by the weighted visual features to guide the generation of the next sentence, the attention network for calculating the weighted visual representation is defined as:
V w =Attention(v,s)
wherein, V w Is the weighted visual representation to be obtained; attention () is an attention function; v = { v = 1 ,v 2 ,…,v k },
Figure BDA0003781724040000031
Is an image feature learned by an image encoder, each feature v i Are all one d v A representation of a dimension corresponding to a portion of an image;
Figure BDA0003781724040000032
code representing the previous sentence,d s Is the dimension of the semantic feature;
in view of visual features
Figure BDA0003781724040000033
And the coding of the previous sentence, the attention distribution is generated in K areas of the image through a single attention distributor, and is expressed through a single-layer neural network and a SoftMax function:
Figure BDA0003781724040000034
α=softmax(z)
in the formula:
Figure BDA0003781724040000035
is a vector, all elements are set to 1;
Figure BDA0003781724040000036
Figure BDA0003781724040000037
is a parameter of the attention network;
Figure BDA0003781724040000038
note the weight of the features in v in (v, s).
Weighted visual representation V based on attention distribution w Obtained by the following steps:
Figure BDA0003781724040000039
in the formula, alpha i Is the ith dimension element in alpha;
s400, report analysis and evaluation:
the generated radiology report is evaluated using the evaluation index:
s500, outputting a result:
and combining the diagnosis sentence Impression and the description paragraphs Finding which are respectively generated into a complete radiology report for outputting, and simultaneously realizing the evaluation of output results by using a plurality of evaluation indexes.
Preferably, in step S100, the image data and the text data in the sample database are in a one-to-one correspondence relationship, and include a training set, a test set, and a verification set.
Preferably, in step S200, learning sentence characterization through a training strategy based on the sentence levels of the comparative learning comprises the following steps:
for the same sentence in a sentence set, a series of sentence representations of different enhanced versions are obtained by implementing different data enhancement methods to be used as positive examples in a training set, a test set or a verification set, and other sentences are used as negative examples;
when the sentence encoder is trained, the sentence encoder is established and the semantic sentence embedding is established by furthest improving the consistency between the sentence representations of different enhanced versions of the same sample and simultaneously keeping the sentence vectors of different samples as far as possible.
Preferably, in step S200, a set of sentences is taken
Figure BDA0003781724040000041
x i Represents the ith sentence, m represents the set
Figure BDA0003781724040000042
The total number of inner sentences; for sentence x i Applying two different data enhancement methods f () and f' () to generate two different versions of sentence embedding e i 、e′ i
e i =f(x i )
e′ i =f′(x i )
In the formula, e i ,
Figure BDA0003781724040000043
L is the length of sentence embedding, D is the hidden dimension of sentence embedding;
then, the sentence is embedded into e i 、e′ i Is encoded to obtain a sentence representation h i 、h′ i
Then for a small batch of N sentences, sentence x i Training target of
Figure BDA00037817240400000411
The following:
Figure BDA0003781724040000044
wherein τ is a temperature hyperparameter, sim () is cosine similarity, then
Figure BDA0003781724040000045
Ultimate loss of contrast
Figure BDA00037817240400000410
Is the average of all N in-batch losses:
Figure BDA0003781724040000046
preferably, in step S200, learning the image representation by a bidirectional contrast learning target between the paired image and text, comprises the following steps:
by paired input (X) v ,X s ) Learning image encoder, wherein X v Representing an image or a group of images, X s Representative description of X v A sentence sequence of mid-imaging information; for each input image X v And each input sentence X d They are encoded by an image encoder f v () And a sentence encoder f s () Conversion into a fixed dimension vector h v And h s (ii) a Then, a representation h of the two modes v 、h s By projecting a function g v () And g s () Project from their encoder space to the same D-dimensional space for comparative learning;
for N input pairs (X) v ,X s ),Resulting in corresponding N representation pairs (v, s), wherein:
v=g v (f v (X v ))
s=g s (f s (X s ))
in the formula, v,
Figure BDA0003781724040000047
with (v) i ,s i ) Represent the ith pair of representations whose training objectives include two loss functions: loss of image-to-text contrast
Figure BDA0003781724040000048
And loss of text-to-image contrast
Figure BDA0003781724040000049
Figure BDA0003781724040000051
Figure BDA0003781724040000052
Ultimate loss of training
Figure BDA0003781724040000053
Is a weighted combination of two penalties:
Figure BDA0003781724040000054
wherein λ is a scalar weight;
by maximizing the correspondence between image-text representation pairs, an image encoder is learned that maps images to a fixed-dimension vector.
Preferably, in step S301, the im compression part sentence decoder generates a word to generate a title in each time step according to the context vector, the previous hidden state and the previously generated word by using the LSTM-based method;
the initial hidden state and cell state of the LSTM are set to zero, the hidden state h at time t t Is modeled as:
h t =LSTM(x t ,h t-1 ,m t-1 )
wherein x is t Is an input vector, m t-1 Is the memory cell vector at time t-1.
The visual feature vector is used as the initial input of the LSTM to predict the first word of a sentence; before inputting LSTM, the visual feature vector output by the image encoder is converted into the same dimension as word embedding by a full connection layer, and the LSTM extracts visual features from the last convolution layer and then generates the whole sentence word by word.
Preferably, the step S400 includes:
s401, evaluating the generated report by using the BLEU and the variant thereof;
s402, evaluating the generated report by using METEOR;
s403, evaluating the generated report by using the ROUGE;
and S404, evaluating the generated report by using CIDER.
Another technical solution of the present invention is to provide a model training system for generating a radiology report based on multimodal contrast learning, which is characterized by comprising:
the sample database is used for storing the acquired radiological images and text data, wherein each text data comprises a conclusive diagnosis sentence Impression and a detailed description paragraph Findings;
the multimode contrast learning module adopts a self-supervision characterization learning method, and learns based on a sample database to obtain an image encoder for visual features in image data and a sentence encoder for extracting semantic features in text data, wherein:
the sentence encoder learns sentence representations through a training strategy based on sentence levels of contrast learning;
an image encoder learns image representations through a bi-directional contrast learning objective between pairs of image data and text data;
a radiology report generation module for recursively generating a diagnosis sentence Impression and a description paragraphs Finding of a radiology report by fusing the visual features extracted by the image encoder and the semantic features extracted by the sentence encoder;
the radiology report generation module further comprises an expression generation module and a fixings generation module, wherein:
the Impactor generation module generates a single diagnostic statement Impactor based on an encoder-decoder framework, the implementation of the Impactor generation module comprising the steps of:
the image encoder extracts visual features from an input image, then sends the visual features into an Impression part sentence decoder, and generates a whole sentence word by word as a diagnosis sentence Impression;
the method comprises the following steps of fusing visual features of an image and semantic features of a sentence by a Findings generation module, generating the sentence in a circulating manner, and finally generating a long section containing a plurality of structural sentences, wherein the Findings generation module comprises the following steps:
in the Findings generation module, in order to make the generated sentence focus on describing different image areas, based on the attention frame, the visual features output by the image encoder and the semantic features of the previous sentence output by the sentence encoder are input into a full-link layer and then into the SoftMax layer to obtain weighted visual features, the weighted visual features are input into a Findings partial sentence decoder, the coding of the sentence is obtained by the Findings partial sentence decoder, the coding of the sentence is input into the sentence encoder as the coding of the previous sentence, and then the sentence encoder is input, so as to obtain new weighted visual features, and the process is repeated until the Findings partial sentence decoder generates an empty sentence, which indicates that the generation of the Findings part is completed;
wherein the coding of the previous sentence and the visual features of the image are combined by the weighted visual features to guide the generation of the next sentence, the attention network for calculating the weighted visual representation is defined as:
V w =Attention(v,s)
wherein, V w Is the weighted visual representation to be obtained; attention () is an Attention function; v = { v = 1 ,v 2 ,…,v k },
Figure BDA0003781724040000071
Is an image feature learned by an image encoder, each feature v i Are all one d v A representation of a dimension corresponding to a portion of an image;
Figure BDA0003781724040000072
code representing the previous sentence, d s Is the dimension of the semantic feature;
in view of visual features
Figure BDA0003781724040000073
And the coding of the previous sentence, the attention distribution is generated in K areas of the image through a single attention distributor, and is expressed through a single-layer neural network and a SoftMax function:
Figure BDA0003781724040000074
α=softmax(z)
in the formula:
Figure BDA0003781724040000075
is a vector, all elements are set to 1;
Figure BDA0003781724040000076
Figure BDA0003781724040000077
is a parameter of the attention network;
Figure BDA0003781724040000078
note the weight of the features in v in (v, s).
Weighted visual representation V based on attention distribution w Obtained by the following steps:
Figure BDA0003781724040000079
in the formula, alpha i Is the ith dimension element in α;
a report analysis evaluation module that evaluates the generated radiology report using the evaluation index;
and the result output module is used for combining the respectively generated diagnosis sentences Impression and description paragraphs into a complete radiology report to be output, and simultaneously realizing the evaluation of output results by using a plurality of evaluation indexes.
The radiology report generation model training system based on multi-modal contrast learning of claim 8, wherein in the radiology report generation module, the image encoder is modeled as a CNN; the sentence encoder is modeled as a model obtained after fine-tuning in a BERT language model, and semantic representations of sentences are generated by averaging the pooling layers.
The invention has reasonable structural design, utilizes the sample database to carry out self-supervision learning, sends the radiology images into the system, finally generates corresponding radiology reports, assists the radiologist to make diagnosis, and greatly improves the accuracy in the diagnosis process.
In summary, compared with the prior art, the invention has at least the following advantages:
(1) The invention provides a multi-modal contrast learning-based recursive model for generating a radiology report. The model combines the visual characteristics of medical images and the semantic characteristics of sentences, and respectively generates a radiology report expressing part and a Finding part through a recursive network;
(2) The invention provides a model pre-training method based on multi-modal contrast learning, which is used for improving the expressive force of visual characteristics and text characteristics;
(3) The invention uses paired medical images and reports to carry out bidirectional contrast learning, and carries out pre-training on the image encoder, so that the image encoder can effectively extract visual representation and improve the consistency between image data and text data.
(4) The sentence encoder is established on a sentence-level training target based on comparative learning, so that the sentence encoder can establish semantically coherent sentence embedding for text representation;
(5) The radiology report generation model training method and system based on multi-modal contrast learning can effectively provide interpretable reasons, self-supervision learning is carried out by utilizing the sample database, radiology images are sent into the system, a radiology report corresponding to the radiology image is generated finally, a radiologist is assisted to make a decision, and the accuracy in the diagnosis process is greatly improved.
Drawings
FIG. 1 is an overall frame diagram of the present invention;
FIG. 2 is sample data according to an embodiment of the present invention;
FIG. 3 is a block diagram of multi-modal contrast learning in accordance with the present invention;
FIG. 4 is a schematic diagram of multi-modal feature fusion in the present invention.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
As shown in fig. 1, an embodiment of the present invention provides a method for training a radiology report generation model based on multi-modal contrast learning, in which an auto-supervised characterization learning method, i.e., multi-modal contrast learning, is embedded in a radiology report generation model to generate a corresponding radiology report for each radiology image, and the method includes the following steps:
s100, sample data acquisition:
the method comprises the steps of collecting a radiological image of the chest of a human body through a nuclear magnetic resonance apparatus, and transmitting image data and corresponding text data to a sample database. Specifically, referring to fig. 2, the image data and the text data in the sample database are in a one-to-one correspondence relationship, and include a training set, a test set, and a verification set. The text data is a radiology report containing a conclusive diagnostic statement Impression and a detailed description paragraph Findings.
S200, multi-modal contrast learning:
an image encoder and a sentence encoder are obtained based on sample database learning by adopting a self-supervision characterization learning method and are respectively used for extracting visual features in an image and semantic features in a sentence, and as shown in fig. 3, the method specifically comprises the following steps:
s201, sentence representation is learned through a sentence level training strategy based on comparative learning.
Taking a set of sentences
Figure BDA0003781724040000091
x i Representing the ith sentence, m represents the set
Figure BDA0003781724040000092
Total number of inner sentences. For sentence x i Applying two different data enhancement methods f () and f' () to generate two different versions of sentence embedding e i 、e′ i
e i =f(x i )
e′ i =f′(x i )
In the formula, e i ,
Figure BDA0003781724040000093
L is the length of sentence embedding and D is the hiding dimension of sentence embedding.
Then, the sentence is embedded into e i 、e′ i Is encoded to obtain a sentence representation h i 、h′ i
Therefore, for the sameSentences, by implementing different data enhancement methods such as cutting, removing and the like, the invention takes a series of different obtained embeddings as 'positive examples' and takes other sentences in the same batch as 'negative examples'. Then, for a small batch of N sentences, sentence x i Training target of
Figure BDA0003781724040000098
The following were used:
Figure BDA0003781724040000094
wherein τ is a temperature hyperparameter, sim () is cosine similarity, then
Figure BDA0003781724040000095
The molecule in the above formula represents h i 、h′ i Cosine similarity between, wherein h' i Is a positive example; and the denominator represents h j 、h′ j Sum of cosine similarity therebetween, wherein h' j Including all positive and negative examples.
Ultimate loss of contrast
Figure BDA0003781724040000097
Is the average of all N in-batch losses:
Figure BDA0003781724040000096
by furthest improving the consistency between different enhanced versions of the same sample and simultaneously keeping sentence vectors of different samples as far as possible, the invention establishes a sentence encoder and constructs semantic sentence embedding.
S202, learning image representation through a bidirectional contrast learning target between the paired images and texts.
By paired input (X) v ,X s ) Learning image weavingCode device, wherein X v Representing an image or a group of images, X s Representative description of X v A sentence sequence of imaging information. For each input image X v And each input sentence X s They are encoded by an image encoder f v () And a sentence encoder f s () Conversion into a fixed dimension vector h v And h s . Then, a representation h of the two modes v 、h s By projecting a function g v () And g s () Projected from their encoder space to the same D-dimensional space for comparative learning.
Thus, for N input pairs (x) v ,X s ) Corresponding N representation pairs (v, s) can be obtained, where:
v=g v (f v (X v ))
s=g s (f s (x s ))
in the formula, v,
Figure BDA0003781724040000101
with (v) i ,s i ) Representation pair i is represented and trained using the same medical data set as the downstream task. The present invention uses Info-NCE as a loss function, which is a comparative loss function for self-supervised learning. The ith pair represents the pair (v) i ,s i ) Includes two loss functions: loss of image-to-text contrast
Figure BDA0003781724040000102
Loss of contrast with text-image
Figure BDA0003781724040000103
Figure BDA0003781724040000104
Figure BDA0003781724040000105
Ultimate loss of training
Figure BDA0003781724040000106
Is a weighted combination of two penalties:
Figure BDA0003781724040000107
where λ is a scalar weight.
By maximizing the correspondence between image-text representation pairs, the present invention learns an image encoder that maps images to a fixed-dimension vector.
And S203, embedding the image encoder and the sentence encoder which are learned through the steps into a radiology report generation model.
S300, generating a radiology report:
recursively generating an Impression part and a Findings part of the radiology report by fusing the visual features extracted by the image encoder and the semantic features extracted by the sentence encoder, comprising the steps of:
s301, generating a single conclusive diagnostic statement imcompression by an imcompression generation module based on a simple encoder-decoder framework.
Specifically, the image encoder first extracts visual features from the input image, and then feeds them to the sentence decoder, generating the whole sentence word by word.
The purpose of the image encoder is to automatically extract visual features from the image, map the image into a context vector as a visual input for all subsequent modules, this vector being obtained by multimodal contrast pre-training. In particular, the image encoder is parameterized as a fully-connected layer, with visual features extracted from the last convolutional layer. The visual features are then fed into the sentence decoder to generate the Impression part.
The invention adopts an LSTM-based method, according to the context vector,A previous hidden state and a previously generated word, one word being generated at each time step to produce a title. The initial hidden state and cell state of the LSTM are set to zero, the hidden state h at time t t Is modeled as:
h t =LSTM(x t ,h t-1 ,m t-1 )
wherein x is t Is an input vector, m t-1 Is the memory cell vector at time t-1.
The visual feature vector is used as the initial input of the LSTM to predict the first word of a sentence. But before entering LSTM, a full connectivity layer is needed to convert visual feature vectors to the same dimensions as word embedding. The entire sentence can then be generated word by word.
S302, fusing the visual characteristics of the image and the semantic characteristics of the sentence by a Finding generation module, circularly generating the sentence, and finally generating a long section containing a plurality of structural sentences.
In particular, multi-modal feature fusion referring to fig. 4, the upper half of fig. 4 produces branches of visual features, and the resulting image encoder, via contrast pre-training, is modeled as a CNN for extracting visual representations from the input image. The lower part of fig. 4 shows the branches that generate semantic features, the sentence coder obtained by the comparative pre-training is modeled as a model obtained by fine-tuning in a language model similar to BERT, and the semantic representation of the sentences is generated by averaging the pooling layers.
In order to focus the generated sentence on describing different image areas, based on the attention framework, the visual features of the image and the semantic features of the text are input into a fully connected layer and then fed into the SoftMax layer to obtain weighted visual features.
The attention network used to compute the weighted visual representation is defined as:
V w =Attention(v,s)
wherein, V w Is the weighted visual representation to be obtained; attention () is the Attention function; v = { v = 1 ,v 2 ,…,v k },
Figure BDA0003781724040000111
Is an image feature learned by an image encoder, each feature v i Are all one d v A representation of a dimension corresponding to a portion of an image;
Figure BDA0003781724040000121
code representing the previous sentence, d s Is the dimension of the semantic features.
In view of visual features
Figure BDA0003781724040000122
And coding of the previous sentence, the invention generates attention distribution in K areas of the image through a single attention distributor, and the attention distribution is expressed through a single-layer neural network and a SoftMax function:
Figure BDA0003781724040000123
α=softmax(z)
in the formula:
Figure BDA0003781724040000124
is a vector, all elements are set to 1;
Figure BDA0003781724040000125
Figure BDA0003781724040000126
is a parameter of the attention network;
Figure BDA0003781724040000127
note the weight of features in v in (v, s).
Weighted visual representation V based on attention distribution w Can be obtained by the following method:
Figure BDA0003781724040000128
in the formula, alpha i Is the ith dimension element in alpha.
The input to the sentence decoder is now a weighted visual representation, so that the decoder will take note of specific regions of the image in order to generate sentences describing different image regions. The learned coding of the previous sentence and the visual characteristics of the image are combined to guide the generation of the next sentence. This process is repeated until an empty sentence is produced, indicating that the creation of the Findings section has been completed. In this way, as different sentences are generated, the model can focus on different areas of the image according to the context of the previous sentence and ensure consistency and consistency of the medical semantics of the generated report.
S400, report analysis and evaluation:
the generated radiology report is evaluated by using four common evaluation indexes, wherein the larger the value of the evaluation index is, the better the performance of the radiology report generation model is represented, and the method specifically comprises the following steps:
s401, evaluating the generated report using the BLEU and its variants.
BLEU is a method for automatically evaluating machine translation, the general idea of which is accuracy, and can be further divided into many variants according to "n-gram", four common indicators being BLEU-1, BLEU-2, BLEU-3 and BLEU-4, where n-gram refers to the number of consecutive words being n.
And S402, evaluating the generated report by using METEOR.
METEOR is an automatic index for machine translation evaluation, has better correlation with human judgment, and takes the accuracy and recall rate based on the whole corpus into consideration and obtains a final index, which is different from BLEU.
And S403, evaluating the generated report by using the ROUGE.
The ROUGE is designed to measure the quality of the summary, measures the "similarity" between the automatically generated summary and the reference summary, and calculates a corresponding score.
And S404, evaluating the generated report by using the CIDER.
CIDER is a consensus-based image description evaluation that calculates the cosine similarity of a reference title and a title generated by a model as a score.
S500, outputting a result:
and combining the respectively generated Impression part and Findings part into a complete radiology report for outputting, and simultaneously realizing the evaluation of output results by using a plurality of evaluation indexes.
And the result output module is in signal connection with a display screen and a printer. The display screen and the printer are connected with the setting result output module through signals, screen display and document printing of the diagnosis report are achieved, and analysis of the report by medical personnel is facilitated.
In this embodiment, the radiological image is acquired by a nuclear magnetic resonance apparatus. The principle of the nuclear magnetic resonance instrument is that a human body is placed in a special magnetic field, a radio frequency pulse is used for exciting hydrogen atomic nuclei in the human body to cause the hydrogen atomic nuclei to resonate, energy is absorbed, after the radio frequency pulse is stopped, the hydrogen atomic nuclei send out radio signals according to specific frequency, the absorbed energy is released and recorded by a receiver outside the human body, and an image is obtained through processing of an electronic computer.
In this embodiment, the result output module is in signal connection with a display screen and a printer. The display screen and the printer are connected through the signal of the setting result output module, screen display and document printing of the diagnosis report are achieved, and analysis of the report by medical personnel is facilitated.

Claims (9)

1. A model training method based on multi-modal contrast learning radiology report generation is characterized by comprising the following steps:
s100, sample data acquisition:
acquiring radiological images and text data, and transmitting the image data and the corresponding text data to a sample database, wherein each text data comprises a conclusive diagnosis sentence Impression and a detailed description paragraph Findings;
s200, multimodal contrast learning:
an image encoder for obtaining visual features in image data and a sentence encoder for extracting semantic features in text data are obtained based on sample database learning by adopting a self-supervision characterization learning method, wherein:
the sentence encoder learns sentence representations through a training strategy based on sentence levels of contrast learning;
an image encoder learns image representations by a bi-directional contrast learning objective between pairs of imagery data and text data;
s300, generating a radiology report:
recursively generating diagnostic sentences Impression and description paragraphs Findings of a radiology report by fusing visual features extracted by an image encoder and semantic features extracted by a sentence encoder, comprising in particular the steps of:
s301, generating a single diagnostic statement imcompression by the imcompression generation module based on the encoder-decoder framework, specifically including the following steps:
the image encoder extracts visual features from an input image, then sends the visual features into an Impression partial sentence decoder, and generates a whole sentence word by word as a diagnosis sentence Impression;
s302, fusing the visual characteristics of the image and the semantic characteristics of the sentence by a Findings generation module, generating the sentence in a circulating way, and finally generating a long paragraph containing a plurality of structural sentences, wherein the method specifically comprises the following steps:
in the Findings generation module, in order to make the generated sentence focus on describing different image areas, based on the attention frame, the visual features output by the image encoder and the semantic features of the previous sentence output by the sentence encoder are input into a full-link layer and then into the SoftMax layer to obtain weighted visual features, the weighted visual features are input into a Findings partial sentence decoder, the coding of the sentence is obtained by the Findings partial sentence decoder, the coding of the sentence is input into the sentence encoder as the coding of the previous sentence, and then the sentence encoder is input, so as to obtain new weighted visual features, and the process is repeated until the Findings partial sentence decoder generates an empty sentence, which indicates that the generation of the description paragraph finds is completed;
wherein the coding of the previous sentence and the visual features of the image are combined by the weighted visual features to guide the generation of the next sentence, the attention network for calculating the weighted visual representation is defined as:
V w =Attention(v,s)
wherein, V w Is the weighted visual representation to be obtained; attention () is the Attention function; v = { v = 1 ,v 2 ,…,v k },
Figure FDA0003781724030000021
Is an image feature learned by an image encoder, each feature v i Are all one d v A representation of a dimension corresponding to a portion of an image;
Figure FDA0003781724030000022
code representing the previous sentence, d s Is the dimension of the semantic feature;
in view of visual features
Figure FDA0003781724030000023
And the coding of the previous sentence, the attention distribution is generated in K areas of the image through a single attention distributor, and is expressed through a single-layer neural network and a SoftMax function:
Figure FDA0003781724030000024
α=softmax(z)
in the formula:
Figure FDA0003781724030000025
is a vector, all elements are set to 1;
Figure FDA0003781724030000026
Figure FDA0003781724030000027
is a parameter of the attention network;
Figure FDA0003781724030000028
note the weight of the features in v in (v, s).
Weighted visual representation V based on attention distribution w Obtained by the following steps:
Figure FDA0003781724030000029
in the formula, alpha i Is the ith dimension element in α;
s400, report analysis and evaluation:
the generated radiology report was evaluated using the evaluation index:
s500, outputting a result:
and combining the diagnosis sentence Impression and the description paragraphs Finding which are respectively generated into a complete radiology report for outputting, and simultaneously realizing the evaluation of output results by using a plurality of evaluation indexes.
2. The method for training the radiology report generation model based on multi-modal contrast learning of claim 1, wherein in step S100, the image data and the text data in the sample database have a one-to-one correspondence relationship, and the sample database includes a training set, a testing set and a verification set.
3. The method as claimed in claim 2, wherein the step S200 of learning sentence characterization by sentence-level training strategy based on comparative learning comprises the following steps:
for the same sentence in a sentence set, a series of sentence representations of different enhanced versions are obtained by implementing different data enhancement methods to serve as positive examples in a training set, a test set or a verification set, and other sentences serve as negative examples;
when the sentence encoder is trained, the sentence encoder is established and the semantic sentence embedding is established by furthest improving the consistency between the sentence representations of different enhanced versions of the same sample and simultaneously keeping the sentence vectors of different samples as far as possible.
4. The method of claim 3, wherein in step S200, a set of sentences is selected
Figure FDA0003781724030000031
x i Representing the ith sentence, m represents the set
Figure FDA0003781724030000032
The total number of inner sentences; for sentence x i Two different data enhancement methods f () and f' () are applied to generate two different versions of sentence-embedding e i 、e′ i
e i =f(x i )
e′ i =f′(x i )
In the formula (I), the compound is shown in the specification,
Figure FDA0003781724030000033
l is the length of sentence embedding, D is the hiding dimension of sentence embedding;
then, sentence embedding e i 、e′ i Is encoded to obtain a sentence representation h i 、h′ i
Then for a small batch of N sentences, sentence x i Training target l i The following were used:
Figure FDA0003781724030000034
wherein τ is a temperature hyperparameter, sim () is cosine similarity, then
Figure FDA0003781724030000035
Final loss of contrast
Figure FDA0003781724030000037
Is the average of all N in-batch losses:
Figure FDA0003781724030000036
5. the multi-modal contrast learning radiology report generation model training method of claim 1, wherein learning image representations through bi-directional contrast learning objects between pairs of images and text in step S200 comprises the steps of:
by paired input (X) v ,X s ) Learning image encoder, wherein X v Representing an image or a group of images, X s Representative description X v A sentence sequence of imaging information; for each input image X v And each input sentence X s They are encoded by an image encoder f v () And a sentence encoder f s () Conversion into a fixed-dimension vector h v And h s (ii) a Then, a representation h of the two modes v 、h s By projecting a function g v () And g s () Project from their encoder space to the same D-dimensional space for contrast learning;
for N input pairs (X) v ,X s ) Resulting in corresponding N representation pairs (v, s), wherein:
v=g v (f v (X v ))
s=g s (f s (X s ))
in the formula (I), the compound is shown in the specification,
Figure FDA0003781724030000041
with (v) i ,s i ) Represent the ith pair of representations whose training objectives include two loss functions: loss of image-to-text contrast
Figure FDA0003781724030000042
Loss of contrast with text-image
Figure FDA0003781724030000043
Figure FDA0003781724030000044
Figure FDA0003781724030000045
Ultimate loss of training
Figure FDA0003781724030000046
Is a weighted combination of two penalties:
Figure FDA0003781724030000047
wherein λ is a scalar weight;
by maximizing the correspondence between image-text representation pairs, an image encoder is learned that maps images to a fixed-dimension vector.
6. The method as claimed in claim 1, wherein in step S301, the im compression partial sentence decoder generates a word to generate a title in each time step according to the context vector, the previous hidden state and the previously generated word by using LSTM-based method;
the initial hidden state and cell state of the LSTM are set to zero, the hidden state h at time t t Is modeled as:
h t =LSTM(x t ,h t-1 ,m t-1 )
wherein x is t Is an input vector, m t-1 Is the memory cell vector at time t-1.
The visual feature vector is used as the initial input of the LSTM to predict the first word of a sentence; before inputting LSTM, the visual feature vector output by the image encoder is converted into the same dimension as word embedding by a full connection layer, and the LSTM extracts visual features from the last convolution layer and then generates the whole sentence word by word.
7. The method as claimed in claim 1, wherein the step S400 includes:
s401, evaluating the generated report by using the BLEU and the variant thereof;
s402, evaluating the generated report by using METEOR;
s403, evaluating the generated report by using the ROUGE;
and S404, evaluating the generated report by using CIDER.
8. A radiology report generation model training system based on multimodal contrast learning, comprising:
the sample database is used for storing the acquired radiological images and text data, wherein each text data comprises a conclusive diagnosis sentence Impression and a detailed description paragraph Findings;
the multimode contrast learning module adopts a self-supervision characterization learning method, and learns based on a sample database to obtain an image encoder for visual features in image data and a sentence encoder for extracting semantic features in text data, wherein:
the sentence encoder learns sentence representations through a training strategy based on sentence levels of contrast learning;
an image encoder learns image representations by a bi-directional contrast learning objective between pairs of imagery data and text data;
a radiology report generation module for recursively generating a diagnosis sentence Impression and a description paragraphs Finding of a radiology report by fusing the visual features extracted by the image encoder and the semantic features extracted by the sentence encoder;
the radiology report generation module further comprises an expression generation module and a fixings generation module, wherein:
the Impactor generation module generates a single diagnostic statement Impactor based on an encoder-decoder framework, the implementation of the Impactor generation module comprising the steps of:
the image encoder extracts visual features from an input image, then sends the visual features into an Impression partial sentence decoder, and generates a whole sentence word by word as a diagnosis sentence Impression;
the method comprises the following steps of fusing visual features of an image and semantic features of a sentence by a Findings generation module, generating the sentence in a circulating manner, and finally generating a long section containing a plurality of structural sentences, wherein the Findings generation module comprises the following steps:
in a Findings generation module, in order to make the generated sentence focus on describing different image regions, based on the attention framework, the visual features output by the image encoder and the semantic features of the previous sentence output by the sentence encoder are input into a full link layer and then into a SoftMax layer to obtain weighted visual features, the weighted visual features are input into a Findings partial sentence decoder, the coding of the sentence is obtained by the Findings partial sentence decoder, the coding of the sentence is used as the coding of the previous sentence input to the sentence encoder, which is input to the sentence encoder, so as to obtain new weighted visual features, and the process is repeated until the Findings partial sentence decoder generates an empty sentence, which indicates that the Findings partial sentence generation is completed;
wherein the coding of the previous sentence and the visual features of the image are combined by the weighted visual features to guide the generation of the next sentence, the attention network for calculating the weighted visual representation is defined as:
V w =Attention(v,s)
wherein, V w Is the weighted visual representation to be obtained; attention () is an Attention function; v = { v) 1 ,v 2 ,…,v k },
Figure FDA0003781724030000061
Is an image feature learned by an image encoder, each feature v i Are all one d v A representation of a dimension corresponding to a portion of an image;
Figure FDA0003781724030000062
code representing the previous sentence, d s Is the dimension of the semantic feature;
in view of visual features
Figure FDA0003781724030000063
And the coding of the previous sentence, the attention distribution is generated in K areas of the image through a single attention distributor, and is expressed through a single-layer neural network and a SoftMax function:
Figure FDA0003781724030000064
α=softmax(z)
in the formula:
Figure FDA0003781724030000065
is a vector, all elements are set to 1;
Figure FDA0003781724030000066
Figure FDA0003781724030000067
is a parameter of the attention network;
Figure FDA0003781724030000068
note the weight of features in v in (v, s).
Weighted visual representation V based on attention distribution w Obtained by the following steps:
Figure FDA0003781724030000069
in the formula, alpha i Is the ith dimension element in alpha;
a report analysis evaluation module that evaluates the generated radiology report using the evaluation index;
and the result output module is used for combining the respectively generated diagnosis sentences Impression and description paragraphs into a complete radiology report to be output, and simultaneously realizing the evaluation of output results by using a plurality of evaluation indexes.
9. The multi-modal contrast learning-based radiology report generation model training system of claim 8, wherein in the radiology report generation module, the image encoder is modeled as a CNN; the sentence encoder is modeled as a model after fine-tuning in a BERT language model, and semantic representations of sentences are generated by averaging the pooling layers.
CN202210931458.3A 2022-08-04 2022-08-04 Model training method and system based on multi-modal contrast learning radiology report generation Pending CN115293128A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210931458.3A CN115293128A (en) 2022-08-04 2022-08-04 Model training method and system based on multi-modal contrast learning radiology report generation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210931458.3A CN115293128A (en) 2022-08-04 2022-08-04 Model training method and system based on multi-modal contrast learning radiology report generation

Publications (1)

Publication Number Publication Date
CN115293128A true CN115293128A (en) 2022-11-04

Family

ID=83825591

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210931458.3A Pending CN115293128A (en) 2022-08-04 2022-08-04 Model training method and system based on multi-modal contrast learning radiology report generation

Country Status (1)

Country Link
CN (1) CN115293128A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116151232A (en) * 2023-04-24 2023-05-23 北京龙智数科科技服务有限公司 Method and device for generating model by multi-stage training text title
CN116797889A (en) * 2023-08-24 2023-09-22 青岛美迪康数字工程有限公司 Updating method and device of medical image recognition model and computer equipment
CN116843778A (en) * 2023-05-23 2023-10-03 北京邮电大学 Method and system for generating X-ray chest radiography image based on radiology report
CN117174240A (en) * 2023-10-26 2023-12-05 中国科学技术大学 Medical image report generation method based on large model field migration

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116151232A (en) * 2023-04-24 2023-05-23 北京龙智数科科技服务有限公司 Method and device for generating model by multi-stage training text title
CN116151232B (en) * 2023-04-24 2023-08-29 北京龙智数科科技服务有限公司 Method and device for generating model by multi-stage training text title
CN116843778A (en) * 2023-05-23 2023-10-03 北京邮电大学 Method and system for generating X-ray chest radiography image based on radiology report
CN116843778B (en) * 2023-05-23 2024-03-26 北京邮电大学 Method and system for generating X-ray chest radiography image based on radiology report
CN116797889A (en) * 2023-08-24 2023-09-22 青岛美迪康数字工程有限公司 Updating method and device of medical image recognition model and computer equipment
CN116797889B (en) * 2023-08-24 2023-12-08 青岛美迪康数字工程有限公司 Updating method and device of medical image recognition model and computer equipment
CN117174240A (en) * 2023-10-26 2023-12-05 中国科学技术大学 Medical image report generation method based on large model field migration
CN117174240B (en) * 2023-10-26 2024-02-09 中国科学技术大学 Medical image report generation method based on large model field migration

Similar Documents

Publication Publication Date Title
CN115293128A (en) Model training method and system based on multi-modal contrast learning radiology report generation
CN113241135A (en) Disease risk prediction method and system based on multi-mode fusion
CN110503635B (en) Hand bone X-ray film bone age assessment method based on heterogeneous data fusion network
CN112541066B (en) Text-structured-based medical and technical report detection method and related equipment
CN113555077B (en) Suspected infectious disease prediction method and device
CN112489740A (en) Medical record detection method, training method of related model, related equipment and device
CN117077786A (en) Knowledge graph-based data knowledge dual-drive intelligent medical dialogue system and method
CN117253576B (en) Outpatient electronic medical record generation method based on Chinese medical large model
Sirshar et al. Attention based automated radiology report generation using CNN and LSTM
CN111524570B (en) Ultrasonic follow-up patient screening method based on machine learning
CN115205880A (en) Medical image report generation method and device
CN113159134A (en) Intelligent diagnosis evaluation method based on mammary gland structural report
CN113555078A (en) Intelligent generation method and system for mode-driven gastroscopy report
CN114708976A (en) Method, device, equipment and storage medium for assisting diagnosis technology
CN112216379A (en) Disease diagnosis system based on intelligent joint learning
CN112749277A (en) Medical data processing method and device and storage medium
CN116797572A (en) Rheumatoid arthritis activity grading device based on multi-mode data
Hartsock et al. Vision-language models for medical report generation and visual question answering: A review
CN115295133A (en) Code checking method for surgical operation
Ihor et al. Exploring Multimodal Data Approach in Natural Language Processing Based on Speech Recognition Algorithms
CN110289065A (en) A kind of auxiliary generates the control method and device of medical electronic report
CN118098482A (en) Intelligent medical management system and method based on 5G technology
CN118072899A (en) Bone mineral density report generation platform based on diffusion model text generation technology
CN117954041A (en) Medical image report generation method, system and computer storage medium
CN114548081A (en) Method and system for automatically generating medical ultrasonic text diagnosis result

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination