CN116127056A - Medical dialogue abstracting method with multi-level characteristic enhancement - Google Patents

Medical dialogue abstracting method with multi-level characteristic enhancement Download PDF

Info

Publication number
CN116127056A
CN116127056A CN202211692317.7A CN202211692317A CN116127056A CN 116127056 A CN116127056 A CN 116127056A CN 202211692317 A CN202211692317 A CN 202211692317A CN 116127056 A CN116127056 A CN 116127056A
Authority
CN
China
Prior art keywords
medical
word
abstract
dialogue
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211692317.7A
Other languages
Chinese (zh)
Inventor
张天宝
冯时
杨振飞
王大玲
张一飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN202211692317.7A priority Critical patent/CN116127056A/en
Publication of CN116127056A publication Critical patent/CN116127056A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

The invention provides a medical dialogue abstracting method with multi-level characteristic enhancement, and relates to the technical field of medical dialogue abstracts. Firstly, acquiring medical dialogue abstract data, and preprocessing the medical dialogue abstract data to enable the data to meet unified model requirements; building an automatic medical dialogue abstract model; the medical dialogue abstract model uses a pointer generation network as a basic framework, and uses a multi-level enhanced input characteristic representation to adapt to a medical dialogue scene by integrating internal attention, speaker embedding and utterance semantics; and finally training the constructed automatic medical dialogue abstract model and testing. The method can effectively improve the performance of the medical dialogue abstract model and enhance the accuracy of the medical dialogue abstract.

Description

Medical dialogue abstracting method with multi-level characteristic enhancement
Technical Field
The invention relates to the technical field of medical dialogue abstracts, in particular to a medical dialogue abstracting method with multi-level characteristic enhancement.
Background
The safety and convenience of online medical treatment are increasingly prominent. By way of text communication, patients can specify their condition and then doctors can provide diagnoses and advice. After each session, some online platforms require doctors to compose summaries for critical information, including disease diagnosis and treatment advice. These abstracts not only provide important medical advice for current patients, but also provide powerful references for subsequent treatment and patients with similar diseases. However, due to the lengthy and highly specialized nature of medical dialogues, abstracting is a repetitive and burdensome task for doctors, greatly reducing the efficiency of online medical services. To relieve the physician's heavy work pressure, an automated medical dialogue summary has been created that automatically summarizes the physician's disease diagnosis and treatment advice from the medical dialogue.
2020, acl-main.703 proposes BART, which abstracts the generation of text summaries using a pretrained post-fine-tuning approach. BART is a de-noising auto-encoder that maps corrupted text to its derived original text. It is implemented in a sequence-to-sequence model, using a bi-directional encoder on corrupted text, and using an autoregressive decoder from left to right.
BART pre-trains a model that combines bi-directional and autoregressive transformers. The pre-training has two phases: the text is corrupted by any noise function, and the sequence-to-sequence model is learned to reconstruct the original text. The training method of BART is to destroy the text and then optimize the reconstruction penalty, i.e. the cross entropy penalty between the output of the decoder and the original text. Since BART has an autoregressive decoder, fine-tuning can be performed directly for sequence generation tasks such as abstract abstracts, and information is copied from the input, which is closely related to the denoising pre-training target. The encoder input is an input sequence and the decoder autoregressively generates an output. Medical conversation summaries are typically copied from the doctor's original words, so the summaries generated should be more realistic than pursuing creativity, while valuable medical terms need not be summarized simply, but should be kept entirely. The BART is a method for abstracting and generating the abstract, that is, the abstract cannot be generated by copying from the original text, and key diagnosis and treatment information is easy to lose, so that a factual error is caused.
The hierarchical coding annotation model HET proposed by column-main.63 generates a medical dialogue abstract by identifying and extracting important sentences, labeling each sentence in the dialogue with a label of importance, regarding these labels as silver criteria, whereby the extracted abstract model can be trained using these labels. The text sets a threshold for judging the importance of a sentence to the abstract. If the ROUGE-1 score of a sentence for a summary is above a threshold, then the sentence is considered more important for a summary with a higher ROUGE-1 score.
The hierarchical coding labeling model HET provided by the text consists of three parts, namely a word level encoder, a memory module and a sentence level encoder. The word level encoder in the text model adopts BERT, takes the representation of [ CLS ] characters output by BERT as the representation of sentences, and sends the sentence representation into the memory module. The memory module adopts an end-to-end memory neural network, and aims to enhance the representation of the current sentence by utilizing the information contained in the sentence related to the current sentence in the dialogue of the context, thereby realizing better extraction of the context sentence information and further realizing better labeling. The text uses an LSTM word level encoder to encode all sentences in the dialogue separately, resulting in a vector representation of each sentence, and treating it as a value in the memory neural network. And then, based on the similarity of the current sentence and other sentences, weighting the corresponding values, connecting the weighted sum of the weighted values to the representation of the BERT data of the current sentence in series, and finally sending the obtained vector to a sentence-level encoder. The sentence-level encoder is composed of an LSTM, and the importance of the sentence is marked by a softmax or conditional random field marker after the output of the LSTM is subjected to linear transformation.
Although the medical dialogue summary overlaps the original medical dialogue to a high degree, the extraction method HET is not fully applicable, since the extracted sentences may contain redundant and non-critical information, such as "good", "that such bar" and the like, which is not actually meant for a salutary, which may result in redundancy and low readability of the summary.
Medical conversation summaries have unique features and challenges compared to text summaries and other conversation summary tasks. First, the abstract typically plagiarizes the original words of the physician, should be more realistic than pursuing creativity. Meanwhile, the valuable medical terms do not need to be summarized simply, but should be retained entirely.
Although the abstract is highly overlapping with the original dialogue, the extraction method is not entirely suitable, as the extracted sentences may contain redundant and non-critical information. Second, both the patient and the physician can input multiple utterances before another speaker responds, so the model should distinguish whether the utterances were spoken by the patient or by the physician. Finally, the key information is scattered in the dialogue and not all doctors' utterances have the value to generate a summary. The nonsensical parts include questions and answers that are not related to diagnosis and treatment advice, and political expressions such as greetings and thank you. Thus, the model should have the ability to recognize the semantics of a particular utterance in order to focus on valuable information.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a medical dialogue abstract method with multi-level characteristic enhancement, which generates an accurate and concise abstract for diagnosis and suggestion of doctors.
In order to solve the technical problems, the invention adopts the following technical scheme: a medical dialogue abstracting method with multi-level characteristic enhancement comprises the following steps:
step 1: obtaining medical dialogue abstract data;
step 1.1: medical dialogue acquisition; the original data of the medical dialogue is crawled from the part of the classical question-answer of the online medical platform; in a medical session, the patient consults an online doctor for some health problems, the doctor helps the patient determine the nature of the problem, provides treatment advice, or advises the patient to other medical institutions to seek further medical treatment; these data are complete conversations between the patient and the physician covering the entire procedure; in addition to the dialogue content, each medical dialogue also includes additional information about the doctor and the patient;
step 1.2: obtaining a summary; the abstract is added after the medical dialogue, and comprises two parts of 'problem description' and 'analysis and suggestion'. The "problem description" section is a medical problem for the patient; the "analysis and advice" section outlines doctor's diagnosis or treatment advice;
step 2: preprocessing medical dialogue abstract data; the medical dialogue and abstract data obtained in the step 1 are preprocessed respectively, so that the data meet the unified model requirement, and the specific method is as follows:
step 2.1: preprocessing overall data;
firstly, medical dialogue abstract data are cleaned, and data of words of both sides of a missing doctor and a patient or data of the abstract are removed;
step 2.2: preprocessing a medical dialogue;
firstly, sequentially splicing dialogues of doctors and patients according to an original dialog sequence to form medical dialog chapters, using a barking word segmentation tool to segment words, removing special symbols and stop words, counting word frequencies and constructing a medical dialog word list;
step 2.3: preprocessing the abstract;
the abstract is segmented by using a crust segmentation tool, and special symbols and stop words are removed;
step 2.4: dividing the data set;
randomly disturbing the preprocessed medical dialogue abstract data sequence, dividing the data sequence into three parts of a training set, a verification set and a test set, and simultaneously counting the average word number of medical dialogues and the average word number of abstracts of the three data sets;
step 3: constructing an automatic medical dialogue abstract model; based on the pretreatment results of the medical dialogue and the abstracts, constructing a model capable of automatically summarizing abstracts from the medical dialogue;
step 3.1: using the pointer generation network as a basic architecture of a medical dialogue summary model;
step 3.1.1: an encoder and a decoder for setting a medical dialogue abstract model; the medical dialogue summary model uses a bi-directional LSTM as an encoder and a uni-directional LSTM as a decoder;
step 3.1.2: using the internal attention to replace the pointer to generate the coverage loss of the network, and reducing the generation of repeated words;
will e ti Defined as encoder hidden state
Figure SMS_1
Attention fraction at decoding time step t; model penalizing input words for which the attention score is greater than a set threshold value obtained in a previous decoding step; defining a new encoder attention score e' ti Specifically, the calculation is as formula (1):
Figure SMS_2
the attention scores of the encoder are then normalized and used to obtain an encoder context vector
Figure SMS_3
For each decoding step t, the model calculates a new decoder attention score +.>
Figure SMS_4
To reduce the generation of previously generated words and to calculate the decoder context vector +.>
Figure SMS_5
The specific calculation formula is as follows:
Figure SMS_6
Figure SMS_7
Figure SMS_8
wherein ,
Figure SMS_9
concealing a state vector for a t-moment decoder, < >>
Figure SMS_10
Is a weight matrix>
Figure SMS_11
Hiding the state vector for the decoder at time t'; />
Figure SMS_12
A normalized decoder attention score;
the abstract generation probability distribution of the abstract generation layer is calculated by using the softmax function, and the calculation formula is as follows:
Figure SMS_13
wherein ,
Figure SMS_14
generating probability distribution for abstraction of abstraction generation layer, W gen Weight matrix for abstract generation layer, b gen A bias vector that is an abstract generation layer;
at the same time, the pointer mechanism uses encoder attention scores
Figure SMS_15
As duplicate original input word w i The calculation formula is as follows:
Figure SMS_16
/>
wherein ,
Figure SMS_17
to the original input word w i Copy probability of (2);
the probability of using the replication mechanism for the decoding step t is calculated as follows:
Figure SMS_18
wherein ,
Figure SMS_19
to use the probability of the replication mechanism for decoding step t, W copy Weight matrix for replication mechanism, b copy A bias vector that is a replication mechanism;
the final probability distribution of the output word is obtained using a weighted sum of the attention probability distribution of the original dialog and the abstract generation probability distribution, calculated as follows:
Figure SMS_20
step 3.2: enhancing the feature representation of the input word by adding speaker-level feature embedding; establishing a trainable speaker embedded vector, wherein the speaker comprises two roles of a doctor and a patient; adding the speaker embedded vector and the word embedded vector corresponding to the speaker utterance to obtain an input embedded vector of the final encoder, wherein the specific calculation formula is as follows:
E input =E speaker +E token (9)
wherein ,Einput Embedding vectors for final input, E speaker Embedding vectors for speaker E token Embedding vectors for words;
step 3.3: introducing a RoBERTa semantic representation; introducing speech semantics to enhance a feature representation of an input word from the speech level; inputting each round of words into a Chinese pre-training language model RoBERTa respectively, and inserting a classification symbol [ CLS ] in front of each word; finally, the corresponding output of the classification symbol [ CLS ] is used as the semantic representation of the sentence, and the calculation formula is as follows:
r i =RoBERTa([CLS],w i1 ,w i2 ,...,w il ) (10)
wherein ,ri For semantic representation of the i-th sentence, w i1 ,w i2 ,...,w il Words in the i-th sentence respectively;
then, a semantic vector of the utterance where each word is located is given to each word, and the encoder attention score of each input word is calculated using the semantic vector of the utterance, as follows:
Figure SMS_21
wherein ,
Figure SMS_22
for word w at time t il Concentration score, v T To calculate the inner product vector of the attention score, W e 、W d and Wr Respectively->
Figure SMS_23
and ri Corresponding weight matrix, < >>
Figure SMS_24
For word w il Semantic vectors of the utterances where they are located;
step 4: training the constructed automatic medical dialogue abstract model and testing;
step 4.1: initializing an input of an encoder; randomly initializing words after medical dialogue word segmentation into word embedded vectors, randomly taking values of word vector values in normal distribution N (0, 1), wherein the used word list is a medical dialogue word list constructed before; randomly initializing speaker embedded vectors, wherein word vector values are randomly valued in normal distribution N (0, 1), and the used word list only contains two words of doctors and patients; the dimension of the two word embedding is set to be the same value, and the word embedding and the speaker word embedding are added to be used as the input of the encoder;
step 4.2: initializing an input of a decoder; mapping words into One-Hot vectors after word segmentation of the abstract, wherein the used word list is a medical dialogue word list constructed before, and a symbol < SOS > special mark is added at an input starting position to serve as a starting mark of decoder input, and the starting mark serves as the input of the decoder;
step 4.3: constructing a standard output of the decoder; mapping words into One-Hot vectors after word segmentation of the abstract, wherein the used word list is a medical dialogue word list constructed before, and the medical dialogue word list is used as a standard output of a decoder;
step 4.4: training a model;
step 4.4.1: setting a loss function; the loss function adopts a cross entropy loss function, and the calculation formula is as follows:
Figure SMS_25
where loss is a loss function, T is the total number of decoding time steps,
Figure SMS_26
for the target word->
Figure SMS_27
The generation probability of (2);
step 4.4.2: setting a training mode; model training is carried out in a Teacher-training mode during training;
step 4.4.3: mini-batch gradient descent; setting the size of batch, putting the divided mini-batch group into an iterative loop, calculating average loss, and making a gradient descent training model;
step 4.5: and testing the model effect.
The beneficial effects of adopting above-mentioned technical scheme to produce lie in: the medical dialogue abstract method with multi-level characteristic enhancement provided by the invention firstly adopts a pointer generation network as a basic framework in order to cope with high replicability, and can selectively replicate original texts and keep the capability of abstract generation. To prevent repeated generation of the same word, internal attention is introduced in the pointer network instead of the original coverage loss mechanism.
Secondly, the medical dialogue abstract model provided by the invention sets a speaker embedding vector for roles of a patient and a doctor so as to distinguish speakers of the words. The model directly adds the speaker embedded vector and the token embedded vector and inputs the resultant embedded vector to the encoder.
Third, a pre-trained language model Roberta is utilized to recognize valuable utterances that contain critical information. Each utterance is input separately to the language model Roberta and the output of the CLS position is taken as a semantic representation of the corresponding utterance. Each tag appends a semantic representation vector of its sentence to participate in the attention calculation. Attention calculations take into account implicit state and sentence semantics, enabling models to focus on key information.
The medical dialog summary model of the present invention uses a pointer generation network as a basic framework to adapt to medical dialog scenarios using a multi-level enhanced input feature representation by integrating in-focus, speaker embedding, and utterance semantics. Based on the results of the automatic assessment of the metrics, the performance of the medical dialog summary model of the present invention exceeds all of the baselines and all of the modules contribute to performance.
Drawings
FIG. 1 is a flowchart of a medical dialogue summarization method with multi-level feature enhancement provided by an embodiment of the invention;
fig. 2 is a block diagram of a medical dialogue summary model according to an embodiment of the present invention.
Detailed Description
The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.
In this embodiment, a medical dialogue summarization method with multi-level feature enhancement, as shown in fig. 1, includes the following steps:
step 1: obtaining medical dialogue abstract data resources;
the embodiment uses a medical dialogue abstract data set of a spring rain doctor, and can be divided into two parts of medical dialogue and corresponding abstract acquisition, and the specific steps are as follows:
step 1.1: medical dialogue acquisition; in this embodiment, the raw data of the medical session is crawled from the "classical question-and-answer" section of the online medical platform spring rain physician. In these dialogs, the patient consults an online doctor for some health problems, the doctor helps them determine the nature of the problem, provides treatment advice, or advises them to other medical institutions to seek further medical assistance. These data are complete conversations between the patient and the physician, covering the entire procedure. In addition to the conversations, each conversation contains additional information such as the type of illness and the corresponding hospital department, as well as the speaker speaking in the conversation;
step 1.2: obtaining a summary; many conversations include abstracts that the physician adds after the conversation is performed, including both "description of the problem" and "analysis and advice". The "problem description" section is a medical problem for the patient; the "analysis and advice" section outlines doctor's diagnosis or treatment advice. The present embodiment uses the "analyze and suggest" section as a standard abstract.
The present embodiment crawls 44900 dialogs from 23 hospital departments, which cover 923 diseases, which constitutes the original corpus of the present embodiment. The data set retains only medical dialogues and corresponding summaries, examples of which are shown in table 1.
Table 1 Chinese medical dialogue summary dataset example
Figure SMS_28
Step 2: preprocessing medical dialogue abstract data resources; the medical dialogue and abstract data resources obtained in the step 1 are preprocessed respectively, so that the data resources meet the unified model requirements, and the specific method is as follows:
step 2.1: preprocessing overall data;
firstly, cleaning a medical dialogue abstract data set, removing words of both sides of a missing doctor and a patient or data lacking an abstract, and finally obtaining 40789 pieces of complete data in total;
step 2.2: preprocessing a medical dialogue;
firstly, sequentially splicing dialogues of doctors and patients according to an original dialog sequence to form medical dialog chapters, using a barking word segmentation tool to segment words, removing special symbols and stop words, counting word frequencies and constructing a medical dialog word list;
step 2.3: preprocessing the abstract;
the abstract is segmented by using a crust segmentation tool, and special symbols and stop words are removed;
step 2.4: dividing the data set;
the preprocessed medical dialogue abstract data sets are randomly disordered in sequence and divided into three parts of a training set, a verification set and a test set, and the average number of words (Token) of medical dialogue and the average number of words (Token) of abstract of the three-part data sets are counted at the same time so as to ensure that the difference between the data sets cannot influence experimental conclusion. The data set partitioning and detailed information are shown in table 2.
Table 2 data set partitioning and detailed information
Data set partitioning Training set Verification set Test set
Number of data 32631 4079 4079
Average Token number for medical dialogue 293.0 291.9 288.1
Summary average Token number 95.4 94.2 93.6
Step 3: constructing an automatic medical dialogue abstract model; based on the pretreatment results of the medical dialogue and the abstracts, a model capable of automatically summarizing the abstracts from the medical dialogue is constructed, the model architecture is shown in fig. 2, and the specific method is as follows:
step 3.1: using the pointer generation network as a basic architecture of a medical dialogue summary model; the present embodiment uses a pointer generation network as a base model that uses a probability of generation to adjust between replication and abstract generation;
step 3.1.1: an encoder and a decoder for setting a medical dialogue abstract model; the medical dialogue summary model uses a bi-directional LSTM as an encoder and a uni-directional LSTM as a decoder;
step 3.1.2: replacing the coverage loss with internal attention; the medical dialogue abstract model uses the internal attention to replace the coverage loss of a pointer generation network, so that the generation of repeated words can be reduced better;
the embodiment will be e ti Defined as encoder hidden state
Figure SMS_29
Attention fraction at decoding time step t; model penalizing input words for which the attention score is greater than a set threshold value obtained in a previous decoding step; defining a new encoder attention score e' ti Specifically, the calculation is as formula (1):
Figure SMS_30
the attention scores of the encoder are then normalized and used to obtain the encoder's attention scoreThe following vectors
Figure SMS_31
For each decoding step t, the model calculates a new decoder attention score +.>
Figure SMS_32
To reduce the generation of previously generated words and to calculate the decoder context vector +.>
Figure SMS_33
The specific calculation formula is as follows:
Figure SMS_34
Figure SMS_35
Figure SMS_36
wherein ,
Figure SMS_37
concealing a state vector for a t-moment decoder, < >>
Figure SMS_38
Is a weight matrix>
Figure SMS_39
Hiding the state vector for the decoder at time t'; />
Figure SMS_40
A normalized decoder attention score; />
The abstract generation probability distribution of the abstract generation layer is calculated by using the softmax function, and the calculation formula is as follows:
Figure SMS_41
wherein ,
Figure SMS_42
generating probability distribution for abstraction of abstraction generation layer, W gen Weight matrix for abstract generation layer, b gen A bias vector that is an abstract generation layer;
at the same time, the pointer mechanism uses encoder attention scores
Figure SMS_43
As duplicate original input word w i The calculation formula is as follows:
Figure SMS_44
wherein ,
Figure SMS_45
to the original input word w i Copy probability of (2);
the probability of using the replication mechanism for the decoding step t is calculated as follows:
Figure SMS_46
wherein ,
Figure SMS_47
to use the probability of the replication mechanism for decoding step t, W copy Weight matrix for replication mechanism, b copy A bias vector that is a replication mechanism;
the final probability distribution of the output word is obtained using a weighted sum of the attention probability distribution of the original dialog and the abstract generation probability distribution, calculated as follows:
Figure SMS_48
step 3.2: enhancing the feature representation of the input word by adding speaker-level feature embedding; in order to adapt to the medical dialogue multi-speaker scene, speaker-level embedding is introduced to distinguish speakers; establishing a trainable speaker embedded vector, wherein the speaker comprises two roles of a doctor and a patient; adding the speaker embedded vector and the word embedded vector corresponding to the speaker utterance to obtain an input embedded vector of the final encoder, wherein the specific calculation formula is as follows:
E input =E speaker +E token (9)
wherein ,Einput Embedding vectors for final input, E speaker Embedding vectors for speaker E token Embedding vectors for words;
step 3.3: introducing a RoBERTa semantic representation; since word-level embedding alone is not sufficient to locate key words, we introduce utterance semantics to enhance the feature representation of the input words from the speech level. In order to obtain the semantic representation of each utterance, a Chinese pre-training language model RoBERTa is introduced; we input each round of utterances separately into RoBERTa and insert a classification symbol [ CLS ] in front of each utterance; finally, the corresponding output of the classification symbol [ CLS ] is used as the semantic representation of the sentence, and the calculation formula is as follows:
r i =RoBERTa([CLS],w i1 ,w i2 ,...,w il ) (10)
wherein ,ri For semantic representation of the i-th sentence, w i1 ,w i2 ,...,w il Words in the i-th sentence respectively;
then, a semantic vector of the utterance where each word is located is given to each word, and the encoder attention score of each input word is calculated using the semantic vector of the utterance, as follows:
Figure SMS_49
wherein ,
Figure SMS_50
for word w at time t il Concentration score, v T To calculate the attention scoreInner product vector, W e 、W d and Wr Respectively->
Figure SMS_51
and ri Corresponding weight matrix, < >>
Figure SMS_52
For word w il Semantic vectors of the utterances where they are located;
step 4: training the constructed automatic medical dialogue abstract model and testing; the specific method comprises the following steps:
step 4.1: initializing an input of an encoder; randomly initializing words after medical dialogue word segmentation into word embedded vectors, randomly taking values of word vector values in normal distribution N (0, 1), wherein the used word list is a medical dialogue word list constructed before; randomly initializing speaker embedded vectors, wherein word vector values are randomly valued in normal distribution N (0, 1), and the used word list only contains two words of doctors and patients; the dimension of the two word embedding is set to be the same value, and the word embedding and the speaker word embedding are added to be used as the input of the encoder;
step 4.2: initializing an input of a decoder; mapping words into One-Hot vectors after word segmentation of the abstract, wherein the used word list is a medical dialogue word list constructed before, and a symbol < SOS > special mark is added at an input starting position to serve as a starting mark of decoder input, and the starting mark serves as the input of the decoder;
step 4.3: constructing a standard output of the decoder; mapping words into One-Hot vectors after word segmentation of the abstract, wherein the used word list is a medical dialogue word list constructed before, and the medical dialogue word list is used as a standard output of a decoder;
step 4.4: training a model; the specific method comprises the following steps:
step 4.4.1: setting a loss function; the loss function adopts a cross entropy loss function, and the calculation formula is as follows:
Figure SMS_53
where loss is the loss function and T is the solutionThe total number of code time steps,
Figure SMS_54
for the target word->
Figure SMS_55
The generation probability of (2);
step 4.4.2: setting a training mode; model training is carried out by using a Teacher-forming mode during training, and in the training network process, the output of the last moment is not used as the input of the next moment each time, but the corresponding last item of a standard answer (group trunk) of the abstract is directly used as the input of the next moment;
step 4.4.3: mini-batch gradient descent; setting the size of batch, putting the divided mini-batch group into an iterative loop, calculating average loss, and making a gradient descent training model;
step 4.5: the model effect is tested by the following specific method:
step 4.5.1: the decoding adopts a Beam Search algorithm during the test; the Beam Search range is set and a summary is generated. The beam Search does not take the absolute probability of each marker itself, but considers all possible extensions of each marker. Then selecting the most suitable marker sequence according to the logarithmic probability, and setting the Beam size=4;
step 4.5.2: using the ROUGE score as an evaluation index; and using a general automatic evaluation index ROUGE score of the text abstract, namely n-gram overlap ratio, as a standard for judging whether the performance of the model is good or not. The present embodiment uses Rough-1, rough-2 and Rough-L scores as automatic assessment indicators. These three metrics represent the accuracy of the single word, double word and longest common subsequence, respectively. The present embodiment uses a commonly accepted third party library to calculate the Rouge score and uses the resulting average as the final result.
Step 4.5.3: selecting a plurality of strong base lines to be compared with the result indexes of the medical dialogue abstract model, and proving the effectiveness of the model;
to verify the validity of the medical dialogue summary model, the present example selects a variety of strong baselines from previous studies: lead-3, which extracts the first three sentences of the doctor as abstracts; the random extraction method randomly extracts three sentences of doctors as abstracts; extractive Oracle it extracts the first three sentences of Rouge-1 score in the active dataset and the test dataset as digests; ranking extraction method TextRank; the pointer generates a network PGNet; ml+rl, which combines maximum likelihood training and reinforcement learning; the hierarchical extraction model HET is specially used for medical dialogue abstract extraction, and a pre-training language model Zen is used; a transducer-based model LongFormer that focuses on processing long input text; the language model BART is pre-trained. The test results are shown in table 3:
table 3 model comparison experiment results
Figure SMS_56
Figure SMS_57
As shown in Table 3, the medical dialog summary model of the present invention performs better than all benchmark models on all Rouge indicators. The two results of Lead-3 and Random are similar, indicating that the physician's diagnosis and suggested location is not fixed. The high performance of Extractive Oracle suggests that the abstract and the original utterance of the physician are highly repeatable. Longformer and BART perform poorly on all metrics, which proves that direct abstract generation does not work. PGNet's performance is moderate, indicating that the replication mechanism is effective, and that traditional text summarization models need to accommodate the context of medical conversations. For HET, the results indicate that the extraction method also takes redundant words as digests when extracting valuable utterances. Ml+rl uses a replication mechanism and uses various optimization techniques, so it performs well on all indicators.
The embodiment also carries out an ablation experiment to prove the validity of the medical dialogue abstract model; as shown in Table 4, after removing the internal attention, speaker embedding, and utterance semantics, respectively, the Rouge-1 score drops by 2.33 points, 4.04 points, and 2.88 points, respectively. The results show that all modules can effectively improve the performance of the medical dialogue abstract model.
Table 4 ablation experimental results
Model ROUGE-1 ROUGE-2 ROUGE-3
no intra-attention 87.19 80.12 86.75
no speaker embedding 85.48 78.05 84.49
no utterance semantics 86.64 78.82 85.76
Our 89.52 82.86 88.79
It should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.

Claims (9)

1. A medical dialogue abstracting method with multi-level characteristic enhancement is characterized in that: the method comprises the following steps:
step 1: obtaining medical dialogue abstract data; the medical dialogue abstract data comprises two parts of medical dialogue and abstract;
step 2: preprocessing medical dialogue abstract data; preprocessing the acquired medical dialogue and abstract data respectively to enable the data to meet the unified model requirement;
step 3: constructing an automatic medical dialogue abstract model; based on the pretreatment results of the medical dialogue and the abstracts, constructing a model capable of automatically summarizing abstracts from the medical dialogue;
step 4: training the constructed automatic medical dialogue abstract model and testing.
2. The method for medical dialogue summarization with multi-level feature enhancement according to claim 1, wherein: the original data of the medical dialogue is crawled from the part of the classical question-answer of the online medical platform; in a medical session, the patient consults an online doctor for some health problems, the doctor helps the patient determine the nature of the problem, provides treatment advice, or advises the patient to other medical institutions to seek further medical treatment; these data are complete conversations between the patient and the physician covering the entire procedure; in addition to the dialogue content, each medical dialogue also includes additional information about the doctor and the patient;
the abstract is added after the medical dialogue, and comprises two parts of 'problem description' and 'analysis and suggestion'; the "problem description" section is a medical problem for the patient; the "analysis and advice" section outlines doctor's diagnosis or treatment advice.
3. The method for medical dialogue summarization with multi-level feature enhancement according to claim 1, wherein: the specific method of the step 2 is as follows:
step 2.1: preprocessing overall data;
firstly, medical dialogue abstract data are cleaned, and data of words of both sides of a missing doctor and a patient or data of the abstract are removed;
step 2.2: preprocessing a medical dialogue;
firstly, sequentially splicing dialogues of doctors and patients according to an original dialog sequence to form medical dialog chapters, using a barking word segmentation tool to segment words, removing special symbols and stop words, counting word frequencies and constructing a medical dialog word list;
step 2.3: preprocessing the abstract;
the abstract is segmented by using a crust segmentation tool, and special symbols and stop words are removed;
step 2.4: dividing the data set;
the data sequence of the preprocessed medical dialogue abstract is randomly disordered and divided into three parts of a training set, a verification set and a test set, and the average word number of the medical dialogue and the average word number of the abstract of the three-part data set are counted.
4. The method for medical dialogue summarization with multi-level feature enhancement according to claim 1, wherein: the specific method of the step 3 is as follows:
step 3.1: using the pointer generation network as a basic architecture of a medical dialogue summary model;
step 3.2: enhancing the feature representation of the input word by adding speaker-level feature embedding;
step 3.3: introducing a RoBERTa semantic representation; introducing utterance semantics enhances the feature representation of the input word from the utterance level.
5. The method for multi-level feature enhanced medical session summarization of claim 4, wherein: the specific method of the step 3.1 is as follows:
step 3.1.1: an encoder and a decoder for setting a medical dialogue abstract model; the medical dialogue summary model uses a bi-directional LSTM as an encoder and a uni-directional LSTM as a decoder;
step 3.1.2: using the internal attention to replace the pointer to generate the coverage loss of the network, and reducing the generation of repeated words;
will e ti Defined as encoder hidden state
Figure FDA0004021728750000021
Attention fraction at decoding time step t; model penalizing input words for which the attention score is greater than a set threshold value obtained in a previous decoding step; defining a new encoder attention score e' ti Specifically, the calculation is as formula (1):
Figure FDA0004021728750000022
the attention scores of the encoder are then normalized and used to obtain an encoder context vector
Figure FDA0004021728750000023
For each decoding step t, the model calculates a new decoder attention score +.>
Figure FDA0004021728750000024
To reduce the generation of previously generated words and to calculate the decoder context vector +.>
Figure FDA0004021728750000025
The specific calculation formula is as follows:
Figure FDA0004021728750000026
Figure FDA0004021728750000027
Figure FDA0004021728750000028
wherein ,
Figure FDA0004021728750000029
concealing a state vector for a t-moment decoder, < >>
Figure FDA00040217287500000210
Is a weight matrix>
Figure FDA00040217287500000211
Hiding the state vector for the decoder at time t'; />
Figure FDA00040217287500000212
A normalized decoder attention score;
the abstract generation probability distribution of the abstract generation layer is calculated by using the softmax function, and the calculation formula is as follows:
Figure FDA00040217287500000213
wherein ,
Figure FDA00040217287500000214
generating probability distribution for abstraction of abstraction generation layer, W gen Weight matrix for abstract generation layer, b gen A bias vector that is an abstract generation layer;
at the same time, the pointer mechanism uses encoder attention scores
Figure FDA00040217287500000215
As duplicate original input word w i The calculation formula is as follows:
Figure FDA0004021728750000031
wherein ,
Figure FDA0004021728750000032
to the original input word w i Copy probability of (2);
the probability of using the replication mechanism for the decoding step t is calculated as follows:
Figure FDA0004021728750000033
wherein ,
Figure FDA0004021728750000034
to use the probability of the replication mechanism for decoding step t, W copy Weight matrix for replication mechanism, b copy A bias vector that is a replication mechanism;
the final probability distribution of the output word is obtained using a weighted sum of the attention probability distribution of the original dialog and the abstract generation probability distribution, calculated as follows:
Figure FDA0004021728750000035
6. the method for multi-level feature enhanced medical session summarization of claim 5, wherein: the specific method of the step 3.2 is as follows:
establishing a trainable speaker embedded vector, wherein the speaker comprises two roles of a doctor and a patient; adding the speaker embedded vector and the word embedded vector corresponding to the speaker utterance to obtain an input embedded vector of the final encoder, wherein the specific calculation formula is as follows:
E input =E speaker +E token (9)
wherein ,Einput Embedding vectors for final input, E speaker Embedding vectors for speaker E token Vectors are embedded for words.
7. The method for multi-level feature enhanced medical session summarization of claim 6 wherein: the specific method of the step 3.3 is as follows:
inputting each round of words into a Chinese pre-training language model RoBERTa respectively, and inserting a classification symbol [ CLS ] in front of each word; finally, the corresponding output of the classification symbol [ CLS ] is used as the semantic representation of the sentence, and the calculation formula is as follows:
r i =RoBERTa([CLS],w i1 ,w i2 ,...,w il ) (10)
wherein ,ri For semantic representation of the i-th sentence, w i1 ,w i2 ,...,w il Words in the i-th sentence respectively;
then, a semantic vector of the utterance where each word is located is given to each word, and the encoder attention score of each input word is calculated using the semantic vector of the utterance, as follows:
Figure FDA0004021728750000036
wherein ,
Figure FDA0004021728750000037
for word w at time t il Concentration score, v T To calculate the inner product vector of the attention score, W e 、W d and Wr Respectively->
Figure FDA0004021728750000038
and ri Corresponding weight matrix, < >>
Figure FDA0004021728750000039
For word w il Semantic vectors of the utterances that are located.
8. The method for multi-level feature enhanced medical session summarization of claim 7 wherein: the specific method of the step 4 is as follows:
step 4.1: initializing an input of an encoder; randomly initializing words after medical dialogue word segmentation into word embedded vectors, randomly taking values of word vector values in normal distribution N (0, 1), wherein the used word list is a medical dialogue word list constructed before; randomly initializing speaker embedded vectors, wherein word vector values are randomly valued in normal distribution N (0, 1), and the used word list only contains two words of doctors and patients; the dimension of the two word embedding is set to be the same value, and the word embedding and the speaker word embedding are added to be used as the input of the encoder;
step 4.2: initializing an input of a decoder; mapping words into One-Hot vectors after word segmentation of the abstract, wherein the used word list is a medical dialogue word list constructed before, and a symbol < SOS > special mark is added at an input starting position to serve as a starting mark of decoder input, and the starting mark serves as the input of the decoder;
step 4.3: constructing a standard output of the decoder; mapping words into One-Hot vectors after word segmentation of the abstract, wherein the used word list is a medical dialogue word list constructed before, and the medical dialogue word list is used as a standard output of a decoder;
step 4.4: training a model;
step 4.5: and testing the model effect.
9. The method for multi-level feature enhanced medical session summarization of claim 8, wherein: the specific method of the step 44 is as follows:
step 4.4.1: setting a loss function; the loss function adopts a cross entropy loss function, and the calculation formula is as follows:
Figure FDA0004021728750000041
where loss is a loss function, T is the total number of decoding time steps,
Figure FDA0004021728750000042
for the target word->
Figure FDA0004021728750000043
The generation probability of (2);
step 4.4.2: setting a training mode; model training is carried out in a Teacher-training mode during training;
step 4.4.3: mini-batch gradient descent; setting the size of batch, putting the divided mini-batch group into an iterative loop, calculating average loss, and making a gradient descent training model.
CN202211692317.7A 2022-12-28 2022-12-28 Medical dialogue abstracting method with multi-level characteristic enhancement Pending CN116127056A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211692317.7A CN116127056A (en) 2022-12-28 2022-12-28 Medical dialogue abstracting method with multi-level characteristic enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211692317.7A CN116127056A (en) 2022-12-28 2022-12-28 Medical dialogue abstracting method with multi-level characteristic enhancement

Publications (1)

Publication Number Publication Date
CN116127056A true CN116127056A (en) 2023-05-16

Family

ID=86309397

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211692317.7A Pending CN116127056A (en) 2022-12-28 2022-12-28 Medical dialogue abstracting method with multi-level characteristic enhancement

Country Status (1)

Country Link
CN (1) CN116127056A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116541505A (en) * 2023-07-05 2023-08-04 华东交通大学 Dialogue abstract generation method based on self-adaptive dialogue segmentation
CN116759077A (en) * 2023-08-18 2023-09-15 北方健康医疗大数据科技有限公司 Medical dialogue intention recognition method based on intelligent agent
CN117009501A (en) * 2023-10-07 2023-11-07 腾讯科技(深圳)有限公司 Method and related device for generating abstract information
CN117370535A (en) * 2023-12-05 2024-01-09 粤港澳大湾区数字经济研究院(福田) Training method of medical dialogue model, medical query method, device and equipment

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116541505A (en) * 2023-07-05 2023-08-04 华东交通大学 Dialogue abstract generation method based on self-adaptive dialogue segmentation
CN116541505B (en) * 2023-07-05 2023-09-19 华东交通大学 Dialogue abstract generation method based on self-adaptive dialogue segmentation
CN116759077A (en) * 2023-08-18 2023-09-15 北方健康医疗大数据科技有限公司 Medical dialogue intention recognition method based on intelligent agent
CN117009501A (en) * 2023-10-07 2023-11-07 腾讯科技(深圳)有限公司 Method and related device for generating abstract information
CN117009501B (en) * 2023-10-07 2024-01-30 腾讯科技(深圳)有限公司 Method and related device for generating abstract information
CN117370535A (en) * 2023-12-05 2024-01-09 粤港澳大湾区数字经济研究院(福田) Training method of medical dialogue model, medical query method, device and equipment
CN117370535B (en) * 2023-12-05 2024-04-16 粤港澳大湾区数字经济研究院(福田) Training method of medical dialogue model, medical query method, device and equipment

Similar Documents

Publication Publication Date Title
CN109670179B (en) Medical record text named entity identification method based on iterative expansion convolutional neural network
CN109697285B (en) Hierarchical BilSt Chinese electronic medical record disease coding and labeling method for enhancing semantic representation
CN116127056A (en) Medical dialogue abstracting method with multi-level characteristic enhancement
CN109635280A (en) A kind of event extraction method based on mark
CN109522546A (en) Entity recognition method is named based on context-sensitive medicine
CN110427486B (en) Body condition text classification method, device and equipment
CN109977199A (en) A kind of reading understanding method based on attention pond mechanism
Hu et al. PLANET: Dynamic content planning in autoregressive transformers for long-form text generation
CN113408430B (en) Image Chinese description system and method based on multi-level strategy and deep reinforcement learning framework
CN113128203A (en) Attention mechanism-based relationship extraction method, system, equipment and storage medium
Zhang et al. Improving Stack Overflow question title generation with copying enhanced CodeBERT model and bi-modal information
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN111241397A (en) Content recommendation method and device and computing equipment
CN111339772B (en) Russian text emotion analysis method, electronic device and storage medium
CN111125333A (en) Generation type knowledge question-answering method based on expression learning and multi-layer covering mechanism
Kim et al. Automatic classification of the Korean triage acuity scale in simulated emergency rooms using speech recognition and natural language processing: a proof of concept study
CN112613322A (en) Text processing method, device, equipment and storage medium
CN116992007A (en) Limiting question-answering system based on question intention understanding
US20220189333A1 (en) Method of generating book database for reading evaluation
CN113705207A (en) Grammar error recognition method and device
Mahajan et al. IBMResearch at MEDIQA 2021: toward improving factual correctness of radiology report abstractive summarization
CN115964475A (en) Dialogue abstract generation method for medical inquiry
CN115062603A (en) Alignment enhancement semantic parsing method, alignment enhancement semantic parsing device and computer program product
KR102418260B1 (en) Method for analyzing customer consultation record
CN113012685B (en) Audio recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination