CN114757188A

CN114757188A - Standard medical text rewriting method based on generation of confrontation network

Info

Publication number: CN114757188A
Application number: CN202210550303.5A
Authority: CN
Inventors: 汪祖民; 徐畅; 季长清; 秦静
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2022-07-15

Abstract

The invention discloses a standard medical text rewriting method based on generation of an antagonistic network, which comprises the following steps: extracting spoken and normalized medical question and answer corpora for processing to obtain a data set; adopting a Transformer model to construct a normalized medical text generator and a spoken medical text generator, and pre-training through a user health term mapping table to obtain a normalized medical text; constructing a standardized medical text discriminator and a spoken medical text discriminator by adopting an LSTM neural network; respectively optimizing a normalized medical text discriminator and a spoken medical text discriminator by using a loss function in combination with the medical text characteristics; and optimizing the normalized medical text generator and the spoken medical text generator by adopting a reinforcement learning mode. The method realizes the mutual migration and rewriting between the spoken text and the normalized text, solves the problem of excessive dependence of the traditional text migration model on the labeled corpus, ensures that the model is still reliable under the condition of no parallel corpus, and reduces the workload required by manual labeling of data.

Description

Standard medical text rewriting method based on generation of confrontation network

Technical Field

The invention relates to the technical field of natural language processing, in particular to a standardized medical text rewriting method based on a generation countermeasure network.

Background

Text style migration has always been a hot issue in the field of natural language generation. The meaning of the method is that on the basis of keeping the semantic content of the original text unchanged, the text is converted or generated into the text with another specific style or attribute, and the smoothness and the vividness of the newly generated text are ensured. The text style migration removes writing style migration or emotion migration of texts, and can be applied to fields of a dialogue question-answering system of a chat robot, text rewriting, specification examination or generation of professional documents and texts and the like. Most of the existing text generation models have the problems of difficult model training and grammar error or semantic missing of generated contents, and the application of the text style migration model can flexibly reduce the training difficulty of the text generation model.

In recent years, the development of deep learning technology enables the natural language processing to be widely applied to various scenes and complex tasks. In the medical field, the online inquiry technology is gradually popularized, and the establishment of various medical health websites enables patients to perform self-diagnosis in an inquiry and answer mode without going home. However, due to lack of professional medical knowledge, the problem that the disease condition is unclear and the expression content is spoken when the users use the platform tools often occurs, so that the AI-assisted diagnosis has an understanding barrier for the information provided by the users. The obstacle is not only in the aspect of machine reading and understanding, but also is often bidirectional, and communication obstacles exist among doctors and patients due to the fact that the patients have spoken description or specialized terms of doctors, and online inquiry is inefficient. Therefore, the application of the text style migration technology in the aspects of text rewriting and text normalization provides a good scheme for solving the problems.

Currently, text style migration methods can be generally divided into two types: supervised learning and unsupervised learning. The supervised learning is similar to the machine translation mode, the style conversion is carried out by using the parallel data set, and the text converted by the method has high precision and good conversion effect. The existing text style migration model also mostly adopts the end-to-end model similar to statistical machine translation, but the model lacks a labeling data corpus, and manual labeling data needs to consume a large amount of manpower and material resources, so the research aiming at the text style migration model is transferred to an unsupervised learning mode.

Compared with a supervised style migration model similar to machine translation, the unsupervised learning model can effectively separate the attributes and contents of the text, and the model can be trained without a large amount of paired data to obtain an ideal generated text. However, at present, the research progress of an unsupervised text style migration model is far slower than that of image style migration, because the text discreteness problem exists when the style migration is applied to the text. The text discreteness enables the text to generate the loss of text fluency and content integrity in the migration process, and the model has the problems of low text generation quality and poor generalization. Secondly, the quality of the model is difficult to evaluate, and different from an image style discrimination mode, the language style definition is fuzzy, so that the method is more challenging.

Disclosure of Invention

The invention aims to provide a standardized medical text rewriting method based on a generation countermeasure network, which can realize bidirectional conversion between patient spoken disease description and professional standardized terms used by doctors and AI auxiliary diagnosis.

In order to achieve the above object, the present application provides a method for rewriting a normative medical text based on a generation countermeasure network, including:

extracting spoken and normalized medical question and answer corpora for processing to obtain a data set;

method for constructing standardized medical text generator by adopting Transformer model

Spoken medical text generator

Pre-training is carried out through a user health term mapping table to obtain a standardized medical text;

construction of a normalized medical text discriminator D Using the LSTM neural network_Φ1(Y)(Y) and spoken medical text discriminator D_Φ2(X)(X)；

Respectively optimizing and standardizing medical text discriminator D by using loss function in combination with medical text characteristics_Φ1(Y)(Y) and spoken medical text discriminator D_Φ2(X)(X)；

Optimization and standardization medical text generator by adopting reinforcement learning mode

Spoken medical text generator

Further, the spoken sentences in the data set are used as X-style samples, and the sentences containing normalized words are used as pseudo-parallel samples of Y target styles to be converted; marking the oral statement which can be mapped with terms in the test set through the user health term mapping table and providing the oral statement as a hidden layer to the normalized medical text generator

Further, the data set includes: data set X ═ X₁,x₂,…,x_i,…,x_nY, data set Y ═ Y₁,y₂,…,y_i,…,y_nI represents an ith sample, n represents n samples in total, and x and y represent a sample statement of a spoken language style and a sample statement of a normalized style respectively; the sample sentence in spoken style is represented as:

the tth word representing a sentence, wherein T represents the length of the sentence, namely the number of words;

to enable opening ofThe sample sentences of the linguistic style and the sample sentences of the normalized style are correlated, after the medical entities in each sentence are identified through word segmentation, the sample sentences of the non-normalized spoken style are labeled by combining with a user health term mapping table, and the labeling sequences are labeled as

The corresponding position of the sample statement which needs to be normalized is marked as 1, and the corresponding position of the sample statement which does not need to be normalized is marked as 0.

Further, a Transformer model is adopted to construct a standardized medical text generator

Spoken medical text generator

The method comprises the following specific steps: adopts a CycleGAN structure to construct a normalized medical text generator

Spoken medical text generator

The two generators are generated in opposite directions and can form a closed loop after being connected to provide feedback information mutually.

Further, the normalized medical text generator is pre-trained using maximum likelihood estimation

Spoken medical text generator

The concrete method is as follows: setting the longest length of a generated sentence pattern as 30 words, setting the Embedding dimension Embedding _ size value of a given word as 512, and enabling an Encoder Encoder and a Decoder Decoder to be of six-layer structures; setting the user health term mapping table as a generated word table, and using a colloquial style sample sentence and a normalized style sample sentenceAnd pre-training word vectors together with the training sets divided in the sentence to generate an Embedding initial value corresponding to the word.

Further, a normalized medical text discriminator D is constructed by adopting an LSTM neural network_Φ1(Y)(Y) and spoken medical text discriminator D_Φ2(X)(X), specifically:

the last hidden layer H of the LSTM neural network_nInstead of a binary logistic regression layer, it is determined whether the input medical text is a true sample from dataset Y or a sample generated by a normalized medical text generator

And carrying out nonlinear conversion on the input medical text high-dimensional sequence to obtain Embedding of words in the sequence, then inputting the Embedding into each basic unit cell, and obtaining the probability of outputting each word by combining a full-connection hidden layer.

Further, normalized medical text discriminator D is optimized separately using a loss function _Φ1(Y)(Y) and spoken medical text discriminator D_Φ2(X)(X), specifically:

adjustable parameter theta in generator₁Real samples of the fixed case randomly sampled dataset Y and samples generated by the normalized medical text generator

Then minimizing the cross entropy; normalized medical text discriminator D_Φ1(Y)The loss function of (Y) is as follows:

L_all＝β₁L₁+β₂L₂

wherein L is₁To normalize the generation of medical text discriminators against loss, β₁To the loss term coefficient, L₂Noting the loss, beta, for sequences₂To coefficient of loss term, beta₁And beta₂Are all less than 0.5;

adjustable parameter theta in generator₂Real samples of a fixed-case randomly sampled dataset X and samples generated by a spoken medical text generator

Then minimizing the cross entropy; oral medical text discriminator D_Φ2(X)The loss function of (X) is as follows:

L'_all＝β₁L'₁+β₂L'₂

wherein L'₁To generate countermeasure against loss, beta, for spoken medical text discriminators₁Is the loss term coefficient, L'₂Noting the loss, beta, for sequences₂To coefficient of loss term, beta₁And beta₂Are all less than 0.5.

Furthermore, a medical text generator is optimized and normalized in a reinforcement learning mode, and the method specifically comprises the following steps:

the normalized medical text generator is optimized using a minimum cross entropy loss function:

the gradient of the above formula is equal to the following formula:

Combining with reinforcement learning mechanism to obtain standardized medical text generator

The reward function of (a) is:

wherein R is_sTo generate style accuracy of text, R_cFor the semantic retention degree, alpha is a harmonic weight parameter, and the value range of the alpha is between 0 and 1;

calculating the probability value of the generated sentence in the target style distribution for the style accuracy reward of the generated text, and taking the probability value as a reward function, wherein the formula is as follows:

for the reward of the semantic retention degree, the cosine similarity between the generated statement and the original statement is used for calculation and is recorded as:

the normalized medical text generator

The desired reward is expressed as:

wherein the content of the first and second substances,

representing generation of statements by generatorsThe probability of the occurrence of the event,

it is shown that when the first t-1 words are input by the generator, the t-th word is randomly sampled and noted

It indicates the expectation of future rewards to the current t-1 location.

Furthermore, the method for optimizing the spoken medical text generator by adopting the reinforcement learning mode specifically comprises the following steps:

the spoken medical text generator is optimized using a minimum cross entropy loss function:

the gradient of the above formula is equal to the following formula:

combining reinforcement learning mechanism to obtain spoken medical text generator

The reward function of (a) is:

for the style accuracy reward of the generated text, calculating the probability value of the generated sentence in the spoken style distribution, taking the probability value as a reward function, and adopting the formula as follows:

for the reward of the semantic retention degree, the cosine similarity between the generated sentence and the original sentence Embedding is used for calculation and is recorded as:

the spoken medical text generator

The desired reward is expressed as:

wherein the content of the first and second substances,

what is shown is the probability that the generator generates the statement,

It indicates the expectation of future rewards to the current t-1 location.

Furthermore, during the process of optimizing the normalized medical text discriminator, the spoken medical text discriminator, the normalized medical text generator and the spoken medical text generator, discriminators and generators with opposite targets are introduced to perform mutual confrontation until a Nash equilibrium state is reached.

Compared with the prior art, the technical scheme adopted by the invention has the advantages that: the method realizes the mutual migration and rewriting between the spoken text and the normalized text, solves the problem of excessive dependence of the traditional text migration model on the labeled corpus, ensures that the model is still reliable under the condition of no parallel corpus, and reduces the workload required by manual labeling of data. The method uses a Transformer as a generator, introduces a reinforcement learning mode for optimization, solves the problem that the application of the generated confrontation network model cycleGAN in the text field cannot be optimized by a gradient descent method, fundamentally avoids the problem of exposure deviation brought by the traditional maximum likelihood estimation optimization mode, and ensures the stability of the model.

Drawings

FIG. 1 is a schematic diagram of an improved structure for generating a countermeasure network model according to the present invention;

FIG. 2 is a diagram of an improved generator model of the present invention;

FIG. 3 is a graph of the results of the present invention after overwriting based on a portion of the test set sample.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application, i.e., the embodiments described are only a few examples and not all examples.

Example 1

The embodiment provides a method for rewriting a normative medical text based on a generation countermeasure network, which is characterized by comprising the following steps:

step S1, extracting spoken and normalized medical question and answer corpora for processing to obtain a data set;

specifically, the cMedQQ paraphrase identification data set provided by CBLUE is selected and processed. The purpose of paraphrase recognition is to determine whether two sentences have the same semantics, and express the same meaning in different sentences. cMedQQ consists of 16070 paired training sets and 2000 test sets, and colloquially speaking the data sets The normalized sentences serve as X-style samples, and the sentences containing normalized words serve as pseudo-parallel samples of the Y-target style to be converted. In order to ensure the stability of the generated statements of the model, a user health term mapping table is additionally adopted to label the spoken statements which can be mapped with terms in the test set and serve as a hidden layer to be provided for the normalized medical text generator

The data set includes: data set X ═ X₁,x₂,…,x_i,…,x_nY, data set Y ═ Y₁,y₂,…,y_i,…,y_nI represents the ith sample, n represents n samples in total, and x and y represent a sample statement of a spoken language style and a sample statement of a normalized style respectively; the sample sentence in spoken style is represented as:

the tth word represents the sentence, and T represents the sentence length, namely the number of words;

in order to enable the sample sentences in the spoken language style to be correlated with the sample sentences in the normalized style, after medical entities in each sentence are identified through word segmentation, the sample sentences in the non-normalized spoken language style are labeled by combining with a user health term mapping table, and the labeling sequence is recorded as

And marking the corresponding position of the sample statement needing to be normalized as 1, and marking the corresponding position of the sample statement not needing to be normalized as 0.

Step S2, adopting Transformer model to construct normalized medical text generator

Spoken medical text generator

specifically, since the purpose of the application is to realize the interconversion between the spoken text and the normalized text containing the professional terms, the CycleGAN structure is adopted to construct two generators. Normalized medical text generator

Spoken medical text generator

The generation directions of the two generators are opposite, so that a closed loop can be formed by connecting the two generators to provide feedback mutually so as to optimize the model, the problem that no paired data exists in the supervised learning model is solved, and compared with other unsupervised learning models, the generated sentences are relatively stable, and the text conversion quality is ensured.

Pre-training a normalized medical text generator using maximum likelihood estimation

Spoken medical text generator

The concrete mode is as follows: setting the longest length of a generated sentence pattern as 30 words, setting the Embedding dimension Embedding _ size value of a given word as 512, and enabling an Encoder Encoder and a Decoder Decoder to be of six-layer structures; and setting the user health term mapping table as a generated word table, pre-training word vectors by using training sets divided from the spoken style sample sentences and the normalized style sample sentences, and generating an Embedding initial value corresponding to a word.

Step S3, constructing a normalized medical text discriminator D by adopting an LSTM neural network_Φ1(Y)(Y) and spoken medical text discriminator D_Φ2(X)(X)；

Specifically, the last hidden layer H of the LSTM neural network_nInstead of a binary logistic regression layer, it is determined whether the input medical text is a true sample from dataset Y or a sample generated by a normalized medical text generator

For the input of the discriminator, unlike the traditional one-hot Vector (Onehot Vector) for generating words used by the countermeasure network, the method selects Embedding as the input of the discriminator. The similarity between two different words cannot be shown by the single-hot coding, and the similarity of the single-hot coding vector is 0 and cannot show the association between the two words by taking the words "two lungs" and "right pulmonary portal" as examples. And when the input sequence is long, the dimensionality of the one-hot coding matrix is too high, and the performance of the model is limited to a certain extent. Therefore, the method firstly carries out nonlinear conversion on the input medical text high-dimensional sequence to obtain Embedding of words in the sequence, then inputs the Embedding into each basic unit cell, and combines with the fully-connected hidden layer to obtain the probability of outputting each word.

Step S4, combining the medical text characteristics, respectively optimizing and standardizing the medical text discriminator D by using a loss function _Φ1(Y)(Y) and spoken medical text discriminator D_Φ2(X)(X)；

In particular, the parameter θ is adjustable at the generator₁Real samples of the fixed case randomly sampled dataset Y and samples generated by the normalized medical text generator

L_all＝β₁L₁+β₂L₂

L'_all＝β₁L'₁+β₂L'₂

Step S5, optimizing and standardizing the medical text generator by adopting a reinforcement learning mode

Spoken medical text generator

Specifically, through a reinforcement learning mode, gradient optimization is not relied on, so that the problems that the traditional gradient descent optimization cannot be used and the guidance cannot be conducted when the Cycle GAN structure is applied to the text are solved. Reinforcement learning continuously corrects errors interactively in the environment through the intelligent agent so as to learn a strategy capable of obtaining the maximum reward. However, in general reinforcement learning strategies, the arbiter provides an overall reward value after the sequence is finished, which is still unstable for discrete data such as text, and is easy to cause a problem of disordered word order, or generate sentences completely unrelated to the original semantics. Therefore, at each time period, future output results are also taken into account.

And step S6, repeating the steps S4-S5 until Nash equilibrium is reached.

Specifically, in order to make the normalized effect of the spoken medical text closer to the real text, the above steps are iterated repeatedly to improve the performance of the normalized medical text generator. Therefore, the discriminant and the generator with opposite targets are introduced to confront each other, the performance is improved under continuous confrontation, the model training reaches an ideal Nash equilibrium state, namely the distribution of the samples generated by the generator is consistent with the distribution in the data set, and the discriminant cannot completely distinguish the generated samples from the samples acquired in the real data set, so that the model performance of the mutual rewriting and migration of the spoken text and the normalized text is improved.

The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims

1. A normative medical text rewriting method based on a generation countermeasure network is characterized by comprising the following steps:

Spoken medical text generator

Spoken medical text generator

2. The canonical medical text rewrite method based on generation of countermeasure network of claim 1, wherein the spoken sentence in the data set is used as an X-style sample, and the sentence containing the canonical word is used as a pseudo-parallel sample of a Y-target style to be converted; marking the oral statement which can be mapped with terms in the test set through the user health term mapping table and providing the oral statement as a hidden layer to the normalized medical text generator

3. The canonical medical text rewrite method based on generation of countermeasure networks according to claim 2, characterized in that the data set includes: data set X ═ X₁，x₂，…，x_i，…，x_nY, data set Y ═ Y₁，y₂，…，y_i，…，y_nI represents an ith sample, n represents n samples in total, and x and y represent a sample statement of a spoken language style and a sample statement of a normalized style respectively; the sample sentence in spoken style is represented as:

4. The standardized medical text rewriting method based on generation of countermeasure network according to claim 1, wherein a Transformer model is adopted to construct the standardized medical text generator

Spoken medical text generator

The method specifically comprises the following steps: adopts a CycleGAN structure to construct standardized medical treatment Text generator

Spoken medical text generator

The two generators have opposite generating directions and can form a closed loop to provide feedback information mutually after being connected.

5. The canonical medical text rewrite method of claim 4, wherein the canonical medical text generator is pre-trained using maximum likelihood estimation

Spoken medical text generator

6. The canonical medical text rewrite method of claim 1, wherein the canonical medical text discriminator D is constructed by using LSTM neural network_Φ1(Y)(Y) and spoken medical text discriminator D_Φ2(X)(X), specifically:

And carrying out nonlinear conversion on the input medical text high-dimensional sequence to obtain Embedding of words in the sequence, then inputting the Embedding into each basic unit cell, and combining with a fully-connected hidden layer to obtain the probability of outputting each word.

7. The method of claim 1, wherein the normalized medical text comprises a loss function optimized for the normalized medical text discriminator D_Φ1(Y)(Y) and spoken medical text discriminator D_Φ2(X)(X), specifically:

L_all＝β₁L₁+β₂L₂

adjustable parameter theta in generator₂Real samples of a fixed-case randomly sampled dataset X and generated by a spoken medical text generatorSample(s)

L′_all＝β₁L′₁+β₂L′₂

Wherein L'₁Countering loss, beta, for generation of spoken medical text discriminators₁Is the loss term coefficient, L'₂Noting the loss, beta, for sequences₂To coefficient of loss term, beta₁And beta₂Are all less than 0.5.

8. The normative medical text rewriting method based on generation countermeasure network as claimed in claim 1, wherein the normative medical text generator is optimized by a reinforcement learning method, specifically:

the gradient of the above formula is equal to the following formula:

combining reinforcement learning mechanism to obtain normalized medical text generator

The reward function of (a) is:

wherein R is_sTo generate style accuracy of text, R_cThe semantic retention degree is defined, alpha is a harmonic weight parameter, and the value range of alpha is 0-1;

for the style accuracy reward of the generated text, calculating a probability value of the generated sentence in the target style distribution, taking the probability value as a reward function, and adopting the formula as follows:

the normalized medical text generator

The desired reward is expressed as:

Wherein, the first and the second end of the pipe are connected with each other,

the probability that the generator generates the statement is indicated,

it is indicated that when the former t-1 words are input by the generator, the t-th word is randomly sampled and noted

It indicates the expectation of future rewards to the current t-1 position.

9. The normative medical text rewriting method based on generation of the countermeasure network according to claim 1, wherein a reinforcement learning manner is adopted to optimize the spoken medical text generator, specifically:

the spoken medical text generator is optimized using a minimized cross entropy loss function:

the gradient of the above formula is equal to the following formula:

The reward function of (a) is:

for the style accuracy reward of the generated text, calculating a probability value of the generated sentence in the spoken style distribution, and taking the probability value as a reward function, wherein the formula is as follows:

the spoken medical text generator

The desired reward is expressed as:

the probability that the generator generates the statement is indicated,

It indicates the expectation of future rewards to the current t-1 position.

10. The normative medical text rewriting method based on the generative confrontation network as claimed in claim 1, wherein opposite-target discriminators and generators are introduced in the process of optimizing the normative medical text discriminator, the normative medical text generator and the normative medical text generator, and mutual confrontation is carried out until a nash equilibrium state is reached.