WO2019179100A1 - 基于生成式对抗网络技术的医疗文本生成方法 - Google Patents

基于生成式对抗网络技术的医疗文本生成方法 Download PDF

Info

Publication number
WO2019179100A1
WO2019179100A1 PCT/CN2018/112285 CN2018112285W WO2019179100A1 WO 2019179100 A1 WO2019179100 A1 WO 2019179100A1 CN 2018112285 W CN2018112285 W CN 2018112285W WO 2019179100 A1 WO2019179100 A1 WO 2019179100A1
Authority
WO
WIPO (PCT)
Prior art keywords
medical
text
word
generated
document
Prior art date
Application number
PCT/CN2018/112285
Other languages
English (en)
French (fr)
Inventor
朱斐
叶飞
伏玉琛
陈冬火
Original Assignee
苏州大学张家港工业技术研究院
苏州大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州大学张家港工业技术研究院, 苏州大学 filed Critical 苏州大学张家港工业技术研究院
Publication of WO2019179100A1 publication Critical patent/WO2019179100A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation
    • G06F40/56Natural language generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Definitions

  • the present invention relates to the field of data mining of medical texts, and in particular to a medical text generating method based on a generative confrontation network technology.
  • the Generative Adversarial Net (GAN) consists of two parts: the generation model and the discriminant model.
  • the discriminant model like the classifier, has a discriminant limit through which the sample is distinguished. For example, output 1 means that the sample is true (true), and output 0 means that the sample is false (false). Therefore, from the perspective of probability, we can get the probability that the sample x belongs to the category y, which is a conditional probability P(y
  • the generated model produces data to fit the entire distribution, and the probabilistic perspective is the probability that the sample x is produced throughout the distribution, ie the joint probability P(xy).
  • the generation model and the discriminant model are used: the discriminant model is used to judge whether the medical text of a given input is “real text”; the task of generating the model is to simulate creating as much as possible The medical model of the discriminant model is judged as "true text”.
  • the two models are not trained, together with the training, generate the model to generate text to deceive the discriminant model, and then discriminate the model to judge whether the text is true or false. These two models continue to learn, train, improve, and finally reach Steady state.
  • LSTM Long Short-Term Memory
  • LSTM adds a "processor” that determines whether information is useful or not.
  • the structure of this processor is called a cell.
  • Three cells are placed in a cell, which are input gate, forgetting gate and output gate.
  • the gate mechanism is a method for selectively passing information. It consists of a sigmoid neural network layer and a dotwise multiplication operation.
  • the sigmoid layer outputs a value between 0 and 1, describing how much each part has. able to pass. 0 means "no amount is allowed to pass", and 1 means “allow any amount to pass”.
  • the LSTM network is suitable for time-series data and is therefore suitable for solving the characteristics of this medical text message over time.
  • CNN Convolutional Neural Network
  • the CNN includes an input layer, a convolution layer, a pooling layer, and an output layer.
  • the mapping relationship between the input layer and the convolution layer is called feature mapping.
  • the mapping relationship between the convolution layer and the pooling layer is called pool operation, such as maximum pool operation, L2 pooling, and so on.
  • the mapping relationship between the pooling layer and the output layer is generally referred to as a full-join operation.
  • CNN also has many applications in text classification and text modeling. This patent method uses CNN to discriminate the "true” and "false” of medical text in the discriminator structure.
  • the object of the present invention is to provide a medical text generation method based on a generative anti-network technology, which can generate new medical texts by disrupting data and simulating, and is used for training, learning, and testing by means of machine learning, data mining, artificial intelligence, and the like. Use, solve the problem of patient privacy and privacy that may be involved in medical texts, and solve the problem of lack of medical text.
  • the Generative Adversarial Net (GAN) consists of two parts: the generation model and the discriminant model.
  • the discriminant model like the classifier, has a discriminant limit through which the sample is distinguished. For example, output 1 means that the sample is true (true), and output 0 means that the sample is false (false). Therefore, from the perspective of probability, we can get the probability that the sample x belongs to the category y, which is a conditional probability P(y
  • the generated model produces data to fit the entire distribution, and the probabilistic perspective is the probability that the sample x is produced throughout the distribution, ie the joint probability P(xy).
  • the generation model and the discriminant model are used: the discriminant model is used to judge whether the medical text of a given input is “real text”; the task of generating the model is to simulate creating as much as possible The medical model of the discriminant model is judged as "true text”.
  • the two models are not trained, together with the training, generate the model to generate text to deceive the discriminant model, and then discriminate the model to judge whether the text is “true” or “false”. These two models continue to learn and train. Improve, and finally reach steady state.
  • LSTM Long Short-Term Memory
  • LSTM adds a "processor” that determines whether information is useful or not.
  • the structure of this processor is called a cell.
  • Three cells are placed in a cell, which are input gate, forgetting gate and output gate.
  • the gate mechanism is a method for selectively passing information. It consists of a sigmoid neural network layer and a dotwise multiplication operation.
  • the sigmoid layer outputs a value between 0 and 1, describing how much each part has. able to pass. 0 means "no amount is allowed to pass", and 1 means “allow any amount to pass”.
  • the LSTM network is suitable for time-series data and is therefore suitable for solving the characteristics of this medical text message over time.
  • CNN Convolutional Neural Network
  • the CNN includes an input layer, a convolution layer, a pooling layer, and an output layer.
  • the mapping relationship between the input layer and the convolution layer is called feature mapping.
  • the mapping relationship between the convolution layer and the pooling layer is called pool operation, such as maximum pool operation, L2 pooling, and so on.
  • the mapping relationship between the pooling layer and the output layer is generally referred to as a full-join operation.
  • CNN also has many applications in text classification and text modeling. This patent method uses CNN to discriminate the "true” and "false” of medical text in the discriminator structure.
  • a medical text generation method based on a generated anti-network technology comprising the following steps:
  • step 10 determining whether the medical document set PD_SET to be preprocessed still has a document unprocessed, and if so, randomly reading one of the medical documents D, and moving to step (6); if not, proceeding to step (10);
  • step (8) using the RNN text classifier, to determine whether the medical document D is the required medical document, if yes, then go to step (9); if not, the medical document D is an unwanted medical document, go to step (5);
  • step (3) determining whether the number m of generated text is less than the number of medical documents n to be generated, and if so, proceeding to step (14); if not, proceeding to step (18);
  • step (17) determining whether GD_BLEU is greater than a set threshold BLEU_MAX, if yes, generating medical text GD as invalid text, discarding GD, and moving to step (17); if not, adding generated medical text GD to generated medical text set GD_SET , turning to step (17);
  • step (1) entering the text database PUBMED in the field of medicine and life science, downloading a plurality of medical documents in a certain field of the authority biomedical literature database MEDLINE,
  • Each medical document is saved in txt format, and each medical document is in English.
  • step (7) a language model based on skip-gram is used to train and obtain a word vector of all words in each medical document.
  • the probability of occurrence of some other word in the window is:
  • u x represents the word vector of the xth word in the window except the target word
  • (u x ) T represents the transposition of u x
  • W is the matrix composed of the word vector of the target word
  • W′ is composed of the target word a matrix of transpositions of word vectors for all other words within the window;
  • e is a natural constant, about 2.71828;
  • K represents the number of all words except the target word in the window of the target word
  • j represents the index value of a word in the K words.
  • step (8) text classification is performed on each medical document by using a classification model based on a recurrent neural network (RNN).
  • RNN recurrent neural network
  • the goal of the classification model is to minimize the cross entropy of the predicted probability distribution and the true probability distribution:
  • N is the number of training samples
  • C is the number of categories, the value is 2, which means there are two categories of documents, one is that the condition can be used as the "required medical document", and the category "1" is used; the other is that the condition is not satisfied, as " Unneeded medical documents", indicated by category "0";
  • k medical documents are first selected as a training set, and a classification model is obtained through training, that is, the classified result is one of “required medical document” or “unwanted medical document”, and the obtained classification is utilized.
  • the model classifies all downloaded medical documents. If the classification result is “required medical documents”, the document is retained, otherwise the document is discarded;
  • the penultimate layer softmax layer of the RNN text classification model is used to output a one-dimensional column vector such that each element value of the vector is a real number between (0, 1), and The sum of the two element values of this vector is 1.
  • exp is the exponential function e x
  • the value of i is 0 and 1
  • x i is an input of the softmax layer
  • y(x i ) represents the output corresponding to the input of the softmax layer, that is, classified as a certain The probability of a class.
  • the first element of the column vector is the probability that the document is classified as category "1" and the second element is the probability that the document is classified as category "0.”
  • the output layer uses a max function: if Then predict that the document belongs to category "1", that is, the document is "required medical document”; Then the predicted document belongs to category "0", that is, the document is "unwanted medical document”.
  • the generation model is a generation model of a generation-oriented confrontation network
  • the objective function is:
  • G is a generator
  • D is a discriminator
  • V(D, G) refers to the target function name
  • p data (x) represents the distribution of real data
  • D(x) represents the probability that the discriminator judges that x is true
  • p z (z) represents the probability distribution of the data z obeyed by the generator
  • G(z) refers to the probability that the data generated by the generator is z
  • D(G(z)) represents a probability that the discriminator discriminates the generated data of the obedient probability distribution G(z) as true
  • Log is a logarithmic function, and the base is a natural constant e;
  • the discriminator trying to maximize the function V (D, G) to maximize the ability to distinguish between authenticity.
  • the generator's task is completely the opposite, it tries to minimize the function V (D, G), so that the real data and Minimize the difference between fake data;
  • the framework used to generate the confrontation network is: the generator adopts the LSTM cyclic neural network structure with memory function, and the discriminator adopts the CNN deep neural network structure.
  • the CNN structure is used to encode sentences, and its core contains a convolutional layer and a maximum pooling operation.
  • the input is a sentence of length T (filled with spaces if the length of the sentence is not T, or truncated if the length of the sentence exceeds T) is characterized as a matrix X of a matrix k*T whose t-th column xt represents a Word vector. This constitutes the input matrix.
  • a convolution operation involves a convolution kernel W c ⁇ ⁇ k ⁇ h , where h represents the window size of a word and k represents the dimension size of the word vector.
  • f( ⁇ ) is a nonlinear activation function similar to a hyperbolic tangent function.
  • b is the offset vector and * represents the convolution operation.
  • the maximum activation value of a square area (assumed to be 2*2) is obtained by the maximum pooling operation, that is, Through the square area, the above maximum pooling operation is used for the entire convolutional layer, and finally the pooling layer is obtained.
  • h window size
  • d convolution kernels are used.
  • use a softmax layer to change each element of the output layer to a one-dimensional vector between 0 and 1, each element of the vector representing its probability of distribution from the real data. This is used as a basis for judging the authenticity of the data.
  • the BLEU parameter value is an automatic evaluation of the degree of similarity between the source text and the target text, and is used to measure the quality of the conversion from the source text to the target text.
  • the BLEU parameters are defined as follows:
  • n is the number of words constituting the word segment, n is 1, 2, 3, 4, representing 1-gram, 2-gram, 3-gram, 4-gram;
  • w n is a weight value, which is 1/4;
  • C is the word segment that appears in the generated text and adopts the "modified n-gram precision" strategy, where C' appears in the generated text but does not use "corrected n-unit precision”
  • the word segment of the strategy; count clip (n-gram) is the number of word segments that appear in the generated text and adopts the "corrected n-unit precision"strategy;Count(n-gram')
  • the set threshold value BLEU_MAX in the step (16) is set to 0.5.
  • the present invention has the following advantages over the prior art: the present invention generates a quantitative medical text randomly based on a generated confrontation network model, and solves the problem of patient privacy and less medical text.
  • FIG. 1 is a flowchart of a medical text generation method based on a generated confrontation network technology according to the present disclosure.
  • FIG. 2 is a structural diagram of a generated confrontation network model disclosed by the present invention.
  • a medical text generation method based on a generative confrontation network technology includes the following steps:
  • step 10 determining whether the medical document set PD_SET to be preprocessed still has a document unprocessed, and if so, randomly reading one of the medical documents D, and moving to step (6); if not, proceeding to step (10);
  • step (8) using the RNN text classifier, to determine whether the medical document D is the required medical document, if yes, then go to step (9); if not, the medical document D is an unwanted medical document, go to step (5);
  • step (3) determining whether the number m of generated text is less than the number of medical documents n to be generated, and if so, proceeding to step (14); if not, proceeding to step (18);
  • step (17) determining whether GD_BLEU is greater than a set threshold BLEU_MAX, if yes, generating medical text GD as invalid text, discarding GD, and moving to step (17); if not, adding generated medical text GD to generated medical text set GD_SET , turning to step (17);
  • step (1) entering the text database PUBMED in the field of medicine and life science, downloading a plurality of medical documents in a certain field of the authority biomedical literature database MEDLINE,
  • Each medical document is saved in txt format, and each medical document is in English;
  • step (7) a language model based on skip-gram is used to train to obtain a word vector of all words in each medical document.
  • the probability of occurrence of some other word in the window is:
  • u x represents the word vector of the xth word in the window except the target word
  • (u x ) T represents the transposition of u x
  • W is the matrix composed of the word vector of the target word
  • W′ is composed of the target word a matrix of transpositions of word vectors for all other words within the window;
  • e is a natural constant, about 2.71828;
  • K represents the number of all words except the target word in the window of the target word
  • j represents the index value of a word in the K words.
  • step (8) a classification model based on a recurrent neural network (RNN) is used to train text classification of each medical document.
  • RNN recurrent neural network
  • the goal of the classification model is to minimize the cross entropy of the predicted probability distribution and the true probability distribution:
  • N is the number of training samples
  • C is the number of categories, the value is 2, which means there are two categories of documents, one is that the condition can be used as the "required medical document", and the category "1" is used; the other is that the condition is not satisfied, as " Unneeded medical documents", indicated by category "0";
  • k medical documents are first selected as a training set, and a classification model is obtained through training, that is, the result of the classification is one of “required medical document” or “unwanted medical document”, and the obtained result is obtained.
  • the classification model classifies all downloaded medical documents. If the classification result is “required medical documents”, the documents are retained, otherwise the documents are discarded;
  • the penultimate layer softmax layer of the RNN text classification model is used to output a one-dimensional column vector such that each element value of the vector is a real number between (0, 1), and two of the vectors The sum of the element values is 1.
  • exp is the exponential function e x
  • the value of i is 0 and 1
  • x i is an input of the softmax layer
  • y(x i ) represents the output corresponding to the input of the softmax layer, that is, classified as a certain The probability of a class.
  • the first element of the column vector is the probability that the document is classified as category "1" and the second element is the probability that the document is classified as category "0.”
  • the output layer uses a max function: if Then predict that the document belongs to category "1", that is, the document is "required medical document”; Then the predicted document belongs to category "0", that is, the document is "unwanted medical document”.
  • the generating model is a generating model of a generated confrontation network
  • the objective function is:
  • G is a generator
  • D is a discriminator
  • V(D, G) refers to the target function name
  • p data (x) represents the distribution of real data
  • D(x) represents the probability that the discriminator judges that x is true
  • p z (z) represents the probability distribution of the data z obeyed by the generator
  • G(z) refers to the probability that the data generated by the generator is z
  • D(G(z)) represents a probability that the discriminator discriminates the generated data of the obedient probability distribution G(z) as true
  • Log is a logarithmic function, and the base is a natural constant e; Refers to the discriminator trying to maximize the function V (D, G) to maximize the ability to distinguish between authenticity. On the other hand, the generator's task is completely the opposite, it tries to minimize the function V (D, G), so that the real data and The difference between fake data is minimized.
  • the framework used to generate the confrontation network is: the generator adopts the LSTM cyclic neural network structure with memory function, and the discriminator adopts the CNN deep neural network structure.
  • the CNN structure is used to encode sentences, and its core contains a convolutional layer and a maximum pooling operation.
  • the input is a sentence of length T (filled with spaces if the length of the sentence is not T, or truncated if the length of the sentence exceeds T) is characterized as a matrix X of a matrix k*T whose t-th column xt represents a Word vector. This constitutes the input matrix.
  • a convolution operation involves a convolution kernel W c ⁇ ⁇ k ⁇ h , where h represents the window size of a word and k represents the dimension size of the word vector.
  • f( ⁇ ) is a nonlinear activation function similar to a hyperbolic tangent function.
  • b is the offset vector and * represents the convolution operation.
  • the maximum activation value of a square area (assumed to be 2*2) is obtained by the maximum pooling operation, that is, Through the square area, the above maximum pooling operation is used for the entire convolutional layer, and finally the pooling layer is obtained.
  • h window size
  • d convolution kernels are used.
  • use a softmax layer to change each element of the output layer to a one-dimensional vector between 0 and 1, each element of the vector representing its probability of distribution from the real data. This is used as a basis for judging the authenticity of the data.
  • the BLEU parameter value is an automatic evaluation of the degree of similarity between the source text and the target text, and is used to measure the quality of the conversion from the source text to the target text, and the BLEU parameter is defined as follows :
  • n is the number of words constituting the word segment, n is 1, 2, 3, 4, representing 1-gram, 2-gram, 3-gram, 4-gram;
  • w n is a weight value, which is 1/4;
  • C is the word segment that appears in the generated text and adopts the "modified n-gram precision" strategy, where C' appears in the generated text but does not use "corrected n-unit precision”
  • the word segment of the strategy; count clip (n-gram) is the number of word segments that appear in the generated text and adopts the "corrected n-unit precision"strategy;Count(n-gram')
  • the set threshold BLEU_MAX in step (16) is set to 0.5.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于生成式对抗网络技术的医疗文本生成方法,包括如下步骤:下载某个科属领域的多篇医疗文档;利用词向量来表示每篇医疗文档中的每个词;对每篇医疗文档进行文本分类,保留需要的医疗文档;获得每篇需要的医疗文档的最佳的输出序列标注;获得每篇需要的医疗文档的关键词集的索引,将索引随机打乱顺序,得到新的医疗文档及其对应的序列标注;采用基于生成式对抗网络的生成模型训练生成医疗文本;输出生成的医疗文本;获取生成的医疗文本的BLEU参数;对生成的医疗文本进行评估,最终得到目标医疗文本。本发明通过打乱数据生成新的医疗文本,解决病人的隐私安全问题和医疗文本较少的问题。

Description

基于生成式对抗网络技术的医疗文本生成方法 技术领域
本发明涉及医疗文本的数据挖掘领域,具体涉及一种基于生成式对抗网络技术的医疗文本生成方法。
背景技术
目前,研究人员将人工智能、机器学习等新方法应用到医疗文本数据中,在获得了一些较好效果的同时也遇到了一些问题,如:
(1)数据匮乏。医学文本的数据,尤其是一些罕见疾病、重症疾病的数据严重不足,这样会导致机器学习等方法在训练学习阶段失效,也无法正确、如实、全面地反映对这些疾病的信息。
(2)隐私安全。如何保障隐私安全一直都是人们关注的热点。如,美国印第安纳州某位前市长突然被发现感染有恶性传染疾病。事后得知,该市长在事发前曾去某家医院就诊,有人根据这家医院提供的所谓“无个人敏感隐私信息”的医疗数据,结合其他大数据进行分析,推断出该市长患有恶性传染疾病。
在医疗中利用人工智能、机器学习等新方法,实现精准医疗、智能医疗时,一般需要使用大量的医疗文本数据进行训练,学习得到决策模型,再对模型进行效果测试,之后才能在临床中应用实践。但医疗文本数据的匮乏和隐私保护问题限制了这些新技术在医学信息学中的应用。故而,如何有效的解决这些问题,是本领域亟待解决的问题。
生成式对抗网络(Generative Adversarial Net,GAN)包括了生成模型和判别模型两个部分。判别模型如同分类器,有一个判别界限,通过这个判别界限去区分样本。例如输出1代表样本是真实的(true),输出0代表样本是虚假的(false),因此,从概率角度分析,可以得到样本x属于类别y的概率,是一个条件概率P(y|x)。生成模型产生数据拟合整个分布,从概率角度分析就是样本x在整个分布中的产生的概率,即联合概率P(xy)。
基于生成式对抗网络技术的医疗文本生成方法中,使用生成模型和判别 模型:判别模型用于判断一批给定输入的医疗文本是否为“真实文本”;生成模型的任务是模拟创造尽可能多的被判别模型判定为“真实文本”的医疗文本。在初始化阶段,这两个模型都没有经过训练的,一起对抗训练,生成模型产生文本去欺骗判别模型,然后判别模型去判断文本是真是假,这两个模型不断学习、训练、改善,最终达到稳态。
长短期记忆网络(Long Short-Term Memory,LSTM)是一种时间递归神经网络,适合于处理和预测时间序列中间隔和延迟相对较长的重要事件。LSTM加入了一个判断信息有用与否的“处理器”,这个处理器作用的结构被称为元胞(cell)。一个cell当中被放置了三扇门,分别为输入门、遗忘门和输出门。门机制是一种让信息选择性通过的方法,它包含一个sigmoid神经网络层和一个点积乘法(pointwise乘法)操作,其中sigmoid层输出0到1之间的数值,描述每个部分有多少量可以通过。0代表“不许任何量通过”,1就指“允许任意量通过”。LSTM网络适用于具有时序性的数据,因此适合解决本的医疗文本信息随时间变化的特点。
卷积神经网络(Convolutional Neural Network,CNN)是一种深度前馈人工神经网络,已成功地应用于图像识别。通常CNN包括输入层,卷积层,池化层和输出层。输入层和卷积层之间的映射关系称为特征映射,卷积层和池化层之间的映射关系称为池操作,如最大池操作、L2池化等。池化层和输出层之间的映射关系一般称为全连接操作。CNN在文本分类和文本建模等方面也有较多的应用,本专利方法在判别器结构采用CNN来判别医疗文本的“真”“假”。
发明内容
本发明的发明目的是提供一种基于生成式对抗网络技术的医疗文本生成方法,通过打乱数据、模拟生成新的医疗文本,供机器学习、数据挖掘、人工智能等方法在训练、学习和测试使用,解决医疗文本可能涉及到的病人隐私安全问题,解决医疗文本匮乏的问题。
生成式对抗网络(Generative Adversarial Net,GAN)包括了生成模型和 判别模型两个部分。判别模型如同分类器,有一个判别界限,通过这个判别界限去区分样本。例如输出1代表样本是真实的(true),输出0代表样本是虚假的(false),因此,从概率角度分析,可以得到样本x属于类别y的概率,是一个条件概率P(y|x)。生成模型产生数据拟合整个分布,从概率角度分析就是样本x在整个分布中的产生的概率,即联合概率P(xy)。
基于生成式对抗网络技术的医疗文本生成方法中,使用生成模型和判别模型:判别模型用于判断一批给定输入的医疗文本是否为“真实文本”;生成模型的任务是模拟创造尽可能多的被判别模型判定为“真实文本”的医疗文本。在初始化阶段,这两个模型都没有经过训练的,一起对抗训练,生成模型产生文本去欺骗判别模型,然后判别模型去判断文本是“真”是“假”,这两个模型不断学习、训练、改善,最终达到稳态。
长短期记忆网络(Long Short-Term Memory,LSTM)是一种时间递归神经网络,适合于处理和预测时间序列中间隔和延迟相对较长的重要事件。LSTM加入了一个判断信息有用与否的“处理器”,这个处理器作用的结构被称为元胞(cell)。一个cell当中被放置了三扇门,分别为输入门、遗忘门和输出门。门机制是一种让信息选择性通过的方法,它包含一个sigmoid神经网络层和一个点积乘法(pointwise乘法)操作,其中sigmoid层输出0到1之间的数值,描述每个部分有多少量可以通过。0代表“不许任何量通过”,1就指“允许任意量通过”。LSTM网络适用于具有时序性的数据,因此适合解决本的医疗文本信息随时间变化的特点。
卷积神经网络(Convolutional Neural Network,CNN)是一种深度前馈人工神经网络,已成功地应用于图像识别。通常CNN包括输入层,卷积层,池化层和输出层。输入层和卷积层之间的映射关系称为特征映射,卷积层和池化层之间的映射关系称为池操作,如最大池操作、L2池化等。池化层和输出层之间的映射关系一般称为全连接操作。CNN在文本分类和文本建模等方面也有较多的应用,本专利方法在判别器结构采用CNN来判别医疗文本的“真”“假”。
为实现上述发明目的,本发明提供以下的技术方案:一种基于生成式对抗网络技术的医疗文本生成方法,包括如下步骤:
(1)下载某个科属领域的多篇医疗文档,形成一个待预处理的医疗文档集PD_SET;
(2)设定需要生成的医疗文档数量n;
(3)初始化生成的医疗文本集GD_SET为空;
(4)初始化输入数据集INPUT_SET为空;
(5)判断待预处理的医疗文档集PD_SET是否还有文档未处理,如是,则随机读取其中的一篇医疗文档D,转向步骤(6);如否,则转向步骤(10);
(6)将医疗文档D从待预处理的医疗文档集PD_SET中移除;
(7)对医疗文档D进行特征提取(向量化)处理,得到医疗文档D的每个词的词向量;
(8)利用RNN文本分类器,判断医疗文档D是否为需要的医疗文档,如是,则转向步骤(9);如否,则医疗文档D为不需要的医疗文档,转向步骤(5);
(9)读取医疗文档D的每个词的词向量,形成句向量,添加到输入数据集INPUT_SET,转向步骤(5);
(10)读取输入数据集INPUT_SET内容;
(11)采用基于生成式对抗网络的方法训练读入的输入数据集INPUT_SET,得到医疗文本生成模型MODEL;
(12)设定已生成文本的数量m=0;
(13)判断已生成文本的数量m是否小于需要生成的医疗文档数量n,如是,则转向步骤(14);如否,则转向步骤(18);
(14)使用医疗文本生成模型MODEL训练生成医疗文本GD;
(15)计算GD的BLEU参数值GD_BLEU;
(16)判断GD_BLEU是否大于设定的阈值BLEU_MAX,如是,则生成医疗文本GD为无效文本,舍弃GD,转向步骤(17);如否,则将生成医疗 文本GD添加到生成的医疗文本集GD_SET,转向步骤(17);
(17)已生成文本的数量m值增加1,转向步骤(13);
(18)判断生成的医疗文本集GD_SET是否为空,如是,则输出“无文本满足条件”,如否,则输出生成的医疗文本集GD_SET。
上述技术方案中,步骤(1)中,进入医学、生命科学领域的文本数据库PUBMED,下载权威的生物医学文献数据库MEDLINE中某个科属领域的多篇医疗文档,
每篇医疗文档以txt格式保存,每篇医疗文档为英文文本。
上述技术方案中,步骤(7)中,采用基于skip-gram的语言模型训练获得每篇医疗文档中所有词的词向量,
在给定单词的条件下,窗口内其他某个单词出现的概率为:
Figure PCTCN2018112285-appb-000001
其中,Z代表相似度(u x) Tv c,(u x) Tv c=W'v c,v c=Ww c,w c表示目标单词的独热向量,v c表示目标单词的词向量,u x代表除了目标单词外窗口内第x个单词的词向量,(u x) T表示u x的转置,W为目标单词的词向量组成的矩阵,W'为由除目标单词以外的窗口内其他所有单词的词向量的转置组成的矩阵;
e是自然常数,约为2.71828;
K代表目标单词的窗口内除目标单词外其他所有单词的个数;
j代表K个单词中某个单词的索引值。
上述技术方案中,步骤(8)中,采用基于循环神经网络(Recurrent Neural Network,RNN)的分类模型训练对每篇医疗文档进行文本分类,
所述分类模型的目标是最小化预测概率分布和真实概率分布的交叉熵:
Figure PCTCN2018112285-appb-000002
其中,
Figure PCTCN2018112285-appb-000003
代表ground-truth标签(真实值或者参考标准),即第i个训 练样本的属于第j个类别的概率值;
Figure PCTCN2018112285-appb-000004
是第i个训练样本经过预测属于第j个类别的概率值;
N是训练样本的数量;
C是类别的数量,值为2,即表示有两种类别的文档,一种是满足条件可以作为“需要的医疗文档”,用类别“1”表示;另一种是不满足条件,作为“不需要的医疗文档”,用类别“0”表示;
上述技术方案中,首先选定k篇医疗文档作为训练集,通过训练得到分类模型,即将分类的结果是“需要的医疗文档”或“不需要的医疗文档”中的一种,利用得到的分类模型对所有下载的医疗文档进行分类判断,如果分类结果为“需要的医疗文档”,则保留该文档,否则舍弃该文档;
具体的,上述技术方案中,RNN文本分类模型的倒数第二层softmax层用于输出一个一维列向量,使得该向量的每个元素值是介于(0,1)之间的实数,并且该向量的两个元素值之和为1。
Figure PCTCN2018112285-appb-000005
上述公式中,exp是指数函数e x,i的取值为0和1,x i是softmax层的某一输入,y(x i)表示softmax层的对应该输入的输出,即被分类为某一类的概率。
该列向量的第一个元素是文档被分类为类别“1”概率,第二个元素是文档被分类为类别“0”的概率。输出层再使用一个max函数:如果
Figure PCTCN2018112285-appb-000006
则预测文档属于类别“1”,即文档是“需要的医疗文档”;如果
Figure PCTCN2018112285-appb-000007
则预测文档属于类别“0”,即文档是“不需要的医疗文档”。
上述技术方案中,步骤(11)中,所述生成模型为生成式对抗网络的生成模型,其目标函数为:
Figure PCTCN2018112285-appb-000008
其中,G为生成器;
D为判别器;
V(D,G)指的是目标函数名;
E代表期望;
p data(x)代表真实数据的分布;
D(x)代表判别器判断x为真的概率;
p z(z)代表生成器生成的数据z服从的概率分布;
G(z)指的是生成器生成的数据为z的概率;
D(G(z))代表判别器判别服从概率分布G(z)的生成数据为真的概率;
log为对数函数,底数为自然常数e;
Figure PCTCN2018112285-appb-000009
指鉴别器尝试最大化函数V(D,G),使其辨别真伪能力达到最大,另一方面,生成器的任务完全相反,它试图最小化函数V(D,G),使真实数据和假数据之间的区别最小化;
生成对抗网络所采用的框架是:生成器采用具有记忆功能的LSTM循环神经网络结构,判别器采用CNN深度神经网络结构。
CNN结构被用来给句子编码,它的核心包含一个卷积层和一个最大池化操作。假设输入是一个长度为T的句子(如果句子长度不够T则用空格补齐,如果句子长度超过T则截断)被表征为一个矩阵k*T的矩阵X,它的第t列xt就代表一个词向量。这样就构成了输入矩阵。
一个卷积操作涉及一个卷积核W c∈□ k×h,h代表某个单词的窗口大小,k代表词向量的维度大小。
c=f(X*W c+b)∈□ T-h+1
f(·)是一个类似于双曲正切函数的非线性激活函数。b是偏置向量,*代表的是卷积操作。通过最大池化操作得到某方形区域(假设为2*2)最大的激活值,即
Figure PCTCN2018112285-appb-000010
通过这个方形区域对整个卷积层使用上述最大池化操作,最终得到了池化层,假设我们的窗口大小为h,使用了d个卷积核。那么池化层到输出层的全连接一共有h*d个。然后再使用一个softmax层来对输出层的每一个元素变为介于0到1之间的一维向量,这个向量的每个元素代表了其来自真实数据分布的概率。以此作为判断数据真伪的依据。
上述技术方案中,步骤(15)和步骤(16)中,BLEU参数值是源文本和目标文本之间相似程度自动评估,用于衡量从源文本转换到目标文本的质量,BLEU参数定义如下:
Figure PCTCN2018112285-appb-000011
其中,如果c>r,则Bp=1,如果c<=r,则Bp=e (1-r/c),C为生成文本的长度,R为真实文本的长度,e为自然常数,约为2.71828,N为4;
n为组成单词片段的单词个数,n取1,2,3,4,代表1-gram,2-gram,3-gram,4-gram;
w n为权重值,取值为1/4;
Figure PCTCN2018112285-appb-000012
C为出现在生成文本中并采用“修正的n-单位精确度”(modified n-gram precision)策略的单词片段,C′为出现在生成文本中但未采用“修正的n-单位精确度”策略的单词片段;count clip(n-gram)为出现在生成文本中并采用“修正的n-单位精确度”策略的单词片段的个数;Count(n-gram')
为出现在生成文本中但未采用“修正的n-单位精确度”策略的单词片段的个数。
上述技术方案中,步骤(16)中的设定的阈值BLEU_MAX设定值为0.5。
由于上述技术方案运用,本发明与现有技术相比具有以下优点:本发明基于生成式对抗网络模型,随机生成定量的医疗文本,解决病人的隐私安全问题和医疗文本较少的问题。
附图说明
图1为本发明公开的基于生成式对抗网络技术的医疗文本生成方法的流程图。
图2为本发明公开的生成式对抗网络模型的结构图。
具体实施方式
下面结合本发明的原理、附图以及实施例对本发明进一步描述
参见图1和图2,如其中的图例所示,一种基于生成式对抗网络技术的医疗文本生成方法,包括如下步骤:
(1)下载某个科属领域的多篇医疗文档,形成一个待预处理的医疗文档集PD_SET;
(2)设定需要生成的医疗文档数量n;
(3)初始化生成的医疗文本集GD_SET为空;
(4)初始化输入数据集INPUT_SET为空;
(5)判断待预处理的医疗文档集PD_SET是否还有文档未处理,如是,则随机读取其中的一篇医疗文档D,转向步骤(6);如否,则转向步骤(10);
(6)将医疗文档D从待预处理的医疗文档集PD_SET中移除;
(7)对医疗文档D进行特征提取(向量化)处理,得到医疗文档D的每个词的词向量;
(8)利用RNN文本分类器,判断医疗文档D是否为需要的医疗文档,如是,则转向步骤(9);如否,则医疗文档D为不需要的医疗文档,转向步骤(5);
(9)读取医疗文档D的每个词的词向量,形成句向量,添加到输入数据集INPUT_SET,转向步骤(5);
(10)读取输入数据集INPUT_SET内容;
(11)采用基于生成式对抗网络的方法训练读入的输入数据集INPUT_SET,得到医疗文本生成模型MODEL;
(12)设定已生成文本的数量m=0;
(13)判断已生成文本的数量m是否小于需要生成的医疗文档数量n,如是,则转向步骤(14);如否,则转向步骤(18);
(14)使用医疗文本生成模型MODEL训练生成医疗文本GD;
(15)计算GD的BLEU参数值GD_BLEU;
(16)判断GD_BLEU是否大于设定的阈值BLEU_MAX,如是,则生成医疗文本GD为无效文本,舍弃GD,转向步骤(17);如否,则将生成医疗文本GD添加到生成的医疗文本集GD_SET,转向步骤(17);
(17)已生成文本的数量m值增加1,转向步骤(13);
(18)判断生成的医疗文本集GD_SET是否为空,如是,则输出“无文本满足条件”,如否,则输出生成的医疗文本集GD_SET。
一种实施方式中,步骤(1)中,进入医学、生命科学领域的文本数据库PUBMED,下载权威的生物医学文献数据库MEDLINE中某个科属领域的多篇医疗文档,
每篇医疗文档以txt格式保存,每篇医疗文档为英文文本;
定义txt文件的内容:
{
Name:Bob
Age:20
Gender:male
Case:Tonsillitis,mild cough
First treatement:Eat anti-inflammatory drugs,drink plenty of water
Second treatement:Do more outdoor exercises to maintain adequate sleep
};。
一种实施方式中,步骤(7)中,采用基于skip-gram的语言模型训练获得每篇医疗文档中所有词的词向量,
在给定单词的条件下,窗口内其他某个单词出现的概率为:
Figure PCTCN2018112285-appb-000013
其中,Z代表相似度(u x) Tv c,(u x) Tv c=W'v c,v c=Ww c,w c表示目标单词的独热向量,v c表示目标单词的词向量,u x代表除了目标单词外窗口内第x 个单词的词向量,(u x) T表示u x的转置,W为目标单词的词向量组成的矩阵,W'为由除目标单词以外的窗口内其他所有单词的词向量的转置组成的矩阵;
e是自然常数,约为2.71828;
K代表目标单词的窗口内除目标单词外其他所有单词的个数;
j代表K个单词中某个单词的索引值。
一种实施方式中,步骤(8)中,采用基于循环神经网络(Recurrent Neural Network,RNN)的分类模型训练对每篇医疗文档进行文本分类,
所述分类模型的目标是最小化预测概率分布和真实概率分布的交叉熵:
Figure PCTCN2018112285-appb-000014
其中,
Figure PCTCN2018112285-appb-000015
代表ground-truth标签(真实值或者参考标准),即第i个训练样本的属于第j个类别的概率值;
Figure PCTCN2018112285-appb-000016
是第i个训练样本经过预测属于第j个类别的概率值;
N是训练样本的数量;
C是类别的数量,值为2,即表示有两种类别的文档,一种是满足条件可以作为“需要的医疗文档”,用类别“1”表示;另一种是不满足条件,作为“不需要的医疗文档”,用类别“0”表示;
一种实施方式中,首先选定k篇医疗文档作为训练集,通过训练得到分类模型,即将分类的结果是“需要的医疗文档”或“不需要的医疗文档”中的一种,利用得到的分类模型对所有下载的医疗文档进行分类判断,如果分类结果为“需要的医疗文档”,则保留该文档,否则舍弃该文档;
具体的,RNN文本分类模型的倒数第二层softmax层用于输出一个一维列向量,使得该向量的每个元素值是介于(0,1)之间的实数,并且该向量的两个元素值之和为1。
Figure PCTCN2018112285-appb-000017
上述公式中,exp是指数函数e x,i的取值为0和1,x i是softmax层的某一输入,y(x i)表示 softmax层的对应该输入的输出,即被分类为某一类的概率。
该列向量的第一个元素是文档被分类为类别“1”概率,第二个元素是文档被分类为类别“0”的概率。输出层再使用一个max函数:如果
Figure PCTCN2018112285-appb-000018
则预测文档属于类别“1”,即文档是“需要的医疗文档”;如果
Figure PCTCN2018112285-appb-000019
则预测文档属于类别“0”,即文档是“不需要的医疗文档”。
一种实施方式中,步骤(11)中,所述生成模型为生成式对抗网络的生成模型,其目标函数为:
Figure PCTCN2018112285-appb-000020
其中,G为生成器;
D为判别器;
V(D,G)指的是目标函数名;
E代表期望;
p data(x)代表真实数据的分布;
D(x)代表判别器判断x为真的概率;
p z(z)代表生成器生成的数据z服从的概率分布;
G(z)指的是生成器生成的数据为z的概率;
D(G(z))代表判别器判别服从概率分布G(z)的生成数据为真的概率;
log为对数函数,底数为自然常数e;
Figure PCTCN2018112285-appb-000021
指鉴别器尝试最大化函数V(D,G),使其辨别真伪能力达到最大,另一方面,生成器的任务完全相反,它试图最小化函数V(D,G),使真实数据和假数据之间的区别最小化。
生成对抗网络所采用的框架是:生成器采用具有记忆功能的LSTM循环神经网络结构,判别器采用CNN深度神经网络结构。
CNN结构被用来给句子编码,它的核心包含一个卷积层和一个最大池化操作。假设输入是一个长度为T的句子(如果句子长度不够T则用空格补齐,如果句子长度超过T则截断)被表征为一个矩阵k*T的矩阵X,它的第t列xt就代表一个词向量。这样就构成了输入矩阵。
一个卷积操作涉及一个卷积核W c∈□ k×h,h代表某个单词的窗口大小,k代表词向量的维度大小。
c=f(X*W c+b)∈□ T-h+1
f(·)是一个类似于双曲正切函数的非线性激活函数。b是偏置向量,*代表的是卷积操作。通过最大池化操作得到某方形区域(假设为2*2)最大的激活值,即
Figure PCTCN2018112285-appb-000022
通过这个方形区域对整个卷积层使用上述最大池化操作,最终得到了池化层,假设我们的窗口大小为h,使用了d个卷积核。那么池化层到输出层的全连接一共有h*d个。然后再使用一个softmax层来对输出层的每一个元素变为介于0到1之间的一维向量,这个向量的每个元素代表了其来自真实数据分布的概率。以此作为判断数据真伪的依据。
一种实施方式中,步骤(15)和步骤(16)中,BLEU参数值是源文本和目标文本之间相似程度自动评估,用于衡量从源文本转换到目标文本的质量,BLEU参数定义如下:
Figure PCTCN2018112285-appb-000023
其中,如果c>r,则Bp=1,如果c<=r,则Bp=e (1-r/c),C为生成文本的长度,R为真实文本的长度,e为自然常数,约为2.71828,N为4;
n为组成单词片段的单词个数,n取1,2,3,4,代表1-gram,2-gram,3-gram,4-gram;
w n为权重值,取值为1/4;
Figure PCTCN2018112285-appb-000024
C为出现在生成文本中并采用“修正的n-单位精确度”(modified n-gram precision)策略的单词片段,C′为出现在生成文本中但未采用“修正的n-单位精确度”策略的单词片段; count clip(n-gram)为出现在生成文本中并采用“修正的n-单位精确度”策略的单词片段的个数;Count(n-gram')
为出现在生成文本中但未采用“修正的n-单位精确度”策略的单词片段的个数。
一种实施方式中,步骤(16)中的设定的阈值BLEU_MAX设定值为0.5。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (9)

  1. 一种基于生成式对抗网络技术的医疗文本生成方法,其特征在于,包括如下步骤:
    (1)下载某个科属领域的多篇医疗文档,形成一个待预处理的医疗文档集PD_SET;
    (2)设定需要生成的医疗文档数量n;
    (3)初始化生成的医疗文本集GD_SET为空;
    (4)初始化输入数据集INPUT_SET为空;
    (5)判断待预处理的医疗文档集PD_SET是否还有文档未处理,如是,则随机读取其中的一篇医疗文档D,转向步骤(6);如否,则转向步骤(10);
    (6)将医疗文档D从待预处理的医疗文档集PD_SET中移除;
    (7)对医疗文档D进行特征提取(向量化)处理,得到医疗文档D的每个词的词向量;
    (8)利用RNN文本分类器,判断医疗文档D是否为需要的医疗文档,如是,则转向步骤(9);如否,则医疗文档D为不需要的医疗文档,转向步骤(5);
    (9)读取医疗文档D的每个词的词向量,形成句向量,添加到输入数据集INPUT_SET,转向步骤(5);
    (10)读取输入数据集INPUT_SET内容;
    (11)采用基于生成式对抗网络的方法训练读入的输入数据集INPUT_SET,得到医疗文本生成模型MODEL;
    (12)设定已生成文本的数量m=0;
    (13)判断已生成文本的数量m是否小于需要生成的医疗文档数量n,如是,则转向步骤(14);如否,则转向步骤(18);
    (14)使用医疗文本生成模型MODEL训练生成医疗文本GD;
    (15)计算GD的BLEU参数值GD_BLEU;
    (16)判断GD_BLEU是否大于设定的阈值BLEU_MAX,如是,则生成医疗文本GD为无效文本,舍弃GD,转向步骤(17);如否,则将生成医疗文 本GD添加到生成的医疗文本集GD_SET,转向步骤(17);
    (17)已生成文本的数量m值增加1,转向步骤(13);
    (18)判断生成的医疗文本集GD_SET是否为空,如是,则输出“无文本满足条件”,如否,则输出生成的医疗文本集GD_SET。
  2. 根据权利要求1所述的医疗文本生成方法,其特征在于,步骤(1)中,进入医学、生命科学领域的文本数据库PUBMED,下载权威的生物医学文献数据库MEDLINE中某个科属领域的多篇医疗文档,
    每篇医疗文档以txt格式保存,每篇医疗文档为英文文本。
  3. 根据权利要求1所述的医疗文本生成方法,其特征在于,步骤(7)中,采用基于skip-gram的语言模型训练获得每篇医疗文档中所有词的词向量,
    在给定单词的条件下,窗口内其他某个单词出现的概率为:
    Figure PCTCN2018112285-appb-100001
    其中,Z代表相似度(u x) Tv c,(u x) Tv c=W'v c,v c=Ww c,w c表示目标单词的独热向量,v c表示目标单词的词向量,u x代表除了目标单词外窗口内第x个单词的词向量,(u x) T表示u x的转置,W为目标单词的词向量组成的矩阵,W'为由除目标单词以外的窗口内其他所有单词的词向量的转置组成的矩阵;
    e是自然常数,约为2.71828;
    K代表目标单词的窗口内除目标单词外其他所有单词的个数;
    j代表K个单词中某个单词的索引值。
  4. 根据权利要求1所述的医疗文本生成方法,其特征在于,步骤(8)中,采用基于循环神经网络(Recurrent Neural Network,RNN)的分类模型训练对每篇医疗文档进行文本分类,
    所述分类模型的目标是最小化预测概率分布和真实概率分布的交叉熵:
    Figure PCTCN2018112285-appb-100002
    其中,
    Figure PCTCN2018112285-appb-100003
    代表ground-truth标签(真实值或者参考标准),即第i个训练样本的属于第j个类别的概率值;
    Figure PCTCN2018112285-appb-100004
    是第i个训练样本经过预测属于第j个类别的概率值;
    N是训练样本的数量;
    C是类别的数量,值为2,即表示有两种类别的文档,一种是满足条件可以作为“需要的医疗文档”,用类别“1”表示;另一种是不满足条件,作为“不需要的医疗文档”,用类别“0”表示。
  5. 根据权利要求1所述的医疗文本生成方法,其特征在于,首先选定k篇医疗文档作为训练集,通过训练得到分类模型,即将分类的结果是“需要的医疗文档”或“不需要的医疗文档”中的一种,利用得到的分类模型对所有下载的医疗文档进行分类判断,如果分类结果为“需要的医疗文档”,则保留该文档,否则舍弃该文档。
  6. 根据权利要求1所述的医疗文本生成方法,其特征在于,步骤(11)中,所述生成模型为生成式对抗网络的生成模型,其目标函数为:
    Figure PCTCN2018112285-appb-100005
    其中,G为生成器;
    D为判别器;
    V(D,G)指的是目标函数名;
    E代表期望;
    p data(x)代表真实数据的分布;
    D(x)代表判别器判断x为真的概率;
    p z(z)代表生成器生成的数据z服从的概率分布;
    G(z)指的是生成器生成的数据为z的概率分布;
    D(G(z))代表判别器判别服从概率分布G(z)的生成数据为真的概率;
    log为对数函数,底数为自然常数e;
    Figure PCTCN2018112285-appb-100006
    指鉴别器尝试最大化函数V(D,G),使其辨别真伪能力达到最大,另一方面,生成器的任务完全相反,它试图最小 化函数V(D,G),使真实数据和假数据之间的区别最小化。
  7. 根据权利要求1所述的医疗文本生成方法,其特征在于,生成器采用具有记忆功能的LSTM循环神经网络结构,判别器采用CNN深度神经网络结构。
  8. 根据权利要求1所述的医疗文本生成方法,其特征在于,步骤(15)和步骤(16)中,BLEU参数值是源文本和目标文本之间相似程度自动评估,用于衡量从源文本转换到目标文本的质量,BLEU参数定义如下:
    Figure PCTCN2018112285-appb-100007
    其中,如果c>r,则Bp=1,如果c<=r,则Bp=e (1-r/c),C为生成文本的长度,R为真实文本的长度,e为自然常数,约为2.71828,N为4;
    n为组成单词片段的单词个数,n取1,2,3,4,代表1-gram,2-gram,3-gram,4-gram;
    w n为权重值,取值为1/4;
    Figure PCTCN2018112285-appb-100008
    C为出现在生成文本中并采用“修正的n-单位精确度”策略的单词片段,C′为出现在生成文本中但未采用“修正的n-单位精确度”策略的单词片段;count clip(n-gram)为出现在生成文本中并采用“修正的n-单位精确度”策略的单词片段的个数;Count(n-gram')为出现在生成文本中但未采用“修正的n-单位精确度”策略的单词片段的个数。
  9. 根据权利要求1所述的医疗文本生成方法,其特征在于,步骤(16)中的设定的阈值BLEU_MAX设定值为0.5。
PCT/CN2018/112285 2018-03-20 2018-10-29 基于生成式对抗网络技术的医疗文本生成方法 WO2019179100A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810227535.0A CN108491497B (zh) 2018-03-20 2018-03-20 基于生成式对抗网络技术的医疗文本生成方法
CN201810227535.0 2018-03-20

Publications (1)

Publication Number Publication Date
WO2019179100A1 true WO2019179100A1 (zh) 2019-09-26

Family

ID=63318479

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/112285 WO2019179100A1 (zh) 2018-03-20 2018-10-29 基于生成式对抗网络技术的医疗文本生成方法

Country Status (2)

Country Link
CN (1) CN108491497B (zh)
WO (1) WO2019179100A1 (zh)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110826337A (zh) * 2019-10-08 2020-02-21 西安建筑科技大学 一种短文本语义训练模型获取方法及相似度匹配算法
CN110956579A (zh) * 2019-11-27 2020-04-03 中山大学 一种基于生成语义分割图的文本改写图片方法
CN111584029A (zh) * 2020-04-30 2020-08-25 天津大学 基于判别式对抗网络的脑电自适应模型及在康复中的应用
CN111753091A (zh) * 2020-06-30 2020-10-09 北京小米松果电子有限公司 分类方法、分类模型的训练方法、装置、设备及存储介质
CN112036750A (zh) * 2020-08-31 2020-12-04 平安医疗健康管理股份有限公司 适用于医疗风控的异常识别方法、装置、设备及存储介质
CN112349370A (zh) * 2020-11-05 2021-02-09 大连理工大学 一种基于对抗网络+众包的电子病历语料构建方法
CN112420205A (zh) * 2020-12-08 2021-02-26 医惠科技有限公司 实体识别模型生成方法、装置及计算机可读存储介质
CN112434722A (zh) * 2020-10-23 2021-03-02 浙江智慧视频安防创新中心有限公司 基于类别相似度的标签平滑计算的方法、装置、电子设备及介质
CN112712118A (zh) * 2020-12-29 2021-04-27 银江股份有限公司 一种面向医疗文本数据的过滤方法及系统
CN112949296A (zh) * 2019-12-10 2021-06-11 医渡云(北京)技术有限公司 基于黎曼空间的词嵌入方法和装置、介质及设备
CN113268991A (zh) * 2021-05-19 2021-08-17 北京邮电大学 一种基于cgan模型的用户人格隐私保护方法
CN113360655A (zh) * 2021-06-25 2021-09-07 中国电子科技集团公司第二十八研究所 一种基于序列标注的航迹点分类及文本生成方法
CN113626601A (zh) * 2021-08-18 2021-11-09 西安理工大学 一种跨域文本分类方法
CN114241263A (zh) * 2021-12-17 2022-03-25 电子科技大学 基于生成对抗网络的雷达干扰半监督开集识别系统
CN115862036A (zh) * 2022-12-14 2023-03-28 北京瑞莱智慧科技有限公司 信息干扰模型训练方法、信息干扰方法、相关装置及介质
CN115938530A (zh) * 2023-01-09 2023-04-07 人工智能与数字经济广东省实验室(广州) 抗后门攻击的智能医疗影像诊断意见自动生成方法
CN116795972A (zh) * 2023-08-11 2023-09-22 之江实验室 一种模型训练的方法、装置、存储介质及电子设备
WO2024066041A1 (zh) * 2022-09-27 2024-04-04 深圳先进技术研究院 基于序列对抗和先验推理的电子保函自动生成方法及装置

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491497B (zh) * 2018-03-20 2020-06-02 苏州大学 基于生成式对抗网络技术的医疗文本生成方法
CN108897769A (zh) * 2018-05-29 2018-11-27 武汉大学 基于生成式对抗网络实现文本分类数据集扩展方法
CN109376903B (zh) * 2018-09-10 2021-12-17 浙江工业大学 一种基于博弈神经网络的pm2.5浓度值预测方法
EP3624021A1 (en) * 2018-09-17 2020-03-18 Robert Bosch GmbH Device and method for training an augmented discriminator
CN109635273B (zh) * 2018-10-25 2023-04-25 平安科技(深圳)有限公司 文本关键词提取方法、装置、设备及存储介质
CN109522411B (zh) * 2018-11-12 2022-10-28 南京德磐信息科技有限公司 一种基于神经网络的写作辅助方法
CN109614480B (zh) * 2018-11-26 2020-10-30 武汉大学 一种基于生成式对抗网络的自动摘要的生成方法及装置
CN109656878B (zh) * 2018-12-12 2020-11-06 中电健康云科技有限公司 健康档案数据生成方法及装置
CN109698017B (zh) * 2018-12-12 2020-11-27 中电健康云科技有限公司 医疗病历数据生成方法及装置
CN109766683B (zh) * 2019-01-16 2021-10-01 中国科学技术大学 一种移动智能设备传感器指纹的保护方法
CN110162779B (zh) * 2019-04-04 2023-08-04 北京百度网讯科技有限公司 病历质量的评估方法、装置及设备
CN110147535A (zh) * 2019-04-18 2019-08-20 平安科技(深圳)有限公司 相似文本生成方法、装置、设备及存储介质
US20200342968A1 (en) * 2019-04-24 2020-10-29 GE Precision Healthcare LLC Visualization of medical device event processing
CN110110060A (zh) * 2019-04-24 2019-08-09 北京百度网讯科技有限公司 一种数据生成方法和装置
CN109998500A (zh) * 2019-04-30 2019-07-12 陕西师范大学 一种基于生成式对抗网络的脉搏信号生成方法及系统
CN110176311A (zh) * 2019-05-17 2019-08-27 北京印刷学院 一种基于对抗神经网络的自动医疗方案推荐方法和系统
CN111008277B (zh) * 2019-10-30 2020-11-03 创意信息技术股份有限公司 一种自动文本摘要方法
CN110807207B (zh) * 2019-10-30 2021-10-08 腾讯科技(深圳)有限公司 数据处理方法、装置、电子设备及存储介质
CN110765491B (zh) * 2019-11-08 2020-07-17 国网浙江省电力有限公司信息通信分公司 一种去敏感化数据关联关系的保持方法及系统
CN113032469B (zh) * 2019-12-24 2024-02-20 医渡云(北京)技术有限公司 文本结构化模型训练、医疗文本结构化方法及装置
CN111666588B (zh) * 2020-05-14 2023-06-23 武汉大学 一种基于生成对抗网络的情绪差分隐私保护方法
CN112287645B (zh) * 2020-11-09 2022-07-26 北京理工大学 一种基于生成式对抗网络的恶意pdf文档生成方法
CN113889213A (zh) * 2021-12-06 2022-01-04 武汉大学 超声内镜报告的生成方法、装置、计算机设备及存储介质
CN117093715B (zh) * 2023-10-18 2023-12-29 湖南财信数字科技有限公司 词库扩充方法、系统、计算机设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512687A (zh) * 2015-12-15 2016-04-20 北京锐安科技有限公司 训练情感分类模型和文本情感极性分析的方法及系统
WO2016084326A1 (ja) * 2014-11-26 2016-06-02 日本電気株式会社 情報処理システム、情報処理方法、及び、記録媒体
CN107330444A (zh) * 2017-05-27 2017-11-07 苏州科技大学 一种基于生成对抗网络的图像自动文本标注方法
CN107590531A (zh) * 2017-08-14 2018-01-16 华南理工大学 一种基于文本生成的wgan方法
CN107609009A (zh) * 2017-07-26 2018-01-19 北京大学深圳研究院 文本情感分析方法、装置、存储介质和计算机设备
CN108491497A (zh) * 2018-03-20 2018-09-04 苏州大学 基于生成式对抗网络技术的医疗文本生成方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016084326A1 (ja) * 2014-11-26 2016-06-02 日本電気株式会社 情報処理システム、情報処理方法、及び、記録媒体
CN105512687A (zh) * 2015-12-15 2016-04-20 北京锐安科技有限公司 训练情感分类模型和文本情感极性分析的方法及系统
CN107330444A (zh) * 2017-05-27 2017-11-07 苏州科技大学 一种基于生成对抗网络的图像自动文本标注方法
CN107609009A (zh) * 2017-07-26 2018-01-19 北京大学深圳研究院 文本情感分析方法、装置、存储介质和计算机设备
CN107590531A (zh) * 2017-08-14 2018-01-16 华南理工大学 一种基于文本生成的wgan方法
CN108491497A (zh) * 2018-03-20 2018-09-04 苏州大学 基于生成式对抗网络技术的医疗文本生成方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG, KUNFENG ET AL.: "Generative Adversarial Networks: The State of the Art and Beyond", ACTA AUTOMATICA SINICA, vol. 43, no. 3, 31 March 2017 (2017-03-31), pages 321 - 332, XP055612268, ISSN: 0254-4156, doi:10.16383/j.aas.2017.y000003 *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110826337A (zh) * 2019-10-08 2020-02-21 西安建筑科技大学 一种短文本语义训练模型获取方法及相似度匹配算法
CN110956579A (zh) * 2019-11-27 2020-04-03 中山大学 一种基于生成语义分割图的文本改写图片方法
CN110956579B (zh) * 2019-11-27 2023-05-23 中山大学 一种基于生成语义分割图的文本改写图片方法
CN112949296A (zh) * 2019-12-10 2021-06-11 医渡云(北京)技术有限公司 基于黎曼空间的词嵌入方法和装置、介质及设备
CN112949296B (zh) * 2019-12-10 2024-05-31 医渡云(北京)技术有限公司 基于黎曼空间的词嵌入方法和装置、介质及设备
CN111584029A (zh) * 2020-04-30 2020-08-25 天津大学 基于判别式对抗网络的脑电自适应模型及在康复中的应用
CN111584029B (zh) * 2020-04-30 2023-04-18 天津大学 基于判别式对抗网络的脑电自适应模型及在康复中的应用
CN111753091A (zh) * 2020-06-30 2020-10-09 北京小米松果电子有限公司 分类方法、分类模型的训练方法、装置、设备及存储介质
CN112036750A (zh) * 2020-08-31 2020-12-04 平安医疗健康管理股份有限公司 适用于医疗风控的异常识别方法、装置、设备及存储介质
CN112434722B (zh) * 2020-10-23 2024-03-19 浙江智慧视频安防创新中心有限公司 基于类别相似度的标签平滑计算的方法、装置、电子设备及介质
CN112434722A (zh) * 2020-10-23 2021-03-02 浙江智慧视频安防创新中心有限公司 基于类别相似度的标签平滑计算的方法、装置、电子设备及介质
CN112349370B (zh) * 2020-11-05 2023-11-24 大连理工大学 一种基于对抗网络+众包的电子病历语料构建方法
CN112349370A (zh) * 2020-11-05 2021-02-09 大连理工大学 一种基于对抗网络+众包的电子病历语料构建方法
CN112420205A (zh) * 2020-12-08 2021-02-26 医惠科技有限公司 实体识别模型生成方法、装置及计算机可读存储介质
CN112712118A (zh) * 2020-12-29 2021-04-27 银江股份有限公司 一种面向医疗文本数据的过滤方法及系统
CN113268991A (zh) * 2021-05-19 2021-08-17 北京邮电大学 一种基于cgan模型的用户人格隐私保护方法
CN113360655B (zh) * 2021-06-25 2022-10-04 中国电子科技集团公司第二十八研究所 一种基于序列标注的航迹点分类及文本生成方法
CN113360655A (zh) * 2021-06-25 2021-09-07 中国电子科技集团公司第二十八研究所 一种基于序列标注的航迹点分类及文本生成方法
CN113626601A (zh) * 2021-08-18 2021-11-09 西安理工大学 一种跨域文本分类方法
CN114241263B (zh) * 2021-12-17 2023-05-02 电子科技大学 基于生成对抗网络的雷达干扰半监督开集识别系统
CN114241263A (zh) * 2021-12-17 2022-03-25 电子科技大学 基于生成对抗网络的雷达干扰半监督开集识别系统
WO2024066041A1 (zh) * 2022-09-27 2024-04-04 深圳先进技术研究院 基于序列对抗和先验推理的电子保函自动生成方法及装置
CN115862036A (zh) * 2022-12-14 2023-03-28 北京瑞莱智慧科技有限公司 信息干扰模型训练方法、信息干扰方法、相关装置及介质
CN115862036B (zh) * 2022-12-14 2024-02-23 北京瑞莱智慧科技有限公司 信息干扰模型训练方法、信息干扰方法、相关装置及介质
CN115938530A (zh) * 2023-01-09 2023-04-07 人工智能与数字经济广东省实验室(广州) 抗后门攻击的智能医疗影像诊断意见自动生成方法
CN115938530B (zh) * 2023-01-09 2023-07-07 人工智能与数字经济广东省实验室(广州) 抗后门攻击的智能医疗影像诊断意见自动生成方法
CN116795972A (zh) * 2023-08-11 2023-09-22 之江实验室 一种模型训练的方法、装置、存储介质及电子设备
CN116795972B (zh) * 2023-08-11 2024-01-09 之江实验室 一种模型训练的方法、装置、存储介质及电子设备

Also Published As

Publication number Publication date
CN108491497B (zh) 2020-06-02
CN108491497A (zh) 2018-09-04

Similar Documents

Publication Publication Date Title
WO2019179100A1 (zh) 基于生成式对抗网络技术的医疗文本生成方法
Varma et al. Snuba: Automating weak supervision to label training data
CN110347837B (zh) 一种心血管疾病非计划再住院风险预测方法
Yang et al. Filtering big data from social media–Building an early warning system for adverse drug reactions
US20200311115A1 (en) Method and system for mapping text phrases to a taxonomy
US20190317955A1 (en) Determining missing content in a database
JP2021532499A (ja) 機械学習に基づく医療データ分類方法、装置、コンピュータデバイス及び記憶媒体
Gale et al. Producing radiologist-quality reports for interpretable artificial intelligence
Gale et al. Producing radiologist-quality reports for interpretable deep learning
US11928597B2 (en) Method and system for classifying images using image embedding
US20220179906A1 (en) Classifying documents using a domain-specific natural language processing model
US11663406B2 (en) Methods and systems for automated detection of personal information using neural networks
US20210098134A1 (en) Multi-task learning in pharmacovigilance
Alsharid et al. Captioning ultrasound images automatically
US20230315994A1 (en) Natural Language Processing for Addressing Bias
Yuan et al. Large language models for healthcare data augmentation: An example on patient-trial matching
CN110781666B (zh) 基于生成式对抗网络的自然语言处理文本建模
Ramnarain-Seetohul et al. Similarity measures in automated essay scoring systems: A ten-year review
Moscato et al. Multi-task learning for few-shot biomedical relation extraction
Saad et al. Novel extreme regression-voting classifier to predict death risk in vaccinated people using VAERS data
US11442963B1 (en) Method of and system for ranking subgraphs as potential explanations for graph classification
Brown et al. Detection of behavioral health cases from sensitive police officer narratives
CN113435212A (zh) 一种基于规则嵌入的文本推断方法及装置
Ghosh et al. Evade: exploring vaccine dissenting discourse on twitter
Luik et al. The effectiveness of phrase skip-gram in primary care NLP for the prediction of lung cancer

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18910668

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15/02/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18910668

Country of ref document: EP

Kind code of ref document: A1