CN108491382A

CN108491382A - A kind of semi-supervised biomedical text semantic disambiguation method

Info

Publication number: CN108491382A
Application number: CN201810207213.XA
Authority: CN
Inventors: 李智; 罗曜儒; 李健
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2018-03-14
Filing date: 2018-03-14
Publication date: 2018-09-04

Abstract

The invention is a semantic disambiguation method for polysemous words in biomedical texts. It mainly includes: using Word2Vec to perform vectorized representation of words in biomedical texts, constructing a vectorized representation of contextual sentences based on the word vector language model based on the bidirectional LSTM model, and then using the relationship between the sentence vector space similarity and combining the label transfer method to convert the existing The labels of the annotated medical data are passed to the most similar unlabeled data according to the probability, and finally the semantic disambiguation of the biomedical text is combined with all the labeled data. Because biomedical data has the characteristics of strong professionalism and many terms, manual processing of medical data is time-consuming, labor-intensive and error-prone. Using the present invention can greatly reduce the cost of manual labeling. At the same time, compared with traditional machine learning methods, it can Effectively improve the accuracy of semantic disambiguation.

Description

A Semi-supervised Method for Semantic Disambiguation of Biomedical Text

技术领域technical field

本发明属于自然语言处理语义消歧领域，是一种基于半监督生物医学文本语义消歧的方法及系统。具体是指基于标签传递法利用双向长短期记忆模型Bi-LSTM对医学文本中多义词进行语义消歧。The invention belongs to the field of natural language processing semantic disambiguation, and is a method and system based on semi-supervised biomedical text semantic disambiguation. Specifically, it refers to the semantic disambiguation of polysemous words in medical texts by using the bidirectional long short-term memory model Bi-LSTM based on the label transfer method.

背景技术Background technique

近几年随着数字信息的爆炸性增长，医护人员越来越容易获得医疗电子数据。在生物医学领域，文本数据包含了大量专业领域的知识与信息，如何从数字化的文本信息中提取有用信息变的越来越重要。相比于通用文本数据，医学文本数据的难点在于专业性强、数据标注困难等。因此理解生物医学文本语义信息以及自动化标注医疗数据成为研究热点。In recent years, with the explosive growth of digital information, it has become easier for medical staff to obtain medical electronic data. In the field of biomedicine, text data contains a large amount of knowledge and information in professional fields, and how to extract useful information from digital text information is becoming more and more important. Compared with general text data, the difficulty of medical text data lies in its strong professionalism and difficult data labeling. Therefore, understanding the semantic information of biomedical texts and automatically labeling medical data has become a research hotspot.

传统的生物医学文本语义消歧方法包括监督学习方法、无监督学习方法以及基于知识库的学习方法。监督学习方法利用标记数据学习一个潜在分类器，然后利用该分类器为未知数据潜在语义进行预测。该方法通常需要大量的标记数据来确保分类器的高准确率，其人工标注过程耗时耗力，因此在生物医学某些领域中数据量不大的情况下并非是最好的选择。无监督学习方法不需要标记数据，数据仅依据潜在相似性进行归类。无监督学习方法大大简化了人工标注数据的工程，然而该方法的准确度任然需要进一步的提高，并不适合医学领域等容错率低的领域。基于知识库的方法利用的是已构建且开源的医疗知识库做训练样本，该方法优点是数据可信度高，缺点是知识库构建可拓展性差且维护困难。Traditional semantic disambiguation methods for biomedical texts include supervised learning methods, unsupervised learning methods and knowledge base-based learning methods. Supervised learning methods use labeled data to learn a latent classifier, and then use this classifier to make predictions for the latent semantics of unseen data. This method usually requires a large amount of labeled data to ensure the high accuracy of the classifier, and its manual labeling process is time-consuming and labor-intensive, so it is not the best choice when the amount of data is not large in some fields of biomedicine. Unsupervised learning methods do not require labeled data, the data is only classified based on potential similarities. The unsupervised learning method greatly simplifies the engineering of manual labeling data. However, the accuracy of this method still needs to be further improved, and it is not suitable for fields with low error tolerance rates such as the medical field. The knowledge base-based method uses the constructed and open-source medical knowledge base as training samples. The advantage of this method is high data reliability, but the disadvantage is that the knowledge base construction has poor scalability and is difficult to maintain.

生物医学文本语义消歧常利用词向量模型将文本中的每个词语向量化，词语语义信息以向量的形式存储在低维空间，相似语义词语具有相似词向量表示。常见词向量转化技术有Word2Vec模型，其包括Skip-gram模型和CBOW模型。其中Skip-gram模型利用目标词预测邻近窗口词的词向量，CBOW模型利用邻近窗口词预测目标词的词向量。相似的，句向量利用的是融合句子中每个词语的词向量特征来表示该语句的语义信息。常见的传统融合方法有级联，求平均，加权求和等方法，其中级联方法是将句子中每个词语的词向量依前后顺序直接拼接而来；求平均方法是将句子中所有词语词向量通过求平均方法得到句向量；加权求和方法是依据每个词语对语义信息的重要性赋予不同的权值，然后依权值相加求和得到句向量。句向量作为特征常用于初始化语言模型，使其为后续自然语言处理任务提供方便。Semantic disambiguation of biomedical texts often uses the word vector model to vectorize each word in the text, and the semantic information of words is stored in a low-dimensional space in the form of vectors, and similar semantic words have similar word vector representations. Common word vector conversion technologies include the Word2Vec model, which includes the Skip-gram model and the CBOW model. Among them, the Skip-gram model uses the target word to predict the word vector of the adjacent window word, and the CBOW model uses the adjacent window word to predict the word vector of the target word. Similarly, the sentence vector uses the word vector feature of each word in the sentence to represent the semantic information of the sentence. Common traditional fusion methods include cascading, averaging, and weighted summation. The cascading method is to directly splice the word vectors of each word in the sentence in order; the averaging method is to combine all the word vectors in the sentence The vectors are averaged to obtain sentence vectors; the weighted summation method is to assign different weights to each word according to the importance of semantic information, and then add and sum according to the weights to obtain sentence vectors. Sentence vectors are often used as features to initialize language models to facilitate subsequent natural language processing tasks.

递归神经网络（RNN）是一种常见处理文本信息的神经网络模型，其特点是可以连接先前时刻的信息到当前时刻的任务上，具有一定记忆性。然而当处理长句子时，理论上RNN可以处理长期依赖问题。但在实践中，Bengio.et al(1994)等人对该问题进行了深入的研究，发现RNN无法成功学习到这些知识，当词语间隔距离较远时，RNN可能导致梯度爆炸或者梯度消失，导致反向传播失败，无法有效保留文字信息。为了克服该缺点，提出了RNN的改进模型——长短期记忆模型（LSTM）。该模型在RNN内部结构基础上新增三个“门”结构，“遗忘门”决定保留上一时刻信息的多少，“输入门”决定保留当前时刻信息的多少，“输出门”决定当前时刻输出信息的多少。LSTM通过这种特殊的门结构选择性的利用上一时刻信息与当前时刻信息，有效的避免了RNN长时依赖的问题。Recurrent neural network (RNN) is a common neural network model for processing text information. It is characterized by being able to connect information from previous moments to tasks at the current moment, and has a certain degree of memory. However, when dealing with long sentences, theoretically RNNs can handle long-term dependencies. But in practice, Bengio.et al (1994) and others conducted in-depth research on this problem and found that RNN cannot successfully learn this knowledge. When the distance between words is far away, RNN may cause gradient explosion or gradient disappearance, resulting in Backpropagation fails and text information cannot be effectively preserved. In order to overcome this shortcoming, an improved model of RNN - long short-term memory model (LSTM) is proposed. The model adds three "gate" structures based on the internal structure of the RNN. The "forget gate" determines how much information is retained at the previous moment, the "input gate" determines how much information is retained at the current moment, and the "output gate" determines the output at the current moment. How much information. LSTM selectively utilizes the information of the previous moment and the information of the current moment through this special gate structure, effectively avoiding the problem of long-term dependence of RNN.

近几年半监督学习成功的运用于语义消歧任务中，其中bootstrapping算法能达到较好的准确度。低召回率分类器可以从一小部分标记的示例中学习，然后用这些语句扩展标记的集合，利用该分类器为未标记语料标注高可信度的标签。近几年，提出了一种用于词义消歧的标签传播算法。并与bootstrapping和支持向量机（SVM）监督分类器进行了比较。标签传播能达到更好的性能，因为它通过优化全局目标来分配标签来，而bootstrapping等传统算法则是基于实例局部相似性传播标签。In recent years, semi-supervised learning has been successfully applied to semantic disambiguation tasks, and the bootstrapping algorithm can achieve better accuracy. A low-recall classifier can learn from a small set of labeled examples, and then expand the labeled set with these sentences, using this classifier to label unlabeled corpus with high confidence. In recent years, a label propagation algorithm for word sense disambiguation has been proposed. And compared with bootstrapping and support vector machine (SVM) supervised classifiers. Label propagation can achieve better performance because it assigns labels by optimizing the global objective, while traditional algorithms such as bootstrapping propagate labels based on instance local similarities.

发明内容Contents of the invention

本发明提供了一种基于半监督学习以及深度学习的生物医学文本语义消歧的方法及系统。一定程度上解决了传统消歧方法全局性不强，人工标注困难且成本高昂等问题，提高了对生物医学文本以及通用文本语义消歧的准确度。The invention provides a method and system for semantic disambiguation of biomedical texts based on semi-supervised learning and deep learning. To a certain extent, it solves the problems of traditional disambiguation methods such as poor globalization, difficulty and high cost of manual labeling, and improves the accuracy of semantic disambiguation of biomedical texts and general texts.

本发明由两大部分组成：1. 基于双向长短期记忆网络LSTM模型对词向量进行融合形成句向量，生成句子的语义特征。2.基于标签传递法的半监督语义消歧模型，利用标记数据的相似性为未标注数据进行自动化标注并同时消除语义歧义。The present invention consists of two parts: 1. Based on the two-way long short-term memory network LSTM model, the word vector is fused to form a sentence vector, and the semantic features of the sentence are generated. 2. A semi-supervised semantic disambiguation model based on the label transfer method, which uses the similarity of labeled data to automatically label unlabeled data and eliminate semantic ambiguity at the same time.

本发明采用的技术方案包括如下步骤：The technical scheme that the present invention adopts comprises the steps:

（一）基于双向长短期记忆网络LSTM模型形成句向量，生成句子的语义特征(1) Based on the two-way long short-term memory network LSTM model to form a sentence vector, and generate the semantic features of the sentence

双向长短期记忆网络LSTM模型包含：输出层、后向隐藏层、前向隐藏层、输入层组成。其中，在每一个时步有六个特有的权值被循环利用，其六个权值对应如下：输入层到前向和后向隐藏层（w1, w3），隐藏层到隐藏层（w2, w5），前向和后向隐藏层到输出层（w4, w6）The bidirectional long short-term memory network LSTM model includes: an output layer, a backward hidden layer, a forward hidden layer, and an input layer. Among them, at each time step, six unique weights are recycled, and the six weights correspond to the following: input layer to forward and backward hidden layers (w1, w3), hidden layer to hidden layer (w2, w5), forward and backward hidden layers to the output layer (w4, w6)

隐藏层为LSTM模型，LSTM模型由三个门（forget gage、input gate、output gate）与一个记忆单元（cell）组成The hidden layer is an LSTM model, and the LSTM model consists of three gates (forget gage, input gate, output gate) and a memory unit (cell)

每一个单词的词向量作为双向循环神经网络LSTM的输入，并与上一时刻的输出共同得到当前输出。该过程分为三个阶段The word vector of each word is used as the input of the bidirectional cyclic neural network LSTM, and together with the output of the previous moment, the current output is obtained. The process is divided into three stages

第一阶段：由forget gate层通过sigmoid函数来选择性过滤上一时刻的信息The first stage: the forget gate layer selectively filters the information of the previous moment through the sigmoid function

其中，为上一时刻输出，为当前输入，即当前词向量，为0到1的值，用来过滤上一时刻学到的信息 in, For the previous moment output, is the current input, that is, the current word vector, A value from 0 to 1, used to filter the information learned at the last moment

第二阶段：产生需要更新的新信息Phase 2: Generate new information that needs to be updated

首先由input gate层通过sigmoid来决定更新哪些值First, the input gate layer decides which values to update through sigmoid

接着由一个tanh层来生成新的候选值 Then a tanh layer is used to generate new candidate values

新信息的候选值进行刷新Candidate values for new information to refresh

第三阶段：模型的输出The third stage: the output of the model

通过sigmoid层得到一个初始输出Get an initial output through the sigmoid layer

然后由tanh函数将进行缩放，两者相乘，得到模型的输出Then by the tanh function will Scale and multiply the two to get the output of the model

（二）.基于标签传递法的半监督语义消歧模型(2). Semi-supervised semantic disambiguation model based on label transfer method

标签传递法利用样本数据间的相似性，将标记数据的标签依概率传递给未标注数据。首先为所有样本构建图模型，其中每个样本为一个节点，节点与的相似性计算方法为：The label transfer method uses the similarity between sample data to transfer the label of the labeled data to the unlabeled data according to the probability. First build a graph model for all samples, where each sample is a node, and the node and The similarity calculation method of is:

其中是超参数。每个节点根据与周围节点的相似性依概率传播标签，概率计算方法为：in is a hyperparameter. Each node propagates the label according to the probability according to the similarity with the surrounding nodes. The probability calculation method is:

n代表边的数量。n represents the number of edges.

附图说明Description of drawings

图1是本发明系统原理图。Fig. 1 is a schematic diagram of the system of the present invention.

图2是本发明LSTM内部结构图。Fig. 2 is a diagram of the internal structure of the LSTM of the present invention.

具体实施方式Detailed ways

（1）用户输入生物医学文本，生成句向量特征(1) The user inputs biomedical text and generates sentence vector features

首先将生物医学文本划分为词组形式，然后将词组用Word2Vec模型生成词向量，然后将每个句子中的词向量依次输入进双向长短期记忆模型中，模型将输出两个句向量，分别为和，通过级联的方式形成新的句向量：First, divide the biomedical text into phrases, then use the Word2Vec model to generate word vectors for the phrases, and then input the word vectors in each sentence into the two-way long-term and short-term memory model, and the model will output two sentence vectors, respectively and , form a new sentence vector by cascading :

新句向量再输入进多层感知机得到最终的句向量：The new sentence vector is then input into the multi-layer perceptron get the final sentence vector :

。 .

（2）利用标签传递法，自动标注未标记数据，并为歧义词消歧(2) Use the label transfer method to automatically label unlabeled data and disambiguate ambiguous words

将（1）得到的句向量特征作为向量图节点，计算每个节点的相似性，根据标签传递法自动为未标注数据传播最相似标签，对于歧义词，亦根据相似性传递最符合句向量的语义信息。Use the sentence vector feature obtained in (1) as a vector graph node, calculate the similarity of each node, automatically propagate the most similar label for unlabeled data according to the label transfer method, and for ambiguous words, also transfer the most suitable sentence vector according to the similarity semantic information.

（3）实验结果(3) Experimental results

根据步骤（1）与步骤（2），采用国际通用医学文本MSH WSD数据集与NLM WSD数据集。其中MSH WSD数据集包含203个医学歧义实体，共有37888歧义句，其中人工标注37090个样本；NLM WSD数据集包含50个歧义实体，含有552153常用句，其中每一个歧义实体都人工标注了100个样本。本实验采用20：1的比例，随机从其他医学语料库增加原标记数据二十分之一的无标记数据，根据本发明提出的基于标签传递法的半监督模型进行试验，其实验结果对比如下：According to step (1) and step (2), the international general medical text MSH WSD dataset and NLM WSD dataset are used. Among them, the MSH WSD dataset contains 203 medical ambiguous entities, with a total of 37,888 ambiguous sentences, of which 37,090 samples are manually labeled; the NLM WSD dataset contains 50 ambiguous entities, including 552,153 common sentences, and each ambiguous entity is manually labeled 100 sample. This experiment adopts the ratio of 20: 1, randomly increases the unmarked data of one-twentieth of the original marked data from other medical corpus, tests according to the semi-supervised model based on the label transfer method proposed by the present invention, and its experimental results are compared as follows:

表 1 MSH WSD数据集实验结果.Table 1. Experimental results of MSH WSD dataset.

表 2 NLM WSD数据集实验结果.Table 2. Experimental results of NLM WSD dataset.

其中SVM表示采用支持向量机作为模型，LSTM表示采用单向长短期记忆模型，Bi-LSTM表示采用双向长短期记忆模型；WE(Con)表示采用级联词向量作为句子语义特征，WE(Avg)表示采用求平均词向量方法作为句子语义特征，WE(Wsum)表示采用加权求和词向量方法作为句子语义特征，Con表示本发明采用的模型作为句子语义特征；LP表示采用本发明提出的标签传递法。根据实验结果可看出，本发明提出的语言模型在增加无标签数据后，并不需要进行人工标注，减轻了医疗人员人工标注的成本，并且在医学文本语义消歧中也取得了最佳准确度，证实该发明确实可行有效。Among them, SVM means using support vector machine as a model, LSTM means using a one-way long-term short-term memory model, Bi-LSTM means using a two-way long-term short-term memory model; WE(Con) means using concatenated word vectors as sentence semantic features, WE(Avg) Represents the use of the average word vector method as the sentence semantic feature, WE (Wsum) represents the adoption of the weighted sum word vector method as the sentence semantic feature, Con represents the model used in the present invention as the sentence semantic feature; LP represents the use of the label transfer proposed by the present invention Law. According to the experimental results, it can be seen that the language model proposed in the present invention does not need manual labeling after adding unlabeled data, which reduces the cost of manual labeling by medical personnel, and also achieves the best accuracy in medical text semantic disambiguation. degree, it is proved that the invention is indeed feasible and effective.

Claims

1. A semi-supervised biomedical text semantic disambiguation method is characterized by comprising the following steps:

(1) vectorization representation of words of medical text based on Word2Vec language model

(2) Performing vectorization representation on medical text sentences based on bidirectional long-term and short-term model Bi-LSTM on the basis of word vectors

(3) And automatically labeling the label-free data based on a label transfer method by using sentence vector space similarity, and performing semantic disambiguation on the polysemous words.

2. The Word2 vectored language model-based vectorized representation of words of medical text according to claim 1, wherein: the words may include both medical specific terms and general text words.

3. The Bi-directional long-short term memory model Bi-LSTM based vectorized representation of medical text sentences as claimed in claim 1 wherein: the bidirectional long-short term memory model Bi-LSTM inputs the word vector representation of each word in the sentence, and outputs the vectorized representation of the sentence.

4. The sentence vector spatial similarity of claim 1, wherein: and calculating the geometric distance between the sentence vectors by using an Euclidean distance formula, and calculating the similarity of the sentence vectors by using the reciprocal of the geometric distance.

5. The label delivery-based automated tagging of unlabeled data and semantic disambiguation of ambiguous words of claim 1, wherein: and (4) transferring the data label to the unlabeled data according to probability by using the similarity between the sentence vectors, and automatically carrying out semantic disambiguation on the medical text data.