CN107193806A - A kind of vocabulary justice former automatic prediction method and device - Google Patents
A kind of vocabulary justice former automatic prediction method and device Download PDFInfo
- Publication number
- CN107193806A CN107193806A CN201710429027.6A CN201710429027A CN107193806A CN 107193806 A CN107193806 A CN 107193806A CN 201710429027 A CN201710429027 A CN 201710429027A CN 107193806 A CN107193806 A CN 107193806A
- Authority
- CN
- China
- Prior art keywords
- sememe
- vocabulary
- vector
- unknown
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 239000013598 vector Substances 0.000 claims abstract description 175
- 239000011159 matrix material Substances 0.000 claims description 90
- 238000004364 calculation method Methods 0.000 claims description 22
- 238000011478 gradient descent method Methods 0.000 claims description 20
- 238000000354 decomposition reaction Methods 0.000 claims description 8
- 238000002372 labelling Methods 0.000 abstract description 12
- 230000006870 function Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
本发明实施例公开了一种词汇义原的自动预测方法及装置,方法包括:根据每个预设词汇的词向量,计算各未知义原词汇与每个已知义原词汇的向量距离;根据各向量距离和距离阈值,选择至少一个目标已知义原词汇作为各未知义原词汇的备选义原集合;根据备选义原集合中各目标已知义原词汇的义原向量,计算得到各未知义原词汇的各义原的分数;根据分数阈值和各义原的分数,得到各未知义原词汇的第一义原向量。通过向量距离确定各未知义原词汇的备选义原集合,进一步计算备选义原集合中各义原的分数,并由此得到各未知义原词汇的第一义原向量,可以自动地对未知义原词汇进行准确地义原预测,减轻手工标注的压力,减少由不同人标注对结果产生的可能偏差。
The embodiment of the present invention discloses a method and device for automatically predicting a sememe of a vocabulary. The method includes: calculating the vector distance between each unknown sememe word and each known sememe word according to the word vector of each preset vocabulary; Each vector distance and distance threshold, select at least one target known sememe vocabulary as the candidate sememe set of each unknown sememe vocabulary; according to the sememe vector of each target known sememe vocabulary in the candidate sememe set, calculate The scores of each sememe of each unknown sememe vocabulary; according to the score threshold and the score of each sememe, the first sememe vector of each unknown sememe vocabulary is obtained. Determine the candidate sememe sets of each unknown sememe vocabulary by vector distance, further calculate the scores of each sememe in the candidate sememe set, and thus obtain the first sememe vector of each unknown sememe vocabulary, which can be automatically compared Accurate sememe prediction of unknown sememe words reduces the pressure of manual labeling and reduces the possible deviation of the results caused by different people's labeling.
Description
技术领域technical field
本发明实施例涉及计算机技术领域,具体涉及一种词汇义原的自动预测方法及装置。The embodiments of the present invention relate to the field of computer technology, and in particular to a method and device for automatic prediction of lexical sememes.
背景技术Background technique
句子是由一个个词汇组成,来表达不同的意思,而不同的词汇有它的特殊性也有他们的相似性,HowNet用来刻画不同词汇的这些特点。HowNet由人工标注,对大部分常见的词汇标注了它的义原,而义原相对于词汇是一个更小的集合,它表示了词汇更基本的一些意义,不同的义原组合可以表示不同的词汇,比如:古董店的义原包括:场所,商业,买,卖,珍宝和过去。而古董店的定义则可以由这些义原来刻画:买卖过去的珍宝的商业场所就是古董店。义原有很多好的特点,比如根据两个词汇的义原的交集来判断这两个词汇的相似度,可以用于更好的生成词向量以用于自然语言处理中更多的任务。Sentences are composed of words to express different meanings, and different words have their particularities and similarities. HowNet is used to describe these characteristics of different words. HowNet is manually marked, and its sememes are marked for most of the common words, and the sememes are a smaller set than the words, which represent some more basic meanings of the words. Different sememe combinations can represent different Vocabulary, such as: the meaning of antique shop includes: place, business, buy, sell, treasure and past. The definition of an antique shop can be described by these meanings: a business place that buys and sells treasures from the past is an antique shop. Sememes have many good features, such as judging the similarity between two words based on the intersection of their sememes, which can be used to better generate word vectors for more tasks in natural language processing.
虽然义原有很多好的性质,但是义原的标注是件非常费时费力的事情。HowNet已经诞生十多年了,最开始是由很多语言专家带头标注的,但是随着信息技术的快速发展,词汇的数量呈爆炸性地增长,而如何高效快速准确地为这些新产生的词汇标注义原就成了一个不得不解决的问题,亟需一种义原的自动构建模型而不是手工标注,既可以保证义原具有相同的特征,也能够避免人标注产生的偏差。Although the meanings have many good properties, the labeling of the meanings is a very time-consuming and labor-intensive task. HowNet has been born for more than ten years. At first, many language experts took the lead in annotating. However, with the rapid development of information technology, the number of vocabulary is growing explosively. How to efficiently, quickly and accurately annotate these newly generated vocabulary? The original has become a problem that has to be solved. There is an urgent need for an automatic construction model of the sememe instead of manual labeling, which can not only ensure that the sememe has the same characteristics, but also avoid the deviation caused by human labeling.
发明内容Contents of the invention
由于现有技术存在上述问题,本发明实施例提出一种词汇义原的自动预测方法及装置。Due to the above-mentioned problems in the prior art, an embodiment of the present invention proposes a method and device for automatic prediction of lexical sememes.
第一方面,本发明实施例提出一种词汇义原的自动预测方法,包括:In the first aspect, the embodiment of the present invention proposes a method for automatic prediction of lexical sememes, including:
根据每个预设词汇的词向量,计算各未知义原词汇与每个已知义原词汇的向量距离;Calculate the vector distance between each unknown sememe vocabulary and each known sememe vocabulary according to the word vector of each preset vocabulary;
根据各向量距离和距离阈值,选择至少一个目标已知义原词汇作为各未知义原词汇的备选义原集合;According to each vector distance and the distance threshold, select at least one target known sememe vocabulary as a candidate sememe set for each unknown sememe vocabulary;
根据备选义原集合中各目标已知义原词汇的义原向量,计算得到各未知义原词汇的各义原的分数;According to the sememe vectors of each target known sememe vocabulary in the candidate sememe set, calculate the scores of each sememe of each unknown sememe vocabulary;
根据分数阈值和各义原的分数,得到各未知义原词汇的第一义原向量;According to the score threshold and the score of each sememe, the first sememe vector of each unknown sememe vocabulary is obtained;
其中,所述预设词汇包括已知义原词汇和未知义原词汇。Wherein, the preset vocabulary includes known sememe vocabulary and unknown sememe vocabulary.
可选地,所述方法还包括:Optionally, the method also includes:
获取预设义原,根据随机梯度下降法和所述预设义原计算得到每个预设词汇的词向量。The preset sememe is obtained, and the word vector of each preset vocabulary is calculated according to the stochastic gradient descent method and the preset sememe.
可选地,所述根据分数阈值和各义原的分数,得到各未知义原词汇的义原向量之后,还包括:Optionally, after obtaining the sememe vectors of each unknown sememe vocabulary according to the score threshold and the scores of each sememe, it also includes:
根据预设义原向量和未知义原词汇向量,得到义原词汇矩阵;According to the preset sememe vector and the unknown sememe vocabulary vector, a sememe vocabulary matrix is obtained;
根据所述义原词汇矩阵,计算得到所述义原词汇矩阵的共现矩阵;According to the sememe vocabulary matrix, calculate the co-occurrence matrix of the sememe vocabulary matrix;
根据随机梯度下降法分别对所述义原词汇矩阵和所述共现矩阵进行分解,得到第二义原向量;Decomposing the sememe vocabulary matrix and the co-occurrence matrix respectively according to the stochastic gradient descent method to obtain a second sememe vector;
根据所述未知义原词汇向量和所述第二义原向量,计算得到目标值;calculating a target value according to the unknown sememe word vector and the second sememe vector;
根据所述目标值和所述第一义原向量,计算得到目标义原向量;calculating a target sememe vector according to the target value and the first sememe vector;
其中,所述义原词汇矩阵由0和1表示,1表示对应的词汇包括对应的义原,0表示对应的词汇不包括对应的义原。Wherein, the sememe vocabulary matrix is represented by 0 and 1, 1 indicates that the corresponding vocabulary includes the corresponding sememe, and 0 indicates that the corresponding vocabulary does not include the corresponding sememe.
可选地,所述根据随机梯度下降法分别对所述义原词汇矩阵和所述共现矩阵进行分解,得到第二义原向量,具体包括:Optionally, decomposing the sememe vocabulary matrix and the co-occurrence matrix respectively according to the stochastic gradient descent method to obtain a second sememe vector, specifically includes:
根据随机梯度下降法和损失函数分别对所述义原词汇矩阵和所述共现矩阵进行分解,得到第二义原向量;Decomposing the sememe vocabulary matrix and the co-occurrence matrix respectively according to a stochastic gradient descent method and a loss function to obtain a second sememe vector;
其中,所述损失函数为:Wherein, the loss function is:
W为所述未知义原词汇向量,S为S'分别为第一预设义原向量和第二预设义原向量,λ为预设系数,Mws、Cst、w、s分别为所述义原词汇矩阵、所述共现矩阵、所述未知义原词汇向量和所述第一预设义原向量中的元素,bw为所述未知义原词汇向量的偏置,bs为所述第一预设义原向量的偏置。W is the unknown sememe vocabulary vector, S is the first pre-set sememe vector and the second pre-set sememe vector, S is the first pre-set sememe vector, λ is the pre-set coefficient, M ws , C st , w, s are all Elements in the semantic original vocabulary matrix, the co-occurrence matrix, the unknown semantic original vocabulary vector and the first preset semantic original vector, b w is the bias of the unknown semantic original vocabulary vector, and b s is The offset of the first predefinition vector.
第二方面,本发明实施例还提出一种词汇义原的自动预测装置,包括:In the second aspect, the embodiment of the present invention also proposes an automatic prediction device for a vocabulary sememe, including:
距离计算模块,用于根据每个预设词汇的词向量,计算各未知义原词汇与每个已知义原词汇的向量距离;The distance calculation module is used to calculate the vector distance between each unknown sememe vocabulary and each known sememe vocabulary according to the word vector of each preset vocabulary;
义原集合确定模块,用于根据各向量距离和距离阈值,选择至少一个目标已知义原词汇作为各未知义原词汇的备选义原集合;A sememe set determination module is used to select at least one target known sememe vocabulary as an alternative sememe set for each unknown sememe vocabulary according to each vector distance and distance threshold;
义原分数计算模块,用于根据备选义原集合中各目标已知义原词汇的义原向量,计算得到各未知义原词汇的各义原的分数;The sememe score calculation module is used to calculate the scores of each sememe of each unknown sememe vocabulary according to the sememe vectors of each target known sememe vocabulary in the alternative sememe set;
义原向量确定模块,用于根据分数阈值和各义原的分数,得到各未知义原词汇的第一义原向量;The sememe vector determination module is used to obtain the first sememe vector of each unknown sememe vocabulary according to the score threshold and the scores of each sememe;
其中,所述预设词汇包括已知义原词汇和未知义原词汇。Wherein, the preset vocabulary includes known sememe vocabulary and unknown sememe vocabulary.
可选地,所述装置还包括:Optionally, the device also includes:
词向量计算模块,用于获取预设义原,根据随机梯度下降法和所述预设义原计算得到每个预设词汇的词向量。The word vector calculation module is used to obtain the preset sememe, and calculate the word vector of each preset vocabulary according to the stochastic gradient descent method and the preset sememe.
可选地,所述装置还包括:Optionally, the device also includes:
义原词汇矩阵获取模块,用于根据预设义原向量和未知义原词汇向量,得到义原词汇矩阵;The sememe vocabulary matrix acquisition module is used to obtain the sememe vocabulary matrix according to the preset sememe vector and the unknown sememe vocabulary vector;
共现矩阵计算模块,用于根据所述义原词汇矩阵,计算得到所述义原词汇矩阵的共现矩阵;A co-occurrence matrix calculation module, configured to calculate the co-occurrence matrix of the sememe vocabulary matrix according to the sememe vocabulary matrix;
矩阵分解模块,用于根据随机梯度下降法分别对所述义原词汇矩阵和所述共现矩阵进行分解,得到第二义原向量;A matrix decomposition module is used to decompose the sememe vocabulary matrix and the co-occurrence matrix respectively according to the stochastic gradient descent method to obtain a second sememe vector;
目标值计算模块,用于根据所述未知义原词汇向量和所述第二义原向量,计算得到目标值;A target value calculation module, configured to calculate a target value according to the unknown sememe word vector and the second sememe vector;
目标义原向量计算模块,用于根据所述目标值和所述第一义原向量,计算得到目标义原向量;A target sememe vector calculation module, configured to calculate and obtain a target sememe vector according to the target value and the first sememe vector;
其中,所述义原词汇矩阵由0和1表示,1表示对应的词汇包括对应的义原,0表示对应的词汇不包括对应的义原。Wherein, the sememe vocabulary matrix is represented by 0 and 1, 1 indicates that the corresponding vocabulary includes the corresponding sememe, and 0 indicates that the corresponding vocabulary does not include the corresponding sememe.
可选地,所述矩阵分解模块具体用于根据随机梯度下降法和损失函数分别对所述义原词汇矩阵和所述共现矩阵进行分解,得到第二义原向量;Optionally, the matrix decomposition module is specifically configured to decompose the sememe vocabulary matrix and the co-occurrence matrix respectively according to a stochastic gradient descent method and a loss function to obtain a second sememe vector;
其中,所述损失函数为:Wherein, the loss function is:
L=∑w∈W,s∈S(W·(S+S')+bw+bs-Mws)2+λ∑s,t∈S(s·t-Cst)2 L=∑ w∈W,s∈S (W·(S+S')+b w +b s -M ws ) 2 +λ∑ s,t∈S (s·tC st ) 2
W为所述未知义原词汇向量,S为S'分别为第一预设义原向量和第二预设义原向量,λ为预设系数,Mws、Cst、w、s分别为所述义原词汇矩阵、所述共现矩阵、所述未知义原词汇向量和所述第一预设义原向量中的元素,bw为所述未知义原词汇向量的偏置,bs为所述第一预设义原向量的偏置。W is the unknown sememe vocabulary vector, S is the first pre-set sememe vector and the second pre-set sememe vector, S is the first pre-set sememe vector, λ is the pre-set coefficient, M ws , C st , w, s are all Elements in the semantic original vocabulary matrix, the co-occurrence matrix, the unknown semantic original vocabulary vector and the first preset semantic original vector, b w is the bias of the unknown semantic original vocabulary vector, and b s is The offset of the first predefinition vector.
由上述技术方案可知,本发明实施例通过未知义原词汇与每个已知义原词汇的向量距离确定各未知义原词汇的备选义原集合,进一步计算备选义原集合中各义原的分数,并由此得到各未知义原词汇的第一义原向量,可以自动地对未知义原词汇进行准确地义原预测,减轻手工标注的压力,并减少由不同人标注对结果产生的可能偏差。It can be seen from the above technical solution that the embodiment of the present invention determines the candidate sememe sets of each unknown sememe word through the vector distance between the unknown sememe word and each known sememe word, and further calculates the sememe set in the candidate sememe set. , and thus get the first sememe vector of each unknown sememe vocabulary, which can automatically predict the exact sememe of unknown sememe words, reduce the pressure of manual labeling, and reduce the impact of labeling results by different people Possible deviation.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.
图1为本发明一实施例提供的一种词汇义原的自动预测方法的流程示意图;Fig. 1 is a schematic flow chart of a method for automatic prediction of a lexical sememe provided by an embodiment of the present invention;
图2为本发明一实施例提供的“古董店”词汇的义原示意图;Fig. 2 is a schematic diagram of the sememe of the vocabulary "antique shop" provided by an embodiment of the present invention;
图3为本发明一实施例提供的“apple”词汇的义原示意图;Fig. 3 is a schematic diagram of the sememe of the word "apple" provided by an embodiment of the present invention;
图4为本发明一实施例提供的备选义原集合进行选择的流程示意图;Fig. 4 is a schematic flow diagram of selecting an alternative sememe set provided by an embodiment of the present invention;
图5为本发明一实施例提供的一种词汇义原的自动预测装置的结构示意图。Fig. 5 is a schematic structural diagram of an automatic prediction device for a word sememe provided by an embodiment of the present invention.
具体实施方式detailed description
下面结合附图,对本发明的具体实施方式作进一步描述。以下实施例仅用于更加清楚地说明本发明的技术方案,而不能以此来限制本发明的保护范围。The specific embodiments of the present invention will be further described below in conjunction with the accompanying drawings. The following examples are only used to illustrate the technical solution of the present invention more clearly, but not to limit the protection scope of the present invention.
图1示出了本实施例提供的一种词汇义原的自动预测方法的流程示意图,包括:Fig. 1 shows a schematic flow chart of a method for automatic prediction of a lexical sememe provided in this embodiment, including:
S101、根据每个预设词汇的词向量,计算各未知义原词汇与每个已知义原词汇的向量距离。S101. Calculate the vector distance between each unknown sememe word and each known sememe word according to the word vector of each preset word.
其中,所述预设词汇包括已知义原词汇和未知义原词汇。Wherein, the preset vocabulary includes known sememe vocabulary and unknown sememe vocabulary.
具体地,首先对一个大的语料库统计各个词汇的词频以及不同词汇之间的上下位关系;然后利用随机梯度下降实现共现矩阵分解以得到词汇的词向量。Specifically, first count the word frequency of each vocabulary and the hyponymy relationship between different vocabulary on a large corpus; then use stochastic gradient descent to achieve co-occurrence matrix decomposition to obtain the word vector of the vocabulary.
共现矩阵蕴含了丰富的文本信息以及词与词之间的相互关系,通过矩阵分解降维,得到词汇的低维表示依然可以很好地体现词汇与词汇之间的相互关系。The co-occurrence matrix contains rich text information and the relationship between words. Through matrix decomposition and dimensionality reduction, the low-dimensional representation of vocabulary can still reflect the relationship between words and words.
每个预设词汇的义原向量的内容如图2和3所示,图2为中文的“古董店”包括的义原,图3为英文的“apple”包括的义原。The content of the sememe vector of each preset vocabulary is shown in Figures 2 and 3, Figure 2 shows the sememe included in the Chinese word "antique store", and Figure 3 shows the sememe included in the English word "apple".
S102、根据各向量距离和距离阈值,选择至少一个目标已知义原词汇作为各未知义原词汇的备选义原集合。S102. Select at least one target known sememe word as a candidate sememe set for each unknown sememe word according to the distance of each vector and the distance threshold.
对于未知义原词汇,备选义原集合的选择如图4所示,在向量空间中找若干个离它最近的已知义原词汇,把他们的义原作为备选义原集合。并且需要根据词汇间的距离,给这些义原打分。For unknown sememe words, the selection of alternative sememe sets is shown in Figure 4. Find several known sememe words closest to it in the vector space, and use their sememes as the alternative sememe sets. And these sememes need to be scored according to the distance between words.
具体地,首先对于每个新词计算它与已知义原向量的词汇的距离;并且选择最近的若干个词汇。对于选出来的最近的词汇,计算它们的义原对于新词的权重。Specifically, firstly, for each new word, the distance between it and the words with known sememe vectors is calculated; and several nearest words are selected. For the nearest selected words, calculate the weights of their sememes to the new words.
假设一个词汇和新词越近,那么这个词汇的义原更可能是新词的义原,所以对于任意一个义原,它对于一个新词,能得到的分数可以如下表示:Assuming that the closer a vocabulary is to a new word, then the sememe of this vocabulary is more likely to be the sememe of the new word, so for any sememe, the score it can get for a new word can be expressed as follows:
其中w表示新词,s表示一个义原,W是所有已知义原词汇的集合,Mvs表示词汇v是否有义原s,有则为1,否则为0,得到的Pr(s|w)越高,s越可能是v的义原。Among them, w represents a new word, s represents a sememe, W is the set of all known sememe words, M vs represents whether the vocabulary v has a sememe s, if yes, it is 1, otherwise it is 0, and the obtained Pr(s|w ) is higher, the more likely s is the sememe of v.
实际上,真正地计算上述过程要复杂很多,由于归一化的向量之间的距离在(-1,1)之间,所以不能很好地区分不同的义原,所以我们让距离越近的词汇的义原有更大的权重,我们引入超参数p,对于第k近的词汇,乘上pk使得不同词汇的义原之间的区分度更大,并且乘上一个指数衰减的系数保证了Pr(s|w)在一定的范围内,不会发散。In fact, it is much more complicated to actually calculate the above process. Since the distance between the normalized vectors is between (-1, 1), it is not possible to distinguish different sememes very well, so we make the closer the distance The sememe of the vocabulary has a greater weight. We introduce the hyperparameter p. For the k-th closest vocabulary, multiplying p k makes the difference between the sememes of different vocabulary greater, and multiplying it with an exponentially decaying coefficient ensures that If Pr(s|w) is within a certain range, it will not diverge.
S103、根据备选义原集合中各目标已知义原词汇的义原向量,计算得到各未知义原词汇的各义原的分数。S103. According to the sememe vectors of each target known sememe vocabulary in the candidate sememe set, calculate the scores of each sememe of each unknown sememe vocabulary.
S104、根据分数阈值和各义原的分数,得到各未知义原词汇的第一义原向量。S104. Obtain the first sememe vector of each unknown sememe vocabulary according to the score threshold and the scores of each sememe.
具体地,词汇往往与它相似的词共享义原,比如中国和美国都共享“专”,“国家”等义原,但是很多词往往有自己的义原,所以提出了本实施例提供的方法,既能从相近的词学到义原,也能学习到特有的义原。Specifically, a vocabulary often shares a sememe with its similar words. For example, both China and the United States share a sememe such as "special" and "country", but many words often have their own sememe, so the method provided in this embodiment is proposed , not only can learn the sememe from similar words, but also can learn the unique sememe.
本实施例通过未知义原词汇与每个已知义原词汇的向量距离确定各未知义原词汇的备选义原集合,进一步计算备选义原集合中各义原的分数,并由此得到各未知义原词汇的第一义原向量,可以自动地对未知义原词汇进行准确地义原预测,减轻手工标注的压力,并减少由不同人标注对结果产生的可能偏差。In this embodiment, the candidate sememe sets of each unknown sememe word are determined by the vector distance between the unknown sememe word and each known sememe word, and the scores of each sememe in the candidate sememe set are further calculated, and thus obtained The first sememe vector of each unknown sememe word can automatically predict the unknown sememe word accurately, reduce the pressure of manual labeling, and reduce the possible deviation of the results caused by different people's labeling.
进一步地,在上述方法实施例的基础上,所述方法还包括:Further, on the basis of the above method embodiments, the method further includes:
S100、获取预设义原,根据随机梯度下降法和所述预设义原计算得到每个预设词汇的词向量。S100. Obtain a preset sememe, and calculate a word vector of each preset vocabulary according to the stochastic gradient descent method and the preset sememe.
其中,预设义原包括1400个常见的义原。Among them, the default sememes include 1400 common sememes.
进一步地,在上述方法实施例的基础上,所述根据分数阈值和各义原的分数,得到各未知义原词汇的义原向量之后,还包括:Further, on the basis of the above method embodiment, after obtaining the sememe vectors of each unknown sememe vocabulary according to the score threshold and the scores of each sememe, it also includes:
S105、根据预设义原向量和未知义原词汇向量,得到义原词汇矩阵;S105. Obtain a sememe vocabulary matrix according to the preset sememe vector and the unknown sememe vocabulary vector;
其中,所述义原词汇矩阵由0和1表示,1表示对应的词汇包括对应的义原,0表示对应的词汇不包括对应的义原。Wherein, the sememe vocabulary matrix is represented by 0 and 1, 1 indicates that the corresponding vocabulary includes the corresponding sememe, and 0 indicates that the corresponding vocabulary does not include the corresponding sememe.
S106、根据所述义原词汇矩阵,计算得到所述义原词汇矩阵的共现矩阵;S106. Calculate and obtain a co-occurrence matrix of the sememe vocabulary matrix according to the sememe vocabulary matrix;
义原与义原的共现矩阵蕴含了丰富的义原之间的关系,就如同词汇的共现矩阵可以用于生成词向量,义原共现矩阵也能辅助生成更好的义原向量。The co-occurrence matrix of sememes and sememes contains rich relationships between sememes. Just as the co-occurrence matrix of words can be used to generate word vectors, the co-occurrence matrix of sememes can also assist in generating better sememe vectors.
S107、根据随机梯度下降法分别对所述义原词汇矩阵和所述共现矩阵进行分解,得到第二义原向量;S107. Decompose the sememe vocabulary matrix and the co-occurrence matrix respectively according to the stochastic gradient descent method to obtain a second sememe vector;
具体地,首先计算词汇与义原的01矩阵;然后计算义原与义原的共现矩阵;最后利用随机梯度下降的方法分解以上两个矩阵来得到义原向量。Specifically, first calculate the 01 matrix of vocabulary and sememe; then calculate the co-occurrence matrix of sememe and sememe; finally use the method of stochastic gradient descent to decompose the above two matrices to obtain the sememe vector.
S108、根据所述未知义原词汇向量和所述第二义原向量,计算得到目标值;S108. Calculate and obtain a target value according to the unknown sememe word vector and the second sememe vector;
S109、根据所述目标值和所述第一义原向量,计算得到目标义原向量;S109. Calculate and obtain a target sememe vector according to the target value and the first sememe vector;
进一步地,在上述方法实施例的基础上,S107具体包括:Further, on the basis of the above method embodiments, S107 specifically includes:
根据随机梯度下降法和损失函数分别对所述义原词汇矩阵和所述共现矩阵进行分解,得到第二义原向量;Decomposing the sememe vocabulary matrix and the co-occurrence matrix respectively according to a stochastic gradient descent method and a loss function to obtain a second sememe vector;
其中,所述损失函数为:Wherein, the loss function is:
L=∑w∈W,s∈S(W·(S+S′)+bw+bs-Mws)2+λ∑s,t∈S(s·t-Cst)2 L=∑ w∈W,s∈S (W·(S+S′)+b w +b s -M ws ) 2 +λ∑ s,t∈S (s·tC st ) 2
W为所述未知义原词汇向量,S为S′分别为第一预设义原向量和第二预设义原向量,λ为预设系数,Mws、Cst、w、s分别为所述义原词汇矩阵、所述共现矩阵、所述未知义原词汇向量和所述第一预设义原向量中的元素,bw为所述未知义原词汇向量的偏置,bs为所述第一预设义原向量的偏置。W is the unknown sememe word vector, S is S′ is the first preset sememe vector and the second preset sememe vector, λ is the preset coefficient, M ws , C st , w, s are all Elements in the semantic original vocabulary matrix, the co-occurrence matrix, the unknown semantic original vocabulary vector and the first preset semantic original vector, b w is the bias of the unknown semantic original vocabulary vector, and b s is The offset of the first predefinition vector.
通过梯度下降使得L降低从而得到一个好的义原向量表示,最后可以用如下函数来计算一个新词与一个义原的可能关系:L is reduced by gradient descent to obtain a good sememe vector representation. Finally, the following function can be used to calculate the possible relationship between a new word and a sememe:
Pr(s|w)=∑v∈Wcos(v,W)Mvs+λcos(w,s)Pr(s|w)=∑ v∈W cos(v, W)M vs +λ cos(w, s)
目前还没有义原的自动预测模型,现有的方式是通过人来手工标注,耗时耗力,而且标注效果因人而异,会对义原的准确性产生较大的影响。本实施例可以利用现有的标注数据来自动预测义原,使用HowNet的一部分数据作为测试集来测试,可以发现,本实施例的结果和人工标注有很大程度的重合,准确度较高。并且,本实施例的模型能够发现一些HowNet中并没有标记出的推荐义元,而这些新发现的候选义原结果也具有相当的正确性。At present, there is no automatic prediction model for sememes. The existing method is to manually label by humans, which is time-consuming and labor-intensive, and the labeling effect varies from person to person, which will have a great impact on the accuracy of sememes. In this embodiment, the existing tagging data can be used to automatically predict sememes, and a part of HowNet data is used as a test set for testing. It can be found that the results of this embodiment overlap to a large extent with manual tagging, and the accuracy is high. Moreover, the model of this embodiment can discover some recommended sememes that have not been marked in HowNet, and the results of these newly discovered candidate sememes are also quite correct.
图5示出了本实施例提供的一种词汇义原的自动预测装置的结构示意图,所述装置包括:距离计算模块501、义原集合确定模块502、义原分数计算模块503和义原向量确定模块504,其中:Fig. 5 shows a schematic structural diagram of an automatic prediction device for a vocabulary sememe provided by this embodiment, the device includes: a distance calculation module 501, a sememe set determination module 502, a sememe score calculation module 503 and a sememe vector Determine module 504, wherein:
所述距离计算模块501用于根据每个预设词汇的词向量,计算各未知义原词汇与每个已知义原词汇的向量距离;The distance calculation module 501 is used to calculate the vector distance between each unknown sememe vocabulary and each known sememe vocabulary according to the word vector of each preset vocabulary;
所述义原集合确定模块502用于根据各向量距离和距离阈值,选择至少一个目标已知义原词汇作为各未知义原词汇的备选义原集合;The sememe set determination module 502 is used to select at least one target known sememe vocabulary as a candidate sememe set for each unknown sememe vocabulary according to each vector distance and distance threshold;
所述义原分数计算模块503用于根据备选义原集合中各目标已知义原词汇的义原向量,计算得到各未知义原词汇的各义原的分数;The sememe score calculation module 503 is used to calculate the scores of each sememe of each unknown sememe vocabulary according to the sememe vectors of each target known sememe vocabulary in the candidate sememe set;
所述义原向量确定模块504用于根据分数阈值和各义原的分数,得到各未知义原词汇的第一义原向量;The sememe vector determination module 504 is used to obtain the first sememe vector of each unknown sememe vocabulary according to the score threshold and the scores of each sememe;
其中,所述预设词汇包括已知义原词汇和未知义原词汇。Wherein, the preset vocabulary includes known sememe vocabulary and unknown sememe vocabulary.
具体地,所述距离计算模块501根据每个预设词汇的词向量,计算各未知义原词汇与每个已知义原词汇的向量距离;所述义原集合确定模块502根据各向量距离和距离阈值,选择至少一个目标已知义原词汇作为各未知义原词汇的备选义原集合;所述义原分数计算模块503根据备选义原集合中各目标已知义原词汇的义原向量,计算得到各未知义原词汇的各义原的分数;所述义原向量确定模块504根据分数阈值和各义原的分数,得到各未知义原词汇的第一义原向量。Specifically, the distance calculation module 501 calculates the vector distance between each unknown sememe vocabulary and each known sememe vocabulary according to the word vector of each preset vocabulary; the sememe set determination module 502 calculates the vector distance between each unknown sememe vocabulary and each Distance threshold, select at least one target known sememe vocabulary as the candidate sememe set of each unknown sememe vocabulary; the sememe score calculation module 503 is based on the sememe of each target known sememe vocabulary in the candidate sememe set Vectors are calculated to obtain the scores of each sememe of each unknown sememe vocabulary; the sememe vector determination module 504 obtains the first sememe vector of each unknown sememe vocabulary according to the score threshold and the scores of each sememe.
本实施例通过未知义原词汇与每个已知义原词汇的向量距离确定各未知义原词汇的备选义原集合,进一步计算备选义原集合中各义原的分数,并由此得到各未知义原词汇的第一义原向量,可以自动地对未知义原词汇进行准确地义原预测,减轻手工标注的压力,并减少由不同人标注对结果产生的可能偏差。In this embodiment, the candidate sememe sets of each unknown sememe word are determined by the vector distance between the unknown sememe word and each known sememe word, and the scores of each sememe in the candidate sememe set are further calculated, and thus obtained The first sememe vector of each unknown sememe word can automatically predict the unknown sememe word accurately, reduce the pressure of manual labeling, and reduce the possible deviation of the results caused by different people's labeling.
进一步地,在上述装置实施例的基础上,所述装置还包括:Further, on the basis of the above device embodiment, the device further includes:
词向量计算模块,用于获取预设义原,根据随机梯度下降法和所述预设义原计算得到每个预设词汇的词向量。The word vector calculation module is used to obtain the preset sememe, and calculate the word vector of each preset vocabulary according to the stochastic gradient descent method and the preset sememe.
进一步地,在上述装置实施例的基础上,所述装置还包括:Further, on the basis of the above device embodiment, the device further includes:
义原词汇矩阵获取模块,用于根据预设义原向量和未知义原词汇向量,得到义原词汇矩阵;The sememe vocabulary matrix acquisition module is used to obtain the sememe vocabulary matrix according to the preset sememe vector and the unknown sememe vocabulary vector;
共现矩阵计算模块,用于根据所述义原词汇矩阵,计算得到所述义原词汇矩阵的共现矩阵;A co-occurrence matrix calculation module, configured to calculate the co-occurrence matrix of the sememe vocabulary matrix according to the sememe vocabulary matrix;
矩阵分解模块,用于根据随机梯度下降法分别对所述义原词汇矩阵和所述共现矩阵进行分解,得到第二义原向量;A matrix decomposition module is used to decompose the sememe vocabulary matrix and the co-occurrence matrix respectively according to the stochastic gradient descent method to obtain a second sememe vector;
目标值计算模块,用于根据所述未知义原词汇向量和所述第二义原向量,计算得到目标值;A target value calculation module, configured to calculate a target value according to the unknown sememe word vector and the second sememe vector;
目标义原向量计算模块,用于根据所述目标值和所述第一义原向量,计算得到目标义原向量;A target sememe vector calculation module, configured to calculate and obtain a target sememe vector according to the target value and the first sememe vector;
其中,所述义原词汇矩阵由0和1表示,1表示对应的词汇包括对应的义原,0表示对应的词汇不包括对应的义原。Wherein, the sememe vocabulary matrix is represented by 0 and 1, 1 indicates that the corresponding vocabulary includes the corresponding sememe, and 0 indicates that the corresponding vocabulary does not include the corresponding sememe.
进一步地,在上述装置实施例的基础上,所述矩阵分解模块具体用于根据随机梯度下降法和损失函数分别对所述义原词汇矩阵和所述共现矩阵进行分解,得到第二义原向量;Further, on the basis of the above device embodiment, the matrix decomposition module is specifically used to decompose the sememe vocabulary matrix and the co-occurrence matrix respectively according to the stochastic gradient descent method and the loss function, to obtain the second sememe vector;
其中,所述损失函数为:Wherein, the loss function is:
L=Σw∈W,s∈S(W·(S+S')+bw+bs-Mws)2+λΣs,t∈S(s·t-Cst)2 L=Σ w∈W,s∈S (W·(S+S')+b w +b s -M ws ) 2 +λΣ s,t∈S (s·tC st ) 2
W为所述未知义原词汇向量,S为S'分别为第一预设义原向量和第二预设义原向量,λ为预设系数,Mws、Cst、w、s分别为所述义原词汇矩阵、所述共现矩阵、所述未知义原词汇向量和所述第一预设义原向量中的元素,bw为所述未知义原词汇向量的偏置,bs为所述第一预设义原向量的偏置。W is the unknown sememe vocabulary vector, S is the first pre-set sememe vector and the second pre-set sememe vector, S is the first pre-set sememe vector, λ is the pre-set coefficient, M ws , C st , w, s are all Elements in the semantic original vocabulary matrix, the co-occurrence matrix, the unknown semantic original vocabulary vector and the first preset semantic original vector, b w is the bias of the unknown semantic original vocabulary vector, and b s is The offset of the first predefinition vector.
本实施例所述的词汇义原的自动预测装置可以用于执行上述方法实施例,其原理和技术效果类似,此处不再赘述。The device for automatic prediction of word sememes described in this embodiment can be used to implement the above method embodiments, and its principles and technical effects are similar, and will not be repeated here.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. It can be understood and implemented by those skilled in the art without any creative efforts.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the above description of the implementations, those skilled in the art can clearly understand that each implementation can be implemented by means of software plus a necessary general hardware platform, and of course also by hardware. Based on this understanding, the essence of the above technical solution or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic discs, optical discs, etc., including several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) execute the methods described in various embodiments or some parts of the embodiments.
应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。It should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it still can The technical solutions described in the foregoing embodiments are modified, or some of the technical features are replaced equivalently; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the present invention.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710429027.6A CN107193806B (en) | 2017-06-08 | 2017-06-08 | A method and device for automatic prediction of lexical sememe |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710429027.6A CN107193806B (en) | 2017-06-08 | 2017-06-08 | A method and device for automatic prediction of lexical sememe |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107193806A true CN107193806A (en) | 2017-09-22 |
CN107193806B CN107193806B (en) | 2019-11-22 |
Family
ID=59877677
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710429027.6A Active CN107193806B (en) | 2017-06-08 | 2017-06-08 | A method and device for automatic prediction of lexical sememe |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107193806B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108984533A (en) * | 2018-08-03 | 2018-12-11 | 清华大学 | A kind of former prediction technique of vocabulary justice and device |
CN109271633A (en) * | 2018-09-17 | 2019-01-25 | 北京神州泰岳软件股份有限公司 | A kind of the term vector training method and device of single semantic supervision |
CN109299459A (en) * | 2018-09-17 | 2019-02-01 | 北京神州泰岳软件股份有限公司 | A kind of the term vector training method and device of single semantic supervision |
CN109446518A (en) * | 2018-10-09 | 2019-03-08 | 清华大学 | The coding/decoding method and decoder of language model |
CN109597988A (en) * | 2018-10-31 | 2019-04-09 | 清华大学 | The former prediction technique of vocabulary justice, device and electronic equipment across language |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150388A (en) * | 2013-03-21 | 2013-06-12 | 天脉聚源(北京)传媒科技有限公司 | Method and device for extracting key words |
CN103186647A (en) * | 2011-12-31 | 2013-07-03 | 北京金山软件有限公司 | Method and device for sequencing according to contribution degree |
CN104699819A (en) * | 2015-03-26 | 2015-06-10 | 浪潮集团有限公司 | Sememe classification method and device |
CN106610949A (en) * | 2016-09-29 | 2017-05-03 | 四川用联信息技术有限公司 | Text feature extraction method based on semantic analysis |
-
2017
- 2017-06-08 CN CN201710429027.6A patent/CN107193806B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103186647A (en) * | 2011-12-31 | 2013-07-03 | 北京金山软件有限公司 | Method and device for sequencing according to contribution degree |
CN103150388A (en) * | 2013-03-21 | 2013-06-12 | 天脉聚源(北京)传媒科技有限公司 | Method and device for extracting key words |
CN104699819A (en) * | 2015-03-26 | 2015-06-10 | 浪潮集团有限公司 | Sememe classification method and device |
CN106610949A (en) * | 2016-09-29 | 2017-05-03 | 四川用联信息技术有限公司 | Text feature extraction method based on semantic analysis |
Non-Patent Citations (3)
Title |
---|
CHUAN-JIE ET AL.: "Dimentional Sentiment Analysis by Synsets and Sense Definitions", 《2016 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP)》 * |
YAN WANG ET AL.: "Incorporating Linguistic Knowledge for Learning Distributed Word Representations", 《PLOS ONE》 * |
孙茂松,等: "借重于人工知识库的词和义项的向量表示_以HowNet为例", 《中文信息学报》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108984533A (en) * | 2018-08-03 | 2018-12-11 | 清华大学 | A kind of former prediction technique of vocabulary justice and device |
CN109271633A (en) * | 2018-09-17 | 2019-01-25 | 北京神州泰岳软件股份有限公司 | A kind of the term vector training method and device of single semantic supervision |
CN109299459A (en) * | 2018-09-17 | 2019-02-01 | 北京神州泰岳软件股份有限公司 | A kind of the term vector training method and device of single semantic supervision |
CN109271633B (en) * | 2018-09-17 | 2023-08-18 | 鼎富智能科技有限公司 | Word vector training method and device for single semantic supervision |
CN109299459B (en) * | 2018-09-17 | 2023-08-22 | 北京神州泰岳软件股份有限公司 | Word vector training method and device for single semantic supervision |
CN109446518A (en) * | 2018-10-09 | 2019-03-08 | 清华大学 | The coding/decoding method and decoder of language model |
CN109446518B (en) * | 2018-10-09 | 2020-06-02 | 清华大学 | Decoding method and decoder for language model |
CN109597988A (en) * | 2018-10-31 | 2019-04-09 | 清华大学 | The former prediction technique of vocabulary justice, device and electronic equipment across language |
CN109597988B (en) * | 2018-10-31 | 2020-04-28 | 清华大学 | Cross-language lexical semantic prediction method, device and electronic device |
Also Published As
Publication number | Publication date |
---|---|
CN107193806B (en) | 2019-11-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108363790B (en) | Method, device, equipment and storage medium for evaluating comments | |
US10170104B2 (en) | Electronic device, method and training method for natural language processing | |
CN109376222B (en) | Question-answer matching degree calculation method, question-answer automatic matching method and device | |
WO2019174423A1 (en) | Entity sentiment analysis method and related apparatus | |
CN107193806B (en) | A method and device for automatic prediction of lexical sememe | |
CN110457708B (en) | Vocabulary mining method and device based on artificial intelligence, server and storage medium | |
CN110858269B (en) | Fact description text prediction method and device | |
CN107437417B (en) | Voice data enhancement method and device based on recurrent neural network voice recognition | |
CN109086265B (en) | Semantic training method and multi-semantic word disambiguation method in short text | |
CN111291177A (en) | Information processing method and device and computer storage medium | |
CN113449084A (en) | Relationship extraction method based on graph convolution | |
JP2022169757A (en) | Retrieval device, retrieval method, and retrieval program | |
CN108647191A (en) | It is a kind of based on have supervision emotion text and term vector sentiment dictionary construction method | |
WO2014073206A1 (en) | Information-processing device and information-processing method | |
JP2021163477A (en) | Method, apparatus, electronic device, computer-readable storage medium, and computer program for image processing | |
CN112560504B (en) | Method, electronic equipment and computer readable medium for extracting information in form document | |
CN107783958B (en) | Target statement identification method and device | |
JP2019148933A (en) | Summary evaluation device, method, program, and storage medium | |
CN115526171A (en) | Intention identification method, device, equipment and computer readable storage medium | |
CN107729509B (en) | Discourse similarity determination method based on recessive high-dimensional distributed feature representation | |
CN114116971A (en) | Model training method, device and computer equipment for generating similar text | |
CN114547321A (en) | Knowledge graph-based answer generation method and device and electronic equipment | |
CN113963804B (en) | Medical data relationship mining method and device | |
CN111813941A (en) | Text classification method, device, device and medium combining RPA and AI | |
CN108733702B (en) | Method, device, electronic equipment and medium for extracting upper and lower relation of user query |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |