CN111191024B

CN111191024B - Method for calculating sentence semantic vector

Info

Publication number: CN111191024B
Application number: CN201811348612.4A
Authority: CN
Inventors: 罗立刚; 刘辉; 张正宽; 张天泽; 常涛; 王玲
Original assignee: Zero Krypton Technology Tianjin Co ltd
Current assignee: Zero Krypton Technology Tianjin Co ltd
Priority date: 2018-11-13
Filing date: 2018-11-13
Publication date: 2023-06-23
Anticipated expiration: 2038-11-13
Also published as: CN111191024A

Abstract

The invention provides a method for calculating sentence semantic vectors, which comprises the following steps: A. word segmentation is carried out on each sentence sample in the corpus to obtain a word set, a word vector generating tool is adopted for training to obtain a word vector of each word, and the word vector set is formed; B. carrying out word vector average value calculation on the sentence to be calculated through the word vector set to obtain the sentence vector of the sentence to be calculated; C. finding out a plurality of words with highest similarity with each constituent word of the sentence to be calculated from the word set, and respectively forming a candidate set; D. and calculating the distance between each word in the candidate set and the sentence vector, and multiplying the distance by the word vector in the candidate set by taking the distance as a weight to obtain the semantic vector of the sentence to be calculated. The method carries out the calculation of the semantic vectors of the sentences by combining the word vectors of each adjacent word constituting the word in the sentences, fully uses the semantic information of all the words, and ensures that the expression is more reasonable.

Description

Method for calculating sentence semantic vector

Technical Field

The invention relates to the technical field of text information processing, in particular to a method for calculating sentence semantic vectors.

Background

The internet gradually becomes an information carrier for recording life and work of people, brings convenience for acquiring information to life and work of people, simultaneously generates a large amount of text data in the process of being used, timely and effectively extracts important information from complex text data, and needs to rely on artificial intelligence to effectively process Natural Language (Natural Language), and in the field of Natural Language processing (Natural Language Processing, NLP), sentence semantic calculation is a basic semantic expression, and a reasonable sentence semantic expression mode can provide beneficial support for downstream application effects. The traditional sentence semantic generation is usually the recalculation of word vectors, the most common way is to average word vectors in sentences, or to generate intermediate results in neural network training as sentence vectors, however, because the word forming of sentences sometimes occurs in the case of inaccurate words or incorrect expression modes, at this time, the existing sentence semantic expression modes also have inaccurate results, and no favorable support can be provided for downstream applications.

Disclosure of Invention

Therefore, the main purpose of the present invention is to provide a method for calculating semantic vectors of sentences, which combines word vectors of adjacent words of each constituent word in sentences to calculate semantic vectors of sentences, fully uses semantic information of all words, has advantages of simple implementation manner, more reasonable expression, and the like, and can solve the problem of incorrect semantic expression of sentences caused by incorrect constituent words or expression manners of sentences, thereby providing favorable support for downstream application effects.

The technical scheme adopted by the invention is that the method for calculating the sentence semantic vector comprises the following steps:

A. word segmentation is carried out on each sentence sample in the corpus to obtain a word set, a word vector generating tool is adopted for training to obtain a word vector of each word, and the word vector set is formed;

B. carrying out word vector average value calculation on the sentence to be calculated through the word vector set to obtain the sentence vector of the sentence to be calculated;

C. finding out a plurality of words with highest similarity with each constituent word of the sentence to be calculated from the word set, and respectively forming a candidate set;

D. and calculating the distance between each word in the candidate set and the sentence vector, and multiplying the distance by the word vector in the candidate set by taking the distance as a weight to obtain the semantic vector of the sentence to be calculated.

By the method, each sentence sample in the existing corpus is subjected to word segmentation to obtain a word set, a word vector of each word is obtained by using a word vector generating tool, the word vector is substituted into the sentence sample, and vector average is carried out, so that sentence vectors can be obtained. However, because the word composition of the sentence sometimes has errors or the expression mode is not fixed, a plurality of words nearest to each composition word of the sentence can be selected to form a candidate set, the distance between the words in the candidate set and the sentence vector is taken as a weight, the distance is multiplied by the word vector in the candidate set, the vector of the adjacent word is introduced into the sentence, so that a more reasonable expression mode of the sentence semantic is obtained, the error rate is lower, the condition that the sentence semantic expression is wrong due to the fact that the sentence composition word or the expression mode is wrong can be solved, and the beneficial support is provided for the downstream application effect.

Wherein, the step B comprises the following steps:

extracting word vectors of each constituent word of the sentence to be calculated from the word vector set;

and carrying out mean value calculation on the extracted word vectors to obtain sentence vectors of sentences to be calculated.

By the above, the sentence vector of the sentence to be calculated is obtained by obtaining the word vectors of the respective constituent words constituting the sentence and performing the average calculation, and the sentence vector is only the basic vector, not the calculated final sentence vector.

Wherein, the step C comprises the following steps:

calculating the similarity between each constituent word in the word set and the sentence to be calculated through a proximity algorithm;

selecting a plurality of phrases with highest similarity to form candidate sets, wherein the number of the candidate sets is the same as the number of the constituent words of the sentences to be calculated;

and selecting each word vector in the candidate set from the word vector set to form a word vector set of the candidate set.

By comparing the similarity between the constituent words of the sentence to be calculated and other words in the word set, and selecting a plurality of word-forming candidate sets with the highest similarity, the same number of candidate sets can be correspondingly generated if the sentence has a plurality of constituent words.

Wherein, the step D comprises the following steps:

calculating each distance between each word in the candidate set and the sentence vector;

and correspondingly multiplying each distance by the word vector of each word in the candidate set, and carrying out mean value calculation on the multiplied result to obtain the semantic vector of the sentence to be calculated.

From the above, because the similarity of each word and sentence in the candidate set is different, the distance between each word and sentence vector is also different, the distance between each word and sentence vector is taken as the specific gravity, the word vector in the candidate set is multiplied, and the average value of the multiplied result is calculated, so that the final semantic vector of the sentence to be calculated can be obtained, and better support is provided for downstream application.

Drawings

FIG. 1 is a flow chart of a method of computing sentence semantic vectors according to the present invention.

Detailed Description

The following describes the method for calculating the semantic vectors of sentences according to the present invention in detail with reference to fig. 1, which specifically includes the following steps:

s100: word segmentation is carried out on each sentence sample in the corpus to obtain a word set, a word vector generating tool is adopted for training to obtain a word vector of each word, and the word vector set is formed;

providing a corpus sample set S1, wherein the set S1 comprises a plurality of sentence samples Si, i is a natural number which is greater than or equal to 1, and performing word segmentation on the sentence samples Si to obtain a word segmentation result Wij, and the word segmentation result Wij is used for obtaining a word set S2.

And performing unsupervised word vector training on the word set S2 to obtain a word vector set Wij_vec. The word vector has good semantic characteristics and is a common way of representing word characteristics. The value of each dimension of the word vector represents a feature that has some semantic and grammatical interpretation, so each dimension of the word vector may be referred to as a word feature. In this embodiment, word2vec model, which is a software tool for training Word vectors that is open by Google corporation in 2013, may be used to train Word vectors in a Word set. According to a given corpus, a word is quickly and effectively expressed into a vector form through an optimized training model.

S200: carrying out word vector average value calculation on the sentence to be calculated through the word vector set to obtain the sentence vector of the sentence to be calculated;

firstly, dividing words of a sentence to be calculated, extracting word vectors of all words constituting the sentence to be calculated from the word vector set Wij_vec, and carrying out mean value calculation on the extracted word vectors to obtain the sentence vector of the sentence to be calculated.

Specifically, let the sentence vector corresponding to the sentence to be calculated be W1, w1=sum (wij_vec), where j belongs to 1 to k, and k is the number of constituent words in the sentence to be calculated.

S300: finding out a plurality of words with highest similarity with each constituent word of the sentence to be calculated from the word set, respectively forming a candidate set and training word vectors of each word of the candidate set;

the term similarity is the quantity of complex relations among terms, is a quantitative measure of semantic similarity degree among terms, and comprises the following specific calculation processes:

calculating the similarity between each constituent word in the word set W2 and the sentence to be calculated through a proximity algorithm;

in the embodiment of the invention, a general cosine similarity formula or a Euclidean distance formula can be selected for calculation by an algorithm of the adjacent similarity, wherein the cosine similarity measures the similarity between two texts by using a cosine value of an included angle of two vectors in a vector space, compared with distance measurement, the cosine similarity is more focused on the difference of the two vectors in the direction, and generally, after the vector representation of the two texts is obtained by an embedding method, the similarity between the two texts can be calculated by using the cosine similarity; euclidean distance, also called Euclidean distance, is the most common distance measure, referring to the true distance between two points in a multidimensional space, or the natural length of a vector (i.e., the distance from the point to the origin), the Euclidean distance in two-dimensional and three-dimensional space being the actual distance between two points.

For example, "i like eating apples" and "i like apple cell phones", the term "apple" in two sentence samples belongs to fruit field and cell phone field respectively, at this time, through the calculation of the similarity of adjacent terms, if "apple" has higher word vector similarity with fruit terms such as "orange", "banana", etc., it represents that this term belongs to fruit field, if "apple" has higher word vector similarity with cell phone terms such as "samsung", "millet", etc., it represents that this term belongs to cell phone field.

According to the ranking of the similarity, selecting a plurality of words with highest similarity to form candidate sets TSet respectively, wherein the number of the candidate sets TSet is the same as the number of the constituent words of the sentences to be calculated, namely, each constituent Word of the sentences to be calculated selects a plurality of Word groups with highest similarity to form the candidate sets TSet, and Word2vec models are adopted to train Word vectors in the candidate sets TSet.

S400: calculating the distance between each word in the candidate set and the sentence vector, and multiplying the distance by the word vector in the candidate set by taking the distance as a weight to obtain the semantic vector of the sentence to be calculated;

the step relies on the calculation result of the similarity between each word in the candidate set TSet and the constituent words of the sentence to be calculated in step S300, and the semantic vector of the sentence to be calculated can be obtained by calculating each distance between each word in the candidate set TSet and the sentence vector W1, correspondingly multiplying each distance by the word vector of each word in the candidate set TSet with each distance as a weight, and performing mean value calculation on the multiplied result.

For example, when the sentence "i likes eating apples" is calculated by semantic vectors, the vectors of the adjacent words "orange" and "banana" of the word "apple" are introduced into the semantics, and the distance between each adjacent word and the sentence vector is taken as the specific gravity, and the word vector is multiplied, so that the expression of the sentence semantic vector is more reasonable, and the occurrence of errors is avoided.

In summary, the invention calculates the semantic vector of the sentence by combining the word vector of each adjacent word constituting the word in the sentence, fully uses the semantic information of all the words, has the advantages of simple implementation mode, more reasonable expression and the like, can solve the problem of incorrect sentence semantic expression caused by incorrect sentence constitution words or expression modes, and provides favorable support for downstream application effect.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. A method of computing a semantic vector for a sentence, comprising the steps of:

C. calculating the similarity between each constituent word in the word set and the sentence to be calculated through a proximity algorithm; selecting a plurality of phrases with highest similarity to form candidate sets, wherein the number of the candidate sets is the same as the number of the constituent words of the sentences to be calculated; selecting each word vector in the candidate set from the word vector set to form a word vector set of the candidate set;

D. calculating each distance between each word in the candidate set and the sentence vector; and correspondingly multiplying each distance by the word vector of each word in the candidate set, and carrying out mean value calculation on the multiplied result to obtain the semantic vector of the sentence to be calculated.

2. The method according to claim 1, wherein said step B comprises: