CN111191024A

CN111191024A - Method for calculating sentence semantic vector

Info

Publication number: CN111191024A
Application number: CN201811348612.4A
Authority: CN
Inventors: 罗立刚; 刘辉; 张正宽; 张天泽; 常涛; 王玲
Original assignee: Zero Krypton Technology Tianjin Co Ltd
Current assignee: Zero Krypton Technology Tianjin Co Ltd
Priority date: 2018-11-13
Filing date: 2018-11-13
Publication date: 2020-05-22
Anticipated expiration: 2038-11-13
Also published as: CN111191024B

Abstract

The invention provides a method for calculating sentence semantic vectors, which comprises the following steps: A. performing word segmentation on each sentence sample in the corpus to obtain a word set, and training by adopting a word vector generation tool to obtain a word vector of each word to form a word vector set; B. performing word vector mean calculation on the sentence to be calculated through the word vector set to obtain a sentence vector of the sentence to be calculated; C. finding out a plurality of words with the highest similarity with each constituent word of the sentence to be calculated in the word set, and respectively forming a candidate set; D. and calculating the distance between each word in the candidate set and the sentence vector, and multiplying the distance by the word vector of the candidate set by taking the distance as the weight to obtain the semantic vector of the sentence to be calculated. The method performs the calculation of sentence semantic vectors by combining the word vectors of adjacent words of each constituent word in the sentence, fully uses the semantic information of all words, and enables the expression to be more reasonable.

Description

Method for calculating sentence semantic vector

Technical Field

The invention relates to the technical field of text information processing, in particular to a method for calculating semantic vectors of sentences.

Background

The internet gradually becomes an information carrier for recording the life and work of people, which brings convenience for obtaining information for the life and work of people, meanwhile, a large amount of text data is generated in the using process, important information is timely and effectively extracted from complex text data, and the Natural Language (Natural Language) needs to be effectively processed by artificial intelligence, while in the field of Natural Language Processing (NLP), the calculation of sentence semantics is a basic semantic expression, and a reasonable sentence semantic expression mode can provide favorable support for downstream application effects. The traditional sentence semantic expression is usually recalculation of word vectors, the most common mode is to find the average value of the word vectors in a sentence, or to use an intermediate result generated in neural network training as the sentence vectors, however, because the constituent words of the sentence sometimes have the condition of inaccurate word usage or incorrect expression mode, at this time, the existing sentence semantic expression mode also has inaccurate result, and cannot provide favorable support for downstream application.

Disclosure of Invention

In view of the above, the main objective of the present invention is to provide a method for calculating a sentence semantic vector, which performs the calculation of the sentence semantic vector by combining word vectors of neighboring words of each constituent word in a sentence, fully uses semantic information of all words, has the advantages of simple implementation manner, more reasonable expression, and the like, can solve the problem of wrong sentence semantic expression caused by wrong constituent words or expression manner of the sentence, and provides a favorable support for downstream application effects.

The technical scheme adopted by the invention is that a method for calculating sentence semantic vectors comprises the following steps:

A. performing word segmentation on each sentence sample in the corpus to obtain a word set, and training by adopting a word vector generation tool to obtain a word vector of each word to form a word vector set;

B. performing word vector mean calculation on the sentence to be calculated through the word vector set to obtain a sentence vector of the sentence to be calculated;

C. finding out a plurality of words with the highest similarity with each constituent word of the sentence to be calculated in the word set, and respectively forming a candidate set;

D. and calculating the distance between each word in the candidate set and the sentence vector, and multiplying the distance by the word vector of the candidate set by taking the distance as the weight to obtain the semantic vector of the sentence to be calculated.

In the above way, word segmentation is performed on each sentence sample in the existing corpus to obtain a word set, a word vector generation tool is used to obtain a word vector of each word, the word vectors are substituted into the sentence samples to perform vector mean, and thus, the sentence vectors can be obtained. However, because the word composition of the sentence sometimes has errors or the expression mode is not fixed, a plurality of words nearest to each composition word of the sentence can be selected to form a candidate set, the distance between the words in the candidate set and the sentence vector is taken as the weight and multiplied by the word vector of the candidate set, and the vector of the adjacent word is introduced into the sentence to obtain an expression mode with more reasonable sentence semantics, the error rate is low, the problem that the sentence semantic expression is wrong due to the fact that the composition words or the expression mode of the sentence is wrong can be solved, and the beneficial support is provided for the downstream application effect.

Wherein the step B comprises:

extracting word vectors of all the constituent words of the sentence to be calculated from the word vector set;

and carrying out mean value calculation on the extracted word vectors to obtain sentence vectors of the sentences to be calculated.

In the above way, the sentence vector of the sentence to be calculated is obtained by obtaining the word vector of each constituent word constituting the sentence and performing the average calculation, and the sentence vector is only the basic vector and is not the calculated final sentence vector.

Wherein the step C comprises:

calculating the similarity between each constituent word of the sentence to be calculated and the word set through a proximity algorithm;

selecting a plurality of phrases with highest similarity to form a candidate set, wherein the number of the candidate set is the same as the number of the constituent words of the sentence to be calculated;

and selecting each word vector in the candidate set from the word vector set to form a word vector set of the candidate set.

By the above, through comparing the similarity between the constituent words of the sentence to be calculated and other words in the word set, and selecting a plurality of words with the highest similarity to form a candidate set, if there are several constituent words in the sentence, the candidate sets with the same quantity can be correspondingly generated.

Wherein the step D comprises:

calculating each distance between each word in the candidate set and the sentence vector;

and correspondingly multiplying each distance by the word vector of each word in the candidate set, and performing mean value calculation on the multiplied results to obtain the semantic vector of the sentence to be calculated.

Therefore, because the similarity of each word in the candidate set and the sentence constituting word is different, the distance between each word and the sentence vector is used as a specific gravity, then the word vector in the candidate set is multiplied, and the mean value calculation is carried out on the multiplied result, so that the final semantic vector of the sentence to be calculated can be obtained, and better support is provided for downstream application.

Drawings

FIG. 1 is a flow chart of a method of calculating a sentence semantic vector according to the present invention.

Detailed Description

The following describes in detail a method for calculating a sentence semantic vector according to the present invention with reference to fig. 1, which specifically includes the following steps:

s100: performing word segmentation on each sentence sample in the corpus to obtain a word set, and training by adopting a word vector generation tool to obtain a word vector of each word to form a word vector set;

setting a corpus sample set S1, wherein the set S1 comprises a plurality of sentence samples Si, i is a natural number greater than or equal to 1, performing word segmentation on the sentence samples Si to obtain word segmentation results Wij, and the word segmentation results Wij obtain a word set S2.

And performing unsupervised word vector training on the word set S2 to obtain a word vector set Wij _ vec. The word vector has good semantic characteristics and is a common way to represent word features. The value of each dimension of the word vector represents a feature having a certain semantic and grammatical interpretation, so each dimension of the word vector may be referred to as a word feature. In this embodiment, Word vectors in a Word set may be trained using a Word2vec model, which is a software tool for training Word vectors opened by Google corporation in 2013. According to a given corpus, a word is quickly and effectively expressed into a vector form through an optimized training model.

S200: performing word vector mean calculation on the sentence to be calculated through the word vector set to obtain a sentence vector of the sentence to be calculated;

the method comprises the steps of firstly, carrying out word segmentation on a sentence to be calculated, extracting word vectors of all constituent words of the sentence to be calculated from the word vector set Wij _ vec, and carrying out mean value calculation on the extracted word vectors to obtain a sentence vector of the sentence to be calculated.

Specifically, if the sentence vector corresponding to the sentence to be calculated is W1, W1 is sum (Wij _ vec), where j is 1 to k, and k is the number of the constituent words in the sentence to be calculated.

S300: finding out a plurality of words with the highest similarity with each formed word of the sentence to be calculated in the word set, respectively forming a candidate set and training a word vector of each word of the candidate set;

the word similarity is a quantitative measure of the number quantization of complex relationships among words and is a quantitative measure of the semantic similarity degree among words, and the specific calculation process is as follows:

calculating the similarity between each constituent word in the word set W2 and each sentence to be calculated through a proximity algorithm;

in the embodiment of the invention, the algorithm of the proximity similarity can be calculated by a general cosine similarity formula or an Euclidean distance formula, wherein the cosine similarity uses a cosine value of an included angle between two vectors in a vector space to measure the similarity between two texts, compared with distance measurement, the cosine similarity pays more attention to the difference of the two vectors in the direction, and under the general condition, after the vectors of the two texts are obtained by an embedding method to represent, the cosine similarity can be used for calculating the similarity between the two texts; euclidean distance, also known as euclidean distance, is the most common distance metric, and refers to the true distance between two points in a multi-dimensional space, or the natural length of a vector (i.e., the distance of the point from the origin), and euclidean distance in two and three dimensions is the actual distance between two points.

For example, "i like to eat apple" and "i like to apple mobile phone", the word "apple" in the two sentence samples belongs to the fruit field and the mobile phone field respectively, at this time, through the calculation of the similarity of the adjacent words, if "apple" has higher word vector similarity with fruit vocabularies such as "orange", "banana", etc., it represents that the vocabulary belongs to the fruit field, if "apple" has higher word vector similarity with mobile phone vocabularies such as "samsung", "millet", etc., it represents that the vocabulary belongs to the mobile phone field.

According to the ranking of the similarity, selecting a plurality of words with the highest similarity to respectively form a candidate set TSet, wherein the number of the candidate set TSet is the same as the number of the constituent words of the sentence to be calculated, namely selecting a plurality of words with the highest similarity to form the candidate set TSet for each constituent Word of the sentence to be calculated, and training Word vectors in the candidate set TSet by adopting a Word2vec model.

S400: calculating the distance between each word in the candidate set and the sentence vector, and multiplying the distance by the word vector of the candidate set by taking the distance as the weight to obtain the semantic vector of the sentence to be calculated;

in this step, depending on the calculation result of the similarity between each word in the candidate set TSet and the constituent word of the sentence to be calculated in step S300, the semantic vector of the sentence to be calculated can be obtained by calculating each distance between each word in the candidate set TSet and the sentence vector W1, multiplying each distance by the word vector of each word in the candidate set TSet with each distance as a weight, and performing mean calculation on the multiplied results.

For example, the sentence "i like eating apple" is that, when calculating the semantic vector, the vectors of the neighboring words "orange" and "banana" of the word "apple" are introduced into the semantic, and the distance between each neighboring word and the sentence vector is used as a proportion and multiplied by the word vector, so that the expression of the sentence semantic vector is more reasonable, and the occurrence of errors is avoided.

In summary, the invention performs the calculation of the sentence semantic vector by combining the word vectors of the adjacent words of each constituent word in the sentence, fully uses the semantic information of all words, has the advantages of simple implementation mode, more reasonable expression and the like, can solve the problem of wrong sentence semantic expression caused by wrong sentence constituent words or expression modes, and provides favorable support for downstream application effects.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method of computing a sentence semantic vector, comprising the steps of:

2. The method of claim 1, wherein step B comprises:

3. The method of claim 2, wherein step C comprises:

4. The method of claim 3, wherein step D comprises: