CN106776559B

CN106776559B - Text semantic similarity calculation method and device

Info

Publication number: CN106776559B
Application number: CN201611155781.7A
Authority: CN
Inventors: 赵耕弘
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2016-12-14
Filing date: 2016-12-14
Publication date: 2020-08-11
Anticipated expiration: 2036-12-14
Also published as: CN106776559A

Abstract

The invention discloses a text semantic similarity calculation method and device, relates to the technical field of natural language processing, and solves the problem of low accuracy of the conventional text similarity calculation method. The method of the invention comprises the following steps: combining words in a first word bag corresponding to the first text and words in a second word bag corresponding to the second text to obtain a dimension word bag; vectorization calculation is carried out on the first word bag and the second word bag according to a word conversion vector tool based on semantics, a first vector and a second vector are obtained, and the dimensions of the first vector and the second vector correspond to words in the dimension word bag one to one; and calculating the similarity value of the first vector and the second vector according to a vector similarity calculation algorithm to obtain a similarity result of the first text and the second text. The method and the device are applied to the process of calculating the text similarity.

Description

Text semantic similarity calculation method and device

Technical Field

The invention relates to the technical field of natural language processing, in particular to a text semantic similarity calculation method and device.

Background

In the natural language processing process, the calculation of the similarity between texts is the basic operation of text processing, and the functions similar to text duplicate finding, hot spot extraction, interest finding and the like can be completed by utilizing the similarity between texts or calculating the distance between texts in the opposite direction. In addition, the similarity between texts is used as a pre-operation, and then complex calculation such as clustering or classification aiming at a large number of texts can be completed. For such complex operations, the precision of the text similarity as a pre-operation directly affects the result of the final operation.

Text is generally understood as an object with infinite dimensions in calculation processing as unstructured data, so structured dimension reduction processing is required before calculating the similarity between texts. For text dimension reduction, the currently common dimension reduction method performs dimension reduction according to word Frequency statistics, and performs dimension reduction according to Term-importance-degree (TFIDF) values. However, when performing text dimension reduction in the ways of word frequency statistics, TFIDF values, and the like, the operation is performed based on the occurrence probability of words, that is, the similarity can only be calculated in the dimension of the same word, and even the dimension of different words that are synonymous cannot be calculated, and when the dimensions of the words in two texts are different, the dimension of the same word in the two texts needs to be used for calculating the similarity, and the semantic features of the texts are likely not to be completely reflected only by using the dimension of the same word, so that the finally calculated similarity result usually cannot reflect the semantic similarity between the texts more accurately.

Disclosure of Invention

In view of the above problems, the present invention provides a method and an apparatus for calculating text semantic similarity, so as to solve the problem that the existing method for calculating text semantic similarity is low in accuracy.

In order to solve the above technical problem, in a first aspect, the present invention provides a method for calculating text semantic similarity, where the method includes:

combining words in a first word bag corresponding to a first text and words in a second word bag corresponding to a second text to obtain a dimension word bag, wherein the first text and the second text are texts for similarity calculation, the words in the first word bag are words obtained by word segmentation of the first text, and the words in the second word bag are words obtained by word segmentation of the second text;

vectorization calculation is carried out on the first word bag and the second word bag according to a word conversion vector tool based on semantics, a first vector and a second vector are obtained, and the dimensions of the first vector and the second vector correspond to words in the dimension word bag one to one;

and calculating the similarity value of the first vector and the second vector according to a vector similarity calculation algorithm to obtain a similarity result of the first text and the second text.

In a second aspect, the present invention provides an apparatus for calculating semantic similarity of texts, the apparatus comprising:

a merging unit, configured to merge words in a first word bag corresponding to a first text and a second word bag corresponding to a second text to obtain a dimension word bag, where the first text and the second text are texts subjected to similarity calculation, a word in the first word bag is a word obtained by segmenting the first text, and a word in the second word bag is a word obtained by segmenting the second text;

the vectorization unit is used for carrying out vectorization calculation on the first word bag and the second word bag according to a semantic-based word conversion vector tool to obtain a first vector and a second vector, and the dimensions of the first vector and the second vector correspond to words in the dimension word bags one by one;

and the similarity calculation unit is used for calculating the similarity values of the first vector and the second vector according to a vector similarity calculation algorithm to obtain a similarity result of the first text and the second text.

By means of the technical scheme, in the process of performing dimension reduction processing on the two texts subjected to similarity calculation, the dimension words in the vector corresponding to the obtained text comprise all words in the two texts, so that the similarity calculation does not need to be performed by only selecting the dimension of the same word in the text, the semantic features of each text can be completely reflected, semantic support is further provided by a semantic-based word conversion vector tool during the vectorization calculation of the text, and the similarity relevance between different synonyms can be fully considered. Therefore, the result of the similarity between texts obtained by final calculation is more accurate.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flowchart illustrating a method for calculating semantic similarity of texts according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating another method for calculating semantic similarity of texts according to an embodiment of the present invention;

FIG. 3 is a block diagram illustrating an apparatus for calculating semantic similarity of texts according to an embodiment of the present invention;

fig. 4 shows a block diagram of another apparatus for calculating semantic similarity of text according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In order to solve the problem of low accuracy of the existing text semantic similarity calculation method, the embodiment of the invention provides a text semantic similarity calculation method, as shown in fig. 1, the method comprises the following steps:

101. and combining words in the first word bag corresponding to the first text and the second word bag corresponding to the second text to obtain a dimension word bag.

The first text and the second text are texts for similarity calculation, words in the first word bag are segmented for the first text and stop words are removed, and words in the second word bag are segmented for the second text and stop words are removed. It should be noted that the words in the first bag and the second bag are words that do not repeat.

A specific example illustrates an implementation manner of combining words in a first bag of words corresponding to a first text and a second bag of words corresponding to a second text to obtain a dimension bag of words: assume that the first bag of words is a 'and the second bag of words is B', the words in the first and second bags of words are as follows, where w represents a word.

A'＝[w_a1,w_a2,w_a3,w_a4,w_a5…]

B'＝[w_b1,w_b2,w_b3,w_b4,w_b5…]

Merging the first bag of words A 'and the second bag of words B' to obtain a dimension bag of words C

C＝[w_a1,w_a2,w_a3,w_a4,w_a5…,w_b1,w_b2,w_b3,w_b4,w_b5…]

It should be noted that the order of the words in the dimension word bag corresponding to the first word bag and the second word bag is not limited. In addition, in order to reduce the complexity of the subsequent calculation, that is, to reduce the dimension number of the first vector and the second vector in the subsequent step, when there are identical words in the first word bag with respect to the second word bag, only one word may be retained for the identical words in the merged dimension word bag, that is, the words in the dimension word bag are words that are not repeated.

102. And performing vectorization calculation on the first bag of words and the second bag of words according to a semantic-based word conversion vector tool to obtain a first vector and a second vector.

Existing common semantic-based Word transformation vector tools include Word2Vec, GloVe, and the like. In this embodiment, Word2Vec is taken as an example for explanation, and any semantic-based Word transformation vector tool may be used in practical application. Word2Vec is an open-source efficient tool for characterizing words as real-valued vectors, and the words can be converted into vectors in a K-dimensional vector space through training by using a deep learning idea, in this embodiment, words in a first Word bag and a second Word bag are converted into vectors with preset dimensions through Word2Vec, similarity calculation is performed on the words in the first Word bag and the dimension Word bags and words in the second Word bag and the dimension Word bags, and finally the first Word bag and the second Word bag are vectorized. It should be noted that, the preset dimension is freely set according to actual requirements, for example, it may be set to 100, 200, and the like, and generally, the larger the preset dimension is, the more accurately the vector after the word conversion can express the semantic features of the word.

It should be noted that, when the first bag of words and the second bag of words are vectorized, the similarity between the words in the first bag of words and the second bag of words and the words in the dimension bag of words is measured by the similarity between the word vectors obtained by the conversion of the semantic-based word conversion vector tool.

103. And calculating the similarity value of the first vector and the second vector according to a vector similarity calculation algorithm to obtain a similarity result of the first text and the second text.

The first vector and the second vector obtained in step 103 are real-valued vectors with the same dimension, so that the similarity value of the first vector and the second vector can be calculated according to a vector similarity calculation algorithm, and the similarity value between the first vector and the second vector is the result of the similarity between the first text and the second text.

It should be noted that the vector similarity calculation algorithm may be any existing algorithm that can calculate the similarity between vectors, such as a cosine similarity calculation method, a relative entropy calculation algorithm, a covariance calculation algorithm, and the like.

In the text semantic similarity calculation method provided by the embodiment of the invention, in the process of performing dimension reduction processing on two texts subjected to similarity calculation, the dimension words in the vector corresponding to the obtained text comprise all words in the two texts, so that the similarity calculation is not required to be performed by only selecting the dimension of the same word in the text, the semantic characteristics of each text can be completely reflected, semantic support is further provided by a semantic-based word conversion vector tool during the vectorization calculation of the text, and the similarity relevance between different synonyms can be fully considered. Therefore, the result of the similarity between texts obtained by final calculation is more accurate.

For the refinement and expansion of the method shown in fig. 1, the embodiment further provides a method for calculating semantic similarity of texts, as shown in fig. 2:

201. and judging whether the number of words contained in the first word bag and the second word bag is greater than a preset threshold value or not.

In practical application, the length of two texts for similarity calculation may be too long, and when the length of the text is too long, the dimension bag of words is too large, which results in the final expression of text vectors, and increases the complexity of calculation. Therefore, after the first bag of words and the second bag of words corresponding to the first text and the second text are obtained, the number of words in the bag of words needs to be determined, so as to determine whether the text is too long. The setting of the preset threshold value can be freely set according to actual requirements.

202. And if the number of the words contained in the first word bag and/or the second word bag is larger than a preset threshold value, performing word interception on the first word bag and/or the second word bag.

If the number of words contained in the first bag of words and/or the second bag of words is greater than the preset threshold, it indicates that the text corresponding to the first bag of words and/or the second bag of words belongs to the case of too long length in step 201, and therefore word interception needs to be performed on the first bag of words and/or the second bag of words, so that the number of words in the first bag of words and/or the second bag of words is reduced to be within the preset threshold. The specific process of word interception of the first word bag and/or the second word bag comprises the following steps:

firstly, calculating the importance TFIDF value of each word in the first word bag and/or the second word bag;

the TFIDF value is used to evaluate the importance of a word or words to a set of documents or a corpus, where the importance of a word increases in direct proportion to the number of times it appears in a document, but decreases in inverse proportion to the frequency with which it appears in the corpus. TFIDF is actually: TF and IDF, TF is Term Frequency (Term Frequency) and IDF is Inverse Document Frequency (Inverse Document Frequency). TF represents the frequency of occurrence of a word in the text. The main idea of IDF is: if the text containing a certain entry t is less, the corresponding IDF is larger, and the entry t has good category distinguishing capability. TFIDFs of the words in the first bag of words and the second bag of words may be calculated from TF values of the words in the first bag of words and/or the second bag of words in the corresponding bag of words and IDF values in the corresponding corpus.

It should be noted that the corresponding corpus is an IDF library trained in advance, and for a training sample of the IDF library, a text set in a certain field or a text set acquired by a certain text acquisition platform (such as a hundred-degree search) may be used.

Secondly, sorting the words in the first word bag and/or the second word bag according to the size sequence of the TFIDF value;

and finally, extracting words with preset number according to the sorted result sequence.

The larger the TFIDF value is, the more important the word corresponding to the TFIDF value is in the first word bag or the second word bag, so when word truncation is required, the word with the small TFIDF value needs to be deleted, and the word with the large TFIDF value is reserved. The extracting of the preset number of words in this embodiment may be extracting a preset threshold number of words. In practical application, the preset threshold and the preset number can be freely set according to actual requirements.

In addition, if the number of words contained in the first bag of words and/or the second bag of words is less than or equal to the preset threshold, step 203 is executed.

203. And combining words in the first word bag corresponding to the first text and the second word bag corresponding to the second text to obtain a dimension word bag.

The implementation of this step is the same as that of step 101 in fig. 1, and is not described here again.

204. And calculating similarity values of each word in the dimension word bag and all words in the first word bag and the second word bag respectively according to a semantic-based word conversion vector tool.

Before calculating the similarity value of each word in the dimension word bag and all words in the first word bag and the second word bag respectively, the words in the word bags need to be converted into corresponding word vectors by applying a semantic-based word conversion vector tool. Calculating the similarity value between words is converted into calculating the similarity value between word vectors and word vectors. The method for calculating the similarity value between the word vectors may use any one of the existing algorithms that can calculate the similarity between the vectors, such as cosine similarity algorithm, relative entropy calculation algorithm, covariance calculation algorithm, and the like.

205. And respectively determining the maximum value of the similarity value of each word in the dimension word bag and all words in the first word bag as the dimension value of the corresponding dimension in the first vector to obtain the first vector.

Given a specific example for explanation, assume that the words in dimension bag C include W1_a、W2_a、W3_a、W1_b、W2_b、W3_bThe words in the first bag A include W1_a、W2_a、W3_aThe words in the second bag B include W1_b、W2_b、W3_bCalculate W1_aRespectively with W1_a、W2_a、W3_aThe similarity values of (a) are L1, L2 and L3, and if L1 in L1, L2 and L3 is the largest, the L1 is taken as W1 in the first vector_aDimension value of the dimension according to W1 in the first vector_aThe dimension value corresponding method can respectively obtain W2 in the first vector in turn_a、W3_a、W1_b、W2_b、W3_bThe dimension values of the corresponding dimensions, it should be noted that the dimensions in the first vector correspond to the words in the dimension word bag one to one in step 203.

It should be noted that, since the word in the first bag is included in the dimension bag, the dimension values of the words in the first vector corresponding to the dimension bag and the same as the first bag are all 1, and corresponding to the above example, the obtained first vector has an approximate form of [1,1,1, L1 [ ]_a,L2_a,L3_a]。

206. And respectively determining the maximum value of the similarity value of each word in the dimension word bag and all words in the second word bag as the dimension value of the corresponding dimension in the second vector to obtain a second vector.

Given a specific example for explanation, assume that the words in dimension bag C include W1_a、W2_a、W3_a、W1_b、W2_b、W3_bThe words in the first bag A include W1_a、W2_a、W3_aThe words in the second bag B include W1_b、W2_b、W3_bCalculate W1_aRespectively with W1_b、W2_b、W3_bThe similarity values of (a) are L3, L4 and L5, and if L3 in L3, L4 and L5 is the largest, the L3 is taken as W1 in the second vector_aDimension value of the dimension according to W1 in the second vector_aThe dimension value corresponding to the dimension can be obtained by the method, and the W2 in the second vector can be obtained respectively in sequence_a、W3_a、W1_b、W2_b、W3_bThe dimension values of the corresponding dimensions, it should be noted that the dimensions in the second vector correspond to the words in the dimension word bag one to one in step 203.

It should be noted that, because the words in the second word bag are included in the dimension word bag, the maximum value of the similarity value between the corresponding words in the dimension word bag that are the same as the words in the second word bag and the words in the second word bag is 1, that is, the dimension values in the second vector are all 1. In the corresponding above example, the resulting second vector has the approximate form [ L1 [ ]_b,L2_b,L3_b,1,1,1]。

It should be noted that the sequence of step 205 and step 206 is not limited, and may be executed simultaneously, or step 206 or step 205 may be executed first.

207. And calculating the similarity value of the first vector and the second vector according to a cosine similarity algorithm.

In this embodiment, a cosine similarity algorithm is specifically applied to calculate the similarity value of the first vector and the second vector obtained in step 205 and step 206, where the similarity value is the similarity value of the corresponding first text and the second text.

In addition, it should be noted that the algorithm for calculating the similarity value of the first vector and the second vector may also be other vector similarity calculation algorithms, such as a relative entropy calculation algorithm, a covariance calculation algorithm, and the like.

In practical applications, a situation that lengths of a first text and a second text are greatly different is usually encountered, numbers of words in a corresponding first word bag and a corresponding second word bag are also greatly different, for similarity calculation between a long text and a short text with a large difference, in order to ensure accuracy of a similarity result, word interception is also required to be performed on a long text before merging the first word bag and the second word bag corresponding to the first text and the second text, a specific interception manner is the same as that in the step 202, and is also performed according to sizes of TFIDF values of words in the word bags, words with small TFIDF values are deleted, and values with large TFIDF values are retained.

It should be noted that, for determining whether the first text and the second text are the text with a larger difference, the specific determination method is as follows: firstly, calculating the ratio of the number of words contained in the first word bag to the number of words contained in the second word bag, wherein the number of words contained in the first word bag is greater than or equal to the number of words contained in the second word bag; then comparing the ratio with a preset ratio; if the ratio exceeds the preset ratio, the articles with larger differences are determined to be the articles with larger length, and then the first bag of words is intercepted, so that the ratio is reduced to the preset ratio, wherein the preset ratio can be set according to the requirements of users, such as 1, 1.1, 1.2 and the like.

Finally, this embodiment provides a specific process for calculating semantic similarity of texts, assuming that there are two texts, as follows:

a first text: a is that I love eating apple "

A second text: b is that he likes to eat banana "

The first word bag A' obtained after the A is divided into words is I, love, eat and apple "

Dividing B into words to obtain a second word bag B' ═ he "" likes "" eats "" bananas "

The first bag and the second bag have the same word "eat", and in order to reduce the complexity of calculation, the present example only keeps one "eat" when the bag merging is performed.

Therefore, the dimension word bag C obtained by combining A 'and B' is I, love, eat, apple, He, like and banana "

After vectorization is performed according to Word2Vec, the obtained first vector and second vector are respectively:

A'＝[1,1,1,1,0.7,0.8,0.7]

B'＝[0.7,0.8,1,0.7,1,1,1]

calculating the similarity values of the first vector and the second vector according to a cosine similarity algorithm:

if the existing text semantic similarity calculation method based on word frequency is used, the result of calculating the similarity between the first text and the second text in the above example is 0.25. It can be seen that the text semantic similarity calculation in this example is more accurate.

Further, as an implementation of the foregoing embodiments, another embodiment of the embodiments of the present invention further provides a device for calculating semantic similarity of texts, which is used to implement the methods described in fig. 1 and fig. 2. As shown in fig. 3, the apparatus includes: a merging unit 31, a vectorization unit 32, and a similarity calculation unit 33.

A merging unit 31, configured to merge words in a first word bag corresponding to a first text and a second word bag corresponding to a second text to obtain a dimension word bag, where the first text and the second text are texts subjected to similarity calculation, a word in the first word bag is a word obtained by performing word segmentation on the first text, and a word in the second word bag is a word obtained by performing word segmentation on the second text;

A'＝[w_a1,w_a2,w_a3,w_a4,w_a5…]

B'＝[w_b1,w_b2,w_b3,w_b4,w_b5…]

C＝[w_a1,w_a2,w_a3,w_a4,w_a5…,w_b1,w_b2,w_b3,w_b4,w_b5…]

It should be noted that the order of the words in the dimension word bag corresponding to the first word bag and the second word bag is not limited. In addition, in order to reduce the complexity of subsequent calculation, that is, to reduce the dimension number of the subsequently obtained first vector and second vector, when there are identical words in the first word bag with respect to the second word bag, only one word may be retained for the identical words in the dimension word bag obtained by merging, that is, the words in the dimension word bag are words that are not repeated.

The vectorization unit 32 is configured to perform vectorization calculation on the first bag of words and the second bag of words according to a semantic-based word conversion vector tool to obtain a first vector and a second vector, where dimensions of the first vector and the second vector correspond to words in the dimension bag of words one to one;

And the similarity calculation unit 33 is configured to calculate similarity values of the first vector and the second vector according to a vector similarity calculation algorithm, so as to obtain a similarity result between the first text and the second text.

The first vector and the second vector obtained by the vectorization unit 32 are real-valued vectors with the same dimension, so that the similarity value of the first vector and the second vector can be calculated according to a vector similarity calculation algorithm, and the similarity value between the first vector and the second vector is the result of the similarity between the first text and the second text.

Specifically, the formula for calculating the similarity value of the first vector and the second vector according to the cosine similarity algorithm is as follows:

wherein A'_iDenotes the ith dimension value of a first vector A ', where B'_iRepresenting the ith dimension value in the first vector B'.

As shown in fig. 4, the vectorization unit 32 includes:

the first calculating module 321 is configured to calculate similarity values between each word in the dimension word bag and all words in the first word bag and the second word bag according to a semantic-based word transformation vector tool;

A first determining module 322, configured to determine a maximum value of similarity values between each word in the dimension word bag and all words in the first word bag as a dimension value of a corresponding dimension in the first vector, respectively, to obtain a first vector;

given a specific example for explanation, assume that the words in dimension bag C include W1_a、W2_a、W3_a、W1_b、W2_b、W3_bThe words in the first bag A include W1_a、W2_a、W3_aThe words in the second bag B include W1_b、W2_b、W3_bCalculate W1_aRespectively with W1_a、W2_a、W3_aThe similarity values of (a) are L1, L2 and L3, and if L1 in L1, L2 and L3 is the largest, the L1 is taken as W1 in the first vector_aDimension value of the dimension according to W1 in the first vector_aThe dimension value corresponding method can respectively obtain W2 in the first vector in turn_a、W3_a、W1_b、W2_b、W3_bThe dimension values of the corresponding dimensions, it should be noted that the vectorization unit 32 can know that the dimensions in the first vector correspond to the words in the dimension word bag one to one.

It should be noted that, because the word in the first word bag is included in the dimension word bag, the dimension values of the corresponding words in the dimension word bag, which are the same as the word in the first word bag, in the first vector are all 1, which corresponds to that the corresponding words in the dimension word bag are all corresponding to the first word bagIn the above example, the resulting first vector has the approximate form [1,1,1, L1 [ ]_a,L2_a,L3_a]。

The second determining module 323 is configured to determine a maximum value of similarity values between each word in the dimension word bag and all words in the second word bag as a dimension value of a corresponding dimension in the second vector, so as to obtain a second vector.

Given a specific example for explanation, assume that the words in dimension bag C include W1_a、W2_a、W3_a、W1_b、W2_b、W3_bThe words in the first bag A include W1_a、W2_a、W3_aThe words in the second bag B include W1_b、W2_b、W3_bCalculate W1_aRespectively with W1_b、W2_b、W3_bThe similarity values of (a) are L3, L4 and L5, and if L3 in L3, L4 and L5 is the largest, the L3 is taken as W1 in the second vector_aDimension value of the dimension according to W1 in the second vector_aThe dimension value corresponding to the dimension can be obtained by the method, and the W2 in the second vector can be obtained respectively in sequence_a、W3_a、W1_b、W2_b、W3_bThe dimension values of the corresponding dimensions, it should be noted that the vectorization unit 32 can know that the dimensions in the second vector correspond to the words in the dimension word bag one to one.

As shown in fig. 4, the apparatus further comprises:

a ratio calculating unit 34, configured to calculate a ratio between the number of words contained in the first word bag and the number of words contained in the second word bag before merging words in the first word bag corresponding to the first text and words in the second word bag corresponding to the second text to obtain a dimension word bag, where the number of words contained in the first word bag is greater than or equal to the number of words contained in the second word bag;

a comparison unit 35, configured to compare the ratio with a preset ratio;

the intercepting unit 36 is configured to intercept words from the first word bag if the ratio exceeds a preset ratio, so that the ratio is reduced to be within the preset ratio;

the executing unit 37 is configured to, if the ratio does not exceed the preset ratio, execute merging of words in the first word bag corresponding to the first text and the second word bag corresponding to the second text to obtain a dimension word bag.

As shown in fig. 4, the apparatus further comprises:

the judging unit 38 is configured to judge whether the number of words contained in the first word bag and the second word bag is greater than a preset threshold value before words in the first word bag corresponding to the first text and words in the second word bag corresponding to the second text are combined to obtain a dimension word bag;

The intercepting unit 36 is further configured to intercept words from the first bag of words and/or the second bag of words if the number of words included in the first bag of words and/or the second bag of words is greater than a preset threshold, so that the number of words in the first bag of words and/or the second bag of words is reduced to be within the preset threshold.

In practical applications, a situation that lengths of a first text and a second text are greatly different is also commonly encountered, numbers of words in a corresponding first word bag and a corresponding second word bag are also greatly different, and for similarity calculation between a long text and a short text with a large difference, in order to ensure accuracy of a similarity result, word interception is further required on the long text before merging the first word bag and the second word bag corresponding to the first text and the second text.

As shown in fig. 4, the intercept unit 36 includes:

the second calculating module 361 is used for calculating the importance TFIDF value of each word in the first word bag and/or the second word bag;

A sorting module 362 for sorting the words in the first word bag and/or the second word bag according to the order of the sizes of the TFIDF values;

and the extracting module 363 is configured to extract a preset number of words according to the sorted result sequence.

The larger the TFIDF value is, the more important the word corresponding to the TFIDF value is in the first word bag or the second word bag, so when word truncation is required, the word with the small TFIDF value needs to be deleted, and the word with the large TFIDF value is reserved.

According to the text semantic similarity calculation device provided by the embodiment of the invention, in the process of performing dimension reduction processing on two texts subjected to similarity calculation, the dimension words in the vector corresponding to the obtained text comprise all words in the two texts, so that the similarity calculation is not required to be performed by only selecting the dimension of the same word in the text, the semantic characteristics of each text can be completely reflected, semantic support is further provided by a semantic-based word conversion vector tool during the vectorization calculation of the text, and the similarity relevance between different synonyms can be fully considered. Therefore, the result of the similarity between texts obtained by final calculation is more accurate.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It will be appreciated that the relevant features of the method and apparatus described above are referred to one another. In addition, "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent merits of the embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components in the title of the invention (e.g., a device for text semantic similarity calculation) according to an embodiment of the invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. A method for calculating text semantic similarity, the method comprising:

calculating the similarity value of the first vector and the second vector according to a vector similarity calculation algorithm to obtain a similarity result of the first text and the second text;

wherein the vectorizing calculation of the first bag of words and the second bag of words according to the semantic-based word conversion vector tool to obtain a first vector and a second vector includes:

converting words in the dimension word bag, the first word bag and the second word bag into corresponding word vectors by applying a semantic-based word conversion vector tool;

calculating similarity values of each word in the dimension word bag and all words in the first word bag and the second word bag respectively according to a semantic-based word conversion vector tool;

respectively determining the maximum value of the similarity value of each word in the dimension word bag and all words in the first word bag as the dimension value of the corresponding dimension in the first vector to obtain a first vector;

and respectively determining the maximum value of the similarity value of each word in the dimension word bag and all words in the second word bag as the dimension value of the corresponding dimension in the second vector to obtain a second vector.

2. The method of claim 1, wherein prior to merging words in a first bag of words corresponding to the first text with words in a second bag of words corresponding to the second text to obtain a bag of dimensional words, the method further comprises:

calculating the ratio of the number of words contained in the first pocket to the number of words contained in the second pocket, wherein the number of words contained in the first pocket is greater than or equal to the number of words contained in the second pocket;

comparing the ratio with a preset ratio;

if the ratio exceeds a preset ratio, performing word interception on the first word bag to reduce the ratio to the preset ratio;

and if the ratio does not exceed the preset ratio, combining words in a first word bag corresponding to the first text and a second word bag corresponding to the second text to obtain a dimension word bag.

3. The method of claim 1, wherein prior to merging words in a first bag of words corresponding to a first text with words in a second bag of words corresponding to a second text to obtain a bag of dimensional words, the method further comprises:

judging whether the number of words contained in the first word bag and the second word bag is greater than a preset threshold value or not;

if the number of words contained in the first word bag and/or the second word bag is larger than a preset threshold value, performing word interception on the first word bag and/or the second word bag so as to reduce the number of words in the first word bag and/or the second word bag to be within a preset threshold value.

4. The method of claim 2 or 3, wherein performing word truncation on the first bag of words and/or the second bag of words comprises:

calculating an importance degree TFIDF value for each word in the first bag of words and/or the second bag of words;

ordering words in the first word bag and/or the second word bag according to the size sequence of the TFIDF value;

and extracting words with preset number according to the sorted result sequence.

5. An apparatus for text semantic similarity calculation, the apparatus comprising:

the similarity calculation unit is used for calculating the similarity values of the first vector and the second vector according to a vector similarity calculation algorithm to obtain a similarity result of the first text and the second text;

wherein the vectorization unit includes:

the first calculation module is used for converting words in the dimension word bag, the first word bag and the second word bag into corresponding word vectors by applying a semantic-based word conversion vector tool; calculating similarity values of each word in the dimension word bag and all words in the first word bag and the second word bag respectively according to a semantic-based word conversion vector tool;

a first determining module, configured to determine a maximum value of similarity values between each word in the dimension word bag and all words in the first word bag as a dimension value of a corresponding dimension in the first vector, respectively, so as to obtain a first vector;

and the second determining module is used for determining the maximum value of the similarity value of each word in the dimension word bag and all words in the second word bag as the dimension value of the corresponding dimension in the second vector to obtain a second vector.

6. The apparatus of claim 5, further comprising:

the ratio calculation unit is used for calculating the ratio of the number of words contained in a first word bag to the number of words contained in a second word bag before combining the words in the first word bag corresponding to a first text and the words in the second word bag corresponding to a second text to obtain a dimension word bag, wherein the number of the words contained in the first word bag is greater than or equal to the number of the words contained in the second word bag;

the comparison unit is used for comparing the ratio with a preset ratio;

the intercepting unit is used for intercepting words from the first word bag if the ratio exceeds a preset ratio so as to reduce the ratio to the preset ratio;

and the execution unit is used for combining words in the first word bag corresponding to the first text and the second word bag corresponding to the second text to obtain a dimension word bag if the ratio does not exceed the preset ratio.

7. The apparatus of claim 5, further comprising:

the judgment unit is used for judging whether the number of words contained in a first word bag and a second word bag corresponding to a first text is larger than a preset threshold value or not before words in the first word bag and the second word bag corresponding to the second text are combined to obtain a dimension word bag;

and the intercepting unit is used for intercepting words in the first word bag and/or the second word bag if the number of the words contained in the first word bag and/or the second word bag is larger than a preset threshold value, so that the number of the words in the first word bag and/or the second word bag is reduced to be within the preset threshold value.

8. The apparatus according to claim 6 or 7, wherein the intercepting unit comprises:

a second calculating module, configured to calculate an importance TFIDF value of each word in the first word bag and/or the second word bag;

the sorting module is used for sorting the words in the first word bag and/or the second word bag according to the size sequence of the TFIDF value;

and the extraction module is used for extracting words with preset quantity according to the sorted result sequence.