CN106776559B - Text semantic similarity calculation method and device - Google Patents

Text semantic similarity calculation method and device Download PDF

Info

Publication number
CN106776559B
CN106776559B CN201611155781.7A CN201611155781A CN106776559B CN 106776559 B CN106776559 B CN 106776559B CN 201611155781 A CN201611155781 A CN 201611155781A CN 106776559 B CN106776559 B CN 106776559B
Authority
CN
China
Prior art keywords
word
words
bag
text
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611155781.7A
Other languages
Chinese (zh)
Other versions
CN106776559A (en
Inventor
赵耕弘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201611155781.7A priority Critical patent/CN106776559B/en
Publication of CN106776559A publication Critical patent/CN106776559A/en
Application granted granted Critical
Publication of CN106776559B publication Critical patent/CN106776559B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text semantic similarity calculation method and device, relates to the technical field of natural language processing, and solves the problem of low accuracy of the conventional text similarity calculation method. The method of the invention comprises the following steps: combining words in a first word bag corresponding to the first text and words in a second word bag corresponding to the second text to obtain a dimension word bag; vectorization calculation is carried out on the first word bag and the second word bag according to a word conversion vector tool based on semantics, a first vector and a second vector are obtained, and the dimensions of the first vector and the second vector correspond to words in the dimension word bag one to one; and calculating the similarity value of the first vector and the second vector according to a vector similarity calculation algorithm to obtain a similarity result of the first text and the second text. The method and the device are applied to the process of calculating the text similarity.

Description

Text semantic similarity calculation method and device
Technical Field
The invention relates to the technical field of natural language processing, in particular to a text semantic similarity calculation method and device.
Background
In the natural language processing process, the calculation of the similarity between texts is the basic operation of text processing, and the functions similar to text duplicate finding, hot spot extraction, interest finding and the like can be completed by utilizing the similarity between texts or calculating the distance between texts in the opposite direction. In addition, the similarity between texts is used as a pre-operation, and then complex calculation such as clustering or classification aiming at a large number of texts can be completed. For such complex operations, the precision of the text similarity as a pre-operation directly affects the result of the final operation.
Text is generally understood as an object with infinite dimensions in calculation processing as unstructured data, so structured dimension reduction processing is required before calculating the similarity between texts. For text dimension reduction, the currently common dimension reduction method performs dimension reduction according to word Frequency statistics, and performs dimension reduction according to Term-importance-degree (TFIDF) values. However, when performing text dimension reduction in the ways of word frequency statistics, TFIDF values, and the like, the operation is performed based on the occurrence probability of words, that is, the similarity can only be calculated in the dimension of the same word, and even the dimension of different words that are synonymous cannot be calculated, and when the dimensions of the words in two texts are different, the dimension of the same word in the two texts needs to be used for calculating the similarity, and the semantic features of the texts are likely not to be completely reflected only by using the dimension of the same word, so that the finally calculated similarity result usually cannot reflect the semantic similarity between the texts more accurately.
Disclosure of Invention
In view of the above problems, the present invention provides a method and an apparatus for calculating text semantic similarity, so as to solve the problem that the existing method for calculating text semantic similarity is low in accuracy.
In order to solve the above technical problem, in a first aspect, the present invention provides a method for calculating text semantic similarity, where the method includes:
combining words in a first word bag corresponding to a first text and words in a second word bag corresponding to a second text to obtain a dimension word bag, wherein the first text and the second text are texts for similarity calculation, the words in the first word bag are words obtained by word segmentation of the first text, and the words in the second word bag are words obtained by word segmentation of the second text;
vectorization calculation is carried out on the first word bag and the second word bag according to a word conversion vector tool based on semantics, a first vector and a second vector are obtained, and the dimensions of the first vector and the second vector correspond to words in the dimension word bag one to one;
and calculating the similarity value of the first vector and the second vector according to a vector similarity calculation algorithm to obtain a similarity result of the first text and the second text.
In a second aspect, the present invention provides an apparatus for calculating semantic similarity of texts, the apparatus comprising:
a merging unit, configured to merge words in a first word bag corresponding to a first text and a second word bag corresponding to a second text to obtain a dimension word bag, where the first text and the second text are texts subjected to similarity calculation, a word in the first word bag is a word obtained by segmenting the first text, and a word in the second word bag is a word obtained by segmenting the second text;
the vectorization unit is used for carrying out vectorization calculation on the first word bag and the second word bag according to a semantic-based word conversion vector tool to obtain a first vector and a second vector, and the dimensions of the first vector and the second vector correspond to words in the dimension word bags one by one;
and the similarity calculation unit is used for calculating the similarity values of the first vector and the second vector according to a vector similarity calculation algorithm to obtain a similarity result of the first text and the second text.
By means of the technical scheme, in the process of performing dimension reduction processing on the two texts subjected to similarity calculation, the dimension words in the vector corresponding to the obtained text comprise all words in the two texts, so that the similarity calculation does not need to be performed by only selecting the dimension of the same word in the text, the semantic features of each text can be completely reflected, semantic support is further provided by a semantic-based word conversion vector tool during the vectorization calculation of the text, and the similarity relevance between different synonyms can be fully considered. Therefore, the result of the similarity between texts obtained by final calculation is more accurate.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flowchart illustrating a method for calculating semantic similarity of texts according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating another method for calculating semantic similarity of texts according to an embodiment of the present invention;
FIG. 3 is a block diagram illustrating an apparatus for calculating semantic similarity of texts according to an embodiment of the present invention;
fig. 4 shows a block diagram of another apparatus for calculating semantic similarity of text according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In order to solve the problem of low accuracy of the existing text semantic similarity calculation method, the embodiment of the invention provides a text semantic similarity calculation method, as shown in fig. 1, the method comprises the following steps:
101. and combining words in the first word bag corresponding to the first text and the second word bag corresponding to the second text to obtain a dimension word bag.
The first text and the second text are texts for similarity calculation, words in the first word bag are segmented for the first text and stop words are removed, and words in the second word bag are segmented for the second text and stop words are removed. It should be noted that the words in the first bag and the second bag are words that do not repeat.
A specific example illustrates an implementation manner of combining words in a first bag of words corresponding to a first text and a second bag of words corresponding to a second text to obtain a dimension bag of words: assume that the first bag of words is a 'and the second bag of words is B', the words in the first and second bags of words are as follows, where w represents a word.
A'=[wa1,wa2,wa3,wa4,wa5…]
B'=[wb1,wb2,wb3,wb4,wb5…]
Merging the first bag of words A 'and the second bag of words B' to obtain a dimension bag of words C
C=[wa1,wa2,wa3,wa4,wa5…,wb1,wb2,wb3,wb4,wb5…]
It should be noted that the order of the words in the dimension word bag corresponding to the first word bag and the second word bag is not limited. In addition, in order to reduce the complexity of the subsequent calculation, that is, to reduce the dimension number of the first vector and the second vector in the subsequent step, when there are identical words in the first word bag with respect to the second word bag, only one word may be retained for the identical words in the merged dimension word bag, that is, the words in the dimension word bag are words that are not repeated.
102. And performing vectorization calculation on the first bag of words and the second bag of words according to a semantic-based word conversion vector tool to obtain a first vector and a second vector.
Existing common semantic-based Word transformation vector tools include Word2Vec, GloVe, and the like. In this embodiment, Word2Vec is taken as an example for explanation, and any semantic-based Word transformation vector tool may be used in practical application. Word2Vec is an open-source efficient tool for characterizing words as real-valued vectors, and the words can be converted into vectors in a K-dimensional vector space through training by using a deep learning idea, in this embodiment, words in a first Word bag and a second Word bag are converted into vectors with preset dimensions through Word2Vec, similarity calculation is performed on the words in the first Word bag and the dimension Word bags and words in the second Word bag and the dimension Word bags, and finally the first Word bag and the second Word bag are vectorized. It should be noted that, the preset dimension is freely set according to actual requirements, for example, it may be set to 100, 200, and the like, and generally, the larger the preset dimension is, the more accurately the vector after the word conversion can express the semantic features of the word.
It should be noted that, when the first bag of words and the second bag of words are vectorized, the similarity between the words in the first bag of words and the second bag of words and the words in the dimension bag of words is measured by the similarity between the word vectors obtained by the conversion of the semantic-based word conversion vector tool.
103. And calculating the similarity value of the first vector and the second vector according to a vector similarity calculation algorithm to obtain a similarity result of the first text and the second text.
The first vector and the second vector obtained in step 103 are real-valued vectors with the same dimension, so that the similarity value of the first vector and the second vector can be calculated according to a vector similarity calculation algorithm, and the similarity value between the first vector and the second vector is the result of the similarity between the first text and the second text.
It should be noted that the vector similarity calculation algorithm may be any existing algorithm that can calculate the similarity between vectors, such as a cosine similarity calculation method, a relative entropy calculation algorithm, a covariance calculation algorithm, and the like.
In the text semantic similarity calculation method provided by the embodiment of the invention, in the process of performing dimension reduction processing on two texts subjected to similarity calculation, the dimension words in the vector corresponding to the obtained text comprise all words in the two texts, so that the similarity calculation is not required to be performed by only selecting the dimension of the same word in the text, the semantic characteristics of each text can be completely reflected, semantic support is further provided by a semantic-based word conversion vector tool during the vectorization calculation of the text, and the similarity relevance between different synonyms can be fully considered. Therefore, the result of the similarity between texts obtained by final calculation is more accurate.
For the refinement and expansion of the method shown in fig. 1, the embodiment further provides a method for calculating semantic similarity of texts, as shown in fig. 2:
201. and judging whether the number of words contained in the first word bag and the second word bag is greater than a preset threshold value or not.
In practical application, the length of two texts for similarity calculation may be too long, and when the length of the text is too long, the dimension bag of words is too large, which results in the final expression of text vectors, and increases the complexity of calculation. Therefore, after the first bag of words and the second bag of words corresponding to the first text and the second text are obtained, the number of words in the bag of words needs to be determined, so as to determine whether the text is too long. The setting of the preset threshold value can be freely set according to actual requirements.
202. And if the number of the words contained in the first word bag and/or the second word bag is larger than a preset threshold value, performing word interception on the first word bag and/or the second word bag.
If the number of words contained in the first bag of words and/or the second bag of words is greater than the preset threshold, it indicates that the text corresponding to the first bag of words and/or the second bag of words belongs to the case of too long length in step 201, and therefore word interception needs to be performed on the first bag of words and/or the second bag of words, so that the number of words in the first bag of words and/or the second bag of words is reduced to be within the preset threshold. The specific process of word interception of the first word bag and/or the second word bag comprises the following steps:
firstly, calculating the importance TFIDF value of each word in the first word bag and/or the second word bag;
the TFIDF value is used to evaluate the importance of a word or words to a set of documents or a corpus, where the importance of a word increases in direct proportion to the number of times it appears in a document, but decreases in inverse proportion to the frequency with which it appears in the corpus. TFIDF is actually: TF and IDF, TF is Term Frequency (Term Frequency) and IDF is Inverse Document Frequency (Inverse Document Frequency). TF represents the frequency of occurrence of a word in the text. The main idea of IDF is: if the text containing a certain entry t is less, the corresponding IDF is larger, and the entry t has good category distinguishing capability. TFIDFs of the words in the first bag of words and the second bag of words may be calculated from TF values of the words in the first bag of words and/or the second bag of words in the corresponding bag of words and IDF values in the corresponding corpus.
It should be noted that the corresponding corpus is an IDF library trained in advance, and for a training sample of the IDF library, a text set in a certain field or a text set acquired by a certain text acquisition platform (such as a hundred-degree search) may be used.
Secondly, sorting the words in the first word bag and/or the second word bag according to the size sequence of the TFIDF value;
and finally, extracting words with preset number according to the sorted result sequence.
The larger the TFIDF value is, the more important the word corresponding to the TFIDF value is in the first word bag or the second word bag, so when word truncation is required, the word with the small TFIDF value needs to be deleted, and the word with the large TFIDF value is reserved. The extracting of the preset number of words in this embodiment may be extracting a preset threshold number of words. In practical application, the preset threshold and the preset number can be freely set according to actual requirements.
In addition, if the number of words contained in the first bag of words and/or the second bag of words is less than or equal to the preset threshold, step 203 is executed.
203. And combining words in the first word bag corresponding to the first text and the second word bag corresponding to the second text to obtain a dimension word bag.
The implementation of this step is the same as that of step 101 in fig. 1, and is not described here again.
204. And calculating similarity values of each word in the dimension word bag and all words in the first word bag and the second word bag respectively according to a semantic-based word conversion vector tool.
Before calculating the similarity value of each word in the dimension word bag and all words in the first word bag and the second word bag respectively, the words in the word bags need to be converted into corresponding word vectors by applying a semantic-based word conversion vector tool. Calculating the similarity value between words is converted into calculating the similarity value between word vectors and word vectors. The method for calculating the similarity value between the word vectors may use any one of the existing algorithms that can calculate the similarity between the vectors, such as cosine similarity algorithm, relative entropy calculation algorithm, covariance calculation algorithm, and the like.
205. And respectively determining the maximum value of the similarity value of each word in the dimension word bag and all words in the first word bag as the dimension value of the corresponding dimension in the first vector to obtain the first vector.
Given a specific example for explanation, assume that the words in dimension bag C include W1a、W2a、W3a、W1b、W2b、W3bThe words in the first bag A include W1a、W2a、W3aThe words in the second bag B include W1b、W2b、W3bCalculate W1aRespectively with W1a、W2a、W3aThe similarity values of (a) are L1, L2 and L3, and if L1 in L1, L2 and L3 is the largest, the L1 is taken as W1 in the first vectoraDimension value of the dimension according to W1 in the first vectoraThe dimension value corresponding method can respectively obtain W2 in the first vector in turna、W3a、W1b、W2b、W3bThe dimension values of the corresponding dimensions, it should be noted that the dimensions in the first vector correspond to the words in the dimension word bag one to one in step 203.
It should be noted that, since the word in the first bag is included in the dimension bag, the dimension values of the words in the first vector corresponding to the dimension bag and the same as the first bag are all 1, and corresponding to the above example, the obtained first vector has an approximate form of [1,1,1, L1 [ ]a,L2a,L3a]。
206. And respectively determining the maximum value of the similarity value of each word in the dimension word bag and all words in the second word bag as the dimension value of the corresponding dimension in the second vector to obtain a second vector.
Given a specific example for explanation, assume that the words in dimension bag C include W1a、W2a、W3a、W1b、W2b、W3bThe words in the first bag A include W1a、W2a、W3aThe words in the second bag B include W1b、W2b、W3bCalculate W1aRespectively with W1b、W2b、W3bThe similarity values of (a) are L3, L4 and L5, and if L3 in L3, L4 and L5 is the largest, the L3 is taken as W1 in the second vectoraDimension value of the dimension according to W1 in the second vectoraThe dimension value corresponding to the dimension can be obtained by the method, and the W2 in the second vector can be obtained respectively in sequencea、W3a、W1b、W2b、W3bThe dimension values of the corresponding dimensions, it should be noted that the dimensions in the second vector correspond to the words in the dimension word bag one to one in step 203.
It should be noted that, because the words in the second word bag are included in the dimension word bag, the maximum value of the similarity value between the corresponding words in the dimension word bag that are the same as the words in the second word bag and the words in the second word bag is 1, that is, the dimension values in the second vector are all 1. In the corresponding above example, the resulting second vector has the approximate form [ L1 [ ]b,L2b,L3b,1,1,1]。
It should be noted that the sequence of step 205 and step 206 is not limited, and may be executed simultaneously, or step 206 or step 205 may be executed first.
207. And calculating the similarity value of the first vector and the second vector according to a cosine similarity algorithm.
In this embodiment, a cosine similarity algorithm is specifically applied to calculate the similarity value of the first vector and the second vector obtained in step 205 and step 206, where the similarity value is the similarity value of the corresponding first text and the second text.
In addition, it should be noted that the algorithm for calculating the similarity value of the first vector and the second vector may also be other vector similarity calculation algorithms, such as a relative entropy calculation algorithm, a covariance calculation algorithm, and the like.
In practical applications, a situation that lengths of a first text and a second text are greatly different is usually encountered, numbers of words in a corresponding first word bag and a corresponding second word bag are also greatly different, for similarity calculation between a long text and a short text with a large difference, in order to ensure accuracy of a similarity result, word interception is also required to be performed on a long text before merging the first word bag and the second word bag corresponding to the first text and the second text, a specific interception manner is the same as that in the step 202, and is also performed according to sizes of TFIDF values of words in the word bags, words with small TFIDF values are deleted, and values with large TFIDF values are retained.
It should be noted that, for determining whether the first text and the second text are the text with a larger difference, the specific determination method is as follows: firstly, calculating the ratio of the number of words contained in the first word bag to the number of words contained in the second word bag, wherein the number of words contained in the first word bag is greater than or equal to the number of words contained in the second word bag; then comparing the ratio with a preset ratio; if the ratio exceeds the preset ratio, the articles with larger differences are determined to be the articles with larger length, and then the first bag of words is intercepted, so that the ratio is reduced to the preset ratio, wherein the preset ratio can be set according to the requirements of users, such as 1, 1.1, 1.2 and the like.
Finally, this embodiment provides a specific process for calculating semantic similarity of texts, assuming that there are two texts, as follows:
a first text: a is that I love eating apple "
A second text: b is that he likes to eat banana "
The first word bag A' obtained after the A is divided into words is I, love, eat and apple "
Dividing B into words to obtain a second word bag B' ═ he "" likes "" eats "" bananas "
The first bag and the second bag have the same word "eat", and in order to reduce the complexity of calculation, the present example only keeps one "eat" when the bag merging is performed.
Therefore, the dimension word bag C obtained by combining A 'and B' is I, love, eat, apple, He, like and banana "
After vectorization is performed according to Word2Vec, the obtained first vector and second vector are respectively:
A'=[1,1,1,1,0.7,0.8,0.7]
B'=[0.7,0.8,1,0.7,1,1,1]
calculating the similarity values of the first vector and the second vector according to a cosine similarity algorithm:
Figure BDA0001180662500000091
if the existing text semantic similarity calculation method based on word frequency is used, the result of calculating the similarity between the first text and the second text in the above example is 0.25. It can be seen that the text semantic similarity calculation in this example is more accurate.
Further, as an implementation of the foregoing embodiments, another embodiment of the embodiments of the present invention further provides a device for calculating semantic similarity of texts, which is used to implement the methods described in fig. 1 and fig. 2. As shown in fig. 3, the apparatus includes: a merging unit 31, a vectorization unit 32, and a similarity calculation unit 33.
A merging unit 31, configured to merge words in a first word bag corresponding to a first text and a second word bag corresponding to a second text to obtain a dimension word bag, where the first text and the second text are texts subjected to similarity calculation, a word in the first word bag is a word obtained by performing word segmentation on the first text, and a word in the second word bag is a word obtained by performing word segmentation on the second text;
the first text and the second text are texts for similarity calculation, words in the first word bag are segmented for the first text and stop words are removed, and words in the second word bag are segmented for the second text and stop words are removed. It should be noted that the words in the first bag and the second bag are words that do not repeat.
A specific example illustrates an implementation manner of combining words in a first bag of words corresponding to a first text and a second bag of words corresponding to a second text to obtain a dimension bag of words: assume that the first bag of words is a 'and the second bag of words is B', the words in the first and second bags of words are as follows, where w represents a word.
A'=[wa1,wa2,wa3,wa4,wa5…]
B'=[wb1,wb2,wb3,wb4,wb5…]
Merging the first bag of words A 'and the second bag of words B' to obtain a dimension bag of words C
C=[wa1,wa2,wa3,wa4,wa5…,wb1,wb2,wb3,wb4,wb5…]
It should be noted that the order of the words in the dimension word bag corresponding to the first word bag and the second word bag is not limited. In addition, in order to reduce the complexity of subsequent calculation, that is, to reduce the dimension number of the subsequently obtained first vector and second vector, when there are identical words in the first word bag with respect to the second word bag, only one word may be retained for the identical words in the dimension word bag obtained by merging, that is, the words in the dimension word bag are words that are not repeated.
The vectorization unit 32 is configured to perform vectorization calculation on the first bag of words and the second bag of words according to a semantic-based word conversion vector tool to obtain a first vector and a second vector, where dimensions of the first vector and the second vector correspond to words in the dimension bag of words one to one;
existing common semantic-based Word transformation vector tools include Word2Vec, GloVe, and the like. In this embodiment, Word2Vec is taken as an example for explanation, and any semantic-based Word transformation vector tool may be used in practical application. Word2Vec is an open-source efficient tool for characterizing words as real-valued vectors, and the words can be converted into vectors in a K-dimensional vector space through training by using a deep learning idea, in this embodiment, words in a first Word bag and a second Word bag are converted into vectors with preset dimensions through Word2Vec, similarity calculation is performed on the words in the first Word bag and the dimension Word bags and words in the second Word bag and the dimension Word bags, and finally the first Word bag and the second Word bag are vectorized. It should be noted that, the preset dimension is freely set according to actual requirements, for example, it may be set to 100, 200, and the like, and generally, the larger the preset dimension is, the more accurately the vector after the word conversion can express the semantic features of the word.
It should be noted that, when the first bag of words and the second bag of words are vectorized, the similarity between the words in the first bag of words and the second bag of words and the words in the dimension bag of words is measured by the similarity between the word vectors obtained by the conversion of the semantic-based word conversion vector tool.
And the similarity calculation unit 33 is configured to calculate similarity values of the first vector and the second vector according to a vector similarity calculation algorithm, so as to obtain a similarity result between the first text and the second text.
The first vector and the second vector obtained by the vectorization unit 32 are real-valued vectors with the same dimension, so that the similarity value of the first vector and the second vector can be calculated according to a vector similarity calculation algorithm, and the similarity value between the first vector and the second vector is the result of the similarity between the first text and the second text.
It should be noted that the vector similarity calculation algorithm may be any existing algorithm that can calculate the similarity between vectors, such as a cosine similarity calculation method, a relative entropy calculation algorithm, a covariance calculation algorithm, and the like.
Specifically, the formula for calculating the similarity value of the first vector and the second vector according to the cosine similarity algorithm is as follows:
Figure BDA0001180662500000111
wherein A'iDenotes the ith dimension value of a first vector A ', where B'iRepresenting the ith dimension value in the first vector B'.
As shown in fig. 4, the vectorization unit 32 includes:
the first calculating module 321 is configured to calculate similarity values between each word in the dimension word bag and all words in the first word bag and the second word bag according to a semantic-based word transformation vector tool;
before calculating the similarity value of each word in the dimension word bag and all words in the first word bag and the second word bag respectively, the words in the word bags need to be converted into corresponding word vectors by applying a semantic-based word conversion vector tool. Calculating the similarity value between words is converted into calculating the similarity value between word vectors and word vectors. The method for calculating the similarity value between the word vectors may use any one of the existing algorithms that can calculate the similarity between the vectors, such as cosine similarity algorithm, relative entropy calculation algorithm, covariance calculation algorithm, and the like.
A first determining module 322, configured to determine a maximum value of similarity values between each word in the dimension word bag and all words in the first word bag as a dimension value of a corresponding dimension in the first vector, respectively, to obtain a first vector;
given a specific example for explanation, assume that the words in dimension bag C include W1a、W2a、W3a、W1b、W2b、W3bThe words in the first bag A include W1a、W2a、W3aThe words in the second bag B include W1b、W2b、W3bCalculate W1aRespectively with W1a、W2a、W3aThe similarity values of (a) are L1, L2 and L3, and if L1 in L1, L2 and L3 is the largest, the L1 is taken as W1 in the first vectoraDimension value of the dimension according to W1 in the first vectoraThe dimension value corresponding method can respectively obtain W2 in the first vector in turna、W3a、W1b、W2b、W3bThe dimension values of the corresponding dimensions, it should be noted that the vectorization unit 32 can know that the dimensions in the first vector correspond to the words in the dimension word bag one to one.
It should be noted that, because the word in the first word bag is included in the dimension word bag, the dimension values of the corresponding words in the dimension word bag, which are the same as the word in the first word bag, in the first vector are all 1, which corresponds to that the corresponding words in the dimension word bag are all corresponding to the first word bagIn the above example, the resulting first vector has the approximate form [1,1,1, L1 [ ]a,L2a,L3a]。
The second determining module 323 is configured to determine a maximum value of similarity values between each word in the dimension word bag and all words in the second word bag as a dimension value of a corresponding dimension in the second vector, so as to obtain a second vector.
Given a specific example for explanation, assume that the words in dimension bag C include W1a、W2a、W3a、W1b、W2b、W3bThe words in the first bag A include W1a、W2a、W3aThe words in the second bag B include W1b、W2b、W3bCalculate W1aRespectively with W1b、W2b、W3bThe similarity values of (a) are L3, L4 and L5, and if L3 in L3, L4 and L5 is the largest, the L3 is taken as W1 in the second vectoraDimension value of the dimension according to W1 in the second vectoraThe dimension value corresponding to the dimension can be obtained by the method, and the W2 in the second vector can be obtained respectively in sequencea、W3a、W1b、W2b、W3bThe dimension values of the corresponding dimensions, it should be noted that the vectorization unit 32 can know that the dimensions in the second vector correspond to the words in the dimension word bag one to one.
It should be noted that, because the words in the second word bag are included in the dimension word bag, the maximum value of the similarity value between the corresponding words in the dimension word bag that are the same as the words in the second word bag and the words in the second word bag is 1, that is, the dimension values in the second vector are all 1. In the corresponding above example, the resulting second vector has the approximate form [ L1 [ ]b,L2b,L3b,1,1,1]。
As shown in fig. 4, the apparatus further comprises:
a ratio calculating unit 34, configured to calculate a ratio between the number of words contained in the first word bag and the number of words contained in the second word bag before merging words in the first word bag corresponding to the first text and words in the second word bag corresponding to the second text to obtain a dimension word bag, where the number of words contained in the first word bag is greater than or equal to the number of words contained in the second word bag;
a comparison unit 35, configured to compare the ratio with a preset ratio;
the intercepting unit 36 is configured to intercept words from the first word bag if the ratio exceeds a preset ratio, so that the ratio is reduced to be within the preset ratio;
the executing unit 37 is configured to, if the ratio does not exceed the preset ratio, execute merging of words in the first word bag corresponding to the first text and the second word bag corresponding to the second text to obtain a dimension word bag.
As shown in fig. 4, the apparatus further comprises:
the judging unit 38 is configured to judge whether the number of words contained in the first word bag and the second word bag is greater than a preset threshold value before words in the first word bag corresponding to the first text and words in the second word bag corresponding to the second text are combined to obtain a dimension word bag;
in practical application, the length of two texts for similarity calculation may be too long, and when the length of the text is too long, the dimension bag of words is too large, which results in the final expression of text vectors, and increases the complexity of calculation. Therefore, after the first bag of words and the second bag of words corresponding to the first text and the second text are obtained, the number of words in the bag of words needs to be determined, so as to determine whether the text is too long. The setting of the preset threshold value can be freely set according to actual requirements.
The intercepting unit 36 is further configured to intercept words from the first bag of words and/or the second bag of words if the number of words included in the first bag of words and/or the second bag of words is greater than a preset threshold, so that the number of words in the first bag of words and/or the second bag of words is reduced to be within the preset threshold.
In practical applications, a situation that lengths of a first text and a second text are greatly different is also commonly encountered, numbers of words in a corresponding first word bag and a corresponding second word bag are also greatly different, and for similarity calculation between a long text and a short text with a large difference, in order to ensure accuracy of a similarity result, word interception is further required on the long text before merging the first word bag and the second word bag corresponding to the first text and the second text.
As shown in fig. 4, the intercept unit 36 includes:
the second calculating module 361 is used for calculating the importance TFIDF value of each word in the first word bag and/or the second word bag;
the TFIDF value is used to evaluate the importance of a word or words to a set of documents or a corpus, where the importance of a word increases in direct proportion to the number of times it appears in a document, but decreases in inverse proportion to the frequency with which it appears in the corpus. TFIDF is actually: TF and IDF, TF is Term Frequency (Term Frequency) and IDF is Inverse Document Frequency (Inverse Document Frequency). TF represents the frequency of occurrence of a word in the text. The main idea of IDF is: if the text containing a certain entry t is less, the corresponding IDF is larger, and the entry t has good category distinguishing capability. TFIDFs of the words in the first bag of words and the second bag of words may be calculated from TF values of the words in the first bag of words and/or the second bag of words in the corresponding bag of words and IDF values in the corresponding corpus.
It should be noted that the corresponding corpus is an IDF library trained in advance, and for a training sample of the IDF library, a text set in a certain field or a text set acquired by a certain text acquisition platform (such as a hundred-degree search) may be used.
A sorting module 362 for sorting the words in the first word bag and/or the second word bag according to the order of the sizes of the TFIDF values;
and the extracting module 363 is configured to extract a preset number of words according to the sorted result sequence.
The larger the TFIDF value is, the more important the word corresponding to the TFIDF value is in the first word bag or the second word bag, so when word truncation is required, the word with the small TFIDF value needs to be deleted, and the word with the large TFIDF value is reserved.
According to the text semantic similarity calculation device provided by the embodiment of the invention, in the process of performing dimension reduction processing on two texts subjected to similarity calculation, the dimension words in the vector corresponding to the obtained text comprise all words in the two texts, so that the similarity calculation is not required to be performed by only selecting the dimension of the same word in the text, the semantic characteristics of each text can be completely reflected, semantic support is further provided by a semantic-based word conversion vector tool during the vectorization calculation of the text, and the similarity relevance between different synonyms can be fully considered. Therefore, the result of the similarity between texts obtained by final calculation is more accurate.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
It will be appreciated that the relevant features of the method and apparatus described above are referred to one another. In addition, "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent merits of the embodiments.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components in the title of the invention (e.g., a device for text semantic similarity calculation) according to an embodiment of the invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims (8)

1. A method for calculating text semantic similarity, the method comprising:
combining words in a first word bag corresponding to a first text and words in a second word bag corresponding to a second text to obtain a dimension word bag, wherein the first text and the second text are texts for similarity calculation, the words in the first word bag are words obtained by word segmentation of the first text, and the words in the second word bag are words obtained by word segmentation of the second text;
vectorization calculation is carried out on the first word bag and the second word bag according to a word conversion vector tool based on semantics, a first vector and a second vector are obtained, and the dimensions of the first vector and the second vector correspond to words in the dimension word bag one to one;
calculating the similarity value of the first vector and the second vector according to a vector similarity calculation algorithm to obtain a similarity result of the first text and the second text;
wherein the vectorizing calculation of the first bag of words and the second bag of words according to the semantic-based word conversion vector tool to obtain a first vector and a second vector includes:
converting words in the dimension word bag, the first word bag and the second word bag into corresponding word vectors by applying a semantic-based word conversion vector tool;
calculating similarity values of each word in the dimension word bag and all words in the first word bag and the second word bag respectively according to a semantic-based word conversion vector tool;
respectively determining the maximum value of the similarity value of each word in the dimension word bag and all words in the first word bag as the dimension value of the corresponding dimension in the first vector to obtain a first vector;
and respectively determining the maximum value of the similarity value of each word in the dimension word bag and all words in the second word bag as the dimension value of the corresponding dimension in the second vector to obtain a second vector.
2. The method of claim 1, wherein prior to merging words in a first bag of words corresponding to the first text with words in a second bag of words corresponding to the second text to obtain a bag of dimensional words, the method further comprises:
calculating the ratio of the number of words contained in the first pocket to the number of words contained in the second pocket, wherein the number of words contained in the first pocket is greater than or equal to the number of words contained in the second pocket;
comparing the ratio with a preset ratio;
if the ratio exceeds a preset ratio, performing word interception on the first word bag to reduce the ratio to the preset ratio;
and if the ratio does not exceed the preset ratio, combining words in a first word bag corresponding to the first text and a second word bag corresponding to the second text to obtain a dimension word bag.
3. The method of claim 1, wherein prior to merging words in a first bag of words corresponding to a first text with words in a second bag of words corresponding to a second text to obtain a bag of dimensional words, the method further comprises:
judging whether the number of words contained in the first word bag and the second word bag is greater than a preset threshold value or not;
if the number of words contained in the first word bag and/or the second word bag is larger than a preset threshold value, performing word interception on the first word bag and/or the second word bag so as to reduce the number of words in the first word bag and/or the second word bag to be within a preset threshold value.
4. The method of claim 2 or 3, wherein performing word truncation on the first bag of words and/or the second bag of words comprises:
calculating an importance degree TFIDF value for each word in the first bag of words and/or the second bag of words;
ordering words in the first word bag and/or the second word bag according to the size sequence of the TFIDF value;
and extracting words with preset number according to the sorted result sequence.
5. An apparatus for text semantic similarity calculation, the apparatus comprising:
a merging unit, configured to merge words in a first word bag corresponding to a first text and a second word bag corresponding to a second text to obtain a dimension word bag, where the first text and the second text are texts subjected to similarity calculation, a word in the first word bag is a word obtained by segmenting the first text, and a word in the second word bag is a word obtained by segmenting the second text;
the vectorization unit is used for carrying out vectorization calculation on the first word bag and the second word bag according to a semantic-based word conversion vector tool to obtain a first vector and a second vector, and the dimensions of the first vector and the second vector correspond to words in the dimension word bags one by one;
the similarity calculation unit is used for calculating the similarity values of the first vector and the second vector according to a vector similarity calculation algorithm to obtain a similarity result of the first text and the second text;
wherein the vectorization unit includes:
the first calculation module is used for converting words in the dimension word bag, the first word bag and the second word bag into corresponding word vectors by applying a semantic-based word conversion vector tool; calculating similarity values of each word in the dimension word bag and all words in the first word bag and the second word bag respectively according to a semantic-based word conversion vector tool;
a first determining module, configured to determine a maximum value of similarity values between each word in the dimension word bag and all words in the first word bag as a dimension value of a corresponding dimension in the first vector, respectively, so as to obtain a first vector;
and the second determining module is used for determining the maximum value of the similarity value of each word in the dimension word bag and all words in the second word bag as the dimension value of the corresponding dimension in the second vector to obtain a second vector.
6. The apparatus of claim 5, further comprising:
the ratio calculation unit is used for calculating the ratio of the number of words contained in a first word bag to the number of words contained in a second word bag before combining the words in the first word bag corresponding to a first text and the words in the second word bag corresponding to a second text to obtain a dimension word bag, wherein the number of the words contained in the first word bag is greater than or equal to the number of the words contained in the second word bag;
the comparison unit is used for comparing the ratio with a preset ratio;
the intercepting unit is used for intercepting words from the first word bag if the ratio exceeds a preset ratio so as to reduce the ratio to the preset ratio;
and the execution unit is used for combining words in the first word bag corresponding to the first text and the second word bag corresponding to the second text to obtain a dimension word bag if the ratio does not exceed the preset ratio.
7. The apparatus of claim 5, further comprising:
the judgment unit is used for judging whether the number of words contained in a first word bag and a second word bag corresponding to a first text is larger than a preset threshold value or not before words in the first word bag and the second word bag corresponding to the second text are combined to obtain a dimension word bag;
and the intercepting unit is used for intercepting words in the first word bag and/or the second word bag if the number of the words contained in the first word bag and/or the second word bag is larger than a preset threshold value, so that the number of the words in the first word bag and/or the second word bag is reduced to be within the preset threshold value.
8. The apparatus according to claim 6 or 7, wherein the intercepting unit comprises:
a second calculating module, configured to calculate an importance TFIDF value of each word in the first word bag and/or the second word bag;
the sorting module is used for sorting the words in the first word bag and/or the second word bag according to the size sequence of the TFIDF value;
and the extraction module is used for extracting words with preset quantity according to the sorted result sequence.
CN201611155781.7A 2016-12-14 2016-12-14 Text semantic similarity calculation method and device Active CN106776559B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611155781.7A CN106776559B (en) 2016-12-14 2016-12-14 Text semantic similarity calculation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611155781.7A CN106776559B (en) 2016-12-14 2016-12-14 Text semantic similarity calculation method and device

Publications (2)

Publication Number Publication Date
CN106776559A CN106776559A (en) 2017-05-31
CN106776559B true CN106776559B (en) 2020-08-11

Family

ID=58888867

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611155781.7A Active CN106776559B (en) 2016-12-14 2016-12-14 Text semantic similarity calculation method and device

Country Status (1)

Country Link
CN (1) CN106776559B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108932647A (en) * 2017-07-24 2018-12-04 上海宏原信息科技有限公司 A kind of method and apparatus for predicting its model of similar article and training
CN110322895B (en) * 2018-03-27 2021-07-09 亿度慧达教育科技(北京)有限公司 Voice evaluation method and computer storage medium
CN111144104B (en) * 2018-11-02 2023-06-20 中国电信股份有限公司 Text similarity determination method, device and computer readable storage medium
CN109992476B (en) * 2019-03-20 2023-08-18 网宿科技股份有限公司 Log analysis method, server and storage medium
CN110516040B (en) * 2019-08-14 2022-08-05 出门问问(武汉)信息科技有限公司 Method, device and computer storage medium for semantic similarity comparison between texts
CN112765976A (en) * 2020-12-30 2021-05-07 北京知因智慧科技有限公司 Text similarity calculation method, device and equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605702A (en) * 2013-11-08 2014-02-26 北京邮电大学 Word similarity based network text classification method
CN104102626A (en) * 2014-07-07 2014-10-15 厦门推特信息科技有限公司 Method for computing semantic similarities among short texts
CN104731771A (en) * 2015-03-27 2015-06-24 大连理工大学 Term vector-based abbreviation ambiguity elimination system and method
CN105824797A (en) * 2015-01-04 2016-08-03 华为技术有限公司 Method, device and system evaluating semantic similarity
CN105955965A (en) * 2016-06-21 2016-09-21 上海智臻智能网络科技股份有限公司 Question information processing method and device
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605702A (en) * 2013-11-08 2014-02-26 北京邮电大学 Word similarity based network text classification method
CN104102626A (en) * 2014-07-07 2014-10-15 厦门推特信息科技有限公司 Method for computing semantic similarities among short texts
CN105824797A (en) * 2015-01-04 2016-08-03 华为技术有限公司 Method, device and system evaluating semantic similarity
CN104731771A (en) * 2015-03-27 2015-06-24 大连理工大学 Term vector-based abbreviation ambiguity elimination system and method
CN105955965A (en) * 2016-06-21 2016-09-21 上海智臻智能网络科技股份有限公司 Question information processing method and device
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text

Also Published As

Publication number Publication date
CN106776559A (en) 2017-05-31

Similar Documents

Publication Publication Date Title
CN106776559B (en) Text semantic similarity calculation method and device
US11449767B2 (en) Method of building a sorting model, and application method and apparatus based on the model
CN106776503B (en) Text semantic similarity determination method and device
WO2019223103A1 (en) Text similarity acquisition method and apparatus, terminal device and medium
CN106557485B (en) Method and device for selecting text classification training set
US20160217142A1 (en) Method and system of acquiring semantic information, keyword expansion and keyword search thereof
JP5012078B2 (en) Category creation method, category creation device, and program
CN109857957B (en) Method for establishing label library, electronic equipment and computer storage medium
CN111737997A (en) Text similarity determination method, text similarity determination equipment and storage medium
CN103886077B (en) Short text clustering method and system
CN102081627A (en) Method and system for determining contribution degree of word in text
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
CN107861949A (en) Extracting method, device and the electronic equipment of text key word
CN110728135A (en) Text theme indexing method and device, electronic equipment and computer storage medium
CN112307860A (en) Image recognition model training method and device and image recognition method and device
CN109753646B (en) Article attribute identification method and electronic equipment
CN107633020B (en) Article similarity detection method and device
CN112818206A (en) Data classification method, device, terminal and storage medium
CN110704643B (en) Method and device for automatically identifying same author of different documents and storage medium terminal
CN110807286A (en) Structural grid identification method
Crocetti Textual spatial cosine similarity
CN109815312B (en) Document query method and device, computing equipment and computer storage medium
CN113139383A (en) Document sorting method, system, electronic equipment and storage medium
CN115879442A (en) Method and system for dynamically calculating weight of keyword
CN115809328A (en) Text abstract generation method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant