CN112257410A

CN112257410A - Similarity calculation method for unbalanced text

Info

Publication number: CN112257410A
Application number: CN202011107977.5A
Authority: CN
Inventors: 谢乾; 马甲林; 蒋圣; 戴晶; 周国栋; 汪涛; 吴大超
Original assignee: Nanjing Haoxiang Basic Software Research Institute Co ltd; Nanjing Keti Software Technology Co ltd; Jiangsu Zhuoyi Information Technology Co ltd; Huaiyin Institute of Technology
Current assignee: Nanjing Haoxiang Basic Software Research Institute Co ltd; Nanjing Keti Software Technology Co ltd; Jiangsu Zhuoyi Information Technology Co ltd; Huaiyin Institute of Technology
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2021-01-22

Abstract

The invention discloses a similarity calculation method of an unbalanced text, which comprises the following steps: inputting a corpus and preprocessing; pre-training word vectors to the corpus by adopting a word2vec model; storing the word vector result obtained by the pre-training; inputting short text T with similarity to be calculated₁And longer text T₂(ii) a Using TF-IDF to text T₁And text T₂And extracting key words. For text T₁All keywords are semantically related words expanded to text T based on word vector results₂The lengths are equal; computing a text T₁And text T₂The similarity of (c). By adopting the similarity calculation method disclosed by the invention, the accuracy of calculating the similarity of the unbalanced text can be improved.

Description

Similarity calculation method for unbalanced text

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a similarity calculation method for unbalanced texts.

Background

Text similarity calculation is one of the core steps of text analysis, and is used in numerous text processing tasks such as text classification, information retrieval, automatic question answering, emotion analysis and the like. Currently, the commonly used text similarity calculation methods mainly include Euclidean distance, cosine distance, KL distance (Kullback-Leibler dictionary) and other deep learning-based methods. However, these methods have high accuracy in calculating the similarity of balanced texts (with small text length differences) and have poor accuracy for unbalanced texts (with large text length differences). However, current information technology often requires computing the similarity of unbalanced text in many applications, such as: in a search engine, searching a target page by a search term; in paper retrieval, matching the paper content by paper title or abstract; in the automatic question answering, answers and the like are searched for from question sentences. Because the short text carries less information, the traditional method has poor effect and low calculation accuracy when calculating the similarity with the long text.

Disclosure of Invention

The invention solves the technical problems that: the conventional method has poor effect and low calculation accuracy when the similarity with the long text is calculated because the short text carries less information.

The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a similarity calculation method for unbalanced text comprises the following steps:

s1: inputting a corpus and preprocessing;

s2: pre-training word vectors for the corpus;

s3: saving the word vector result obtained by pre-training in the step S2;

s4: inputting short text T with similarity to be calculated₁And longer text T₂；

S5: for text T₁And text T₂Extracting key words;

s6: for text T₁All keywords are semantically related words expanded to text T based on word vector results₂The lengths are equal;

s7: computing a text T₁And text T₂The similarity of (c).

Further, in step S1, a jieba word segmentation toolkit of python is used to perform word segmentation and stop word processing on all texts in the corpus before the corpus pre-training word vectors.

Further, in step S2, word vectors are pre-trained on the corpus using the word2vec model.

Further, in step S5, the text T is subjected to TF-IDF₁And text T₂Extracting keywords, and specifically comprising the following steps:

s51: for text T₁And text T₂Performing word segmentation;

s52: for text T₁And text T₂Removing stop words;

s53: computing a text T₁And text T₂Selecting words larger than a threshold value mu as text keywords according to the TF-IDF values of all the words, wherein the TF-DF is calculated in the following mode:

TF-IDF＝TF*IDF

in the formula: TF-the frequency of occurrence of a word in a text/total number of words of the text,

IDF log (total number of texts in corpus/number of texts containing the word + 1).

Further, step S6, for the text T₁All keywords are semantically related words expanded to text T based on word vector results₂The specific steps of equal length are as follows:

s61: traversal text T₁For the text T₁Each keyword w in_iCalculating the semantic distance from the word vector result obtained in step S3, selecting the sum w_iFront N with minimum distance_iWord as w_iThe semantic related words of (1);

s62: outputting shorter text T₁Augmented text T₁’。

Further, the semantic distance is calculated by cosine similarity, and the specific method is as follows:

in the above formula, Sim () represents similarity calculation;

Sim(w_i，w_j) Representing a keyword w_iAnd w_jThe semantic distance of (d);

W_iand W_jAre respectively key words w_iAnd w_jThe word vector of (2);

k is the length of the word vector;

W_i，nand W_j，nAre respectively key words w_iAnd w_jThe nth component of the word vector.

Further, the text T₁Keyword w of_iIs selected from the group consisting of_iNearest top N_iA semantically related word, the parameter N_iDetermined by the following equation:

in the above formula, TF-IDF (w)_i) For the word w in step S5_iThe calculated TF-IDF value of;

|T₁i and I T₂L are respectively text T₁And text T₂The number of keywords.

Further, in step 7, the text T is calculated₁And text T₂The specific steps of similarity are as follows:

s71: computing a text T₁' text vector T_1-1’；

S72: computing a text T₂Text vector T of_2-2，

S73: calculating text vector T by adopting cosine formula_1-1' and T_2-2The similarity of (2);

s74: outputting a text T₁And T₂And (4) similarity.

Further, in step S71, the text T₁' text vector T_1-1', obtained by:

in the above formula, N_TAs a text T₁' and the number of keywords of the text T2,

W_nis the word w_nWord vectors in the pre-training model results obtained in step S3.

In step S72, the text T₂Text vector T of_2-2Obtained by the following method:

in the above formula, N_TAs a text T₁' and text T₂The number of keywords;

W_mis the word w_mWord vectors in the pre-training model results obtained in step S3.

Further, in step S73, the text vector T is calculated by using the cosine formula_1-1' and T_2-2Is likeDegree:

T’_1-1，hand T_2-2，hRespectively a text vector T_1-1' and T_2-2The h component of (a);

in step S74, Sim (T)₁，T₂)＝Sim(T_1-1’，T_2-2) Output text T₁And T₂Similarity Sim (T)₁，T₂)。

Has the advantages that: compared with the prior art, the invention has the following advantages:

the similarity calculation method of the unbalanced text disclosed by the invention is characterized in that a word vector model based on deep learning is pre-trained, and reasonable semantic expansion is carried out on a short text to be calculated for similarity, so that two texts with different lengths to be calculated are in a balanced state, and the problem of poor accuracy of similarity calculation results of the short text and a long text under the unbalanced length is solved. The currently used text similarity calculation methods include: the Euclidean distance, the cosine distance, the KL distance and other deep learning-based methods cannot separately solve the problem that the similarity calculation result of the text with the unbalanced length is poor in accuracy. The similarity calculation method of the unbalanced text can greatly improve the accuracy of the similarity calculation result of the unbalanced text.

Detailed Description

The present invention will be further illustrated by the following specific examples, which are carried out on the premise of the technical scheme of the present invention, and it should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.

The similarity calculation method of the unbalanced text specifically comprises the following steps:

step S1: inputting a corpus and preprocessing;

the corpus adopted by the embodiment of the invention is 100 thousands of published academic papers in academic journals of information technology class. Each paper includes a topic (as short text) and a summary (as corresponding long text) that constitute unbalanced text. Before pre-training word vectors of the corpus, a jieba word segmentation toolkit of python is adopted to perform word segmentation and stop word processing on all texts in the corpus.

Step S2: pre-training word vectors to the corpus by adopting a word2vec model;

specifically, word2vec model in python open-source genetic package is adopted to train the pre-training word vector of the corpus,

step S3: and (3) storing the word vector result obtained by the pre-training in the step (2), and keeping the result in a disk file.

Step S4: inputting short text T with similarity to be calculated₁And longer text T₂

Step S5: adopting TF-IDF (term frequency-inverse document frequency) to perform T-text processing₁And text T₂Extracting keywords, and specifically comprising the following steps:

step S51: for text T₁And text T₂Performing word segmentation;

step S52: for text T₁And text T₂Removing stop words;

step S53: computing a text T₁And text T₂Selecting the words larger than the threshold value mu as the text keywords according to the TF-IDF values of all the words. Wherein, the calculation formula of TF-IDF is as follows:

TF-IDF＝TF*IDF

in the above equation, TF is the frequency of occurrence of a word in a text/the total number of words in the text,

IDF log (total number of texts in corpus/number of texts containing the word + 1). The threshold value μ is a parameter, and is determined in an actual case through manual experience or experiments. In this example,. mu.0.4.

Step S6: for text T₁All keywords are semantically related words expanded to text T based on word vector results₂The lengths are equal, and the specific steps are as follows:

step S61: traversal text T₁For the text T₁Each keyword w in_iCalculating the semantic distance from the word vector result obtained in step S3, selecting the semantic distance from the word vector result and w_iFront N with minimum distance_iWord as w_iThe semantically related word of (1).

The semantic distance is calculated by cosine similarity, and the specific formula is as follows:

in the above formula, Sim () represents similarity calculation;

Sim(w_i，w_j) Representing a keyword w_iAnd w_jThe semantic distance of (d);

k is the length of the word vector;

Further, wherein the text T₁Keyword w of_iIs selected from the group consisting of_iNearest top N_iA semantically related word, parameter N_iDetermined by the following equation:

Step S62: outputting shorter text T₁Augmented text T₁’。

Step S7: computing a text T₁And text T₂The similarity of (c). The method comprises the following specific steps:

step S71: text T₁' text vector T_1-1', is obtained by the following formula:

in the above formula, N_TAs a text T₁' and text T₂The number of keywords;

Step S72: text T₂Text vector T of_2-2The following formula is obtained:

in the above formula, N_TAs a text T₁' and text T₂The number of keywords;

W_mis the word w_mAnd (3) pre-training the word vectors in the model result obtained in the step 3.

Step S73: calculating text vector T by adopting cosine formula_1-1' and T_2-2Similarity of (2):

step S74: sim (T)₁，T₂)＝Sim(T_1-1’，T_2-2) Output text T₁And T₂Similarity Sim (T)₁，T₂)。

To further illustrate the implementation effect of the method provided by the present invention, 1000 published academic papers are alternatively used, each paper comprises a topic (as short text) and a summary (as corresponding long text) to form an unbalanced text as data for verification.

And respectively calculating the similarity of the topics of 1000 papers and the 1000 abstracts by adopting the similarity calculation method provided by the invention, and selecting the final result with the maximum similarity value in the results as the method provided by the invention.

Evaluation criteria: if the subject and the abstract are consistent with the reality, the result is correct if the similarity value displayed by the experimental result is the maximum.

Evaluation criteria: the aspects presented in the present invention are compared with a cosine similarity calculation method.

And (4) verification result: the accuracy of the cosine similarity calculation method is 64.6 percent, but the accuracy of the method provided by the invention can reach 80.2 percent, so that the accuracy of the similarity calculation result of the unbalanced text can be greatly improved by adopting the similarity calculation method of the unbalanced text provided by the invention.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A similarity calculation method for unbalanced text, comprising the steps of:

s1: inputting a corpus and preprocessing;

s2: pre-training word vectors for the corpus;

s3: saving the word vector result obtained by pre-training in the step S2;

S5: for text T₁And text T₂Extracting key words;

s7: computing a text T₁And text T₂The similarity of (c).

2. The method of calculating the similarity of unbalanced text according to claim 1, wherein: in step S1, before pre-training word vectors in the corpus, word segmentation and stop word processing are performed on all texts in the corpus using the jieba word segmentation toolkit of python.

3. The method of calculating the similarity of unbalanced text according to claim 1, wherein: in step S2, word vectors are pre-trained to the corpus using the word2vec model.

4. The method of calculating the similarity of unbalanced text according to claim 1, wherein: in step S5, text T is subjected to TF-IDF₁And text T₂Extracting keywords, and specifically comprising the following steps:

s51: for text T₁And text T₂Performing word segmentation;

s52: for text T₁And text T₂Removing stop words;

s53: computing a text T₁And text T₂Selecting words larger than a threshold value mu as text keywords according to the TF-IDF values of all the words, wherein the TF-IDF is calculated in the following mode:

TF-IDF＝TF*IDF

5. The method of calculating the similarity of unbalanced text according to claim 1, wherein: step S6, for the text T₁All keywords are semantically related words expanded to text T based on word vector results₂The specific steps of equal length are as follows:

s62: output is shorterText T₁Augmented text T₁’。

6. The method of calculating the similarity of unbalanced text according to claim 5, wherein: the semantic distance is calculated by cosine similarity, and the specific method is as follows:

in the above formula, Sim () represents similarity calculation;

Sim(w_i，w_j) Representing a keyword w_iAnd w_jThe semantic distance of (d);

W_iand W_jAre respectively key words w_iAnd w_jThe word vector of (2);

k is the length of the word vector;

7. The method of calculating the similarity of unbalanced text according to claim 6, wherein: text T₁Keyword w of_iIs selected from the group consisting of_iNearest top N_iA semantically related word, the parameter N_iDetermined by the following equation:

8. The method of calculating the similarity of unbalanced text according to claim 1, wherein: in step 7, meterCalculating text T₁And text T₂The specific steps of similarity are as follows:

s71: computing a text T₁' text vector T_1-1’；

S72: computing a text T₂Text vector T of_2-2，

s74: outputting a text T₁And T₂And (4) similarity.

9. The method of calculating the similarity of unbalanced text according to claim 8, wherein: in step S71, the text T₁' text vector T_1-1', obtained by:

in the above formula, N_TAs a text T₁' and text T₂The number of the keywords of (1),

In step S72, the text T₂Text vector T of_2-2Obtained by the following method:

in the above formula, N_TAs a text T₁' and text T₂The number of keywords;

10. The method of calculating the similarity of unbalanced text according to claim 8, wherein: in step S73, the cosine formula is used to calculate the text vector T_1-1' and T_2-2Similarity of (2):

T_1-1’_，hand T_2-2，hRespectively a text vector T_1-1' and T_2-2The h component of (a);