CN112257410A - Similarity calculation method for unbalanced text - Google Patents
Similarity calculation method for unbalanced text Download PDFInfo
- Publication number
- CN112257410A CN112257410A CN202011107977.5A CN202011107977A CN112257410A CN 112257410 A CN112257410 A CN 112257410A CN 202011107977 A CN202011107977 A CN 202011107977A CN 112257410 A CN112257410 A CN 112257410A
- Authority
- CN
- China
- Prior art keywords
- text
- similarity
- word
- vector
- unbalanced
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004364 calculation method Methods 0.000 title claims abstract description 27
- 239000013598 vector Substances 0.000 claims abstract description 57
- 238000012549 training Methods 0.000 claims abstract description 18
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 238000000034 method Methods 0.000 claims description 21
- 230000011218 segmentation Effects 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 5
- 238000013135 deep learning Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000005034 decoration Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 238000007796 conventional method Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Abstract
The invention discloses a similarity calculation method of an unbalanced text, which comprises the following steps: inputting a corpus and preprocessing; pre-training word vectors to the corpus by adopting a word2vec model; storing the word vector result obtained by the pre-training; inputting short text T with similarity to be calculated1And longer text T2(ii) a Using TF-IDF to text T1And text T2And extracting key words. For text T1All keywords are semantically related words expanded to text T based on word vector results2The lengths are equal; computing a text T1And text T2The similarity of (c). By adopting the similarity calculation method disclosed by the invention, the accuracy of calculating the similarity of the unbalanced text can be improved.
Description
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a similarity calculation method for unbalanced texts.
Background
Text similarity calculation is one of the core steps of text analysis, and is used in numerous text processing tasks such as text classification, information retrieval, automatic question answering, emotion analysis and the like. Currently, the commonly used text similarity calculation methods mainly include Euclidean distance, cosine distance, KL distance (Kullback-Leibler dictionary) and other deep learning-based methods. However, these methods have high accuracy in calculating the similarity of balanced texts (with small text length differences) and have poor accuracy for unbalanced texts (with large text length differences). However, current information technology often requires computing the similarity of unbalanced text in many applications, such as: in a search engine, searching a target page by a search term; in paper retrieval, matching the paper content by paper title or abstract; in the automatic question answering, answers and the like are searched for from question sentences. Because the short text carries less information, the traditional method has poor effect and low calculation accuracy when calculating the similarity with the long text.
Disclosure of Invention
The invention solves the technical problems that: the conventional method has poor effect and low calculation accuracy when the similarity with the long text is calculated because the short text carries less information.
The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a similarity calculation method for unbalanced text comprises the following steps:
s1: inputting a corpus and preprocessing;
s2: pre-training word vectors for the corpus;
s3: saving the word vector result obtained by pre-training in the step S2;
s4: inputting short text T with similarity to be calculated1And longer text T2;
S5: for text T1And text T2Extracting key words;
s6: for text T1All keywords are semantically related words expanded to text T based on word vector results2The lengths are equal;
s7: computing a text T1And text T2The similarity of (c).
Further, in step S1, a jieba word segmentation toolkit of python is used to perform word segmentation and stop word processing on all texts in the corpus before the corpus pre-training word vectors.
Further, in step S2, word vectors are pre-trained on the corpus using the word2vec model.
Further, in step S5, the text T is subjected to TF-IDF1And text T2Extracting keywords, and specifically comprising the following steps:
s51: for text T1And text T2Performing word segmentation;
s52: for text T1And text T2Removing stop words;
s53: computing a text T1And text T2Selecting words larger than a threshold value mu as text keywords according to the TF-IDF values of all the words, wherein the TF-DF is calculated in the following mode:
TF-IDF=TF*IDF
in the formula: TF-the frequency of occurrence of a word in a text/total number of words of the text,
IDF log (total number of texts in corpus/number of texts containing the word + 1).
Further, step S6, for the text T1All keywords are semantically related words expanded to text T based on word vector results2The specific steps of equal length are as follows:
s61: traversal text T1For the text T1Each keyword w iniCalculating the semantic distance from the word vector result obtained in step S3, selecting the sum wiFront N with minimum distanceiWord as wiThe semantic related words of (1);
s62: outputting shorter text T1Augmented text T1’。
Further, the semantic distance is calculated by cosine similarity, and the specific method is as follows:
in the above formula, Sim () represents similarity calculation;
Sim(wi,wj) Representing a keyword wiAnd wjThe semantic distance of (d);
Wiand WjAre respectively key words wiAnd wjThe word vector of (2);
k is the length of the word vector;
Wi,nand Wj,nAre respectively key words wiAnd wjThe nth component of the word vector.
Further, the text T1Keyword w ofiIs selected from the group consisting ofiNearest top NiA semantically related word, the parameter NiDetermined by the following equation:
in the above formula, TF-IDF (w)i) For the word w in step S5iThe calculated TF-IDF value of;
|T1i and I T2L are respectively text T1And text T2The number of keywords.
Further, in step 7, the text T is calculated1And text T2The specific steps of similarity are as follows:
s71: computing a text T1' text vector T1-1’;
S72: computing a text T2Text vector T of2-2,
S73: calculating text vector T by adopting cosine formula1-1' and T2-2The similarity of (2);
s74: outputting a text T1And T2And (4) similarity.
Further, in step S71, the text T1' text vector T1-1', obtained by:
in the above formula, NTAs a text T1' and the number of keywords of the text T2,
Wnis the word wnWord vectors in the pre-training model results obtained in step S3.
In step S72, the text T2Text vector T of2-2Obtained by the following method:
in the above formula, NTAs a text T1' and text T2The number of keywords;
Wmis the word wmWord vectors in the pre-training model results obtained in step S3.
Further, in step S73, the text vector T is calculated by using the cosine formula1-1' and T2-2Is likeDegree:
T’1-1,hand T2-2,hRespectively a text vector T1-1' and T2-2The h component of (a);
in step S74, Sim (T)1,T2)=Sim(T1-1’,T2-2) Output text T1And T2Similarity Sim (T)1,T2)。
Has the advantages that: compared with the prior art, the invention has the following advantages:
the similarity calculation method of the unbalanced text disclosed by the invention is characterized in that a word vector model based on deep learning is pre-trained, and reasonable semantic expansion is carried out on a short text to be calculated for similarity, so that two texts with different lengths to be calculated are in a balanced state, and the problem of poor accuracy of similarity calculation results of the short text and a long text under the unbalanced length is solved. The currently used text similarity calculation methods include: the Euclidean distance, the cosine distance, the KL distance and other deep learning-based methods cannot separately solve the problem that the similarity calculation result of the text with the unbalanced length is poor in accuracy. The similarity calculation method of the unbalanced text can greatly improve the accuracy of the similarity calculation result of the unbalanced text.
Detailed Description
The present invention will be further illustrated by the following specific examples, which are carried out on the premise of the technical scheme of the present invention, and it should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.
The similarity calculation method of the unbalanced text specifically comprises the following steps:
step S1: inputting a corpus and preprocessing;
the corpus adopted by the embodiment of the invention is 100 thousands of published academic papers in academic journals of information technology class. Each paper includes a topic (as short text) and a summary (as corresponding long text) that constitute unbalanced text. Before pre-training word vectors of the corpus, a jieba word segmentation toolkit of python is adopted to perform word segmentation and stop word processing on all texts in the corpus.
Step S2: pre-training word vectors to the corpus by adopting a word2vec model;
specifically, word2vec model in python open-source genetic package is adopted to train the pre-training word vector of the corpus,
step S3: and (3) storing the word vector result obtained by the pre-training in the step (2), and keeping the result in a disk file.
Step S4: inputting short text T with similarity to be calculated1And longer text T2
Step S5: adopting TF-IDF (term frequency-inverse document frequency) to perform T-text processing1And text T2Extracting keywords, and specifically comprising the following steps:
step S51: for text T1And text T2Performing word segmentation;
step S52: for text T1And text T2Removing stop words;
step S53: computing a text T1And text T2Selecting the words larger than the threshold value mu as the text keywords according to the TF-IDF values of all the words. Wherein, the calculation formula of TF-IDF is as follows:
TF-IDF=TF*IDF
in the above equation, TF is the frequency of occurrence of a word in a text/the total number of words in the text,
IDF log (total number of texts in corpus/number of texts containing the word + 1). The threshold value μ is a parameter, and is determined in an actual case through manual experience or experiments. In this example,. mu.0.4.
Step S6: for text T1All keywords are semantically related words expanded to text T based on word vector results2The lengths are equal, and the specific steps are as follows:
step S61: traversal text T1For the text T1Each keyword w iniCalculating the semantic distance from the word vector result obtained in step S3, selecting the semantic distance from the word vector result and wiFront N with minimum distanceiWord as wiThe semantically related word of (1).
The semantic distance is calculated by cosine similarity, and the specific formula is as follows:
in the above formula, Sim () represents similarity calculation;
Sim(wi,wj) Representing a keyword wiAnd wjThe semantic distance of (d);
k is the length of the word vector;
Wi,nand Wj,nAre respectively key words wiAnd wjThe nth component of the word vector.
Further, wherein the text T1Keyword w ofiIs selected from the group consisting ofiNearest top NiA semantically related word, parameter NiDetermined by the following equation:
in the above formula, TF-IDF (w)i) For the word w in step S5iThe calculated TF-IDF value of;
|T1i and I T2L are respectively text T1And text T2The number of keywords.
Step S62: outputting shorter text T1Augmented text T1’。
Step S7: computing a text T1And text T2The similarity of (c). The method comprises the following specific steps:
step S71: text T1' text vector T1-1', is obtained by the following formula:
in the above formula, NTAs a text T1' and text T2The number of keywords;
Wnis the word wnWord vectors in the pre-training model results obtained in step S3.
Step S72: text T2Text vector T of2-2The following formula is obtained:
in the above formula, NTAs a text T1' and text T2The number of keywords;
Wmis the word wmAnd (3) pre-training the word vectors in the model result obtained in the step 3.
Step S73: calculating text vector T by adopting cosine formula1-1' and T2-2Similarity of (2):
T’1-1,hand T2-2,hRespectively a text vector T1-1' and T2-2The h component of (a);
step S74: sim (T)1,T2)=Sim(T1-1’,T2-2) Output text T1And T2Similarity Sim (T)1,T2)。
To further illustrate the implementation effect of the method provided by the present invention, 1000 published academic papers are alternatively used, each paper comprises a topic (as short text) and a summary (as corresponding long text) to form an unbalanced text as data for verification.
And respectively calculating the similarity of the topics of 1000 papers and the 1000 abstracts by adopting the similarity calculation method provided by the invention, and selecting the final result with the maximum similarity value in the results as the method provided by the invention.
Evaluation criteria: if the subject and the abstract are consistent with the reality, the result is correct if the similarity value displayed by the experimental result is the maximum.
Evaluation criteria: the aspects presented in the present invention are compared with a cosine similarity calculation method.
And (4) verification result: the accuracy of the cosine similarity calculation method is 64.6 percent, but the accuracy of the method provided by the invention can reach 80.2 percent, so that the accuracy of the similarity calculation result of the unbalanced text can be greatly improved by adopting the similarity calculation method of the unbalanced text provided by the invention.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (10)
1. A similarity calculation method for unbalanced text, comprising the steps of:
s1: inputting a corpus and preprocessing;
s2: pre-training word vectors for the corpus;
s3: saving the word vector result obtained by pre-training in the step S2;
s4: inputting short text T with similarity to be calculated1And longer text T2;
S5: for text T1And text T2Extracting key words;
s6: for text T1All keywords are semantically related words expanded to text T based on word vector results2The lengths are equal;
s7: computing a text T1And text T2The similarity of (c).
2. The method of calculating the similarity of unbalanced text according to claim 1, wherein: in step S1, before pre-training word vectors in the corpus, word segmentation and stop word processing are performed on all texts in the corpus using the jieba word segmentation toolkit of python.
3. The method of calculating the similarity of unbalanced text according to claim 1, wherein: in step S2, word vectors are pre-trained to the corpus using the word2vec model.
4. The method of calculating the similarity of unbalanced text according to claim 1, wherein: in step S5, text T is subjected to TF-IDF1And text T2Extracting keywords, and specifically comprising the following steps:
s51: for text T1And text T2Performing word segmentation;
s52: for text T1And text T2Removing stop words;
s53: computing a text T1And text T2Selecting words larger than a threshold value mu as text keywords according to the TF-IDF values of all the words, wherein the TF-IDF is calculated in the following mode:
TF-IDF=TF*IDF
in the formula: TF-the frequency of occurrence of a word in a text/total number of words of the text,
IDF log (total number of texts in corpus/number of texts containing the word + 1).
5. The method of calculating the similarity of unbalanced text according to claim 1, wherein: step S6, for the text T1All keywords are semantically related words expanded to text T based on word vector results2The specific steps of equal length are as follows:
s61: traversal text T1For the text T1Each keyword w iniCalculating the semantic distance from the word vector result obtained in step S3, selecting the sum wiFront N with minimum distanceiWord as wiThe semantic related words of (1);
s62: output is shorterText T1Augmented text T1’。
6. The method of calculating the similarity of unbalanced text according to claim 5, wherein: the semantic distance is calculated by cosine similarity, and the specific method is as follows:
in the above formula, Sim () represents similarity calculation;
Sim(wi,wj) Representing a keyword wiAnd wjThe semantic distance of (d);
Wiand WjAre respectively key words wiAnd wjThe word vector of (2);
k is the length of the word vector;
Wi,nand Wj,nAre respectively key words wiAnd wjThe nth component of the word vector.
7. The method of calculating the similarity of unbalanced text according to claim 6, wherein: text T1Keyword w ofiIs selected from the group consisting ofiNearest top NiA semantically related word, the parameter NiDetermined by the following equation:
in the above formula, TF-IDF (w)i) For the word w in step S5iThe calculated TF-IDF value of;
|T1i and I T2L are respectively text T1And text T2The number of keywords.
8. The method of calculating the similarity of unbalanced text according to claim 1, wherein: in step 7, meterCalculating text T1And text T2The specific steps of similarity are as follows:
s71: computing a text T1' text vector T1-1’;
S72: computing a text T2Text vector T of2-2,
S73: calculating text vector T by adopting cosine formula1-1' and T2-2The similarity of (2);
s74: outputting a text T1And T2And (4) similarity.
9. The method of calculating the similarity of unbalanced text according to claim 8, wherein: in step S71, the text T1' text vector T1-1', obtained by:
in the above formula, NTAs a text T1' and text T2The number of the keywords of (1),
Wnis the word wnWord vectors in the pre-training model results obtained in step S3.
In step S72, the text T2Text vector T of2-2Obtained by the following method:
in the above formula, NTAs a text T1' and text T2The number of keywords;
Wmis the word wmWord vectors in the pre-training model results obtained in step S3.
10. The method of calculating the similarity of unbalanced text according to claim 8, wherein: in step S73, the cosine formula is used to calculate the text vector T1-1' and T2-2Similarity of (2):
T1-1’,hand T2-2,hRespectively a text vector T1-1' and T2-2The h component of (a);
in step S74, Sim (T)1,T2)=Sim(T1-1’,T2-2) Output text T1And T2Similarity Sim (T)1,T2)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011107977.5A CN112257410A (en) | 2020-10-15 | 2020-10-15 | Similarity calculation method for unbalanced text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011107977.5A CN112257410A (en) | 2020-10-15 | 2020-10-15 | Similarity calculation method for unbalanced text |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112257410A true CN112257410A (en) | 2021-01-22 |
Family
ID=74244380
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011107977.5A Pending CN112257410A (en) | 2020-10-15 | 2020-10-15 | Similarity calculation method for unbalanced text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112257410A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113486662A (en) * | 2021-07-19 | 2021-10-08 | 上汽通用五菱汽车股份有限公司 | Text processing method, system and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106484664A (en) * | 2016-10-21 | 2017-03-08 | 竹间智能科技(上海)有限公司 | Similarity calculating method between a kind of short text |
CN107122451A (en) * | 2017-04-26 | 2017-09-01 | 北京科技大学 | A kind of legal documents case by grader method for auto constructing |
CN108280206A (en) * | 2018-01-30 | 2018-07-13 | 尹忠博 | A kind of short text classification method based on semantically enhancement |
CN108681557A (en) * | 2018-04-08 | 2018-10-19 | 中国科学院信息工程研究所 | Based on the short text motif discovery method and system indicated from expansion with similar two-way constraint |
CN110889443A (en) * | 2019-11-21 | 2020-03-17 | 成都数联铭品科技有限公司 | Unsupervised text classification system and unsupervised text classification method |
-
2020
- 2020-10-15 CN CN202011107977.5A patent/CN112257410A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106484664A (en) * | 2016-10-21 | 2017-03-08 | 竹间智能科技(上海)有限公司 | Similarity calculating method between a kind of short text |
CN107122451A (en) * | 2017-04-26 | 2017-09-01 | 北京科技大学 | A kind of legal documents case by grader method for auto constructing |
CN108280206A (en) * | 2018-01-30 | 2018-07-13 | 尹忠博 | A kind of short text classification method based on semantically enhancement |
CN108681557A (en) * | 2018-04-08 | 2018-10-19 | 中国科学院信息工程研究所 | Based on the short text motif discovery method and system indicated from expansion with similar two-way constraint |
CN110889443A (en) * | 2019-11-21 | 2020-03-17 | 成都数联铭品科技有限公司 | Unsupervised text classification system and unsupervised text classification method |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113486662A (en) * | 2021-07-19 | 2021-10-08 | 上汽通用五菱汽车股份有限公司 | Text processing method, system and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
El-Beltagy et al. | Combining lexical features and a supervised learning approach for Arabic sentiment analysis | |
WO2019228203A1 (en) | Short text classification method and system | |
CN111177365A (en) | Unsupervised automatic abstract extraction method based on graph model | |
US20190303375A1 (en) | Relevant passage retrieval system | |
CN106202153A (en) | The spelling error correction method of a kind of ES search engine and system | |
CN109002473A (en) | A kind of sentiment analysis method based on term vector and part of speech | |
CN108920599B (en) | Question-answering system answer accurate positioning and extraction method based on knowledge ontology base | |
US20110213763A1 (en) | Web content mining of pair-based data | |
WO2021253873A1 (en) | Method and apparatus for retrieving similar document | |
CN110879834A (en) | Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof | |
CN110705247A (en) | Based on x2-C text similarity calculation method | |
CN112069312A (en) | Text classification method based on entity recognition and electronic device | |
CN107818173B (en) | Vector space model-based Chinese false comment filtering method | |
CN111859950A (en) | Method for automatically generating lecture notes | |
Lin et al. | Enhanced BERT-based ranking models for spoken document retrieval | |
CN112257410A (en) | Similarity calculation method for unbalanced text | |
CN111159405B (en) | Irony detection method based on background knowledge | |
Xue et al. | DPAEG: a dependency parse-based adversarial examples generation method for intelligent Q&A robots | |
Ye et al. | A sentiment based non-factoid question-answering framework | |
CN114416914B (en) | Processing method based on picture question and answer | |
CN112989839A (en) | Keyword feature-based intent recognition method and system embedded in language model | |
CN113505196B (en) | Text retrieval method and device based on parts of speech, electronic equipment and storage medium | |
Zheng et al. | A novel hierarchical convolutional neural network for question answering over paragraphs | |
Ronghui et al. | Application of Improved Convolutional Neural Network in Text Classification. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |