CN112257410A - Similarity calculation method for unbalanced text - Google Patents

Similarity calculation method for unbalanced text Download PDF

Info

Publication number
CN112257410A
CN112257410A CN202011107977.5A CN202011107977A CN112257410A CN 112257410 A CN112257410 A CN 112257410A CN 202011107977 A CN202011107977 A CN 202011107977A CN 112257410 A CN112257410 A CN 112257410A
Authority
CN
China
Prior art keywords
text
similarity
word
vector
unbalanced
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011107977.5A
Other languages
Chinese (zh)
Inventor
谢乾
马甲林
蒋圣
戴晶
周国栋
汪涛
吴大超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Haoxiang Basic Software Research Institute Co ltd
Nanjing Keti Software Technology Co ltd
Jiangsu Zhuoyi Information Technology Co ltd
Huaiyin Institute of Technology
Original Assignee
Nanjing Haoxiang Basic Software Research Institute Co ltd
Nanjing Keti Software Technology Co ltd
Jiangsu Zhuoyi Information Technology Co ltd
Huaiyin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Haoxiang Basic Software Research Institute Co ltd, Nanjing Keti Software Technology Co ltd, Jiangsu Zhuoyi Information Technology Co ltd, Huaiyin Institute of Technology filed Critical Nanjing Haoxiang Basic Software Research Institute Co ltd
Priority to CN202011107977.5A priority Critical patent/CN112257410A/en
Publication of CN112257410A publication Critical patent/CN112257410A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention discloses a similarity calculation method of an unbalanced text, which comprises the following steps: inputting a corpus and preprocessing; pre-training word vectors to the corpus by adopting a word2vec model; storing the word vector result obtained by the pre-training; inputting short text T with similarity to be calculated1And longer text T2(ii) a Using TF-IDF to text T1And text T2And extracting key words. For text T1All keywords are semantically related words expanded to text T based on word vector results2The lengths are equal; computing a text T1And text T2The similarity of (c). By adopting the similarity calculation method disclosed by the invention, the accuracy of calculating the similarity of the unbalanced text can be improved.

Description

Similarity calculation method for unbalanced text
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a similarity calculation method for unbalanced texts.
Background
Text similarity calculation is one of the core steps of text analysis, and is used in numerous text processing tasks such as text classification, information retrieval, automatic question answering, emotion analysis and the like. Currently, the commonly used text similarity calculation methods mainly include Euclidean distance, cosine distance, KL distance (Kullback-Leibler dictionary) and other deep learning-based methods. However, these methods have high accuracy in calculating the similarity of balanced texts (with small text length differences) and have poor accuracy for unbalanced texts (with large text length differences). However, current information technology often requires computing the similarity of unbalanced text in many applications, such as: in a search engine, searching a target page by a search term; in paper retrieval, matching the paper content by paper title or abstract; in the automatic question answering, answers and the like are searched for from question sentences. Because the short text carries less information, the traditional method has poor effect and low calculation accuracy when calculating the similarity with the long text.
Disclosure of Invention
The invention solves the technical problems that: the conventional method has poor effect and low calculation accuracy when the similarity with the long text is calculated because the short text carries less information.
The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a similarity calculation method for unbalanced text comprises the following steps:
s1: inputting a corpus and preprocessing;
s2: pre-training word vectors for the corpus;
s3: saving the word vector result obtained by pre-training in the step S2;
s4: inputting short text T with similarity to be calculated1And longer text T2
S5: for text T1And text T2Extracting key words;
s6: for text T1All keywords are semantically related words expanded to text T based on word vector results2The lengths are equal;
s7: computing a text T1And text T2The similarity of (c).
Further, in step S1, a jieba word segmentation toolkit of python is used to perform word segmentation and stop word processing on all texts in the corpus before the corpus pre-training word vectors.
Further, in step S2, word vectors are pre-trained on the corpus using the word2vec model.
Further, in step S5, the text T is subjected to TF-IDF1And text T2Extracting keywords, and specifically comprising the following steps:
s51: for text T1And text T2Performing word segmentation;
s52: for text T1And text T2Removing stop words;
s53: computing a text T1And text T2Selecting words larger than a threshold value mu as text keywords according to the TF-IDF values of all the words, wherein the TF-DF is calculated in the following mode:
TF-IDF=TF*IDF
in the formula: TF-the frequency of occurrence of a word in a text/total number of words of the text,
IDF log (total number of texts in corpus/number of texts containing the word + 1).
Further, step S6, for the text T1All keywords are semantically related words expanded to text T based on word vector results2The specific steps of equal length are as follows:
s61: traversal text T1For the text T1Each keyword w iniCalculating the semantic distance from the word vector result obtained in step S3, selecting the sum wiFront N with minimum distanceiWord as wiThe semantic related words of (1);
s62: outputting shorter text T1Augmented text T1’。
Further, the semantic distance is calculated by cosine similarity, and the specific method is as follows:
Figure BDA0002726824320000021
in the above formula, Sim () represents similarity calculation;
Sim(wi,wj) Representing a keyword wiAnd wjThe semantic distance of (d);
Wiand WjAre respectively key words wiAnd wjThe word vector of (2);
k is the length of the word vector;
Wi,nand Wj,nAre respectively key words wiAnd wjThe nth component of the word vector.
Further, the text T1Keyword w ofiIs selected from the group consisting ofiNearest top NiA semantically related word, the parameter NiDetermined by the following equation:
Figure BDA0002726824320000031
in the above formula, TF-IDF (w)i) For the word w in step S5iThe calculated TF-IDF value of;
|T1i and I T2L are respectively text T1And text T2The number of keywords.
Further, in step 7, the text T is calculated1And text T2The specific steps of similarity are as follows:
s71: computing a text T1' text vector T1-1’;
S72: computing a text T2Text vector T of2-2
S73: calculating text vector T by adopting cosine formula1-1' and T2-2The similarity of (2);
s74: outputting a text T1And T2And (4) similarity.
Further, in step S71, the text T1' text vector T1-1', obtained by:
Figure BDA0002726824320000032
in the above formula, NTAs a text T1' and the number of keywords of the text T2,
Wnis the word wnWord vectors in the pre-training model results obtained in step S3.
In step S72, the text T2Text vector T of2-2Obtained by the following method:
Figure BDA0002726824320000033
in the above formula, NTAs a text T1' and text T2The number of keywords;
Wmis the word wmWord vectors in the pre-training model results obtained in step S3.
Further, in step S73, the text vector T is calculated by using the cosine formula1-1' and T2-2Is likeDegree:
Figure BDA0002726824320000034
T’1-1,hand T2-2,hRespectively a text vector T1-1' and T2-2The h component of (a);
in step S74, Sim (T)1,T2)=Sim(T1-1’,T2-2) Output text T1And T2Similarity Sim (T)1,T2)。
Has the advantages that: compared with the prior art, the invention has the following advantages:
the similarity calculation method of the unbalanced text disclosed by the invention is characterized in that a word vector model based on deep learning is pre-trained, and reasonable semantic expansion is carried out on a short text to be calculated for similarity, so that two texts with different lengths to be calculated are in a balanced state, and the problem of poor accuracy of similarity calculation results of the short text and a long text under the unbalanced length is solved. The currently used text similarity calculation methods include: the Euclidean distance, the cosine distance, the KL distance and other deep learning-based methods cannot separately solve the problem that the similarity calculation result of the text with the unbalanced length is poor in accuracy. The similarity calculation method of the unbalanced text can greatly improve the accuracy of the similarity calculation result of the unbalanced text.
Detailed Description
The present invention will be further illustrated by the following specific examples, which are carried out on the premise of the technical scheme of the present invention, and it should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.
The similarity calculation method of the unbalanced text specifically comprises the following steps:
step S1: inputting a corpus and preprocessing;
the corpus adopted by the embodiment of the invention is 100 thousands of published academic papers in academic journals of information technology class. Each paper includes a topic (as short text) and a summary (as corresponding long text) that constitute unbalanced text. Before pre-training word vectors of the corpus, a jieba word segmentation toolkit of python is adopted to perform word segmentation and stop word processing on all texts in the corpus.
Step S2: pre-training word vectors to the corpus by adopting a word2vec model;
specifically, word2vec model in python open-source genetic package is adopted to train the pre-training word vector of the corpus,
step S3: and (3) storing the word vector result obtained by the pre-training in the step (2), and keeping the result in a disk file.
Step S4: inputting short text T with similarity to be calculated1And longer text T2
Step S5: adopting TF-IDF (term frequency-inverse document frequency) to perform T-text processing1And text T2Extracting keywords, and specifically comprising the following steps:
step S51: for text T1And text T2Performing word segmentation;
step S52: for text T1And text T2Removing stop words;
step S53: computing a text T1And text T2Selecting the words larger than the threshold value mu as the text keywords according to the TF-IDF values of all the words. Wherein, the calculation formula of TF-IDF is as follows:
TF-IDF=TF*IDF
in the above equation, TF is the frequency of occurrence of a word in a text/the total number of words in the text,
IDF log (total number of texts in corpus/number of texts containing the word + 1). The threshold value μ is a parameter, and is determined in an actual case through manual experience or experiments. In this example,. mu.0.4.
Step S6: for text T1All keywords are semantically related words expanded to text T based on word vector results2The lengths are equal, and the specific steps are as follows:
step S61: traversal text T1For the text T1Each keyword w iniCalculating the semantic distance from the word vector result obtained in step S3, selecting the semantic distance from the word vector result and wiFront N with minimum distanceiWord as wiThe semantically related word of (1).
The semantic distance is calculated by cosine similarity, and the specific formula is as follows:
Figure BDA0002726824320000051
in the above formula, Sim () represents similarity calculation;
Sim(wi,wj) Representing a keyword wiAnd wjThe semantic distance of (d);
k is the length of the word vector;
Wi,nand Wj,nAre respectively key words wiAnd wjThe nth component of the word vector.
Further, wherein the text T1Keyword w ofiIs selected from the group consisting ofiNearest top NiA semantically related word, parameter NiDetermined by the following equation:
Figure BDA0002726824320000052
in the above formula, TF-IDF (w)i) For the word w in step S5iThe calculated TF-IDF value of;
|T1i and I T2L are respectively text T1And text T2The number of keywords.
Step S62: outputting shorter text T1Augmented text T1’。
Step S7: computing a text T1And text T2The similarity of (c). The method comprises the following specific steps:
step S71: text T1' text vector T1-1', is obtained by the following formula:
Figure BDA0002726824320000053
in the above formula, NTAs a text T1' and text T2The number of keywords;
Wnis the word wnWord vectors in the pre-training model results obtained in step S3.
Step S72: text T2Text vector T of2-2The following formula is obtained:
Figure BDA0002726824320000061
in the above formula, NTAs a text T1' and text T2The number of keywords;
Wmis the word wmAnd (3) pre-training the word vectors in the model result obtained in the step 3.
Step S73: calculating text vector T by adopting cosine formula1-1' and T2-2Similarity of (2):
Figure BDA0002726824320000062
T’1-1,hand T2-2,hRespectively a text vector T1-1' and T2-2The h component of (a);
step S74: sim (T)1,T2)=Sim(T1-1’,T2-2) Output text T1And T2Similarity Sim (T)1,T2)。
To further illustrate the implementation effect of the method provided by the present invention, 1000 published academic papers are alternatively used, each paper comprises a topic (as short text) and a summary (as corresponding long text) to form an unbalanced text as data for verification.
And respectively calculating the similarity of the topics of 1000 papers and the 1000 abstracts by adopting the similarity calculation method provided by the invention, and selecting the final result with the maximum similarity value in the results as the method provided by the invention.
Evaluation criteria: if the subject and the abstract are consistent with the reality, the result is correct if the similarity value displayed by the experimental result is the maximum.
Evaluation criteria: the aspects presented in the present invention are compared with a cosine similarity calculation method.
And (4) verification result: the accuracy of the cosine similarity calculation method is 64.6 percent, but the accuracy of the method provided by the invention can reach 80.2 percent, so that the accuracy of the similarity calculation result of the unbalanced text can be greatly improved by adopting the similarity calculation method of the unbalanced text provided by the invention.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A similarity calculation method for unbalanced text, comprising the steps of:
s1: inputting a corpus and preprocessing;
s2: pre-training word vectors for the corpus;
s3: saving the word vector result obtained by pre-training in the step S2;
s4: inputting short text T with similarity to be calculated1And longer text T2
S5: for text T1And text T2Extracting key words;
s6: for text T1All keywords are semantically related words expanded to text T based on word vector results2The lengths are equal;
s7: computing a text T1And text T2The similarity of (c).
2. The method of calculating the similarity of unbalanced text according to claim 1, wherein: in step S1, before pre-training word vectors in the corpus, word segmentation and stop word processing are performed on all texts in the corpus using the jieba word segmentation toolkit of python.
3. The method of calculating the similarity of unbalanced text according to claim 1, wherein: in step S2, word vectors are pre-trained to the corpus using the word2vec model.
4. The method of calculating the similarity of unbalanced text according to claim 1, wherein: in step S5, text T is subjected to TF-IDF1And text T2Extracting keywords, and specifically comprising the following steps:
s51: for text T1And text T2Performing word segmentation;
s52: for text T1And text T2Removing stop words;
s53: computing a text T1And text T2Selecting words larger than a threshold value mu as text keywords according to the TF-IDF values of all the words, wherein the TF-IDF is calculated in the following mode:
TF-IDF=TF*IDF
in the formula: TF-the frequency of occurrence of a word in a text/total number of words of the text,
IDF log (total number of texts in corpus/number of texts containing the word + 1).
5. The method of calculating the similarity of unbalanced text according to claim 1, wherein: step S6, for the text T1All keywords are semantically related words expanded to text T based on word vector results2The specific steps of equal length are as follows:
s61: traversal text T1For the text T1Each keyword w iniCalculating the semantic distance from the word vector result obtained in step S3, selecting the sum wiFront N with minimum distanceiWord as wiThe semantic related words of (1);
s62: output is shorterText T1Augmented text T1’。
6. The method of calculating the similarity of unbalanced text according to claim 5, wherein: the semantic distance is calculated by cosine similarity, and the specific method is as follows:
Figure FDA0002726824310000021
in the above formula, Sim () represents similarity calculation;
Sim(wi,wj) Representing a keyword wiAnd wjThe semantic distance of (d);
Wiand WjAre respectively key words wiAnd wjThe word vector of (2);
k is the length of the word vector;
Wi,nand Wj,nAre respectively key words wiAnd wjThe nth component of the word vector.
7. The method of calculating the similarity of unbalanced text according to claim 6, wherein: text T1Keyword w ofiIs selected from the group consisting ofiNearest top NiA semantically related word, the parameter NiDetermined by the following equation:
Figure FDA0002726824310000022
in the above formula, TF-IDF (w)i) For the word w in step S5iThe calculated TF-IDF value of;
|T1i and I T2L are respectively text T1And text T2The number of keywords.
8. The method of calculating the similarity of unbalanced text according to claim 1, wherein: in step 7, meterCalculating text T1And text T2The specific steps of similarity are as follows:
s71: computing a text T1' text vector T1-1’;
S72: computing a text T2Text vector T of2-2
S73: calculating text vector T by adopting cosine formula1-1' and T2-2The similarity of (2);
s74: outputting a text T1And T2And (4) similarity.
9. The method of calculating the similarity of unbalanced text according to claim 8, wherein: in step S71, the text T1' text vector T1-1', obtained by:
Figure FDA0002726824310000023
in the above formula, NTAs a text T1' and text T2The number of the keywords of (1),
Wnis the word wnWord vectors in the pre-training model results obtained in step S3.
In step S72, the text T2Text vector T of2-2Obtained by the following method:
Figure FDA0002726824310000031
in the above formula, NTAs a text T1' and text T2The number of keywords;
Wmis the word wmWord vectors in the pre-training model results obtained in step S3.
10. The method of calculating the similarity of unbalanced text according to claim 8, wherein: in step S73, the cosine formula is used to calculate the text vector T1-1' and T2-2Similarity of (2):
Figure FDA0002726824310000032
T1-1,hand T2-2,hRespectively a text vector T1-1' and T2-2The h component of (a);
in step S74, Sim (T)1,T2)=Sim(T1-1’,T2-2) Output text T1And T2Similarity Sim (T)1,T2)。
CN202011107977.5A 2020-10-15 2020-10-15 Similarity calculation method for unbalanced text Pending CN112257410A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011107977.5A CN112257410A (en) 2020-10-15 2020-10-15 Similarity calculation method for unbalanced text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011107977.5A CN112257410A (en) 2020-10-15 2020-10-15 Similarity calculation method for unbalanced text

Publications (1)

Publication Number Publication Date
CN112257410A true CN112257410A (en) 2021-01-22

Family

ID=74244380

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011107977.5A Pending CN112257410A (en) 2020-10-15 2020-10-15 Similarity calculation method for unbalanced text

Country Status (1)

Country Link
CN (1) CN112257410A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486662A (en) * 2021-07-19 2021-10-08 上汽通用五菱汽车股份有限公司 Text processing method, system and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text
CN107122451A (en) * 2017-04-26 2017-09-01 北京科技大学 A kind of legal documents case by grader method for auto constructing
CN108280206A (en) * 2018-01-30 2018-07-13 尹忠博 A kind of short text classification method based on semantically enhancement
CN108681557A (en) * 2018-04-08 2018-10-19 中国科学院信息工程研究所 Based on the short text motif discovery method and system indicated from expansion with similar two-way constraint
CN110889443A (en) * 2019-11-21 2020-03-17 成都数联铭品科技有限公司 Unsupervised text classification system and unsupervised text classification method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text
CN107122451A (en) * 2017-04-26 2017-09-01 北京科技大学 A kind of legal documents case by grader method for auto constructing
CN108280206A (en) * 2018-01-30 2018-07-13 尹忠博 A kind of short text classification method based on semantically enhancement
CN108681557A (en) * 2018-04-08 2018-10-19 中国科学院信息工程研究所 Based on the short text motif discovery method and system indicated from expansion with similar two-way constraint
CN110889443A (en) * 2019-11-21 2020-03-17 成都数联铭品科技有限公司 Unsupervised text classification system and unsupervised text classification method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486662A (en) * 2021-07-19 2021-10-08 上汽通用五菱汽车股份有限公司 Text processing method, system and medium

Similar Documents

Publication Publication Date Title
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
El-Beltagy et al. Combining lexical features and a supervised learning approach for Arabic sentiment analysis
WO2019228203A1 (en) Short text classification method and system
CN111177365A (en) Unsupervised automatic abstract extraction method based on graph model
US20190303375A1 (en) Relevant passage retrieval system
CN106202153A (en) The spelling error correction method of a kind of ES search engine and system
CN109002473A (en) A kind of sentiment analysis method based on term vector and part of speech
CN108920599B (en) Question-answering system answer accurate positioning and extraction method based on knowledge ontology base
US20110213763A1 (en) Web content mining of pair-based data
WO2021253873A1 (en) Method and apparatus for retrieving similar document
CN110879834A (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN110705247A (en) Based on x2-C text similarity calculation method
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN107818173B (en) Vector space model-based Chinese false comment filtering method
CN111859950A (en) Method for automatically generating lecture notes
Lin et al. Enhanced BERT-based ranking models for spoken document retrieval
CN112257410A (en) Similarity calculation method for unbalanced text
CN111159405B (en) Irony detection method based on background knowledge
Xue et al. DPAEG: a dependency parse-based adversarial examples generation method for intelligent Q&A robots
Ye et al. A sentiment based non-factoid question-answering framework
CN114416914B (en) Processing method based on picture question and answer
CN112989839A (en) Keyword feature-based intent recognition method and system embedded in language model
CN113505196B (en) Text retrieval method and device based on parts of speech, electronic equipment and storage medium
Zheng et al. A novel hierarchical convolutional neural network for question answering over paragraphs
Ronghui et al. Application of Improved Convolutional Neural Network in Text Classification.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination