CN110825877A

CN110825877A - Semantic similarity analysis method based on text clustering

Info

Publication number: CN110825877A
Application number: CN201911100265.8A
Authority: CN
Inventors: 唐昱润; 宫法明; 马玉辉; 司朋举; 李昕
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2019-11-12
Filing date: 2019-11-12
Publication date: 2020-02-21

Abstract

The invention discloses a semantic similarity analysis method based on text clustering, which comprises the following steps: the method comprises the steps of taking unprocessed text data as input, carrying out word frequency statistics on the text subjected to data preprocessing, taking word frequency statistical information as prior knowledge to be added into text clustering, and providing a posterior judgment criterion, and also taking the word frequency statistics as a classifier to carry out an unsupervised clustering method on the basis of once more so as to improve the accuracy and timeliness of text clustering results; carrying out synonym disambiguation on the processed text, generating a semantic vector fusing context characteristics after carrying out semantic role labeling, processing the text sequence by adopting two LSTMs with completely same structures and parameters, adding the product and the variance of the result, amplifying the same points and the difference of the text, and calculating to obtain the final result of similarity analysis. The method can be applied to the actual scenes of text similarity analysis in various different fields, and can well process text data of different types.

Description

Semantic similarity analysis method based on text clustering

Technical Field

The invention belongs to the field of natural language processing, and relates to a semantic similarity analysis method based on text clustering.

Background

Text clustering and semantic similarity detection are always important research subjects in the field of natural language processing, can automatically and accurately determine the category of texts, semantic extraction and similarity comparison in text data, and are important for processing and applying text data. In recent years, the maturity of the research field is always accompanied by the rapid increase of the number of reports and scientific achievements, the summarization and the summary providing become important, a similarity analysis method based on text clustering is more and more concerned, the method can be divided into two stages of text clustering and text similarity analysis, the current method basically focuses more on word frequency information, and ignores the semantic information of keywords and the data structure and the context information of texts, and the semantic information and the context information of a plurality of keywords in the texts are beneficial to the clustering similarity analysis of the texts.

The problem of efficiently analyzing and detecting the similarity of the text in a specific field is basically solved, but the problem that the similarity is difficult to quickly and accurately obtain by applying the method to the similarity analysis in a multi-field and huge range of a text library, the problems of high dimensionality of a feature word vector, sparse data, low-frequency word omission, lack of semantic information and the like exist, and professional words in the text and synonyms which can generate ambiguity in different fields can also influence the similarity analysis result. Although the semantic similarity analysis method based on deep learning can reduce the influence and improve the accuracy, the problem of overlong detection time exists, and therefore how to quickly and efficiently analyze the text similarity in a wide field becomes a difficult problem to be solved urgently.

Disclosure of Invention

In order to overcome the defects, the invention provides a semantic similarity analysis method based on text clustering, which comprises the following specific steps:

s1, for an input unprocessed text T, adopting the modes of removing stop words, code conversion and Chinese word segmentation to carry out data preprocessing, and converting the data into a form capable of being calculated;

s2, training text word vectors by using Skip-gram and Softmax models for the divided words to calculate similarity between the words;

s3, calculating the word frequency inverse document frequency by using a TF-IDF algorithm, further obtaining the value of the TF-IDF, and extracting the key words of the detected text;

s4, adding the word frequency statistical information as prior knowledge into text clustering, providing a posterior judgment criterion, and performing preliminary classification on the texts in the sample library;

s5, adding the extracted keywords as prior knowledge into a classifier, clustering the text data in the sample library on the basis of the prior knowledge, and accurately refining the category of the previously obtained text data;

s6, performing morphological analysis, synonym disambiguation and semantic role labeling on the preprocessed text to be detected to generate a semantic vector fusing context characteristics;

s7, inputting semantic vectors into two LSTM processing text sequences with completely same structures and parameters, adding products and variances of results, and amplifying the same points and differences of texts;

and S8, outputting the final result of the text similarity analysis.

For step S3, the word vector Skip-gram model used in the present invention is a Huffman tree constructed based on Hierarchical Softmax, and can predict the probability of the appearance of the previous and next words from the large-scale non-labeled text data according to the currently input word, that is, the words appearing around can be predicted according to the probability of the appearance of the current word; according to the co-occurrence principle of words in a window, calculating the co-occurrence probability among the words based on window sliding, so that word vectors generated by each feature word contain certain text structure information and semantic information, wherein the structure and the calculation mode of a Skip-gram model are as follows:

the Skip-gram model comprises an input layer, a projection layer and an output layer, wherein the input layer is a current characteristic word, and a word vector W of the word_t∈R^m(ii) a Output ofThe purpose of the projection layer is to maximize the value of the objective function L for the probability of a word appearing in the context window of the feature word, given a set of word sequences W₁,W₂,…,W_NThen, then

In the formula (1), N is the length of the word sequence, c represents the context length of the current characteristic word, generally takes 5-10, and has good effect, and P (W)_j+1|W_j) For knowing the current word W_jProbability of occurrence, its context feature word Wj₊₁The probability of occurrence, all word vectors obtained by training through Skip-gram model, form a word vector matrix X belonging to R^mn(ii) a With X_i∈R^mRepresenting the word vectors of the feature word i in the m-dimensional space, the similarity between feature words can be measured by using the distance between corresponding word vectors, and the euclidean distance between two vectors is shown as the following formula:

d(W_i,W_j)＝‖x_i-x_j‖₂(2)

d (W) in formula (2)_i,W_j) Semantic distance, x, representing features i and j_iAnd x_jExpression of characteristic word W_i,W_jCorresponding word vector, d (W)_i,W_j) The smaller the value of (a), the smaller the semantic distance between two feature words, the more similar the semantics.

For step S5, a conventional K-means algorithm is used to perform a text clustering process, and a plurality of text sets D to be clustered are formed as { D ═ D₁，D₂，D₃，…，D_nThe text clustering algorithm based on word vectors and feature semantic distances has the specific flow as follows:

(1) traversing the input text set for each text (D)_i) Performing word segmentation, word stop removal and other operations by using a word segmentation tool to obtain a text feature word set S ═ { S1, S2, S3, …, Sm };

(2) training a large-scale corpus obtained after word segmentation by using a Skip-gram model to obtain a word vector of each feature word;

(3) counting and calculating the word frequency, position and word distance information of the characteristic words of each text;

(4) randomly selecting k texts as initial clustering centroids, calculating the distance between each text and the k clustering centroids by using a formula (3), and selecting the centroid with the shortest distance as the centroid of the text, wherein the calculation process is as follows:

l(W_i,W_j)＝αtf_ij×idf_i+βk+γg_ij×d(W_i,W_j) (3)

in the formula (3), l (wi, wj) represents the multi-feature semantic distance between the feature words i and j, d (wi, wj) represents the semantic distance between the features i and j, tf_ijIs word frequency information, idf_iEntropy of information, g, for feature words_ijRepresenting the weight of the distance from the first occurrence to the last occurrence of the feature word i in the text j, wherein α, β and gamma are weights of three different types of features, and α + β + gamma is 1;

(5) sequentially calculating the distances from the text with the same centroid to other centroids, and selecting the centroid with the shortest distance as a new centroid of the text;

(6) and (5) circularly executing the steps (4) and (5) until the centroid is not changed any more, and finally obtaining a clustering result.

For step S7, the invention uses two LSTM neural networks with completely same input structure and parameters to process the text sequence model, and the model is composed of a preprocessing layer, an input layer, a training layer and an output layer; let the input be { x, x' }, and the output be y, l_iIs an intermediate hidden layer, W_iAs a weight matrix for each layer, b_iFor each layer bias, the following:

in equation (4), the activation function of training layers l2-l4 is ReLU, the activation function of output layer y is Softmax, the similarity is classified into 6 classes, so K equals 6, and the following is deduced:

the model adopts a double-sequence structure, each sequence is trained based on LSTM, the model is described in detail as follows, t1 and t2 of texts needing to be trained are input into an ① input layer, the texts are converted into word sequences s1 and s2 through word segmentation, the word sequences are mapped with word vectors in pre-training, the word sequences are converted into word vector sequences v1 and v2, ② the word vector sequences are input into an LSTM model for training to obtain two text vectors which are r1 and r2 respectively, product operation is carried out on r1 and r2 to obtain p, variance operation is carried out on r1 and r2 to obtain q, and finally the four results of r1, r2, p and q are connected together, a ③ output layer puts the connecting vectors into a full-connection layer for calculation, semantic similarity values are obtained finally, compared with the previous model, the model not only comprises a double-sequence processing structure, but also carries out further processing on training results of the text vectors, namely the results of the two sequences are multiplied by the full-connection layer for calculation, the product operation result is obtained, the semantic similarity values are obtained, the two sequences are amplified, and the variance of the two sequences is improved, and the variance of the two sequences is amplified, so that the variance of the.

The semantic similarity analysis method based on text clustering solves the problems of large text similarity analysis error and poor real-time performance in scenes with relevance among various fields in the prior art, and has the following advantages:

(1) the method can be applied to various practical scenes, realizes the method for text clustering and semantic similarity analysis, and forms a general framework of a text similarity comparison task in a specific practical application scene;

(2) the method can fully utilize the semantic information and the context information of the keywords, improves the text clustering and semantic similarity analysis method, simplifies the subsequent classification network, and can adapt to the input of multi-type text data;

(3) in the method, in the actual scientific research environment with multiple crossed fields, the accuracy and the analysis rate of text clustering and semantic similarity analysis are improved by adopting a semantic-based keyword extraction method and combining semantic role labeling and context information.

Drawings

FIG. 1 is a flow chart of a semantic similarity analysis method based on text clustering according to the present invention.

FIG. 2 is a schematic structural diagram of a semantic similarity analysis model according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the following figures and detailed description:

a semantic similarity analysis method based on text clustering is shown in FIG. 1, which is a flow chart of the semantic similarity analysis method based on text clustering, and the method comprises the following steps:

s1, preprocessing data by text drying, stop word removing, code conversion and Chinese word segmentation for an input unprocessed text sequence. The original data come from the scientific research results declared by marine oil extraction plants over the years, the scientific research results are divided into 4 categories, and the newly declared scientific research result documents are processed in real time by taking the actual scientific research results in work as a sample library.

And S2, training word vectors, training the text word vectors of the segmented words in the text data to be analyzed by adopting Skip-gram and Softmax models to calculate the similarity between words, predicting the occurrence probability of the upper and lower words from large-scale non-labeled text data according to the currently input words, namely predicting the words appearing around according to the occurrence probability of the current words. According to the co-occurrence principle of words in a window, the co-occurrence probability among the words is calculated based on window sliding, so that word vectors generated by each feature word contain certain text structure information and semantic information.

S3, word frequency statistics, wherein the word frequency statistics method adopted by the invention is a common weighting technology used for information retrieval and text mining, TF-IDF.

TF-IDF is a statistical method to evaluate the importance of words to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. If the frequency TF of a word in one article is high and the word rarely appears in other articles, the word or phrase is considered to have good category distinguishing capability and is suitable for classification;

tf (term frequency), the number of times a word appears in an article, is usually normalized (typically the word frequency divided by the total word number of the article) to prevent it from being biased towards long documents, as shown in equation (7):

the IDF (inverse Document frequency) is an inverse text frequency index, and if the number of documents containing the keywords is less, the keywords are proved to have good category distinguishing capability. The IDF for a keyword can be obtained by dividing the total number of articles by the number of articles containing the keyword, and then taking the logarithm of the result.

In equation (8), | D | is the total number of files in the corpus. I { j ∈ dj | denotes the inclusion of the word t_iThe number of files. A high word frequency within a particular document, and a low document frequency for that word across the document collection, may result in a high-weighted TF-IDF. Therefore, common words in scientific research results are filtered out by adopting TF-IDF, and important words are reserved.

And S4, text classification, namely adding word frequency statistical information as prior knowledge into text clustering, providing a posterior judgment criterion, and performing preliminary classification on the texts in the sample library. The word frequency information counted in the S3 is used as prior knowledge and added into a classifier to preliminarily classify the documents to be compared in the sample library, and classification results are judged according to a posterior judgment criterion, so that the accuracy of text classification is improved as much as possible.

And S5, text clustering, wherein the method adopted by the invention for text clustering is based on a text clustering algorithm of word vectors and characteristic semantic distances, scientific research results declared by ocean oil production plants are used as a text set to be clustered all the year round, the text set is traversed, and a text characteristic word set is obtained after word segmentation, word deactivation and other operations are carried out on each text (Di) by using a word segmentation tool. And training the large-scale corpus obtained after word segmentation by using a Skip-gram model to obtain a word vector of each feature word, and then carrying out statistics and calculation on the word frequency, position and word distance information of the feature words on each text. Randomly selecting n texts as initial clustering centroids, calculating the distance between each text and the n clustering centroids, and selecting the centroid with the shortest distance as the centroid of the text. Sequentially calculating the distances from the text with the same centroid to other centroids, selecting the centroid with the shortest distance as a new centroid of the text, circularly selecting the initialized centroid and continuously calculating the centroid with the shortest distance until the centroid does not change any more.

And S6, generating semantic vectors, and analyzing the semantic similarity of the text, wherein the semantic similarity of the text can be researched from multiple characteristics of text characteristics, including stopping the influence of words on Chinese participles, analyzing the language state, calculating the semantic similarity, eliminating ambiguity of synonyms, fusing context characteristics with the semantic vectors, adding text structure prediction based on a neural network, and the like, so that the similarity of text data can be more accurately analyzed. The process can be roughly divided into four steps of predicate marking, preprocessing, semantic role marking and semantic role classification. Wherein predicate notation is to identify verb predicates in sentences and assign word senses to them. In the preprocessing stage, the dependency relationship tree is mainly pruned, and the relationship node which is least likely to bear the role of the predicate on the dependency tree is deleted, so that unnecessary structural information is eliminated, and the number of the instances input into the classifier is effectively reduced. And then extracting key characteristics for determining the performance of the semantic role labeling system, and fusing the key characteristics with the context characteristics to generate a semantic vector.

S7, semantic similarity analysis, wherein semantic vectors are input into two LSTM processing text sequences with the same structure and parameters, the product and the variance of the result are added, the same points and the difference of the text are amplified, and the model structure is shown in figure 2. And inputting the vector sequence into an LSTM model for training to obtain two text vectors. In order to improve the sensitivity and the accuracy of the model, the sensitivity of the model is improved by amplifying the same part of the two sequences and reducing the opposite part of the two sequences through product operation, the accuracy of the model is improved by reflecting the difference of the two sequences through variance, then the product operation is carried out on the two text vectors, then the variance operation is carried out on the text vectors, finally the two text vectors, the product result and the variance result are connected together and output to a full-connection layer for calculation, and finally the semantic similarity value is obtained.

And S8, outputting the categories of the sample library to be compared, namely the text clustering result and the semantic similarity analysis result of the text to be analyzed.

In summary, the semantic similarity analysis method based on text clustering of the present invention performs fast semantic similarity analysis on text data in actual scenes, can be applied to various actual scenes, can perform semantic similarity analysis well according to a sample library formed by each scene, and is applicable to various fields.

While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims

1. A semantic similarity analysis method based on text clustering is characterized by comprising the following specific steps:

s1, for an input unprocessed text T, adopting the modes of removing stop words, code conversion and Chinese word segmentation to carry out data preprocessing and converting the data into a form capable of being calculated;

and S8, outputting the final result of the text similarity analysis.

2. The semantic similarity analysis method based on text clustering according to claim 1, characterized in that for step (b), step (c) is performed

S3, the word vector Skip-gram model used by the invention is a Huffman tree constructed based on Hierarchical software Softmax, and can predict the occurrence probability of the upper and lower words from large-scale non-labeled text data according to the currently input word, namely, words appearing around can be predicted according to the occurrence probability of the current word; according to the co-occurrence principle of words in a window, calculating the co-occurrence probability among the words based on window sliding, so that word vectors generated by each feature word contain certain text structure information and semantic information, wherein the structure and the calculation mode of a Skip-gram model are as follows:

the Skip-gram model comprises an input layer, a projection layer and an output layer, wherein the input layer is a current characteristic word, and a word vector W of the word_t∈R^mThe output is the characteristic wordProbability of occurrence of a word in a context window, the purpose of the projection layer being to maximize the value of an objective function L, given a set of word sequences W₁,W₂,…,W_NThen, then

In the formula (1), N is the length of the word sequence, c represents the context length of the current characteristic word (generally 5-10, the effect is good), and P (W)_j+1|W_j) For knowing the current word W_jProbability of occurrence, its context feature word Wj₊₁The probability of occurrence; all word vectors obtained through Skip-gram model training form a word vector matrix X belonging to R^mnWith X_i∈R^mThe word vectors representing the feature words i in the m-dimensional space, and the similarity between the feature words, can be measured by using the distance between the corresponding word vectors. The Euclidean distance between two vectors is shown as formula (2):

d(W_i,W_j)＝‖x_i-x_j‖₂(2)

3. The semantic similarity analysis method according to claim 1, wherein for step S5, a conventional K-means algorithm is used to perform a text clustering process, and a plurality of text sets to be clustered are represented by D ═ D { (D)₁，D₂，D₃，…，D_nThe text clustering algorithm based on word vectors and feature semantic distances has the specific flow as follows:

(1) traversing the input text set for each text (D)_i) Using word segmentation tool to perform word segmentation, word stop and other operations to obtain text characteristicsThe set of words S ═ { S1, S2, S3, …, Sm };

(4) randomly selecting k texts as initial clustering centroids, calculating the distance between each text and the k clustering centroids by using a formula (3), and selecting the centroid with the shortest distance as the centroid of the text, wherein the calculation process is as follows;

l(W_i,W_j)＝αtf_ij×idf_i+βk+γg_ij×d(W_i,W_j) (3)

in the formula (3), l (wi, wj) represents the multi-feature semantic distance between the feature words i and j, d (wi, wj) represents the semantic distance between the features i and j, tf_ijThe word frequency information is idfi, the information entropy of the feature word is gij, the distance weight from the first occurrence to the last occurrence of the feature word i in the text j is represented by gij, α, β and gamma are weights of three different types of features, and α + β + gamma is 1;

4. The semantic similarity analysis method based on text clustering according to the claim 1 is characterized in that for step S7, the invention uses two LSTM neural networks with identical input structure and parameters to process text sequence models, the models are composed of a preprocessing layer, an input layer, a training layer and an output layer; let the input be { x, x' }, and the output be y, l_iIs an intermediate hidden layer, W_iAs a weight matrix for each layer, b_iFor each layer bias, the following:

the model adopts a double-sequence structure, each sequence is trained based on LSTM, the model is described in detail as t1 and t2 needing to be trained are input into an ① input layer, the text is converted into word sequences s1 and s2 through word segmentation, the word sequences are mapped with word vectors in pre-training, the word sequences are converted into word vector sequences v1 and v2, ② the word vector sequences are input into an LSTM model for training, two text vectors are obtained, the two text vectors are r1 and r2 respectively, product operation is firstly carried out on r1 and r2 to obtain p, variance operation is then carried out on r1 and r2 to obtain q, and finally the four results of r1, r2, p and q are connected together, a ③ output layer puts the connecting vector into a full-connection layer for calculation, finally semantic similarity values are obtained, compared with the previous model, the model not only comprises a double-sequence processing structure, but also carries out further processing on the training results of the text vectors, namely the training results of the two sequences are connected together for calculation, the product operation is obtained, the semantics are obtained, the two sequences are the same, the product operation is obtained, the sensitivity of the two sequences is improved, and the difference of the.