CN110825877A - Semantic similarity analysis method based on text clustering - Google Patents

Semantic similarity analysis method based on text clustering Download PDF

Info

Publication number
CN110825877A
CN110825877A CN201911100265.8A CN201911100265A CN110825877A CN 110825877 A CN110825877 A CN 110825877A CN 201911100265 A CN201911100265 A CN 201911100265A CN 110825877 A CN110825877 A CN 110825877A
Authority
CN
China
Prior art keywords
text
word
semantic
words
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911100265.8A
Other languages
Chinese (zh)
Inventor
唐昱润
宫法明
马玉辉
司朋举
李昕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Petroleum East China
Original Assignee
China University of Petroleum East China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Petroleum East China filed Critical China University of Petroleum East China
Priority to CN201911100265.8A priority Critical patent/CN110825877A/en
Publication of CN110825877A publication Critical patent/CN110825877A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs

Abstract

The invention discloses a semantic similarity analysis method based on text clustering, which comprises the following steps: the method comprises the steps of taking unprocessed text data as input, carrying out word frequency statistics on the text subjected to data preprocessing, taking word frequency statistical information as prior knowledge to be added into text clustering, and providing a posterior judgment criterion, and also taking the word frequency statistics as a classifier to carry out an unsupervised clustering method on the basis of once more so as to improve the accuracy and timeliness of text clustering results; carrying out synonym disambiguation on the processed text, generating a semantic vector fusing context characteristics after carrying out semantic role labeling, processing the text sequence by adopting two LSTMs with completely same structures and parameters, adding the product and the variance of the result, amplifying the same points and the difference of the text, and calculating to obtain the final result of similarity analysis. The method can be applied to the actual scenes of text similarity analysis in various different fields, and can well process text data of different types.

Description

Semantic similarity analysis method based on text clustering
Technical Field
The invention belongs to the field of natural language processing, and relates to a semantic similarity analysis method based on text clustering.
Background
Text clustering and semantic similarity detection are always important research subjects in the field of natural language processing, can automatically and accurately determine the category of texts, semantic extraction and similarity comparison in text data, and are important for processing and applying text data. In recent years, the maturity of the research field is always accompanied by the rapid increase of the number of reports and scientific achievements, the summarization and the summary providing become important, a similarity analysis method based on text clustering is more and more concerned, the method can be divided into two stages of text clustering and text similarity analysis, the current method basically focuses more on word frequency information, and ignores the semantic information of keywords and the data structure and the context information of texts, and the semantic information and the context information of a plurality of keywords in the texts are beneficial to the clustering similarity analysis of the texts.
The problem of efficiently analyzing and detecting the similarity of the text in a specific field is basically solved, but the problem that the similarity is difficult to quickly and accurately obtain by applying the method to the similarity analysis in a multi-field and huge range of a text library, the problems of high dimensionality of a feature word vector, sparse data, low-frequency word omission, lack of semantic information and the like exist, and professional words in the text and synonyms which can generate ambiguity in different fields can also influence the similarity analysis result. Although the semantic similarity analysis method based on deep learning can reduce the influence and improve the accuracy, the problem of overlong detection time exists, and therefore how to quickly and efficiently analyze the text similarity in a wide field becomes a difficult problem to be solved urgently.
Disclosure of Invention
In order to overcome the defects, the invention provides a semantic similarity analysis method based on text clustering, which comprises the following specific steps:
s1, for an input unprocessed text T, adopting the modes of removing stop words, code conversion and Chinese word segmentation to carry out data preprocessing, and converting the data into a form capable of being calculated;
s2, training text word vectors by using Skip-gram and Softmax models for the divided words to calculate similarity between the words;
s3, calculating the word frequency inverse document frequency by using a TF-IDF algorithm, further obtaining the value of the TF-IDF, and extracting the key words of the detected text;
s4, adding the word frequency statistical information as prior knowledge into text clustering, providing a posterior judgment criterion, and performing preliminary classification on the texts in the sample library;
s5, adding the extracted keywords as prior knowledge into a classifier, clustering the text data in the sample library on the basis of the prior knowledge, and accurately refining the category of the previously obtained text data;
s6, performing morphological analysis, synonym disambiguation and semantic role labeling on the preprocessed text to be detected to generate a semantic vector fusing context characteristics;
s7, inputting semantic vectors into two LSTM processing text sequences with completely same structures and parameters, adding products and variances of results, and amplifying the same points and differences of texts;
and S8, outputting the final result of the text similarity analysis.
For step S3, the word vector Skip-gram model used in the present invention is a Huffman tree constructed based on Hierarchical Softmax, and can predict the probability of the appearance of the previous and next words from the large-scale non-labeled text data according to the currently input word, that is, the words appearing around can be predicted according to the probability of the appearance of the current word; according to the co-occurrence principle of words in a window, calculating the co-occurrence probability among the words based on window sliding, so that word vectors generated by each feature word contain certain text structure information and semantic information, wherein the structure and the calculation mode of a Skip-gram model are as follows:
the Skip-gram model comprises an input layer, a projection layer and an output layer, wherein the input layer is a current characteristic word, and a word vector W of the wordt∈Rm(ii) a Output ofThe purpose of the projection layer is to maximize the value of the objective function L for the probability of a word appearing in the context window of the feature word, given a set of word sequences W1,W2,…,WNThen, then
Figure BDA0002269646220000021
In the formula (1), N is the length of the word sequence, c represents the context length of the current characteristic word, generally takes 5-10, and has good effect, and P (W)j+1|Wj) For knowing the current word WjProbability of occurrence, its context feature word Wj+1The probability of occurrence, all word vectors obtained by training through Skip-gram model, form a word vector matrix X belonging to Rmn(ii) a With Xi∈RmRepresenting the word vectors of the feature word i in the m-dimensional space, the similarity between feature words can be measured by using the distance between corresponding word vectors, and the euclidean distance between two vectors is shown as the following formula:
d(Wi,Wj)=‖xi-xj2(2)
d (W) in formula (2)i,Wj) Semantic distance, x, representing features i and jiAnd xjExpression of characteristic word Wi,WjCorresponding word vector, d (W)i,Wj) The smaller the value of (a), the smaller the semantic distance between two feature words, the more similar the semantics.
For step S5, a conventional K-means algorithm is used to perform a text clustering process, and a plurality of text sets D to be clustered are formed as { D ═ D1,D2,D3,…,DnThe text clustering algorithm based on word vectors and feature semantic distances has the specific flow as follows:
(1) traversing the input text set for each text (D)i) Performing word segmentation, word stop removal and other operations by using a word segmentation tool to obtain a text feature word set S ═ { S1, S2, S3, …, Sm };
(2) training a large-scale corpus obtained after word segmentation by using a Skip-gram model to obtain a word vector of each feature word;
(3) counting and calculating the word frequency, position and word distance information of the characteristic words of each text;
(4) randomly selecting k texts as initial clustering centroids, calculating the distance between each text and the k clustering centroids by using a formula (3), and selecting the centroid with the shortest distance as the centroid of the text, wherein the calculation process is as follows:
l(Wi,Wj)=αtfij×idfi+βk+γgij×d(Wi,Wj) (3)
in the formula (3), l (wi, wj) represents the multi-feature semantic distance between the feature words i and j, d (wi, wj) represents the semantic distance between the features i and j, tfijIs word frequency information, idfiEntropy of information, g, for feature wordsijRepresenting the weight of the distance from the first occurrence to the last occurrence of the feature word i in the text j, wherein α, β and gamma are weights of three different types of features, and α + β + gamma is 1;
(5) sequentially calculating the distances from the text with the same centroid to other centroids, and selecting the centroid with the shortest distance as a new centroid of the text;
(6) and (5) circularly executing the steps (4) and (5) until the centroid is not changed any more, and finally obtaining a clustering result.
For step S7, the invention uses two LSTM neural networks with completely same input structure and parameters to process the text sequence model, and the model is composed of a preprocessing layer, an input layer, a training layer and an output layer; let the input be { x, x' }, and the output be y, liIs an intermediate hidden layer, WiAs a weight matrix for each layer, biFor each layer bias, the following:
Figure BDA0002269646220000031
in equation (4), the activation function of training layers l2-l4 is ReLU, the activation function of output layer y is Softmax, the similarity is classified into 6 classes, so K equals 6, and the following is deduced:
Figure BDA0002269646220000032
Figure BDA0002269646220000033
the model adopts a double-sequence structure, each sequence is trained based on LSTM, the model is described in detail as follows, t1 and t2 of texts needing to be trained are input into an ① input layer, the texts are converted into word sequences s1 and s2 through word segmentation, the word sequences are mapped with word vectors in pre-training, the word sequences are converted into word vector sequences v1 and v2, ② the word vector sequences are input into an LSTM model for training to obtain two text vectors which are r1 and r2 respectively, product operation is carried out on r1 and r2 to obtain p, variance operation is carried out on r1 and r2 to obtain q, and finally the four results of r1, r2, p and q are connected together, a ③ output layer puts the connecting vectors into a full-connection layer for calculation, semantic similarity values are obtained finally, compared with the previous model, the model not only comprises a double-sequence processing structure, but also carries out further processing on training results of the text vectors, namely the results of the two sequences are multiplied by the full-connection layer for calculation, the product operation result is obtained, the semantic similarity values are obtained, the two sequences are amplified, and the variance of the two sequences is improved, and the variance of the two sequences is amplified, so that the variance of the.
The semantic similarity analysis method based on text clustering solves the problems of large text similarity analysis error and poor real-time performance in scenes with relevance among various fields in the prior art, and has the following advantages:
(1) the method can be applied to various practical scenes, realizes the method for text clustering and semantic similarity analysis, and forms a general framework of a text similarity comparison task in a specific practical application scene;
(2) the method can fully utilize the semantic information and the context information of the keywords, improves the text clustering and semantic similarity analysis method, simplifies the subsequent classification network, and can adapt to the input of multi-type text data;
(3) in the method, in the actual scientific research environment with multiple crossed fields, the accuracy and the analysis rate of text clustering and semantic similarity analysis are improved by adopting a semantic-based keyword extraction method and combining semantic role labeling and context information.
Drawings
FIG. 1 is a flow chart of a semantic similarity analysis method based on text clustering according to the present invention.
FIG. 2 is a schematic structural diagram of a semantic similarity analysis model according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the following figures and detailed description:
a semantic similarity analysis method based on text clustering is shown in FIG. 1, which is a flow chart of the semantic similarity analysis method based on text clustering, and the method comprises the following steps:
s1, preprocessing data by text drying, stop word removing, code conversion and Chinese word segmentation for an input unprocessed text sequence. The original data come from the scientific research results declared by marine oil extraction plants over the years, the scientific research results are divided into 4 categories, and the newly declared scientific research result documents are processed in real time by taking the actual scientific research results in work as a sample library.
And S2, training word vectors, training the text word vectors of the segmented words in the text data to be analyzed by adopting Skip-gram and Softmax models to calculate the similarity between words, predicting the occurrence probability of the upper and lower words from large-scale non-labeled text data according to the currently input words, namely predicting the words appearing around according to the occurrence probability of the current words. According to the co-occurrence principle of words in a window, the co-occurrence probability among the words is calculated based on window sliding, so that word vectors generated by each feature word contain certain text structure information and semantic information.
S3, word frequency statistics, wherein the word frequency statistics method adopted by the invention is a common weighting technology used for information retrieval and text mining, TF-IDF.
TF-IDF is a statistical method to evaluate the importance of words to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. If the frequency TF of a word in one article is high and the word rarely appears in other articles, the word or phrase is considered to have good category distinguishing capability and is suitable for classification;
tf (term frequency), the number of times a word appears in an article, is usually normalized (typically the word frequency divided by the total word number of the article) to prevent it from being biased towards long documents, as shown in equation (7):
Figure BDA0002269646220000051
the IDF (inverse Document frequency) is an inverse text frequency index, and if the number of documents containing the keywords is less, the keywords are proved to have good category distinguishing capability. The IDF for a keyword can be obtained by dividing the total number of articles by the number of articles containing the keyword, and then taking the logarithm of the result.
Figure BDA0002269646220000052
In equation (8), | D | is the total number of files in the corpus. I { j ∈ dj | denotes the inclusion of the word tiThe number of files. A high word frequency within a particular document, and a low document frequency for that word across the document collection, may result in a high-weighted TF-IDF. Therefore, common words in scientific research results are filtered out by adopting TF-IDF, and important words are reserved.
And S4, text classification, namely adding word frequency statistical information as prior knowledge into text clustering, providing a posterior judgment criterion, and performing preliminary classification on the texts in the sample library. The word frequency information counted in the S3 is used as prior knowledge and added into a classifier to preliminarily classify the documents to be compared in the sample library, and classification results are judged according to a posterior judgment criterion, so that the accuracy of text classification is improved as much as possible.
And S5, text clustering, wherein the method adopted by the invention for text clustering is based on a text clustering algorithm of word vectors and characteristic semantic distances, scientific research results declared by ocean oil production plants are used as a text set to be clustered all the year round, the text set is traversed, and a text characteristic word set is obtained after word segmentation, word deactivation and other operations are carried out on each text (Di) by using a word segmentation tool. And training the large-scale corpus obtained after word segmentation by using a Skip-gram model to obtain a word vector of each feature word, and then carrying out statistics and calculation on the word frequency, position and word distance information of the feature words on each text. Randomly selecting n texts as initial clustering centroids, calculating the distance between each text and the n clustering centroids, and selecting the centroid with the shortest distance as the centroid of the text. Sequentially calculating the distances from the text with the same centroid to other centroids, selecting the centroid with the shortest distance as a new centroid of the text, circularly selecting the initialized centroid and continuously calculating the centroid with the shortest distance until the centroid does not change any more.
And S6, generating semantic vectors, and analyzing the semantic similarity of the text, wherein the semantic similarity of the text can be researched from multiple characteristics of text characteristics, including stopping the influence of words on Chinese participles, analyzing the language state, calculating the semantic similarity, eliminating ambiguity of synonyms, fusing context characteristics with the semantic vectors, adding text structure prediction based on a neural network, and the like, so that the similarity of text data can be more accurately analyzed. The process can be roughly divided into four steps of predicate marking, preprocessing, semantic role marking and semantic role classification. Wherein predicate notation is to identify verb predicates in sentences and assign word senses to them. In the preprocessing stage, the dependency relationship tree is mainly pruned, and the relationship node which is least likely to bear the role of the predicate on the dependency tree is deleted, so that unnecessary structural information is eliminated, and the number of the instances input into the classifier is effectively reduced. And then extracting key characteristics for determining the performance of the semantic role labeling system, and fusing the key characteristics with the context characteristics to generate a semantic vector.
S7, semantic similarity analysis, wherein semantic vectors are input into two LSTM processing text sequences with the same structure and parameters, the product and the variance of the result are added, the same points and the difference of the text are amplified, and the model structure is shown in figure 2. And inputting the vector sequence into an LSTM model for training to obtain two text vectors. In order to improve the sensitivity and the accuracy of the model, the sensitivity of the model is improved by amplifying the same part of the two sequences and reducing the opposite part of the two sequences through product operation, the accuracy of the model is improved by reflecting the difference of the two sequences through variance, then the product operation is carried out on the two text vectors, then the variance operation is carried out on the text vectors, finally the two text vectors, the product result and the variance result are connected together and output to a full-connection layer for calculation, and finally the semantic similarity value is obtained.
And S8, outputting the categories of the sample library to be compared, namely the text clustering result and the semantic similarity analysis result of the text to be analyzed.
In summary, the semantic similarity analysis method based on text clustering of the present invention performs fast semantic similarity analysis on text data in actual scenes, can be applied to various actual scenes, can perform semantic similarity analysis well according to a sample library formed by each scene, and is applicable to various fields.
While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims (4)

1. A semantic similarity analysis method based on text clustering is characterized by comprising the following specific steps:
s1, for an input unprocessed text T, adopting the modes of removing stop words, code conversion and Chinese word segmentation to carry out data preprocessing and converting the data into a form capable of being calculated;
s2, training text word vectors by using Skip-gram and Softmax models for the divided words to calculate similarity between the words;
s3, calculating the word frequency inverse document frequency by using a TF-IDF algorithm, further obtaining the value of the TF-IDF, and extracting the key words of the detected text;
s4, adding the word frequency statistical information as prior knowledge into text clustering, providing a posterior judgment criterion, and performing preliminary classification on the texts in the sample library;
s5, adding the extracted keywords as prior knowledge into a classifier, clustering the text data in the sample library on the basis of the prior knowledge, and accurately refining the category of the previously obtained text data;
s6, performing morphological analysis, synonym disambiguation and semantic role labeling on the preprocessed text to be detected to generate a semantic vector fusing context characteristics;
s7, inputting semantic vectors into two LSTM processing text sequences with completely same structures and parameters, adding products and variances of results, and amplifying the same points and differences of texts;
and S8, outputting the final result of the text similarity analysis.
2. The semantic similarity analysis method based on text clustering according to claim 1, characterized in that for step (b), step (c) is performed
S3, the word vector Skip-gram model used by the invention is a Huffman tree constructed based on Hierarchical software Softmax, and can predict the occurrence probability of the upper and lower words from large-scale non-labeled text data according to the currently input word, namely, words appearing around can be predicted according to the occurrence probability of the current word; according to the co-occurrence principle of words in a window, calculating the co-occurrence probability among the words based on window sliding, so that word vectors generated by each feature word contain certain text structure information and semantic information, wherein the structure and the calculation mode of a Skip-gram model are as follows:
the Skip-gram model comprises an input layer, a projection layer and an output layer, wherein the input layer is a current characteristic word, and a word vector W of the wordt∈RmThe output is the characteristic wordProbability of occurrence of a word in a context window, the purpose of the projection layer being to maximize the value of an objective function L, given a set of word sequences W1,W2,…,WNThen, then
Figure FDA0002269646210000011
In the formula (1), N is the length of the word sequence, c represents the context length of the current characteristic word (generally 5-10, the effect is good), and P (W)j+1|Wj) For knowing the current word WjProbability of occurrence, its context feature word Wj+1The probability of occurrence; all word vectors obtained through Skip-gram model training form a word vector matrix X belonging to RmnWith Xi∈RmThe word vectors representing the feature words i in the m-dimensional space, and the similarity between the feature words, can be measured by using the distance between the corresponding word vectors. The Euclidean distance between two vectors is shown as formula (2):
d(Wi,Wj)=‖xi-xj2(2)
d (W) in formula (2)i,Wj) Semantic distance, x, representing features i and jiAnd xjExpression of characteristic word Wi,WjCorresponding word vector, d (W)i,Wj) The smaller the value of (a), the smaller the semantic distance between two feature words, the more similar the semantics.
3. The semantic similarity analysis method according to claim 1, wherein for step S5, a conventional K-means algorithm is used to perform a text clustering process, and a plurality of text sets to be clustered are represented by D ═ D { (D)1,D2,D3,…,DnThe text clustering algorithm based on word vectors and feature semantic distances has the specific flow as follows:
(1) traversing the input text set for each text (D)i) Using word segmentation tool to perform word segmentation, word stop and other operations to obtain text characteristicsThe set of words S ═ { S1, S2, S3, …, Sm };
(2) training a large-scale corpus obtained after word segmentation by using a Skip-gram model to obtain a word vector of each feature word;
(3) counting and calculating the word frequency, position and word distance information of the characteristic words of each text;
(4) randomly selecting k texts as initial clustering centroids, calculating the distance between each text and the k clustering centroids by using a formula (3), and selecting the centroid with the shortest distance as the centroid of the text, wherein the calculation process is as follows;
l(Wi,Wj)=αtfij×idfi+βk+γgij×d(Wi,Wj) (3)
in the formula (3), l (wi, wj) represents the multi-feature semantic distance between the feature words i and j, d (wi, wj) represents the semantic distance between the features i and j, tfijThe word frequency information is idfi, the information entropy of the feature word is gij, the distance weight from the first occurrence to the last occurrence of the feature word i in the text j is represented by gij, α, β and gamma are weights of three different types of features, and α + β + gamma is 1;
(5) sequentially calculating the distances from the text with the same centroid to other centroids, and selecting the centroid with the shortest distance as a new centroid of the text;
(6) and (5) circularly executing the steps (4) and (5) until the centroid is not changed any more, and finally obtaining a clustering result.
4. The semantic similarity analysis method based on text clustering according to the claim 1 is characterized in that for step S7, the invention uses two LSTM neural networks with identical input structure and parameters to process text sequence models, the models are composed of a preprocessing layer, an input layer, a training layer and an output layer; let the input be { x, x' }, and the output be y, liIs an intermediate hidden layer, WiAs a weight matrix for each layer, biFor each layer bias, the following:
Figure FDA0002269646210000021
in equation (4), the activation function of training layers l2-l4 is ReLU, the activation function of output layer y is Softmax, the similarity is classified into 6 classes, so K equals 6, and the following is deduced:
Figure FDA0002269646210000031
Figure FDA0002269646210000032
the model adopts a double-sequence structure, each sequence is trained based on LSTM, the model is described in detail as t1 and t2 needing to be trained are input into an ① input layer, the text is converted into word sequences s1 and s2 through word segmentation, the word sequences are mapped with word vectors in pre-training, the word sequences are converted into word vector sequences v1 and v2, ② the word vector sequences are input into an LSTM model for training, two text vectors are obtained, the two text vectors are r1 and r2 respectively, product operation is firstly carried out on r1 and r2 to obtain p, variance operation is then carried out on r1 and r2 to obtain q, and finally the four results of r1, r2, p and q are connected together, a ③ output layer puts the connecting vector into a full-connection layer for calculation, finally semantic similarity values are obtained, compared with the previous model, the model not only comprises a double-sequence processing structure, but also carries out further processing on the training results of the text vectors, namely the training results of the two sequences are connected together for calculation, the product operation is obtained, the semantics are obtained, the two sequences are the same, the product operation is obtained, the sensitivity of the two sequences is improved, and the difference of the.
CN201911100265.8A 2019-11-12 2019-11-12 Semantic similarity analysis method based on text clustering Pending CN110825877A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911100265.8A CN110825877A (en) 2019-11-12 2019-11-12 Semantic similarity analysis method based on text clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911100265.8A CN110825877A (en) 2019-11-12 2019-11-12 Semantic similarity analysis method based on text clustering

Publications (1)

Publication Number Publication Date
CN110825877A true CN110825877A (en) 2020-02-21

Family

ID=69554249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911100265.8A Pending CN110825877A (en) 2019-11-12 2019-11-12 Semantic similarity analysis method based on text clustering

Country Status (1)

Country Link
CN (1) CN110825877A (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488725A (en) * 2020-03-15 2020-08-04 复旦大学 Machine intelligent auxiliary root-pricking theoretical coding optimization method
CN111611809A (en) * 2020-05-26 2020-09-01 西藏大学 Chinese sentence similarity calculation method based on neural network
CN111680131A (en) * 2020-06-22 2020-09-18 平安银行股份有限公司 Document clustering method and system based on semantics and computer equipment
CN111898365A (en) * 2020-04-03 2020-11-06 北京沃东天骏信息技术有限公司 Method and device for detecting text
CN112000802A (en) * 2020-07-24 2020-11-27 南京航空航天大学 Software defect positioning method based on similarity integration
CN112069307A (en) * 2020-08-25 2020-12-11 中国人民大学 Legal law citation information extraction system
CN112256874A (en) * 2020-10-21 2021-01-22 平安科技(深圳)有限公司 Model training method, text classification method, device, computer equipment and medium
CN112347758A (en) * 2020-11-06 2021-02-09 中国平安人寿保险股份有限公司 Text abstract generation method and device, terminal equipment and storage medium
CN113011555A (en) * 2021-02-09 2021-06-22 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN113111645A (en) * 2021-04-28 2021-07-13 东南大学 Media text similarity detection method
CN113257060A (en) * 2021-05-13 2021-08-13 张予立 Question answering solving method, device, equipment and storage medium
CN113342928A (en) * 2021-05-07 2021-09-03 上海大学 Method and system for extracting process information from steel material patent text based on improved TextRank algorithm
CN113407717A (en) * 2021-05-28 2021-09-17 数库(上海)科技有限公司 Method, device, equipment and storage medium for eliminating ambiguity of industry words in news
CN113535927A (en) * 2021-07-30 2021-10-22 杭州网易智企科技有限公司 Method, medium, device and computing equipment for acquiring similar texts
CN113591474A (en) * 2021-07-21 2021-11-02 西北工业大学 Repeated data detection method based on weighted fusion Loc2vec model
CN113656548A (en) * 2021-08-18 2021-11-16 福州大学 Text classification model interpretation method and system based on data envelope analysis
CN113807073A (en) * 2020-06-16 2021-12-17 中国电信股份有限公司 Text content abnormity detection method, device and storage medium
CN114969348A (en) * 2022-07-27 2022-08-30 杭州电子科技大学 Electronic file classification method and system based on inversion regulation knowledge base
CN115168600A (en) * 2022-06-23 2022-10-11 广州大学 Value chain knowledge discovery method under personalized customization
CN115186778A (en) * 2022-09-13 2022-10-14 福建省特种设备检验研究院 Text analysis-based hidden danger identification method and terminal for pressure-bearing special equipment
CN116166321A (en) * 2023-04-26 2023-05-26 浙江鹏信信息科技股份有限公司 Code clone detection method, system and computer readable storage medium
CN116796754A (en) * 2023-04-20 2023-09-22 浙江浙里信征信有限公司 Visual analysis method and system based on time-varying context semantic sequence pair comparison
CN117592562A (en) * 2024-01-18 2024-02-23 卓世未来(天津)科技有限公司 Knowledge base automatic construction method based on natural language processing
CN117591769A (en) * 2023-12-22 2024-02-23 云尖(北京)软件有限公司 Webpage tamper-proof method and system
CN117648409A (en) * 2024-01-30 2024-03-05 北京点聚信息技术有限公司 OCR-based format file anti-counterfeiting recognition method

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488725A (en) * 2020-03-15 2020-08-04 复旦大学 Machine intelligent auxiliary root-pricking theoretical coding optimization method
CN111488725B (en) * 2020-03-15 2023-04-07 复旦大学 Machine intelligent auxiliary root-pricking theoretical coding optimization method
CN111898365A (en) * 2020-04-03 2020-11-06 北京沃东天骏信息技术有限公司 Method and device for detecting text
CN111611809A (en) * 2020-05-26 2020-09-01 西藏大学 Chinese sentence similarity calculation method based on neural network
CN111611809B (en) * 2020-05-26 2023-04-18 西藏大学 Chinese sentence similarity calculation method based on neural network
CN113807073B (en) * 2020-06-16 2023-11-14 中国电信股份有限公司 Text content anomaly detection method, device and storage medium
CN113807073A (en) * 2020-06-16 2021-12-17 中国电信股份有限公司 Text content abnormity detection method, device and storage medium
CN111680131A (en) * 2020-06-22 2020-09-18 平安银行股份有限公司 Document clustering method and system based on semantics and computer equipment
CN111680131B (en) * 2020-06-22 2022-08-12 平安银行股份有限公司 Document clustering method and system based on semantics and computer equipment
CN112000802A (en) * 2020-07-24 2020-11-27 南京航空航天大学 Software defect positioning method based on similarity integration
CN112069307A (en) * 2020-08-25 2020-12-11 中国人民大学 Legal law citation information extraction system
CN112256874B (en) * 2020-10-21 2023-08-08 平安科技(深圳)有限公司 Model training method, text classification method, device, computer equipment and medium
CN112256874A (en) * 2020-10-21 2021-01-22 平安科技(深圳)有限公司 Model training method, text classification method, device, computer equipment and medium
CN112347758A (en) * 2020-11-06 2021-02-09 中国平安人寿保险股份有限公司 Text abstract generation method and device, terminal equipment and storage medium
CN113011555A (en) * 2021-02-09 2021-06-22 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN113011555B (en) * 2021-02-09 2023-01-31 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN113111645B (en) * 2021-04-28 2024-02-06 东南大学 Media text similarity detection method
CN113111645A (en) * 2021-04-28 2021-07-13 东南大学 Media text similarity detection method
CN113342928A (en) * 2021-05-07 2021-09-03 上海大学 Method and system for extracting process information from steel material patent text based on improved TextRank algorithm
CN113257060A (en) * 2021-05-13 2021-08-13 张予立 Question answering solving method, device, equipment and storage medium
CN113407717B (en) * 2021-05-28 2022-12-20 数库(上海)科技有限公司 Method, device, equipment and storage medium for eliminating ambiguity of industrial words in news
CN113407717A (en) * 2021-05-28 2021-09-17 数库(上海)科技有限公司 Method, device, equipment and storage medium for eliminating ambiguity of industry words in news
CN113591474B (en) * 2021-07-21 2024-04-05 西北工业大学 Repeated data detection method of Loc2vec model based on weighted fusion
CN113591474A (en) * 2021-07-21 2021-11-02 西北工业大学 Repeated data detection method based on weighted fusion Loc2vec model
CN113535927A (en) * 2021-07-30 2021-10-22 杭州网易智企科技有限公司 Method, medium, device and computing equipment for acquiring similar texts
CN113656548A (en) * 2021-08-18 2021-11-16 福州大学 Text classification model interpretation method and system based on data envelope analysis
CN113656548B (en) * 2021-08-18 2023-08-04 福州大学 Text classification model interpretation method and system based on data envelope analysis
CN115168600A (en) * 2022-06-23 2022-10-11 广州大学 Value chain knowledge discovery method under personalized customization
CN115168600B (en) * 2022-06-23 2023-07-11 广州大学 Value chain knowledge discovery method under personalized customization
CN114969348B (en) * 2022-07-27 2023-10-27 杭州电子科技大学 Electronic file hierarchical classification method and system based on inversion adjustment knowledge base
CN114969348A (en) * 2022-07-27 2022-08-30 杭州电子科技大学 Electronic file classification method and system based on inversion regulation knowledge base
CN115186778A (en) * 2022-09-13 2022-10-14 福建省特种设备检验研究院 Text analysis-based hidden danger identification method and terminal for pressure-bearing special equipment
CN116796754A (en) * 2023-04-20 2023-09-22 浙江浙里信征信有限公司 Visual analysis method and system based on time-varying context semantic sequence pair comparison
CN116166321B (en) * 2023-04-26 2023-06-27 浙江鹏信信息科技股份有限公司 Code clone detection method, system and computer readable storage medium
CN116166321A (en) * 2023-04-26 2023-05-26 浙江鹏信信息科技股份有限公司 Code clone detection method, system and computer readable storage medium
CN117591769A (en) * 2023-12-22 2024-02-23 云尖(北京)软件有限公司 Webpage tamper-proof method and system
CN117591769B (en) * 2023-12-22 2024-04-16 云尖(北京)软件有限公司 Webpage tamper-proof method and system
CN117592562A (en) * 2024-01-18 2024-02-23 卓世未来(天津)科技有限公司 Knowledge base automatic construction method based on natural language processing
CN117592562B (en) * 2024-01-18 2024-04-09 卓世未来(天津)科技有限公司 Knowledge base automatic construction method based on natural language processing
CN117648409A (en) * 2024-01-30 2024-03-05 北京点聚信息技术有限公司 OCR-based format file anti-counterfeiting recognition method
CN117648409B (en) * 2024-01-30 2024-04-05 北京点聚信息技术有限公司 OCR-based format file anti-counterfeiting recognition method

Similar Documents

Publication Publication Date Title
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN109190117B (en) Short text semantic similarity calculation method based on word vector
CN108804421B (en) Text similarity analysis method and device, electronic equipment and computer storage medium
CN109670014B (en) Paper author name disambiguation method based on rule matching and machine learning
CN112256939B (en) Text entity relation extraction method for chemical field
CN107844533A (en) A kind of intelligent Answer System and analysis method
WO2020063071A1 (en) Sentence vector calculation method based on chi-square test, and text classification method and system
CN111930933A (en) Detection case processing method and device based on artificial intelligence
CN115203421A (en) Method, device and equipment for generating label of long text and storage medium
CN111429184A (en) User portrait extraction method based on text information
Bhutada et al. Semantic latent dirichlet allocation for automatic topic extraction
CN114997288A (en) Design resource association method
TW202111569A (en) Text classification method with high scalability and multi-tag and apparatus thereof also providing a method and a device for constructing topic classification templates
CN116304020A (en) Industrial text entity extraction method based on semantic source analysis and span characteristics
CN106815209B (en) Uygur agricultural technical term identification method
Tianxiong et al. Identifying chinese event factuality with convolutional neural networks
CN113139061B (en) Case feature extraction method based on word vector clustering
CN114511027B (en) Method for extracting English remote data through big data network
CN114265935A (en) Science and technology project establishment management auxiliary decision-making method and system based on text mining
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning
CN113516202A (en) Webpage accurate classification method for CBL feature extraction and denoising
Thilagavathi et al. Document clustering in forensic investigation by hybrid approach
CN114595324A (en) Method, device, terminal and non-transitory storage medium for power grid service data domain division
CN113590738A (en) Method for detecting network sensitive information based on content and emotion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200221