CN110825877A - Semantic similarity analysis method based on text clustering - Google Patents
Semantic similarity analysis method based on text clustering Download PDFInfo
- Publication number
- CN110825877A CN110825877A CN201911100265.8A CN201911100265A CN110825877A CN 110825877 A CN110825877 A CN 110825877A CN 201911100265 A CN201911100265 A CN 201911100265A CN 110825877 A CN110825877 A CN 110825877A
- Authority
- CN
- China
- Prior art keywords
- text
- word
- semantic
- words
- clustering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
Abstract
The invention discloses a semantic similarity analysis method based on text clustering, which comprises the following steps: the method comprises the steps of taking unprocessed text data as input, carrying out word frequency statistics on the text subjected to data preprocessing, taking word frequency statistical information as prior knowledge to be added into text clustering, and providing a posterior judgment criterion, and also taking the word frequency statistics as a classifier to carry out an unsupervised clustering method on the basis of once more so as to improve the accuracy and timeliness of text clustering results; carrying out synonym disambiguation on the processed text, generating a semantic vector fusing context characteristics after carrying out semantic role labeling, processing the text sequence by adopting two LSTMs with completely same structures and parameters, adding the product and the variance of the result, amplifying the same points and the difference of the text, and calculating to obtain the final result of similarity analysis. The method can be applied to the actual scenes of text similarity analysis in various different fields, and can well process text data of different types.
Description
Technical Field
The invention belongs to the field of natural language processing, and relates to a semantic similarity analysis method based on text clustering.
Background
Text clustering and semantic similarity detection are always important research subjects in the field of natural language processing, can automatically and accurately determine the category of texts, semantic extraction and similarity comparison in text data, and are important for processing and applying text data. In recent years, the maturity of the research field is always accompanied by the rapid increase of the number of reports and scientific achievements, the summarization and the summary providing become important, a similarity analysis method based on text clustering is more and more concerned, the method can be divided into two stages of text clustering and text similarity analysis, the current method basically focuses more on word frequency information, and ignores the semantic information of keywords and the data structure and the context information of texts, and the semantic information and the context information of a plurality of keywords in the texts are beneficial to the clustering similarity analysis of the texts.
The problem of efficiently analyzing and detecting the similarity of the text in a specific field is basically solved, but the problem that the similarity is difficult to quickly and accurately obtain by applying the method to the similarity analysis in a multi-field and huge range of a text library, the problems of high dimensionality of a feature word vector, sparse data, low-frequency word omission, lack of semantic information and the like exist, and professional words in the text and synonyms which can generate ambiguity in different fields can also influence the similarity analysis result. Although the semantic similarity analysis method based on deep learning can reduce the influence and improve the accuracy, the problem of overlong detection time exists, and therefore how to quickly and efficiently analyze the text similarity in a wide field becomes a difficult problem to be solved urgently.
Disclosure of Invention
In order to overcome the defects, the invention provides a semantic similarity analysis method based on text clustering, which comprises the following specific steps:
s1, for an input unprocessed text T, adopting the modes of removing stop words, code conversion and Chinese word segmentation to carry out data preprocessing, and converting the data into a form capable of being calculated;
s2, training text word vectors by using Skip-gram and Softmax models for the divided words to calculate similarity between the words;
s3, calculating the word frequency inverse document frequency by using a TF-IDF algorithm, further obtaining the value of the TF-IDF, and extracting the key words of the detected text;
s4, adding the word frequency statistical information as prior knowledge into text clustering, providing a posterior judgment criterion, and performing preliminary classification on the texts in the sample library;
s5, adding the extracted keywords as prior knowledge into a classifier, clustering the text data in the sample library on the basis of the prior knowledge, and accurately refining the category of the previously obtained text data;
s6, performing morphological analysis, synonym disambiguation and semantic role labeling on the preprocessed text to be detected to generate a semantic vector fusing context characteristics;
s7, inputting semantic vectors into two LSTM processing text sequences with completely same structures and parameters, adding products and variances of results, and amplifying the same points and differences of texts;
and S8, outputting the final result of the text similarity analysis.
For step S3, the word vector Skip-gram model used in the present invention is a Huffman tree constructed based on Hierarchical Softmax, and can predict the probability of the appearance of the previous and next words from the large-scale non-labeled text data according to the currently input word, that is, the words appearing around can be predicted according to the probability of the appearance of the current word; according to the co-occurrence principle of words in a window, calculating the co-occurrence probability among the words based on window sliding, so that word vectors generated by each feature word contain certain text structure information and semantic information, wherein the structure and the calculation mode of a Skip-gram model are as follows:
the Skip-gram model comprises an input layer, a projection layer and an output layer, wherein the input layer is a current characteristic word, and a word vector W of the wordt∈Rm(ii) a Output ofThe purpose of the projection layer is to maximize the value of the objective function L for the probability of a word appearing in the context window of the feature word, given a set of word sequences W1,W2,…,WNThen, then
In the formula (1), N is the length of the word sequence, c represents the context length of the current characteristic word, generally takes 5-10, and has good effect, and P (W)j+1|Wj) For knowing the current word WjProbability of occurrence, its context feature word Wj+1The probability of occurrence, all word vectors obtained by training through Skip-gram model, form a word vector matrix X belonging to Rmn(ii) a With Xi∈RmRepresenting the word vectors of the feature word i in the m-dimensional space, the similarity between feature words can be measured by using the distance between corresponding word vectors, and the euclidean distance between two vectors is shown as the following formula:
d(Wi,Wj)=‖xi-xj‖2(2)
d (W) in formula (2)i,Wj) Semantic distance, x, representing features i and jiAnd xjExpression of characteristic word Wi,WjCorresponding word vector, d (W)i,Wj) The smaller the value of (a), the smaller the semantic distance between two feature words, the more similar the semantics.
For step S5, a conventional K-means algorithm is used to perform a text clustering process, and a plurality of text sets D to be clustered are formed as { D ═ D1,D2,D3,…,DnThe text clustering algorithm based on word vectors and feature semantic distances has the specific flow as follows:
(1) traversing the input text set for each text (D)i) Performing word segmentation, word stop removal and other operations by using a word segmentation tool to obtain a text feature word set S ═ { S1, S2, S3, …, Sm };
(2) training a large-scale corpus obtained after word segmentation by using a Skip-gram model to obtain a word vector of each feature word;
(3) counting and calculating the word frequency, position and word distance information of the characteristic words of each text;
(4) randomly selecting k texts as initial clustering centroids, calculating the distance between each text and the k clustering centroids by using a formula (3), and selecting the centroid with the shortest distance as the centroid of the text, wherein the calculation process is as follows:
l(Wi,Wj)=αtfij×idfi+βk+γgij×d(Wi,Wj) (3)
in the formula (3), l (wi, wj) represents the multi-feature semantic distance between the feature words i and j, d (wi, wj) represents the semantic distance between the features i and j, tfijIs word frequency information, idfiEntropy of information, g, for feature wordsijRepresenting the weight of the distance from the first occurrence to the last occurrence of the feature word i in the text j, wherein α, β and gamma are weights of three different types of features, and α + β + gamma is 1;
(5) sequentially calculating the distances from the text with the same centroid to other centroids, and selecting the centroid with the shortest distance as a new centroid of the text;
(6) and (5) circularly executing the steps (4) and (5) until the centroid is not changed any more, and finally obtaining a clustering result.
For step S7, the invention uses two LSTM neural networks with completely same input structure and parameters to process the text sequence model, and the model is composed of a preprocessing layer, an input layer, a training layer and an output layer; let the input be { x, x' }, and the output be y, liIs an intermediate hidden layer, WiAs a weight matrix for each layer, biFor each layer bias, the following:
in equation (4), the activation function of training layers l2-l4 is ReLU, the activation function of output layer y is Softmax, the similarity is classified into 6 classes, so K equals 6, and the following is deduced:
the model adopts a double-sequence structure, each sequence is trained based on LSTM, the model is described in detail as follows, t1 and t2 of texts needing to be trained are input into an ① input layer, the texts are converted into word sequences s1 and s2 through word segmentation, the word sequences are mapped with word vectors in pre-training, the word sequences are converted into word vector sequences v1 and v2, ② the word vector sequences are input into an LSTM model for training to obtain two text vectors which are r1 and r2 respectively, product operation is carried out on r1 and r2 to obtain p, variance operation is carried out on r1 and r2 to obtain q, and finally the four results of r1, r2, p and q are connected together, a ③ output layer puts the connecting vectors into a full-connection layer for calculation, semantic similarity values are obtained finally, compared with the previous model, the model not only comprises a double-sequence processing structure, but also carries out further processing on training results of the text vectors, namely the results of the two sequences are multiplied by the full-connection layer for calculation, the product operation result is obtained, the semantic similarity values are obtained, the two sequences are amplified, and the variance of the two sequences is improved, and the variance of the two sequences is amplified, so that the variance of the.
The semantic similarity analysis method based on text clustering solves the problems of large text similarity analysis error and poor real-time performance in scenes with relevance among various fields in the prior art, and has the following advantages:
(1) the method can be applied to various practical scenes, realizes the method for text clustering and semantic similarity analysis, and forms a general framework of a text similarity comparison task in a specific practical application scene;
(2) the method can fully utilize the semantic information and the context information of the keywords, improves the text clustering and semantic similarity analysis method, simplifies the subsequent classification network, and can adapt to the input of multi-type text data;
(3) in the method, in the actual scientific research environment with multiple crossed fields, the accuracy and the analysis rate of text clustering and semantic similarity analysis are improved by adopting a semantic-based keyword extraction method and combining semantic role labeling and context information.
Drawings
FIG. 1 is a flow chart of a semantic similarity analysis method based on text clustering according to the present invention.
FIG. 2 is a schematic structural diagram of a semantic similarity analysis model according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the following figures and detailed description:
a semantic similarity analysis method based on text clustering is shown in FIG. 1, which is a flow chart of the semantic similarity analysis method based on text clustering, and the method comprises the following steps:
s1, preprocessing data by text drying, stop word removing, code conversion and Chinese word segmentation for an input unprocessed text sequence. The original data come from the scientific research results declared by marine oil extraction plants over the years, the scientific research results are divided into 4 categories, and the newly declared scientific research result documents are processed in real time by taking the actual scientific research results in work as a sample library.
And S2, training word vectors, training the text word vectors of the segmented words in the text data to be analyzed by adopting Skip-gram and Softmax models to calculate the similarity between words, predicting the occurrence probability of the upper and lower words from large-scale non-labeled text data according to the currently input words, namely predicting the words appearing around according to the occurrence probability of the current words. According to the co-occurrence principle of words in a window, the co-occurrence probability among the words is calculated based on window sliding, so that word vectors generated by each feature word contain certain text structure information and semantic information.
S3, word frequency statistics, wherein the word frequency statistics method adopted by the invention is a common weighting technology used for information retrieval and text mining, TF-IDF.
TF-IDF is a statistical method to evaluate the importance of words to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. If the frequency TF of a word in one article is high and the word rarely appears in other articles, the word or phrase is considered to have good category distinguishing capability and is suitable for classification;
tf (term frequency), the number of times a word appears in an article, is usually normalized (typically the word frequency divided by the total word number of the article) to prevent it from being biased towards long documents, as shown in equation (7):
the IDF (inverse Document frequency) is an inverse text frequency index, and if the number of documents containing the keywords is less, the keywords are proved to have good category distinguishing capability. The IDF for a keyword can be obtained by dividing the total number of articles by the number of articles containing the keyword, and then taking the logarithm of the result.
In equation (8), | D | is the total number of files in the corpus. I { j ∈ dj | denotes the inclusion of the word tiThe number of files. A high word frequency within a particular document, and a low document frequency for that word across the document collection, may result in a high-weighted TF-IDF. Therefore, common words in scientific research results are filtered out by adopting TF-IDF, and important words are reserved.
And S4, text classification, namely adding word frequency statistical information as prior knowledge into text clustering, providing a posterior judgment criterion, and performing preliminary classification on the texts in the sample library. The word frequency information counted in the S3 is used as prior knowledge and added into a classifier to preliminarily classify the documents to be compared in the sample library, and classification results are judged according to a posterior judgment criterion, so that the accuracy of text classification is improved as much as possible.
And S5, text clustering, wherein the method adopted by the invention for text clustering is based on a text clustering algorithm of word vectors and characteristic semantic distances, scientific research results declared by ocean oil production plants are used as a text set to be clustered all the year round, the text set is traversed, and a text characteristic word set is obtained after word segmentation, word deactivation and other operations are carried out on each text (Di) by using a word segmentation tool. And training the large-scale corpus obtained after word segmentation by using a Skip-gram model to obtain a word vector of each feature word, and then carrying out statistics and calculation on the word frequency, position and word distance information of the feature words on each text. Randomly selecting n texts as initial clustering centroids, calculating the distance between each text and the n clustering centroids, and selecting the centroid with the shortest distance as the centroid of the text. Sequentially calculating the distances from the text with the same centroid to other centroids, selecting the centroid with the shortest distance as a new centroid of the text, circularly selecting the initialized centroid and continuously calculating the centroid with the shortest distance until the centroid does not change any more.
And S6, generating semantic vectors, and analyzing the semantic similarity of the text, wherein the semantic similarity of the text can be researched from multiple characteristics of text characteristics, including stopping the influence of words on Chinese participles, analyzing the language state, calculating the semantic similarity, eliminating ambiguity of synonyms, fusing context characteristics with the semantic vectors, adding text structure prediction based on a neural network, and the like, so that the similarity of text data can be more accurately analyzed. The process can be roughly divided into four steps of predicate marking, preprocessing, semantic role marking and semantic role classification. Wherein predicate notation is to identify verb predicates in sentences and assign word senses to them. In the preprocessing stage, the dependency relationship tree is mainly pruned, and the relationship node which is least likely to bear the role of the predicate on the dependency tree is deleted, so that unnecessary structural information is eliminated, and the number of the instances input into the classifier is effectively reduced. And then extracting key characteristics for determining the performance of the semantic role labeling system, and fusing the key characteristics with the context characteristics to generate a semantic vector.
S7, semantic similarity analysis, wherein semantic vectors are input into two LSTM processing text sequences with the same structure and parameters, the product and the variance of the result are added, the same points and the difference of the text are amplified, and the model structure is shown in figure 2. And inputting the vector sequence into an LSTM model for training to obtain two text vectors. In order to improve the sensitivity and the accuracy of the model, the sensitivity of the model is improved by amplifying the same part of the two sequences and reducing the opposite part of the two sequences through product operation, the accuracy of the model is improved by reflecting the difference of the two sequences through variance, then the product operation is carried out on the two text vectors, then the variance operation is carried out on the text vectors, finally the two text vectors, the product result and the variance result are connected together and output to a full-connection layer for calculation, and finally the semantic similarity value is obtained.
And S8, outputting the categories of the sample library to be compared, namely the text clustering result and the semantic similarity analysis result of the text to be analyzed.
In summary, the semantic similarity analysis method based on text clustering of the present invention performs fast semantic similarity analysis on text data in actual scenes, can be applied to various actual scenes, can perform semantic similarity analysis well according to a sample library formed by each scene, and is applicable to various fields.
While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.
Claims (4)
1. A semantic similarity analysis method based on text clustering is characterized by comprising the following specific steps:
s1, for an input unprocessed text T, adopting the modes of removing stop words, code conversion and Chinese word segmentation to carry out data preprocessing and converting the data into a form capable of being calculated;
s2, training text word vectors by using Skip-gram and Softmax models for the divided words to calculate similarity between the words;
s3, calculating the word frequency inverse document frequency by using a TF-IDF algorithm, further obtaining the value of the TF-IDF, and extracting the key words of the detected text;
s4, adding the word frequency statistical information as prior knowledge into text clustering, providing a posterior judgment criterion, and performing preliminary classification on the texts in the sample library;
s5, adding the extracted keywords as prior knowledge into a classifier, clustering the text data in the sample library on the basis of the prior knowledge, and accurately refining the category of the previously obtained text data;
s6, performing morphological analysis, synonym disambiguation and semantic role labeling on the preprocessed text to be detected to generate a semantic vector fusing context characteristics;
s7, inputting semantic vectors into two LSTM processing text sequences with completely same structures and parameters, adding products and variances of results, and amplifying the same points and differences of texts;
and S8, outputting the final result of the text similarity analysis.
2. The semantic similarity analysis method based on text clustering according to claim 1, characterized in that for step (b), step (c) is performed
S3, the word vector Skip-gram model used by the invention is a Huffman tree constructed based on Hierarchical software Softmax, and can predict the occurrence probability of the upper and lower words from large-scale non-labeled text data according to the currently input word, namely, words appearing around can be predicted according to the occurrence probability of the current word; according to the co-occurrence principle of words in a window, calculating the co-occurrence probability among the words based on window sliding, so that word vectors generated by each feature word contain certain text structure information and semantic information, wherein the structure and the calculation mode of a Skip-gram model are as follows:
the Skip-gram model comprises an input layer, a projection layer and an output layer, wherein the input layer is a current characteristic word, and a word vector W of the wordt∈RmThe output is the characteristic wordProbability of occurrence of a word in a context window, the purpose of the projection layer being to maximize the value of an objective function L, given a set of word sequences W1,W2,…,WNThen, then
In the formula (1), N is the length of the word sequence, c represents the context length of the current characteristic word (generally 5-10, the effect is good), and P (W)j+1|Wj) For knowing the current word WjProbability of occurrence, its context feature word Wj+1The probability of occurrence; all word vectors obtained through Skip-gram model training form a word vector matrix X belonging to RmnWith Xi∈RmThe word vectors representing the feature words i in the m-dimensional space, and the similarity between the feature words, can be measured by using the distance between the corresponding word vectors. The Euclidean distance between two vectors is shown as formula (2):
d(Wi,Wj)=‖xi-xj‖2(2)
d (W) in formula (2)i,Wj) Semantic distance, x, representing features i and jiAnd xjExpression of characteristic word Wi,WjCorresponding word vector, d (W)i,Wj) The smaller the value of (a), the smaller the semantic distance between two feature words, the more similar the semantics.
3. The semantic similarity analysis method according to claim 1, wherein for step S5, a conventional K-means algorithm is used to perform a text clustering process, and a plurality of text sets to be clustered are represented by D ═ D { (D)1,D2,D3,…,DnThe text clustering algorithm based on word vectors and feature semantic distances has the specific flow as follows:
(1) traversing the input text set for each text (D)i) Using word segmentation tool to perform word segmentation, word stop and other operations to obtain text characteristicsThe set of words S ═ { S1, S2, S3, …, Sm };
(2) training a large-scale corpus obtained after word segmentation by using a Skip-gram model to obtain a word vector of each feature word;
(3) counting and calculating the word frequency, position and word distance information of the characteristic words of each text;
(4) randomly selecting k texts as initial clustering centroids, calculating the distance between each text and the k clustering centroids by using a formula (3), and selecting the centroid with the shortest distance as the centroid of the text, wherein the calculation process is as follows;
l(Wi,Wj)=αtfij×idfi+βk+γgij×d(Wi,Wj) (3)
in the formula (3), l (wi, wj) represents the multi-feature semantic distance between the feature words i and j, d (wi, wj) represents the semantic distance between the features i and j, tfijThe word frequency information is idfi, the information entropy of the feature word is gij, the distance weight from the first occurrence to the last occurrence of the feature word i in the text j is represented by gij, α, β and gamma are weights of three different types of features, and α + β + gamma is 1;
(5) sequentially calculating the distances from the text with the same centroid to other centroids, and selecting the centroid with the shortest distance as a new centroid of the text;
(6) and (5) circularly executing the steps (4) and (5) until the centroid is not changed any more, and finally obtaining a clustering result.
4. The semantic similarity analysis method based on text clustering according to the claim 1 is characterized in that for step S7, the invention uses two LSTM neural networks with identical input structure and parameters to process text sequence models, the models are composed of a preprocessing layer, an input layer, a training layer and an output layer; let the input be { x, x' }, and the output be y, liIs an intermediate hidden layer, WiAs a weight matrix for each layer, biFor each layer bias, the following:
in equation (4), the activation function of training layers l2-l4 is ReLU, the activation function of output layer y is Softmax, the similarity is classified into 6 classes, so K equals 6, and the following is deduced:
the model adopts a double-sequence structure, each sequence is trained based on LSTM, the model is described in detail as t1 and t2 needing to be trained are input into an ① input layer, the text is converted into word sequences s1 and s2 through word segmentation, the word sequences are mapped with word vectors in pre-training, the word sequences are converted into word vector sequences v1 and v2, ② the word vector sequences are input into an LSTM model for training, two text vectors are obtained, the two text vectors are r1 and r2 respectively, product operation is firstly carried out on r1 and r2 to obtain p, variance operation is then carried out on r1 and r2 to obtain q, and finally the four results of r1, r2, p and q are connected together, a ③ output layer puts the connecting vector into a full-connection layer for calculation, finally semantic similarity values are obtained, compared with the previous model, the model not only comprises a double-sequence processing structure, but also carries out further processing on the training results of the text vectors, namely the training results of the two sequences are connected together for calculation, the product operation is obtained, the semantics are obtained, the two sequences are the same, the product operation is obtained, the sensitivity of the two sequences is improved, and the difference of the.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911100265.8A CN110825877A (en) | 2019-11-12 | 2019-11-12 | Semantic similarity analysis method based on text clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911100265.8A CN110825877A (en) | 2019-11-12 | 2019-11-12 | Semantic similarity analysis method based on text clustering |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110825877A true CN110825877A (en) | 2020-02-21 |
Family
ID=69554249
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911100265.8A Pending CN110825877A (en) | 2019-11-12 | 2019-11-12 | Semantic similarity analysis method based on text clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110825877A (en) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111488725A (en) * | 2020-03-15 | 2020-08-04 | 复旦大学 | Machine intelligent auxiliary root-pricking theoretical coding optimization method |
CN111611809A (en) * | 2020-05-26 | 2020-09-01 | 西藏大学 | Chinese sentence similarity calculation method based on neural network |
CN111680131A (en) * | 2020-06-22 | 2020-09-18 | 平安银行股份有限公司 | Document clustering method and system based on semantics and computer equipment |
CN111898365A (en) * | 2020-04-03 | 2020-11-06 | 北京沃东天骏信息技术有限公司 | Method and device for detecting text |
CN112000802A (en) * | 2020-07-24 | 2020-11-27 | 南京航空航天大学 | Software defect positioning method based on similarity integration |
CN112069307A (en) * | 2020-08-25 | 2020-12-11 | 中国人民大学 | Legal law citation information extraction system |
CN112256874A (en) * | 2020-10-21 | 2021-01-22 | 平安科技(深圳)有限公司 | Model training method, text classification method, device, computer equipment and medium |
CN112347758A (en) * | 2020-11-06 | 2021-02-09 | 中国平安人寿保险股份有限公司 | Text abstract generation method and device, terminal equipment and storage medium |
CN113011555A (en) * | 2021-02-09 | 2021-06-22 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and storage medium |
CN113111645A (en) * | 2021-04-28 | 2021-07-13 | 东南大学 | Media text similarity detection method |
CN113257060A (en) * | 2021-05-13 | 2021-08-13 | 张予立 | Question answering solving method, device, equipment and storage medium |
CN113342928A (en) * | 2021-05-07 | 2021-09-03 | 上海大学 | Method and system for extracting process information from steel material patent text based on improved TextRank algorithm |
CN113407717A (en) * | 2021-05-28 | 2021-09-17 | 数库(上海)科技有限公司 | Method, device, equipment and storage medium for eliminating ambiguity of industry words in news |
CN113535927A (en) * | 2021-07-30 | 2021-10-22 | 杭州网易智企科技有限公司 | Method, medium, device and computing equipment for acquiring similar texts |
CN113591474A (en) * | 2021-07-21 | 2021-11-02 | 西北工业大学 | Repeated data detection method based on weighted fusion Loc2vec model |
CN113656548A (en) * | 2021-08-18 | 2021-11-16 | 福州大学 | Text classification model interpretation method and system based on data envelope analysis |
CN113807073A (en) * | 2020-06-16 | 2021-12-17 | 中国电信股份有限公司 | Text content abnormity detection method, device and storage medium |
CN114969348A (en) * | 2022-07-27 | 2022-08-30 | 杭州电子科技大学 | Electronic file classification method and system based on inversion regulation knowledge base |
CN115168600A (en) * | 2022-06-23 | 2022-10-11 | 广州大学 | Value chain knowledge discovery method under personalized customization |
CN115186778A (en) * | 2022-09-13 | 2022-10-14 | 福建省特种设备检验研究院 | Text analysis-based hidden danger identification method and terminal for pressure-bearing special equipment |
CN116166321A (en) * | 2023-04-26 | 2023-05-26 | 浙江鹏信信息科技股份有限公司 | Code clone detection method, system and computer readable storage medium |
CN116796754A (en) * | 2023-04-20 | 2023-09-22 | 浙江浙里信征信有限公司 | Visual analysis method and system based on time-varying context semantic sequence pair comparison |
CN117592562A (en) * | 2024-01-18 | 2024-02-23 | 卓世未来(天津)科技有限公司 | Knowledge base automatic construction method based on natural language processing |
CN117591769A (en) * | 2023-12-22 | 2024-02-23 | 云尖(北京)软件有限公司 | Webpage tamper-proof method and system |
CN117648409A (en) * | 2024-01-30 | 2024-03-05 | 北京点聚信息技术有限公司 | OCR-based format file anti-counterfeiting recognition method |
-
2019
- 2019-11-12 CN CN201911100265.8A patent/CN110825877A/en active Pending
Cited By (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111488725A (en) * | 2020-03-15 | 2020-08-04 | 复旦大学 | Machine intelligent auxiliary root-pricking theoretical coding optimization method |
CN111488725B (en) * | 2020-03-15 | 2023-04-07 | 复旦大学 | Machine intelligent auxiliary root-pricking theoretical coding optimization method |
CN111898365A (en) * | 2020-04-03 | 2020-11-06 | 北京沃东天骏信息技术有限公司 | Method and device for detecting text |
CN111611809A (en) * | 2020-05-26 | 2020-09-01 | 西藏大学 | Chinese sentence similarity calculation method based on neural network |
CN111611809B (en) * | 2020-05-26 | 2023-04-18 | 西藏大学 | Chinese sentence similarity calculation method based on neural network |
CN113807073B (en) * | 2020-06-16 | 2023-11-14 | 中国电信股份有限公司 | Text content anomaly detection method, device and storage medium |
CN113807073A (en) * | 2020-06-16 | 2021-12-17 | 中国电信股份有限公司 | Text content abnormity detection method, device and storage medium |
CN111680131A (en) * | 2020-06-22 | 2020-09-18 | 平安银行股份有限公司 | Document clustering method and system based on semantics and computer equipment |
CN111680131B (en) * | 2020-06-22 | 2022-08-12 | 平安银行股份有限公司 | Document clustering method and system based on semantics and computer equipment |
CN112000802A (en) * | 2020-07-24 | 2020-11-27 | 南京航空航天大学 | Software defect positioning method based on similarity integration |
CN112069307A (en) * | 2020-08-25 | 2020-12-11 | 中国人民大学 | Legal law citation information extraction system |
CN112256874B (en) * | 2020-10-21 | 2023-08-08 | 平安科技(深圳)有限公司 | Model training method, text classification method, device, computer equipment and medium |
CN112256874A (en) * | 2020-10-21 | 2021-01-22 | 平安科技(深圳)有限公司 | Model training method, text classification method, device, computer equipment and medium |
CN112347758A (en) * | 2020-11-06 | 2021-02-09 | 中国平安人寿保险股份有限公司 | Text abstract generation method and device, terminal equipment and storage medium |
CN113011555A (en) * | 2021-02-09 | 2021-06-22 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and storage medium |
CN113011555B (en) * | 2021-02-09 | 2023-01-31 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and storage medium |
CN113111645B (en) * | 2021-04-28 | 2024-02-06 | 东南大学 | Media text similarity detection method |
CN113111645A (en) * | 2021-04-28 | 2021-07-13 | 东南大学 | Media text similarity detection method |
CN113342928A (en) * | 2021-05-07 | 2021-09-03 | 上海大学 | Method and system for extracting process information from steel material patent text based on improved TextRank algorithm |
CN113257060A (en) * | 2021-05-13 | 2021-08-13 | 张予立 | Question answering solving method, device, equipment and storage medium |
CN113407717B (en) * | 2021-05-28 | 2022-12-20 | 数库(上海)科技有限公司 | Method, device, equipment and storage medium for eliminating ambiguity of industrial words in news |
CN113407717A (en) * | 2021-05-28 | 2021-09-17 | 数库(上海)科技有限公司 | Method, device, equipment and storage medium for eliminating ambiguity of industry words in news |
CN113591474B (en) * | 2021-07-21 | 2024-04-05 | 西北工业大学 | Repeated data detection method of Loc2vec model based on weighted fusion |
CN113591474A (en) * | 2021-07-21 | 2021-11-02 | 西北工业大学 | Repeated data detection method based on weighted fusion Loc2vec model |
CN113535927A (en) * | 2021-07-30 | 2021-10-22 | 杭州网易智企科技有限公司 | Method, medium, device and computing equipment for acquiring similar texts |
CN113656548A (en) * | 2021-08-18 | 2021-11-16 | 福州大学 | Text classification model interpretation method and system based on data envelope analysis |
CN113656548B (en) * | 2021-08-18 | 2023-08-04 | 福州大学 | Text classification model interpretation method and system based on data envelope analysis |
CN115168600A (en) * | 2022-06-23 | 2022-10-11 | 广州大学 | Value chain knowledge discovery method under personalized customization |
CN115168600B (en) * | 2022-06-23 | 2023-07-11 | 广州大学 | Value chain knowledge discovery method under personalized customization |
CN114969348B (en) * | 2022-07-27 | 2023-10-27 | 杭州电子科技大学 | Electronic file hierarchical classification method and system based on inversion adjustment knowledge base |
CN114969348A (en) * | 2022-07-27 | 2022-08-30 | 杭州电子科技大学 | Electronic file classification method and system based on inversion regulation knowledge base |
CN115186778A (en) * | 2022-09-13 | 2022-10-14 | 福建省特种设备检验研究院 | Text analysis-based hidden danger identification method and terminal for pressure-bearing special equipment |
CN116796754A (en) * | 2023-04-20 | 2023-09-22 | 浙江浙里信征信有限公司 | Visual analysis method and system based on time-varying context semantic sequence pair comparison |
CN116166321B (en) * | 2023-04-26 | 2023-06-27 | 浙江鹏信信息科技股份有限公司 | Code clone detection method, system and computer readable storage medium |
CN116166321A (en) * | 2023-04-26 | 2023-05-26 | 浙江鹏信信息科技股份有限公司 | Code clone detection method, system and computer readable storage medium |
CN117591769A (en) * | 2023-12-22 | 2024-02-23 | 云尖(北京)软件有限公司 | Webpage tamper-proof method and system |
CN117591769B (en) * | 2023-12-22 | 2024-04-16 | 云尖(北京)软件有限公司 | Webpage tamper-proof method and system |
CN117592562A (en) * | 2024-01-18 | 2024-02-23 | 卓世未来(天津)科技有限公司 | Knowledge base automatic construction method based on natural language processing |
CN117592562B (en) * | 2024-01-18 | 2024-04-09 | 卓世未来(天津)科技有限公司 | Knowledge base automatic construction method based on natural language processing |
CN117648409A (en) * | 2024-01-30 | 2024-03-05 | 北京点聚信息技术有限公司 | OCR-based format file anti-counterfeiting recognition method |
CN117648409B (en) * | 2024-01-30 | 2024-04-05 | 北京点聚信息技术有限公司 | OCR-based format file anti-counterfeiting recognition method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110825877A (en) | Semantic similarity analysis method based on text clustering | |
CN106649260B (en) | Product characteristic structure tree construction method based on comment text mining | |
CN109190117B (en) | Short text semantic similarity calculation method based on word vector | |
CN108804421B (en) | Text similarity analysis method and device, electronic equipment and computer storage medium | |
CN109670014B (en) | Paper author name disambiguation method based on rule matching and machine learning | |
CN112256939B (en) | Text entity relation extraction method for chemical field | |
CN107844533A (en) | A kind of intelligent Answer System and analysis method | |
WO2020063071A1 (en) | Sentence vector calculation method based on chi-square test, and text classification method and system | |
CN111930933A (en) | Detection case processing method and device based on artificial intelligence | |
CN115203421A (en) | Method, device and equipment for generating label of long text and storage medium | |
CN111429184A (en) | User portrait extraction method based on text information | |
Bhutada et al. | Semantic latent dirichlet allocation for automatic topic extraction | |
CN114997288A (en) | Design resource association method | |
TW202111569A (en) | Text classification method with high scalability and multi-tag and apparatus thereof also providing a method and a device for constructing topic classification templates | |
CN116304020A (en) | Industrial text entity extraction method based on semantic source analysis and span characteristics | |
CN106815209B (en) | Uygur agricultural technical term identification method | |
Tianxiong et al. | Identifying chinese event factuality with convolutional neural networks | |
CN113139061B (en) | Case feature extraction method based on word vector clustering | |
CN114511027B (en) | Method for extracting English remote data through big data network | |
CN114265935A (en) | Science and technology project establishment management auxiliary decision-making method and system based on text mining | |
CN111061939B (en) | Scientific research academic news keyword matching recommendation method based on deep learning | |
CN113516202A (en) | Webpage accurate classification method for CBL feature extraction and denoising | |
Thilagavathi et al. | Document clustering in forensic investigation by hybrid approach | |
CN114595324A (en) | Method, device, terminal and non-transitory storage medium for power grid service data domain division | |
CN113590738A (en) | Method for detecting network sensitive information based on content and emotion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200221 |