CN110597986A - Text clustering system and method based on fine tuning characteristics - Google Patents

Text clustering system and method based on fine tuning characteristics Download PDF

Info

Publication number
CN110597986A
CN110597986A CN201910757370.2A CN201910757370A CN110597986A CN 110597986 A CN110597986 A CN 110597986A CN 201910757370 A CN201910757370 A CN 201910757370A CN 110597986 A CN110597986 A CN 110597986A
Authority
CN
China
Prior art keywords
cluster
sentence
clustering
distance
sentences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910757370.2A
Other languages
Chinese (zh)
Inventor
汪鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Weier Network Technology Co Ltd
Original Assignee
Hangzhou Weier Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Weier Network Technology Co Ltd filed Critical Hangzhou Weier Network Technology Co Ltd
Priority to CN201910757370.2A priority Critical patent/CN110597986A/en
Publication of CN110597986A publication Critical patent/CN110597986A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Abstract

A text clustering method based on fine tuning characteristics relates to the technical field of information clustering; the method comprises the following steps: chat corpus: storing the chat linguistic data; a downstream task module: constructing a downstream task and generating a sentence vector; a hierarchical clustering module: clustering the sentence vectors into clusters; an anomaly detection module: detecting the reliability of each cluster; the method comprises the following steps: the most appropriate query method is generated for each cluster. According to the method, the pre-training word vector model is finely adjusted through a downstream task, so that the generated vector is more suitable for a clustering task; improving the feasibility of each cluster by using anomaly detection; extracting clusters with high reliability by scoring each cluster; and finally, the inquiry name of each cluster is generated, so that the reliability of the cluster name is improved.

Description

Text clustering system and method based on fine tuning characteristics
Technical Field
The invention relates to the technical field of information clustering, in particular to a text clustering system and method based on fine tuning characteristics.
Background
With the development and application of network technology and the explosive growth of information resources, the research of text mining, information filtering and information searching has unprecedented prospects. Therefore, clustering techniques are becoming the core of text information mining techniques. Text clustering is an important technique used in text mining to find data distribution and its implicit data patterns. Clustering is achieved by dividing data with similarities into different groups so that the elements in each cluster share some common characteristics, usually in terms of a defined distance metric. However, conventional clustering is sensitive to different initial values, outliers, and the generalization performance is weak.
Aiming at the defects, with the rapid development of the vector technology of pre-training words such as bert, gpt, elmo and the like in the field of nlp, vectors more suitable for clustering are generated through downstream subtasks, the generalization capability of clustering is ensured, and meanwhile, an anomaly detection module is added on the basis, so that the clustering effect is further improved.
Disclosure of Invention
The invention aims to provide a text clustering system and method based on fine tuning characteristics.
The technical scheme of the invention is as follows: a system for text clustering based on fine tuning features, comprising:
chat corpus: storing the chat linguistic data;
a downstream task module: constructing a downstream task and generating a sentence vector;
a hierarchical clustering module: clustering the sentence vectors into clusters;
an anomaly detection module: detecting the reliability of each cluster;
the method comprises the following steps: the most appropriate query method is generated for each cluster.
The above text clustering system based on fine tuning features further includes a text preprocessing module: the method is used for preprocessing and normalizing the similar information including pictures, links, cards and password-panning information in the chat corpus.
A text clustering method based on fine tuning features comprises the following steps:
s1, constructing a downstream task, wherein the downstream task is used for predicting whether 2 sentences in the chat corpus are similar or not;
s2, preprocessing files, namely putting similar information including pictures, links, cards and password panning in the chat corpus into a cluster for preprocessing normalization;
s3, representing sentences in the chat corpus by vectors;
s4, clustering, namely clustering sentence vectors into clusters;
s5, question generation, which generates the most suitable question for each cluster.
In the foregoing text clustering method based on fine-tuning features, the downstream task in step S1 uses a triple Loss function, the input samples are 3 sentences, including a middle sample called an anchor, a left sample as a positive sample, and a right sample as a negative sample, and the training optimization aims at maximizing the distance between the anchor sample and the negative sample and minimizing the distance between the anchor sample and the positive sample.
In the foregoing text clustering method based on fine tuning features, the step S3 includes:
hierarchical clustering: firstly, calculating the Euclidean distance between sentences, setting a distance threshold value, and combining two sentences the distance of which is less than the threshold value into a cluster;
calculating the distance between each sentence and the third sentence in the cluster, taking the minimum value which is the distance between the cluster and the third sentence, and combining the third sentence into the cluster if the sentence is smaller than a threshold value;
continuing to calculate a fourth sentence and a fifth sentence according to the steps until the cluster is stored with sentences with similar distances, and finishing hierarchical clustering;
abnormality detection: using the lof algorithm, by calculating the relative density of each sentence to its neighboring data points, it is calculated as follows
reach_distk(p,o)=max{k-distance(o),d(p,o)}
k-distance (o) is the distance between the kth point nearest p and p, and d (p, o) is the distance between p and o
reach_distk(p,o)=max{k-distance(o),d(p,o)}
Calculating LOF scores for each intra-cluster question, indicating that the local density of the question p is as good as its neighbors if the LOF score of the question p is around 1; if the LOF score of the question p is less than 1, it indicates that the question p is in a relatively dense area, unlike an outlier; if the LOF score of the data point p is far more than 1, the problem p is more distant than other points and is likely to be an abnormal point;
and (3) grading generation: scoring each generated cluster, wherein the score is used for measuring the clustering quality and reserving high-quality clusters;
the score is calculated using the root mean square standard deviation:
wherein Ci represents the ith cluster, Ci is the center of the cluster, x ∈ Ci represents a sample point belonging to the ith cluster, ni is the number of samples of the ith cluster, and P is the vector dimension corresponding to the sample point.
In the foregoing text clustering method based on fine tuning features, the step S4 uses the pagerank algorithm:
each sentence in the cluster is regarded as a node, if the two sentences have similarity, an undirected weighted edge is regarded as being arranged between the two nodes, and the weight is the similarity;
the score for each sentence was calculated:
s (vi) is the medium importance (PR value) of the sentence i, d is a damping coefficient, which is generally set to 0.85, in (vi) is a set of sentences to which links pointing to the sentence i exist, out (vj) is a set of sentences to which links in the sentence j point, and | out (vj) | is the number of elements in the set, and finally, the sentence with the largest PR value is taken as a question.
Compared with the prior art, the method has the advantages that the pre-training word vector model is finely adjusted through the downstream task, so that the generated vector is more suitable for the clustering task; improving the feasibility of each cluster by using anomaly detection; extracting clusters with high reliability by scoring each cluster; and finally, the inquiry name of each cluster is generated, so that the reliability of the cluster name is improved.
Detailed Description
The present invention is further illustrated by the following examples, which are not to be construed as limiting the invention.
Example (b): a text clustering system based on fine tuning features is characterized in that: the method comprises the following steps:
chat corpus: storing the chat linguistic data;
a downstream task module: constructing a downstream task and generating a sentence vector;
a text preprocessing module: the method is used for preprocessing and normalizing the similar information including pictures, links, cards and password-panning information in the chat corpus.
A hierarchical clustering module: clustering the sentence vectors into clusters;
an anomaly detection module: detecting the reliability of each cluster;
the method comprises the following steps: the most appropriate query method is generated for each cluster.
Specifically, the method for clustering the pre-training word vector and the text based on the anomaly detection of the downstream task comprises the following steps: the original pre-training model is trained based on a large-scale Chinese pre-material library, although the model has strong text expression capacity, in order to make the model more suitable for our tasks, data needs to be constructed, and the model is finely adjusted on a constructed data set, so that the advantages are 2: 1. the model can see the professional nouns possibly related in the chat corpus, and the possibility of semantic deviation is reduced (semantic deviation, such as jade and Florence is a place, but if the model does not see the term, the vectors generated by the two are not related); 2. the fine tuning can enable the vectors generated by the model to express similarity among sentences better, and the fine tuning optimization aims to achieve the purposes that the distances among similar sentences are close and the distances among dissimilar sentences are far, which is the same as the clustering logic, the similar sentences are clustered into a cluster, and different sentences are separated.
The method comprises the following steps:
s1, constructing a downstream task, wherein the downstream task is used for predicting whether 2 sentences in the chat corpus are similar or not;
the downstream task adopts a triple Loss function, input samples are 3 sentences, the intermediate samples are called acher, a left sample is a positive sample, a right sample is a negative sample, and the training optimization aims at maximizing the distance between the acher sample and the negative sample and minimizing the distance between the acher sample and the positive sample.
S2, preprocessing files, namely putting similar information including pictures, links, cards and password panning in the chat corpus into a cluster for preprocessing normalization; these implicitly repeated data can be removed, which can improve the quality of clustering on the one hand and can also speed up the clustering process, since about 50% of the data can be removed after the removal of the duplicate.
S3, representing sentences in the chat corpus by vectors;
s4, clustering, namely clustering sentence vectors into clusters;
the method comprises the following steps:
hierarchical clustering: firstly, calculating the Euclidean distance between sentences, setting a distance threshold value, and combining two sentences the distance of which is less than the threshold value into a cluster;
the cluster now contains two sentences, and the distance between the cluster and the third sentence is calculated as follows: calculating the distance between each sentence and the third sentence (the distance between the two sentences and the third sentence) in the cluster, taking the minimum value, namely the distance between the cluster and the third sentence, and combining the third sentence into the cluster if the sentence is smaller than a threshold value;
continuing to calculate a fourth sentence and a fifth sentence according to the steps until the cluster is stored with sentences with similar distances, and finishing hierarchical clustering;
abnormality detection: using the lof algorithm, by calculating the relative density of each sentence to its neighboring data points, it is calculated as follows
reach_distk(p,o)=max{k-distance(o),d(p,o)}
k-distance (o) is the distance between the kth point nearest p and p, and d (p, o) is the distance between p and o
reach_distk(p,o)=max{k-distance(o),d(p,o)}
Calculating LOF scores within each cluster, indicating that the local density of a data point p is about the same as its neighbors if the LOF score of the data point p is around 1; if the LOF score of the data point p is less than 1, it indicates that the data point p is in a relatively dense region, unlike an outlier; if the LOF score of the data point p is far greater than 1, the data point p is more distant than other points and is likely to be an abnormal point;
and (3) grading generation: scoring each generated cluster, wherein the score is used for measuring the clustering quality and reserving high-quality clusters;
the score is calculated using the root mean square standard deviation:
wherein Ci represents the ith cluster, Ci is the center of the cluster, x ∈ Ci represents a sample point belonging to the ith cluster, ni is the number of samples of the ith cluster, and P is the vector dimension corresponding to the sample point.
S5, question generation, which generates the most suitable question for each cluster.
The step S4 uses the pagerank algorithm:
each sentence in the cluster is regarded as a node, if the two sentences have similarity, an undirected weighted edge is regarded as being arranged between the two nodes, and the weight is the similarity;
the score for each sentence was calculated:
s (vi) is the medium importance (PR value) of the sentence i, d is a damping coefficient, which is generally set to 0.85, in (vi) is a set of sentences to which links pointing to the sentence i exist, out (vj) is a set of sentences to which links in the sentence j point, and | out (vj) | is the number of elements in the set, and finally, the sentence with the largest PR value is taken as a question.

Claims (6)

1. A text clustering system based on fine tuning features is characterized in that: the method comprises the following steps:
chat corpus: storing the chat linguistic data;
a downstream task module: constructing a downstream task and generating a sentence vector;
a hierarchical clustering module: clustering the sentence vectors into clusters;
an anomaly detection module: detecting the reliability of each cluster;
the method comprises the following steps: the most appropriate query method is generated for each cluster.
2. The fine feature based text clustering system of claim 1, wherein: the system also comprises a text preprocessing module: the method is used for preprocessing and normalizing the similar information including pictures, links, cards and password-panning information in the chat corpus.
3. The method for clustering texts based on fine tuning features according to claim 1, wherein: the method comprises the following steps:
s1, constructing a downstream task, wherein the downstream task is used for predicting whether 2 sentences in the chat corpus are similar or not;
s2, preprocessing files, namely putting similar information including pictures, links, cards and password panning in the chat corpus into a cluster for preprocessing normalization;
s3, representing sentences in the chat corpus by vectors;
s4, clustering, namely clustering sentence vectors into clusters;
s5, question generation, which generates the most suitable question for each cluster.
4. The method of claim 3, wherein the text clustering method based on the fine tuning feature comprises: in the step S1, the downstream task uses a triple Loss function, the input samples are 3 sentences, the middle sample is called an anchor, the left sample is a positive sample, and the right sample is a negative sample, and the training optimization aims at maximizing the distance between the anchor sample and the negative sample and minimizing the distance between the anchor sample and the positive sample.
5. The method of claim 4, wherein the text clustering method based on the fine tuning feature comprises: the step S3 includes:
hierarchical clustering: firstly, calculating the Euclidean distance between sentences, setting a distance threshold value, and combining two sentences the distance of which is less than the threshold value into a cluster;
calculating the distance between each sentence and the third sentence in the cluster, taking the minimum value which is the distance between the cluster and the third sentence, and combining the third sentence into the cluster if the sentence is smaller than a threshold value;
continuing to calculate a fourth sentence and a fifth sentence according to the steps until the cluster is stored with sentences with similar distances, and finishing hierarchical clustering;
abnormality detection: using the lof algorithm, by calculating the relative density of each sentence to its neighboring data points, it is calculated as follows
reach_distk(p,o)=nax{k-distance(o),d(p,o)}
k-distance (o) is the distance between the kth point nearest p and p, and d (p, o) is the distance between p and o
reach_distk(p,o)=nax{k-distance(o),d(p,o)}
Calculating LOF scores for each intra-cluster question, indicating that the local density of the question p is as good as its neighbors if the LOF score of the question p is around 1; if the LOF score of the question p is less than 1, it indicates that the question p is in a relatively dense area, unlike an outlier; if the LOF score of the data point p is far more than 1, the problem p is more distant than other points and is likely to be an abnormal point;
and (3) grading generation: scoring each generated cluster, wherein the score is used for measuring the clustering quality and reserving high-quality clusters;
the score is calculated using the root mean square standard deviation:
wherein Ci represents the ith cluster, Ci is the center of the cluster, x ∈ Ci represents a sample point belonging to the ith cluster, ni is the number of samples of the ith cluster, and P is the vector dimension corresponding to the sample point.
6. The method of claim 3, wherein the text clustering method based on the fine tuning feature comprises: the step S4 uses the pagerank algorithm:
each sentence in the cluster is regarded as a node, if the two sentences have similarity, an undirected weighted edge is regarded as being arranged between the two nodes, and the weight is the similarity;
the score for each sentence was calculated:
s (vi) is the medium importance (PR value) of the sentence i, d is a damping coefficient, which is generally set to 0.85, in (vi) is a set of sentences to which links pointing to the sentence i exist, out (vj) is a set of sentences to which links in the sentence j point, and | out (vj) | is the number of elements in the set, and finally, the sentence with the largest PR value is taken as a question.
CN201910757370.2A 2019-08-16 2019-08-16 Text clustering system and method based on fine tuning characteristics Pending CN110597986A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910757370.2A CN110597986A (en) 2019-08-16 2019-08-16 Text clustering system and method based on fine tuning characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910757370.2A CN110597986A (en) 2019-08-16 2019-08-16 Text clustering system and method based on fine tuning characteristics

Publications (1)

Publication Number Publication Date
CN110597986A true CN110597986A (en) 2019-12-20

Family

ID=68854591

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910757370.2A Pending CN110597986A (en) 2019-08-16 2019-08-16 Text clustering system and method based on fine tuning characteristics

Country Status (1)

Country Link
CN (1) CN110597986A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111178415A (en) * 2019-12-21 2020-05-19 厦门快商通科技股份有限公司 Method and system for hierarchical clustering of intention data based on BERT
CN111368081A (en) * 2020-03-03 2020-07-03 支付宝(杭州)信息技术有限公司 Method and system for determining selected text content
CN111814448A (en) * 2020-07-03 2020-10-23 苏州思必驰信息科技有限公司 Method and device for quantizing pre-training language model
CN113538075A (en) * 2020-04-14 2021-10-22 阿里巴巴集团控股有限公司 Data processing method, model training method, device and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133238A (en) * 2016-02-29 2017-09-05 阿里巴巴集团控股有限公司 A kind of text message clustering method and text message clustering system
CN107967255A (en) * 2017-11-08 2018-04-27 北京广利核系统工程有限公司 A kind of method and system for judging text similarity

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133238A (en) * 2016-02-29 2017-09-05 阿里巴巴集团控股有限公司 A kind of text message clustering method and text message clustering system
CN107967255A (en) * 2017-11-08 2018-04-27 北京广利核系统工程有限公司 A kind of method and system for judging text similarity

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
万家强: "基于连通性的离群检测与聚类研究", 《中国优秀博士学位论文全文数据库 信息科技辑 ISSN 1674-022X》 *
邵洪雨: "短文本聚类及聚类结果描述方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑 ISSN 1674-0246》 *
陈伶红: "高校图书馆个性化推荐系统的研究与设计", 《中国优秀硕士学位论文全文数据库 信息科技辑 ISSN 1674-0246》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111178415A (en) * 2019-12-21 2020-05-19 厦门快商通科技股份有限公司 Method and system for hierarchical clustering of intention data based on BERT
CN111368081A (en) * 2020-03-03 2020-07-03 支付宝(杭州)信息技术有限公司 Method and system for determining selected text content
CN113538075A (en) * 2020-04-14 2021-10-22 阿里巴巴集团控股有限公司 Data processing method, model training method, device and equipment
CN111814448A (en) * 2020-07-03 2020-10-23 苏州思必驰信息科技有限公司 Method and device for quantizing pre-training language model
CN111814448B (en) * 2020-07-03 2024-01-16 思必驰科技股份有限公司 Pre-training language model quantization method and device

Similar Documents

Publication Publication Date Title
Abbas et al. Multinomial Naive Bayes classification model for sentiment analysis
CN106383877B (en) Social media online short text clustering and topic detection method
CN110597986A (en) Text clustering system and method based on fine tuning characteristics
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN110321925B (en) Text multi-granularity similarity comparison method based on semantic aggregated fingerprints
CN106372061B (en) Short text similarity calculation method based on semantics
CN109815336B (en) Text aggregation method and system
CN107832306A (en) A kind of similar entities method for digging based on Doc2vec
Mohammed et al. Glove word embedding and DBSCAN algorithms for semantic document clustering
CN106557777B (en) One kind being based on the improved Kmeans document clustering method of SimHash
CN110442726B (en) Social media short text online clustering method based on entity constraint
CN107357895B (en) Text representation processing method based on bag-of-words model
CN112883722B (en) Distributed text summarization method based on cloud data center
CN110851733A (en) Community discovery and emotion interpretation method based on network topology and document content
Gad et al. Incremental clustering algorithm based on phrase-semantic similarity histogram
CN114048310A (en) Dynamic intelligence event timeline extraction method based on LDA theme AP clustering
Basha et al. An improved similarity matching based clustering framework for short and sentence level text
Zhao et al. Improving continual relation extraction by distinguishing analogous semantics
CN110597982A (en) Short text topic clustering algorithm based on word co-occurrence network
CN115688768A (en) Medical text professional classification method based on confrontation data enhancement
CN115098690A (en) Multi-data document classification method and system based on cluster analysis
CN111737461B (en) Text processing method and device, electronic equipment and computer readable storage medium
Bakr et al. Efficient incremental phrase-based document clustering
Nguyen et al. Text summarization on large-scale Vietnamese datasets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191220

RJ01 Rejection of invention patent application after publication