CN110597986A - Text clustering system and method based on fine tuning characteristics - Google Patents
Text clustering system and method based on fine tuning characteristics Download PDFInfo
- Publication number
- CN110597986A CN110597986A CN201910757370.2A CN201910757370A CN110597986A CN 110597986 A CN110597986 A CN 110597986A CN 201910757370 A CN201910757370 A CN 201910757370A CN 110597986 A CN110597986 A CN 110597986A
- Authority
- CN
- China
- Prior art keywords
- cluster
- sentence
- clustering
- distance
- sentences
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Abstract
A text clustering method based on fine tuning characteristics relates to the technical field of information clustering; the method comprises the following steps: chat corpus: storing the chat linguistic data; a downstream task module: constructing a downstream task and generating a sentence vector; a hierarchical clustering module: clustering the sentence vectors into clusters; an anomaly detection module: detecting the reliability of each cluster; the method comprises the following steps: the most appropriate query method is generated for each cluster. According to the method, the pre-training word vector model is finely adjusted through a downstream task, so that the generated vector is more suitable for a clustering task; improving the feasibility of each cluster by using anomaly detection; extracting clusters with high reliability by scoring each cluster; and finally, the inquiry name of each cluster is generated, so that the reliability of the cluster name is improved.
Description
Technical Field
The invention relates to the technical field of information clustering, in particular to a text clustering system and method based on fine tuning characteristics.
Background
With the development and application of network technology and the explosive growth of information resources, the research of text mining, information filtering and information searching has unprecedented prospects. Therefore, clustering techniques are becoming the core of text information mining techniques. Text clustering is an important technique used in text mining to find data distribution and its implicit data patterns. Clustering is achieved by dividing data with similarities into different groups so that the elements in each cluster share some common characteristics, usually in terms of a defined distance metric. However, conventional clustering is sensitive to different initial values, outliers, and the generalization performance is weak.
Aiming at the defects, with the rapid development of the vector technology of pre-training words such as bert, gpt, elmo and the like in the field of nlp, vectors more suitable for clustering are generated through downstream subtasks, the generalization capability of clustering is ensured, and meanwhile, an anomaly detection module is added on the basis, so that the clustering effect is further improved.
Disclosure of Invention
The invention aims to provide a text clustering system and method based on fine tuning characteristics.
The technical scheme of the invention is as follows: a system for text clustering based on fine tuning features, comprising:
chat corpus: storing the chat linguistic data;
a downstream task module: constructing a downstream task and generating a sentence vector;
a hierarchical clustering module: clustering the sentence vectors into clusters;
an anomaly detection module: detecting the reliability of each cluster;
the method comprises the following steps: the most appropriate query method is generated for each cluster.
The above text clustering system based on fine tuning features further includes a text preprocessing module: the method is used for preprocessing and normalizing the similar information including pictures, links, cards and password-panning information in the chat corpus.
A text clustering method based on fine tuning features comprises the following steps:
s1, constructing a downstream task, wherein the downstream task is used for predicting whether 2 sentences in the chat corpus are similar or not;
s2, preprocessing files, namely putting similar information including pictures, links, cards and password panning in the chat corpus into a cluster for preprocessing normalization;
s3, representing sentences in the chat corpus by vectors;
s4, clustering, namely clustering sentence vectors into clusters;
s5, question generation, which generates the most suitable question for each cluster.
In the foregoing text clustering method based on fine-tuning features, the downstream task in step S1 uses a triple Loss function, the input samples are 3 sentences, including a middle sample called an anchor, a left sample as a positive sample, and a right sample as a negative sample, and the training optimization aims at maximizing the distance between the anchor sample and the negative sample and minimizing the distance between the anchor sample and the positive sample.
In the foregoing text clustering method based on fine tuning features, the step S3 includes:
hierarchical clustering: firstly, calculating the Euclidean distance between sentences, setting a distance threshold value, and combining two sentences the distance of which is less than the threshold value into a cluster;
calculating the distance between each sentence and the third sentence in the cluster, taking the minimum value which is the distance between the cluster and the third sentence, and combining the third sentence into the cluster if the sentence is smaller than a threshold value;
continuing to calculate a fourth sentence and a fifth sentence according to the steps until the cluster is stored with sentences with similar distances, and finishing hierarchical clustering;
abnormality detection: using the lof algorithm, by calculating the relative density of each sentence to its neighboring data points, it is calculated as follows
reach_distk(p,o)=max{k-distance(o),d(p,o)}
k-distance (o) is the distance between the kth point nearest p and p, and d (p, o) is the distance between p and o
reach_distk(p,o)=max{k-distance(o),d(p,o)}
Calculating LOF scores for each intra-cluster question, indicating that the local density of the question p is as good as its neighbors if the LOF score of the question p is around 1; if the LOF score of the question p is less than 1, it indicates that the question p is in a relatively dense area, unlike an outlier; if the LOF score of the data point p is far more than 1, the problem p is more distant than other points and is likely to be an abnormal point;
and (3) grading generation: scoring each generated cluster, wherein the score is used for measuring the clustering quality and reserving high-quality clusters;
the score is calculated using the root mean square standard deviation:
wherein Ci represents the ith cluster, Ci is the center of the cluster, x ∈ Ci represents a sample point belonging to the ith cluster, ni is the number of samples of the ith cluster, and P is the vector dimension corresponding to the sample point.
In the foregoing text clustering method based on fine tuning features, the step S4 uses the pagerank algorithm:
each sentence in the cluster is regarded as a node, if the two sentences have similarity, an undirected weighted edge is regarded as being arranged between the two nodes, and the weight is the similarity;
the score for each sentence was calculated:
s (vi) is the medium importance (PR value) of the sentence i, d is a damping coefficient, which is generally set to 0.85, in (vi) is a set of sentences to which links pointing to the sentence i exist, out (vj) is a set of sentences to which links in the sentence j point, and | out (vj) | is the number of elements in the set, and finally, the sentence with the largest PR value is taken as a question.
Compared with the prior art, the method has the advantages that the pre-training word vector model is finely adjusted through the downstream task, so that the generated vector is more suitable for the clustering task; improving the feasibility of each cluster by using anomaly detection; extracting clusters with high reliability by scoring each cluster; and finally, the inquiry name of each cluster is generated, so that the reliability of the cluster name is improved.
Detailed Description
The present invention is further illustrated by the following examples, which are not to be construed as limiting the invention.
Example (b): a text clustering system based on fine tuning features is characterized in that: the method comprises the following steps:
chat corpus: storing the chat linguistic data;
a downstream task module: constructing a downstream task and generating a sentence vector;
a text preprocessing module: the method is used for preprocessing and normalizing the similar information including pictures, links, cards and password-panning information in the chat corpus.
A hierarchical clustering module: clustering the sentence vectors into clusters;
an anomaly detection module: detecting the reliability of each cluster;
the method comprises the following steps: the most appropriate query method is generated for each cluster.
Specifically, the method for clustering the pre-training word vector and the text based on the anomaly detection of the downstream task comprises the following steps: the original pre-training model is trained based on a large-scale Chinese pre-material library, although the model has strong text expression capacity, in order to make the model more suitable for our tasks, data needs to be constructed, and the model is finely adjusted on a constructed data set, so that the advantages are 2: 1. the model can see the professional nouns possibly related in the chat corpus, and the possibility of semantic deviation is reduced (semantic deviation, such as jade and Florence is a place, but if the model does not see the term, the vectors generated by the two are not related); 2. the fine tuning can enable the vectors generated by the model to express similarity among sentences better, and the fine tuning optimization aims to achieve the purposes that the distances among similar sentences are close and the distances among dissimilar sentences are far, which is the same as the clustering logic, the similar sentences are clustered into a cluster, and different sentences are separated.
The method comprises the following steps:
s1, constructing a downstream task, wherein the downstream task is used for predicting whether 2 sentences in the chat corpus are similar or not;
the downstream task adopts a triple Loss function, input samples are 3 sentences, the intermediate samples are called acher, a left sample is a positive sample, a right sample is a negative sample, and the training optimization aims at maximizing the distance between the acher sample and the negative sample and minimizing the distance between the acher sample and the positive sample.
S2, preprocessing files, namely putting similar information including pictures, links, cards and password panning in the chat corpus into a cluster for preprocessing normalization; these implicitly repeated data can be removed, which can improve the quality of clustering on the one hand and can also speed up the clustering process, since about 50% of the data can be removed after the removal of the duplicate.
S3, representing sentences in the chat corpus by vectors;
s4, clustering, namely clustering sentence vectors into clusters;
the method comprises the following steps:
hierarchical clustering: firstly, calculating the Euclidean distance between sentences, setting a distance threshold value, and combining two sentences the distance of which is less than the threshold value into a cluster;
the cluster now contains two sentences, and the distance between the cluster and the third sentence is calculated as follows: calculating the distance between each sentence and the third sentence (the distance between the two sentences and the third sentence) in the cluster, taking the minimum value, namely the distance between the cluster and the third sentence, and combining the third sentence into the cluster if the sentence is smaller than a threshold value;
continuing to calculate a fourth sentence and a fifth sentence according to the steps until the cluster is stored with sentences with similar distances, and finishing hierarchical clustering;
abnormality detection: using the lof algorithm, by calculating the relative density of each sentence to its neighboring data points, it is calculated as follows
reach_distk(p,o)=max{k-distance(o),d(p,o)}
k-distance (o) is the distance between the kth point nearest p and p, and d (p, o) is the distance between p and o
reach_distk(p,o)=max{k-distance(o),d(p,o)}
Calculating LOF scores within each cluster, indicating that the local density of a data point p is about the same as its neighbors if the LOF score of the data point p is around 1; if the LOF score of the data point p is less than 1, it indicates that the data point p is in a relatively dense region, unlike an outlier; if the LOF score of the data point p is far greater than 1, the data point p is more distant than other points and is likely to be an abnormal point;
and (3) grading generation: scoring each generated cluster, wherein the score is used for measuring the clustering quality and reserving high-quality clusters;
the score is calculated using the root mean square standard deviation:
wherein Ci represents the ith cluster, Ci is the center of the cluster, x ∈ Ci represents a sample point belonging to the ith cluster, ni is the number of samples of the ith cluster, and P is the vector dimension corresponding to the sample point.
S5, question generation, which generates the most suitable question for each cluster.
The step S4 uses the pagerank algorithm:
each sentence in the cluster is regarded as a node, if the two sentences have similarity, an undirected weighted edge is regarded as being arranged between the two nodes, and the weight is the similarity;
the score for each sentence was calculated:
s (vi) is the medium importance (PR value) of the sentence i, d is a damping coefficient, which is generally set to 0.85, in (vi) is a set of sentences to which links pointing to the sentence i exist, out (vj) is a set of sentences to which links in the sentence j point, and | out (vj) | is the number of elements in the set, and finally, the sentence with the largest PR value is taken as a question.
Claims (6)
1. A text clustering system based on fine tuning features is characterized in that: the method comprises the following steps:
chat corpus: storing the chat linguistic data;
a downstream task module: constructing a downstream task and generating a sentence vector;
a hierarchical clustering module: clustering the sentence vectors into clusters;
an anomaly detection module: detecting the reliability of each cluster;
the method comprises the following steps: the most appropriate query method is generated for each cluster.
2. The fine feature based text clustering system of claim 1, wherein: the system also comprises a text preprocessing module: the method is used for preprocessing and normalizing the similar information including pictures, links, cards and password-panning information in the chat corpus.
3. The method for clustering texts based on fine tuning features according to claim 1, wherein: the method comprises the following steps:
s1, constructing a downstream task, wherein the downstream task is used for predicting whether 2 sentences in the chat corpus are similar or not;
s2, preprocessing files, namely putting similar information including pictures, links, cards and password panning in the chat corpus into a cluster for preprocessing normalization;
s3, representing sentences in the chat corpus by vectors;
s4, clustering, namely clustering sentence vectors into clusters;
s5, question generation, which generates the most suitable question for each cluster.
4. The method of claim 3, wherein the text clustering method based on the fine tuning feature comprises: in the step S1, the downstream task uses a triple Loss function, the input samples are 3 sentences, the middle sample is called an anchor, the left sample is a positive sample, and the right sample is a negative sample, and the training optimization aims at maximizing the distance between the anchor sample and the negative sample and minimizing the distance between the anchor sample and the positive sample.
5. The method of claim 4, wherein the text clustering method based on the fine tuning feature comprises: the step S3 includes:
hierarchical clustering: firstly, calculating the Euclidean distance between sentences, setting a distance threshold value, and combining two sentences the distance of which is less than the threshold value into a cluster;
calculating the distance between each sentence and the third sentence in the cluster, taking the minimum value which is the distance between the cluster and the third sentence, and combining the third sentence into the cluster if the sentence is smaller than a threshold value;
continuing to calculate a fourth sentence and a fifth sentence according to the steps until the cluster is stored with sentences with similar distances, and finishing hierarchical clustering;
abnormality detection: using the lof algorithm, by calculating the relative density of each sentence to its neighboring data points, it is calculated as follows
reach_distk(p,o)=nax{k-distance(o),d(p,o)}
k-distance (o) is the distance between the kth point nearest p and p, and d (p, o) is the distance between p and o
reach_distk(p,o)=nax{k-distance(o),d(p,o)}
Calculating LOF scores for each intra-cluster question, indicating that the local density of the question p is as good as its neighbors if the LOF score of the question p is around 1; if the LOF score of the question p is less than 1, it indicates that the question p is in a relatively dense area, unlike an outlier; if the LOF score of the data point p is far more than 1, the problem p is more distant than other points and is likely to be an abnormal point;
and (3) grading generation: scoring each generated cluster, wherein the score is used for measuring the clustering quality and reserving high-quality clusters;
the score is calculated using the root mean square standard deviation:
wherein Ci represents the ith cluster, Ci is the center of the cluster, x ∈ Ci represents a sample point belonging to the ith cluster, ni is the number of samples of the ith cluster, and P is the vector dimension corresponding to the sample point.
6. The method of claim 3, wherein the text clustering method based on the fine tuning feature comprises: the step S4 uses the pagerank algorithm:
each sentence in the cluster is regarded as a node, if the two sentences have similarity, an undirected weighted edge is regarded as being arranged between the two nodes, and the weight is the similarity;
the score for each sentence was calculated:
s (vi) is the medium importance (PR value) of the sentence i, d is a damping coefficient, which is generally set to 0.85, in (vi) is a set of sentences to which links pointing to the sentence i exist, out (vj) is a set of sentences to which links in the sentence j point, and | out (vj) | is the number of elements in the set, and finally, the sentence with the largest PR value is taken as a question.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910757370.2A CN110597986A (en) | 2019-08-16 | 2019-08-16 | Text clustering system and method based on fine tuning characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910757370.2A CN110597986A (en) | 2019-08-16 | 2019-08-16 | Text clustering system and method based on fine tuning characteristics |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110597986A true CN110597986A (en) | 2019-12-20 |
Family
ID=68854591
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910757370.2A Pending CN110597986A (en) | 2019-08-16 | 2019-08-16 | Text clustering system and method based on fine tuning characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110597986A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111178415A (en) * | 2019-12-21 | 2020-05-19 | 厦门快商通科技股份有限公司 | Method and system for hierarchical clustering of intention data based on BERT |
CN111368081A (en) * | 2020-03-03 | 2020-07-03 | 支付宝(杭州)信息技术有限公司 | Method and system for determining selected text content |
CN111814448A (en) * | 2020-07-03 | 2020-10-23 | 苏州思必驰信息科技有限公司 | Method and device for quantizing pre-training language model |
CN113538075A (en) * | 2020-04-14 | 2021-10-22 | 阿里巴巴集团控股有限公司 | Data processing method, model training method, device and equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107133238A (en) * | 2016-02-29 | 2017-09-05 | 阿里巴巴集团控股有限公司 | A kind of text message clustering method and text message clustering system |
CN107967255A (en) * | 2017-11-08 | 2018-04-27 | 北京广利核系统工程有限公司 | A kind of method and system for judging text similarity |
-
2019
- 2019-08-16 CN CN201910757370.2A patent/CN110597986A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107133238A (en) * | 2016-02-29 | 2017-09-05 | 阿里巴巴集团控股有限公司 | A kind of text message clustering method and text message clustering system |
CN107967255A (en) * | 2017-11-08 | 2018-04-27 | 北京广利核系统工程有限公司 | A kind of method and system for judging text similarity |
Non-Patent Citations (3)
Title |
---|
万家强: "基于连通性的离群检测与聚类研究", 《中国优秀博士学位论文全文数据库 信息科技辑 ISSN 1674-022X》 * |
邵洪雨: "短文本聚类及聚类结果描述方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑 ISSN 1674-0246》 * |
陈伶红: "高校图书馆个性化推荐系统的研究与设计", 《中国优秀硕士学位论文全文数据库 信息科技辑 ISSN 1674-0246》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111178415A (en) * | 2019-12-21 | 2020-05-19 | 厦门快商通科技股份有限公司 | Method and system for hierarchical clustering of intention data based on BERT |
CN111368081A (en) * | 2020-03-03 | 2020-07-03 | 支付宝(杭州)信息技术有限公司 | Method and system for determining selected text content |
CN113538075A (en) * | 2020-04-14 | 2021-10-22 | 阿里巴巴集团控股有限公司 | Data processing method, model training method, device and equipment |
CN111814448A (en) * | 2020-07-03 | 2020-10-23 | 苏州思必驰信息科技有限公司 | Method and device for quantizing pre-training language model |
CN111814448B (en) * | 2020-07-03 | 2024-01-16 | 思必驰科技股份有限公司 | Pre-training language model quantization method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Abbas et al. | Multinomial Naive Bayes classification model for sentiment analysis | |
CN106383877B (en) | Social media online short text clustering and topic detection method | |
CN110597986A (en) | Text clustering system and method based on fine tuning characteristics | |
CN108052593B (en) | Topic keyword extraction method based on topic word vector and network structure | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN110321925B (en) | Text multi-granularity similarity comparison method based on semantic aggregated fingerprints | |
CN106372061B (en) | Short text similarity calculation method based on semantics | |
CN109815336B (en) | Text aggregation method and system | |
CN107832306A (en) | A kind of similar entities method for digging based on Doc2vec | |
Mohammed et al. | Glove word embedding and DBSCAN algorithms for semantic document clustering | |
CN106557777B (en) | One kind being based on the improved Kmeans document clustering method of SimHash | |
CN110442726B (en) | Social media short text online clustering method based on entity constraint | |
CN107357895B (en) | Text representation processing method based on bag-of-words model | |
CN112883722B (en) | Distributed text summarization method based on cloud data center | |
CN110851733A (en) | Community discovery and emotion interpretation method based on network topology and document content | |
Gad et al. | Incremental clustering algorithm based on phrase-semantic similarity histogram | |
CN114048310A (en) | Dynamic intelligence event timeline extraction method based on LDA theme AP clustering | |
Basha et al. | An improved similarity matching based clustering framework for short and sentence level text | |
Zhao et al. | Improving continual relation extraction by distinguishing analogous semantics | |
CN110597982A (en) | Short text topic clustering algorithm based on word co-occurrence network | |
CN115688768A (en) | Medical text professional classification method based on confrontation data enhancement | |
CN115098690A (en) | Multi-data document classification method and system based on cluster analysis | |
CN111737461B (en) | Text processing method and device, electronic equipment and computer readable storage medium | |
Bakr et al. | Efficient incremental phrase-based document clustering | |
Nguyen et al. | Text summarization on large-scale Vietnamese datasets |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191220 |
|
RJ01 | Rejection of invention patent application after publication |