CN110597986A

CN110597986A - Text clustering system and method based on fine tuning characteristics

Info

Publication number: CN110597986A
Application number: CN201910757370.2A
Authority: CN
Inventors: 汪鹏
Original assignee: Hangzhou Weier Network Technology Co Ltd
Current assignee: Hangzhou Weier Network Technology Co Ltd
Priority date: 2019-08-16
Filing date: 2019-08-16
Publication date: 2019-12-20

Abstract

A text clustering method based on fine tuning characteristics relates to the technical field of information clustering; the method comprises the following steps: chat corpus: storing the chat linguistic data; a downstream task module: constructing a downstream task and generating a sentence vector; a hierarchical clustering module: clustering the sentence vectors into clusters; an anomaly detection module: detecting the reliability of each cluster; the method comprises the following steps: the most appropriate query method is generated for each cluster. According to the method, the pre-training word vector model is finely adjusted through a downstream task, so that the generated vector is more suitable for a clustering task; improving the feasibility of each cluster by using anomaly detection; extracting clusters with high reliability by scoring each cluster; and finally, the inquiry name of each cluster is generated, so that the reliability of the cluster name is improved.

Description

Text clustering system and method based on fine tuning characteristics

Technical Field

The invention relates to the technical field of information clustering, in particular to a text clustering system and method based on fine tuning characteristics.

Background

With the development and application of network technology and the explosive growth of information resources, the research of text mining, information filtering and information searching has unprecedented prospects. Therefore, clustering techniques are becoming the core of text information mining techniques. Text clustering is an important technique used in text mining to find data distribution and its implicit data patterns. Clustering is achieved by dividing data with similarities into different groups so that the elements in each cluster share some common characteristics, usually in terms of a defined distance metric. However, conventional clustering is sensitive to different initial values, outliers, and the generalization performance is weak.

Aiming at the defects, with the rapid development of the vector technology of pre-training words such as bert, gpt, elmo and the like in the field of nlp, vectors more suitable for clustering are generated through downstream subtasks, the generalization capability of clustering is ensured, and meanwhile, an anomaly detection module is added on the basis, so that the clustering effect is further improved.

Disclosure of Invention

The invention aims to provide a text clustering system and method based on fine tuning characteristics.

The technical scheme of the invention is as follows: a system for text clustering based on fine tuning features, comprising:

chat corpus: storing the chat linguistic data;

a downstream task module: constructing a downstream task and generating a sentence vector;

a hierarchical clustering module: clustering the sentence vectors into clusters;

an anomaly detection module: detecting the reliability of each cluster;

the method comprises the following steps: the most appropriate query method is generated for each cluster.

The above text clustering system based on fine tuning features further includes a text preprocessing module: the method is used for preprocessing and normalizing the similar information including pictures, links, cards and password-panning information in the chat corpus.

A text clustering method based on fine tuning features comprises the following steps:

s1, constructing a downstream task, wherein the downstream task is used for predicting whether 2 sentences in the chat corpus are similar or not;

s2, preprocessing files, namely putting similar information including pictures, links, cards and password panning in the chat corpus into a cluster for preprocessing normalization;

s3, representing sentences in the chat corpus by vectors;

s4, clustering, namely clustering sentence vectors into clusters;

s5, question generation, which generates the most suitable question for each cluster.

In the foregoing text clustering method based on fine-tuning features, the downstream task in step S1 uses a triple Loss function, the input samples are 3 sentences, including a middle sample called an anchor, a left sample as a positive sample, and a right sample as a negative sample, and the training optimization aims at maximizing the distance between the anchor sample and the negative sample and minimizing the distance between the anchor sample and the positive sample.

In the foregoing text clustering method based on fine tuning features, the step S3 includes:

hierarchical clustering: firstly, calculating the Euclidean distance between sentences, setting a distance threshold value, and combining two sentences the distance of which is less than the threshold value into a cluster;

calculating the distance between each sentence and the third sentence in the cluster, taking the minimum value which is the distance between the cluster and the third sentence, and combining the third sentence into the cluster if the sentence is smaller than a threshold value;

continuing to calculate a fourth sentence and a fifth sentence according to the steps until the cluster is stored with sentences with similar distances, and finishing hierarchical clustering;

abnormality detection: using the lof algorithm, by calculating the relative density of each sentence to its neighboring data points, it is calculated as follows

reach_dist_k(p，o)＝max{k-distance(o)，d(p，o)}

k-distance (o) is the distance between the kth point nearest p and p, and d (p, o) is the distance between p and o

reach_dist_k(p，o)＝max{k-distance(o)，d(p，o)}

Calculating LOF scores for each intra-cluster question, indicating that the local density of the question p is as good as its neighbors if the LOF score of the question p is around 1; if the LOF score of the question p is less than 1, it indicates that the question p is in a relatively dense area, unlike an outlier; if the LOF score of the data point p is far more than 1, the problem p is more distant than other points and is likely to be an abnormal point;

and (3) grading generation: scoring each generated cluster, wherein the score is used for measuring the clustering quality and reserving high-quality clusters;

the score is calculated using the root mean square standard deviation:

wherein Ci represents the ith cluster, Ci is the center of the cluster, x ∈ Ci represents a sample point belonging to the ith cluster, ni is the number of samples of the ith cluster, and P is the vector dimension corresponding to the sample point.

In the foregoing text clustering method based on fine tuning features, the step S4 uses the pagerank algorithm:

each sentence in the cluster is regarded as a node, if the two sentences have similarity, an undirected weighted edge is regarded as being arranged between the two nodes, and the weight is the similarity;

the score for each sentence was calculated:

s (vi) is the medium importance (PR value) of the sentence i, d is a damping coefficient, which is generally set to 0.85, in (vi) is a set of sentences to which links pointing to the sentence i exist, out (vj) is a set of sentences to which links in the sentence j point, and | out (vj) | is the number of elements in the set, and finally, the sentence with the largest PR value is taken as a question.

Compared with the prior art, the method has the advantages that the pre-training word vector model is finely adjusted through the downstream task, so that the generated vector is more suitable for the clustering task; improving the feasibility of each cluster by using anomaly detection; extracting clusters with high reliability by scoring each cluster; and finally, the inquiry name of each cluster is generated, so that the reliability of the cluster name is improved.

Detailed Description

The present invention is further illustrated by the following examples, which are not to be construed as limiting the invention.

Example (b): a text clustering system based on fine tuning features is characterized in that: the method comprises the following steps:

chat corpus: storing the chat linguistic data;

a text preprocessing module: the method is used for preprocessing and normalizing the similar information including pictures, links, cards and password-panning information in the chat corpus.

an anomaly detection module: detecting the reliability of each cluster;

Specifically, the method for clustering the pre-training word vector and the text based on the anomaly detection of the downstream task comprises the following steps: the original pre-training model is trained based on a large-scale Chinese pre-material library, although the model has strong text expression capacity, in order to make the model more suitable for our tasks, data needs to be constructed, and the model is finely adjusted on a constructed data set, so that the advantages are 2: 1. the model can see the professional nouns possibly related in the chat corpus, and the possibility of semantic deviation is reduced (semantic deviation, such as jade and Florence is a place, but if the model does not see the term, the vectors generated by the two are not related); 2. the fine tuning can enable the vectors generated by the model to express similarity among sentences better, and the fine tuning optimization aims to achieve the purposes that the distances among similar sentences are close and the distances among dissimilar sentences are far, which is the same as the clustering logic, the similar sentences are clustered into a cluster, and different sentences are separated.

The method comprises the following steps:

the downstream task adopts a triple Loss function, input samples are 3 sentences, the intermediate samples are called acher, a left sample is a positive sample, a right sample is a negative sample, and the training optimization aims at maximizing the distance between the acher sample and the negative sample and minimizing the distance between the acher sample and the positive sample.

S2, preprocessing files, namely putting similar information including pictures, links, cards and password panning in the chat corpus into a cluster for preprocessing normalization; these implicitly repeated data can be removed, which can improve the quality of clustering on the one hand and can also speed up the clustering process, since about 50% of the data can be removed after the removal of the duplicate.

S3, representing sentences in the chat corpus by vectors;

s4, clustering, namely clustering sentence vectors into clusters;

the method comprises the following steps:

the cluster now contains two sentences, and the distance between the cluster and the third sentence is calculated as follows: calculating the distance between each sentence and the third sentence (the distance between the two sentences and the third sentence) in the cluster, taking the minimum value, namely the distance between the cluster and the third sentence, and combining the third sentence into the cluster if the sentence is smaller than a threshold value;

reach_dist_k(p，o)＝max{k-distance(o)，d(p，o)}

Calculating LOF scores within each cluster, indicating that the local density of a data point p is about the same as its neighbors if the LOF score of the data point p is around 1; if the LOF score of the data point p is less than 1, it indicates that the data point p is in a relatively dense region, unlike an outlier; if the LOF score of the data point p is far greater than 1, the data point p is more distant than other points and is likely to be an abnormal point;

the score is calculated using the root mean square standard deviation:

The step S4 uses the pagerank algorithm:

the score for each sentence was calculated:

Claims

1. A text clustering system based on fine tuning features is characterized in that: the method comprises the following steps:

chat corpus: storing the chat linguistic data;

an anomaly detection module: detecting the reliability of each cluster;

2. The fine feature based text clustering system of claim 1, wherein: the system also comprises a text preprocessing module: the method is used for preprocessing and normalizing the similar information including pictures, links, cards and password-panning information in the chat corpus.

3. The method for clustering texts based on fine tuning features according to claim 1, wherein: the method comprises the following steps:

s3, representing sentences in the chat corpus by vectors;

s4, clustering, namely clustering sentence vectors into clusters;

4. The method of claim 3, wherein the text clustering method based on the fine tuning feature comprises: in the step S1, the downstream task uses a triple Loss function, the input samples are 3 sentences, the middle sample is called an anchor, the left sample is a positive sample, and the right sample is a negative sample, and the training optimization aims at maximizing the distance between the anchor sample and the negative sample and minimizing the distance between the anchor sample and the positive sample.

5. The method of claim 4, wherein the text clustering method based on the fine tuning feature comprises: the step S3 includes:

reach_dist_k(p，o)＝nax{k-distance(o)，d(p，o)}

reach_distk(p，o)＝nax{k-distance(o)，d(p，o)}

the score is calculated using the root mean square standard deviation:

6. The method of claim 3, wherein the text clustering method based on the fine tuning feature comprises: the step S4 uses the pagerank algorithm:

the score for each sentence was calculated: