CN116361470A

CN116361470A - Text clustering cleaning and merging method based on topic description

Info

Publication number: CN116361470A
Application number: CN202310347961.9A
Authority: CN
Inventors: 王磊; 郭鸿飞; 王俊艳; 徐才; 王柯淇; 蔡昌艳; 蒋永余; 王璋盛; 曹家; 罗引
Original assignee: Xinhua Fusion Media Technology Development Beijing Co ltd; Beijing Zhongke Wenge Technology Co ltd
Current assignee: Xinhua Fusion Media Technology Development Beijing Co ltd; Beijing Zhongke Wenge Technology Co ltd
Priority date: 2023-04-03
Filing date: 2023-04-03
Publication date: 2023-06-30
Anticipated expiration: 2043-04-03
Also published as: CN116361470B

Abstract

The invention provides a text clustering cleaning and merging method based on topic description, which comprises the steps of firstly clustering texts to obtain a plurality of clustering results, wherein each clustering result is equivalent to one topic, and then cleaning and merging the clustering results based on three indexes of the text similarity of a topic vector and a text vector in the topic, the text similarity of the topic description and the topic description generated by each text, and the number of keywords of the text and the topic, and finally obtaining the clustering results and the description of each topic, so that the clustering results are more accurate.

Description

Text clustering cleaning and merging method based on topic description

Technical Field

The invention relates to the field of natural language processing, in particular to a text clustering cleaning and merging method based on topic description.

Background

Clustering text of information in text according to corresponding topics has very important applications in the field of text processing. However, since the text information coverage is very wide, the number of text information generated per day is also very large, which presents some challenges for practical application of text clustering. The existing text clustering algorithm is mainly based on a text clustering algorithm such as a kmeans algorithm and the like to cluster a plurality of texts, but the clustered results are not optimized. The kmeans-based algorithm generally needs to set the number of clusters, generally sets k clusters randomly, or determines the number of clusters based on methods such as profile coefficients, elbow rules and the like, but these methods cannot ensure the accuracy of the number of clusters, and also have the situation that texts are clustered wrongly. Although the single-pass based method does not need to set the clustering number, the text is also subject to false clustering, and the clustered results are not cleaned and combined by the existing method.

Disclosure of Invention

Aiming at the technical problems, the invention adopts the following technical scheme:

the embodiment of the invention provides a text clustering cleaning and merging method based on topic description, which comprises the following steps:

s100, obtaining feature vectors and keywords of each text in the texts to be clustered, wherein each text comprises h keywords;

s200, clustering texts to be clustered by using a set clustering algorithm based on the obtained feature vectors to obtain a plurality of topics;

s300, acquiring any text in any topic in the current topic, and keywords, feature vectors and topic description feature vectors of any topic;

s400, carrying out p-th cleaning treatment on the current topic based on the same number of keywords between the text and the topic, the similarity between feature vectors of the text and the topic and the similarity between topic description feature vectors of the text and the topic to obtain n (p) topics after treatment; wherein any topic a of n (p) topics satisfies the following condition: g ^p (a，q)≥D1 _p ，SF ^p _aq ≥D2 _p And ST is ^p _aq ≥D3 _p The method comprises the steps of carrying out a first treatment on the surface of the Wherein g ^p (a, q) is topic a and the qth text T in topic a _aq The same number of keywords, SF ^p _aq Feature vector sum T for topic a _aq Similarity between feature vectors of (1), ST ^p _aq Topic description eigenvector sum T for topic a _aq Topic description feature vectors of (a) similarity; D1D 1 _p D2 is a first set threshold corresponding to the p-th cleaning treatment _p D3 for the second set threshold corresponding to the p-th cleaning treatment _p A third set threshold corresponding to the p-th cleaning treatment; a is 1 to n (p), p is 1 to C0, and C0 is preset times; q takes values from 1 to f (a), f (a) being the number of texts in topic a;

s500, setting p=p+1, if p is less than or equal to C0, executing S300; otherwise, H topics after the cleaning treatment are obtained, and S600 is executed;

s600, obtaining a topic list S which is obtained by sequencing according to a decreasing text quantity mode based on H topics, and obtaining keywords, feature vectors, topic description and topic description feature vectors corresponding to any topic u in the S, wherein the value of u is 1 to H;

s700, combining the S based on the same number of keywords among topics, the similarity among feature vectors of the topics and the similarity among topic description feature vectors of the topics to obtain a target topic list;

s800, outputting topic descriptions and corresponding texts of all topics in a target topic list, wherein the text corresponding to each topic is a text which is sequenced according to the text release time.

The invention has at least the following beneficial effects:

according to the text clustering cleaning and merging method based on topic description, firstly, texts are clustered to obtain a plurality of clustering results, each clustering result is equivalent to one topic, and then, based on three indexes of the text similarity of a topic vector and the text vector in the topic, the text similarity of the topic description and the topic description generated by each text, and the number of keywords of the text and the topic, the clustering results are cleaned and merged, and finally, the clustering results and the description of each topic are obtained, so that the clustering results are more accurate.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a text cluster cleaning and merging method based on topic description according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The embodiment of the invention provides a text clustering cleaning and merging method based on topic description, as shown in fig. 1, the method can comprise the following steps:

s100, obtaining feature vectors and keywords of each text in the texts to be clustered, wherein each text comprises h keywords.

In the embodiment of the invention, the text can be news text or other types of text. Text may be crawled from various websites or channels.

In the embodiment of the invention, the keywords of each text can be obtained by the existing keyword extraction method. In an exemplary embodiment, the invention obtains the keyword of each text and the weight corresponding to each keyword through a TextRank algorithm. The number of keywords h may be set based on actual needs, in one example, h=7.

In the embodiment of the invention, the feature vector of each text can be obtained through the existing feature extraction method. In one exemplary embodiment, the feature vector of each text may be obtained through a SimBERT model, and in particular, the title of the text may be input into the SimBERT model to obtain the feature vector of each text.

Since news crawled from various websites or channels may be repeated, in the embodiment of the present invention, the text to be clustered is text obtained after the duplication removal process. In the embodiment of the invention, the text can be subjected to the duplicate removal processing by the following two methods.

The method comprises the following steps: the method based on text similarity and rules is adopted for duplication elimination. Specifically, for each text, calculating the similarity between the text and other crawled texts in turn, if the similarity exceeds a set similarity threshold value and the proportion of the same words in the titles of the two texts exceeds a set proportion threshold value, considering the text as a repeated text, and then selecting the text with the latest release time to store in a text library to be clustered. In the embodiment of the present invention, the similarity between the texts a and b can be obtained by the following formula:

wherein T is _a And T _b Feature vectors of text a and b, respectively, ||T _a I and I T _b I are T respectively _a And T _b Is a mold of (a).

In the embodiment of the invention, the threshold value of the similarity may be set higher, for example, 0.95, and the ratio threshold value of the same word in the header may be set to 0.5 or 0.6.

The second method is as follows: news deduplication is performed based on DBSCAN clusters. DBSCAN is an unsupervised machine learning clustering algorithm. It does not need to specify the number of clusters, avoids outliers, and works well in data clusters of arbitrary shape and size. It needs to set two parameters to cluster:

1) Epsilon: maximum radius of the community. Data points will be of the same type if their mutual distance is less than or equal to a specified epsilon. It is a distance metric that DBSCAN uses to determine whether two points are similar and belong to the same class. The invention can set a smaller threshold value Epsilon, similar texts are aggregated together (the distance between the similar texts is smaller), then the same clustering result selects one text with the nearest release time, and other texts are discarded. In one exemplary embodiment, epsilon may be set to 0.06.

2) Minimum point (minPts): a neighborhood of minPts numbers within a radius of the neighborhood is considered a cluster. The invention can set the minimum point to 1 or 2 to retain more noise data or outliers, and only very similar text is pruned.

One skilled in the art knows that one or both of the above two methods can be selected according to actual needs to perform de-duplication on the text so as to obtain the text to be clustered.

And S200, clustering the texts to be clustered by using a set clustering algorithm to obtain a plurality of topics.

In the embodiment of the invention, the text to be clustered can be clustered by using the existing clustering algorithm. In one exemplary embodiment, the set clustering algorithm may be a DBSCAN algorithm.

S300, acquiring any text in any topic in the current topic, and keywords, feature vectors, topic description and topic description feature vectors of any topic.

In the embodiment of the invention, the keywords and the feature vectors of any text are acquired in the S100 and only need to be directly called. The topic description of any text is obtained by the following steps:

s301, inputting the title of any text into a set topic description generation model to obtain a corresponding topic description. In the embodiment of the invention, the set topic description generation model can be a pre-training generation model such as T5 or BART. The training step of the topic description generation model comprises the following steps:

(1) Constructing a dataset

And manually selecting a plurality of topics for labeling the data set for all the clustered topics. For each selected topic, a plurality of texts are randomly selected, topic descriptions of the texts are generated manually, each news corresponds to one topic description, news describing the same topic is described, and the corresponding topic descriptions are as same as possible. And taking the title of each news as the input of the generated model, and taking the manually generated topic description as the true value of the model to construct a training data set.

In the embodiment of the invention, topic description can be regarded as topic or subject information corresponding to the current text.

(2) Model training

And inputting the title of each text in the training data set into a pre-training generating model for training to obtain a corresponding prediction result, and calculating loss of the prediction result and the artificially marked true value to train model parameters to obtain a trained topic description generating model.

After the trained topic description generation model is obtained, inputting the title of any text in any topic into the trained topic description generation model to obtain the corresponding topic description.

S302, inputting topic description of any text into a set topic description feature generation model to obtain a corresponding topic description feature vector.

In the embodiment of the invention, the set topic description feature generation model may be a SimBERT model. And inputting topic description of any text into the SimBERT model to obtain a corresponding topic description feature vector.

In the embodiment of the invention, topic description feature vectors of any topic are obtained through the following steps:

s303, based on topic descriptions corresponding to all texts in any topic, obtaining topic descriptions with the largest generation frequency of all topic descriptions as topic descriptions of the topic.

Since topic descriptions of different texts may be the same, topic descriptions corresponding to all texts in any topic are combined, the occurrence frequency of each topic description in the topic descriptions after the combination is acquired, and the topic description with the largest occurrence frequency is used as the topic description of the topic. For example, if a certain topic description is generated by 3 texts, the frequency of generation of the topic description is 3.

S304, inputting topic description of the topic into a set topic description feature generation model to obtain a corresponding topic description feature vector.

Specifically, the topic description of the topic is input to a SimBERT model, and a corresponding topic description feature vector is obtained.

Further, in the embodiment of the present invention, the keyword of any topic is obtained by the following steps:

s305, combining the same keywords in the keywords of all texts in any topic, and recalculating the weight to obtain the combined keywords.

S306, sorting the combined keywords according to the order of the weights from large to small, and acquiring the first h keywords in the sorted keywords as keywords of any topic.

Specifically, if a certain keyword appears in only one text, the weight of the keyword is the weight in the text. If a certain keyword appears in a plurality of texts, the weight of the keyword is the sum of the weights of the keyword in the plurality of texts, for example, the keyword a appears in 3 texts, the weight of the keyword a is b1+b2+b3, wherein b1 to b3 are the weights of the keyword a in 3 texts respectively.

Further, in the embodiment of the present invention, the feature vector of any topic is the average value of the feature vectors of all texts in the topic, namely the feature vector of topic i

h _ij Is the j-th text T in topic i _ij The corresponding feature vector, f (i), is the number of text in topic i.

S400, carrying out p-th cleaning treatment on the current topic based on the same number of keywords between the text and the topic, the similarity between feature vectors of the text and the topic and the similarity between topic description feature vectors of the text and the topic to obtain n (p) topics after treatment; wherein any topic a of n (p) topics satisfies the following condition: g ^p (a，q)≥D1 _p ，SF ^p _aq ≥D2 _p And ST is ^p _aq ≥D3 _p The method comprises the steps of carrying out a first treatment on the surface of the Wherein g ^p (a, q) is topic a and the qth text T in topic a _aq The same number of keywords, SF ^p _aq Feature vector sum T for topic a _aq Similarity between feature vectors of (1), ST ^p _aq Topic description eigenvector sum T for topic a _aq Topic description feature vectors of (a) similarity; D1D 1 _p D2 is a first set threshold corresponding to the p-th cleaning treatment _p D3 for the second set threshold corresponding to the p-th cleaning treatment _p A third set threshold corresponding to the p-th cleaning treatment; a is 1 to n (p), p is 1 to C0, and C0 is preset times; q takes values from 1 to f (a), f (a) being the number of text in topic a.

In the embodiment of the invention, the similarity between the feature vectors and the similarity between the topic description feature vectors can be obtained through the existing similarity algorithm, such as cosine similarity and the like.

In the embodiment of the invention, the first set threshold value to the third set threshold value corresponding to each cleaning process may be the same or different, and may be set based on actual needs. The first to third set thresholds may be set based on actual needs, in an exemplary embodiment, the first set threshold may be selected from 2 and 3, the second set threshold may be selected from 0.65, 0.7 and 0.8, and the third set threshold may be selected from 0.7, 0.8 and 0.85, and a desired combination value may be selected according to actual needs.

In the embodiment of the invention, C0 may be set based on actual needs, preferably, c0.ltoreq.3, and more preferably, c0=2.

S500, setting p=p+1, if p is less than or equal to C0, executing S300; otherwise, H topics after the cleaning process are obtained, and S600 is executed. Obviously, h=n (p).

S600, obtaining a topic list S which is obtained by sorting according to a decreasing text quantity mode based on H topics, and obtaining keywords, feature vectors, topic description and topic description feature vectors corresponding to any topic u in the S, wherein the value of u is 1 to H. That is, in S, the number of texts in the former topic is larger than the number of texts in the latter topic.

The keyword, the feature vector, the topic description, and the topic description feature vector corresponding to any topic u can be obtained with reference to S300.

And S700, combining the S based on the same number of keywords among topics, the similarity among feature vectors of the topics and the similarity among topic description feature vectors of the topics to obtain a target topic list.

S800, outputting topic descriptions and corresponding texts of all topics in the target topic list.

In S800, the output text is text that is ordered by the release time, for example, ordered by the order of the release time from early to late. Topic descriptions for each topic in the target topic list can be obtained based on S303.

Further, in an embodiment of the present invention, S400 may specifically include:

s410, regarding a jth text T in a topic i in the current topics corresponding to the p-th cleaning process _ij Respectively obtaing ^p (i，j)、SF ^p _ij And ST (ST) ^p _ij If g ^p (i，j)≥D1 _p And SF (sulfur hexafluoride) ^p _ij ≥D2 _p ST ^p _ij ≥D3 _p Will T _ij Remaining in topic i, execute S440; otherwise, S420 is performed; wherein g ^p (i, j) is T at the p-th cleaning treatment _ij The same number of keywords as topic i, SF ^p _ij T at the p-th washing treatment _ij Similarity between the eigenvectors of (1) and topic eigenvectors of topic i, ST ^p _ij T at the p-th washing treatment _ij Similarity between topic description feature vectors of topic i and topic description feature vectors of topic i; i has a value of 1 to k, and k is the number of current topics; the value of j is 1 to f (i), and f (i) is the number of texts in the topic i.

S420, obtain g ^p (ij，s)、SF ^ps _ij And ST (ST) ^ps _ij If g ^p (ij，s)≥D1 _p And SF (sulfur hexafluoride) ^ps _ij ≥D2 _p ST ^ps _ij ≥D3 _p Will T _ij Adding to the topic S and deleting from the original topic, and executing S440; otherwise, S430 is performed; wherein topic s is the s-th topic of k-1 topics except topic i in the current topics, g ^p (ij, s) is T at the p-th cleaning treatment _ij The same number of keywords as topics s, SF ^ps _ij T at the p-th washing treatment _ij Similarity between the eigenvectors of (1) and the topic eigenvectors of topic s, ST ^ps _ij T at the p-th washing treatment _ij The topic description feature vector of (a) and the topic description feature vector of the topic s, and the value of s is 1 to k-1.

S430, set s=s+1, if s.ltoreq.k-1, execute S420, otherwise, T _ij Creates a new topic and adds T _ij Adding to the corresponding new topic, setting k=k+1, i.e., if one topic is newly added, increasing the number of current topics by 1, and executing S440.

S440, setting j=j+1, if j is less than or equal to f (i), executing S410; otherwise, i=i+1 is set, and if i is less than or equal to k, S410 is executed, otherwise S500 is executed.

Those skilled in the art know that since the amount of text in each topic may change during processing, the keywords, feature vectors, topic descriptions, and topic description feature vectors of each of the current topics change in real time.

Further, in another embodiment of the present invention, S420 is replaced with:

s421, obtaining topic description similarity set ST _ij ＝{ST ¹ _ij ，ST ² _ij ，…，ST ^s _ij ，…，ST ^k-1 _ij }，ST ^s _ij Is T _ij The topic description feature vector of the topic is similar to the topic description feature vector corresponding to the s-th topic in k-1 topics except the topic i in the current topics, and the value of s is 1 to k-1.

S422, ST _ij Sorting according to descending order to obtain a sorted similarity set, and obtaining the first m similarity forming comparison similarity set STC in the sorted similarity set _ij . m may be set based on actual needs, e.g., m=5.

S423, obtain g ^p (ij, w) and SF ^pw _ij If g ^p (ij，w)≥D1 _p And SF (sulfur hexafluoride) ^pw _ij ≥D2 _p Adding topic w to T _ij In the corresponding candidate topic set, S431 is executed, otherwise S431 is directly executed; wherein topic w is STC _ij W-th topic of the corresponding m topics, g ^p (ij, w) is T at the p-th washing treatment _ij The same number of keywords as topics w, SF ^pw _ij T at the p-th washing treatment _ij The similarity between the corresponding feature vector and the feature vector corresponding to the topic w, and the value of w is 1 to m.

Further, S430 is replaced with:

s431, w=w+1 is set, if w is less than or equal to m, S423 is executed, otherwise S432 is executed.

S432, if T _ij The corresponding candidate topic set is Null, then T _ij Creates a new topic and adds T _ij Adding to the corresponding new topic and deleting from the original topic, setting k=k+1, and executing S440; if T _ij If the corresponding candidate topic set contains a similarity, then T is as follows _ij Adding to the topic corresponding to the similarity and deleting from the original topic, and executing S440; if T _ij If the corresponding candidate topic set contains a plurality of similarities, T is as follows _ij Adding to and deleting from the topic corresponding to the maximum similarity in the corresponding candidate topic set, and executing S440.

Further, in the embodiment of the present invention, S700 may specifically include:

s710, obtaining g (u, v), S1 _uv And S2 _uv If g (u, v) is ≡D4, and S1 _uv Not less than D5 and S2 _uv If not less than D6, combining the topic u and the topic v, and executing S730; otherwise, executing S720; wherein topic v is the v-th topic in the current merged topic list, g (u, v) is the same number of keywords between topic u and topic v, S1 _uv Is the similarity between the topic feature vector of topic u and the topic feature vector of topic v, S2 _uv The similarity between the topic description feature vector of the topic u and the topic description feature vector of the topic v is that the value of v is 1 to n, and n is the number of topics in the current combined topic list; d4 is a fourth set threshold, D5 is a fifth set threshold, and D6 is a sixth set threshold; the initial value in the merged topic list is Null.

In the embodiment of the present invention, D4 to D6 may be set to be the same as the first to third set thresholds, respectively.

S720, setting v=v+1, if v is less than or equal to n, executing S710, otherwise, adding the topic u as a new topic into the current combined topic list; s730 is performed.

S730, setting u=u+1, if u is equal to or less than H, executing S710, otherwise executing S740.

S740, acquiring the number of texts in any topic in the current topic list, and deleting the topic from the current topic list if the number of texts in the topic is less than a set number threshold; a target topic list is obtained and S800 is performed.

In the embodiment of the present invention, the set number threshold may be 3. The topic description of the topic obtained after the topic u and the topic v are combined is the topic description of the topic v, and the eigenvector is the average value of the eigenvectors of the topic u and the topic v.

Those skilled in the art know that when u=1, since the number of topics in the current merged topic list is Null, topic 1 will be added to the current merged topic list.

Further, S800 further includes: the method comprises the steps of respectively obtaining a keyword, a feature vector, a topic description and a topic description feature vector of each topic in a target topic list to update the keyword, the feature vector, the topic description and the topic description feature vector of each topic.

According to the text clustering cleaning and merging method based on topic description, firstly, texts are clustered to obtain a plurality of clustering results, each clustering result is equivalent to one topic, and then, based on three indexes of the text similarity of a topic vector and a text vector in the topic, the text similarity of the topic description and the topic description generated by each text, and the number of keywords of the text and the topic, the clustering results are cleaned and optimized, and finally, the clustering results and the description of each topic are obtained, so that the clustering results are more accurate.

Embodiments of the present invention also provide a non-transitory computer readable storage medium that may be disposed in an electronic device to store at least one instruction or at least one program for implementing one of the methods embodiments, the at least one instruction or the at least one program being loaded and executed by the processor to implement the methods provided by the embodiments described above.

Embodiments of the present invention also provide an electronic device comprising a processor and the aforementioned non-transitory computer-readable storage medium.

Embodiments of the invention also provide a computer program product comprising program code for causing an electronic device to carry out the steps of the method according to various exemplary embodiments of the invention as described in this specification, when said program product is run on the electronic device.

While certain specific embodiments of the invention have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the present disclosure is defined by the appended claims.

Claims

1. A text clustering cleaning and merging method based on topic description is characterized by comprising the following steps:

s400, carrying out p-th cleaning treatment on the current topic based on the same number of keywords between the text and the topic, the similarity between feature vectors of the text and the topic and the similarity between topic description feature vectors of the text and the topic to obtain n (p) topics after treatment; wherein any topic a of n (p) topics satisfies the following condition: g ^p (a，q)≥D1 _p ，SF ^p _aq ≥D2 _p And ST is ^p _aq ≥D3 _p The method comprises the steps of carrying out a first treatment on the surface of the Wherein g ^p (a, q) is topic a and the qth text T in topic a _aq The same number of keywords, SF ^p _aq Feature vector sum T for topic a _aq Similarity between feature vectors of (1), ST ^p _aq Topic description feature vector for topic aAnd T _aq Topic description feature vectors of (a) similarity; D1D 1 _p D2 is a first set threshold corresponding to the p-th cleaning treatment _p D3 for the second set threshold corresponding to the p-th cleaning treatment _p A third set threshold corresponding to the p-th cleaning treatment; a is 1 to n (p), p is 1 to C0, and C0 is preset times; q takes values from 1 to f (a), f (a) being the number of texts in topic a;

2. The method according to claim 1, wherein S400 specifically comprises:

s410, regarding a jth text T in a topic i in the current topics corresponding to the p-th cleaning process _ij G is obtained respectively ^p (i，j)、SF ^p _ij And ST (ST) ^p _ij If g ^p (i，j)≥D1 _p And SF (sulfur hexafluoride) ^p _ij ≥D2 _p ST ^p _ij ≥D3 _p Will T _ij Remaining in topic i, execute S440; otherwise, S420 is performed; wherein g ^p (i, j) is T at the p-th cleaning treatment _ij The same number of keywords as topic i, SF ^p _ij T at the p-th washing treatment _ij Is characterized bySimilarity between quantity and topic feature vector of topic i, ST ^p _ij T at the p-th washing treatment _ij Similarity between topic description feature vectors of topic i and topic description feature vectors of topic i; i has a value of 1 to k, and k is the number of current topics; the value of j is 1 to f (i), and f (i) is the number of texts in the topic i;

s420, obtain g ^p (ij，s)、SF ^ps _ij And ST (ST) ^ps _ij If g ^p (ij，s)≥D1 _p And SF (sulfur hexafluoride) ^ps _ij ≥D2 _p ST ^ps _ij ≥D3 _p Will T _ij Adding to the topic S and deleting from the original topic, and executing S440; otherwise, S430 is performed; wherein topic s is the s-th topic of k-1 topics except topic i in the current topics, g ^p (ij, s) is T at the p-th cleaning treatment _ij The same number of keywords as topics s, SF ^ps _ij T at the p-th washing treatment _ij Similarity between the eigenvectors of (1) and the topic eigenvectors of topic s, ST ^ps _ij T at the p-th washing treatment _ij The similarity between the topic description feature vector of the topic and the topic description feature vector of the topic s, and the value of s is 1 to k-1;

s430, set s=s+1, if s.ltoreq.k-1, execute S420, otherwise, T _ij Creates a new topic and adds T _ij Adding to the corresponding new topic and deleting from the original topic, setting k=k+1, and executing S440;

3. The method according to claim 1, wherein S700 specifically comprises:

s710, obtaining g (u, v), S1 _uv And S2 _uv If g (u, v) is ≡D4, and S1 _uv Not less than D5 and S2 _uv If not less than D6, combining the topic u and the topic v, and executing S730; otherwise, executing S720; wherein topic v is the current combinationAnd v-th topic in the topic list, g (u, v) is the same number of keywords between topic u and topic v, S1 _uv Is the similarity between the topic feature vector of topic u and the topic feature vector of topic v, S2 _uv The similarity between the topic description feature vector of the topic u and the topic description feature vector of the topic v is that the value of v is 1 to n, and n is the number of topics in the current combined topic list; d4 is a fourth set threshold, D5 is a fifth set threshold, and D6 is a sixth set threshold; the initial value in the combined topic list is Null;

s720, setting v=v+1, if v is less than or equal to n, executing S710, otherwise, adding topic u as a new topic into the current merged topic list and setting n=n+1; s730 is performed;

s730, setting u=u+1, if u is not greater than H, executing S710, otherwise, executing S740;

4. The method of claim 2, wherein S420 is replaced with:

s421, obtaining topic description similarity set ST _ij ＝{ST ¹ _ij ，ST ² _ij ，…，ST ^s _ij ，…，ST ^k-1 _ij }，ST ^s _ij Is T _ij The similarity between the topic description feature vector of the current topic and the topic description feature vector corresponding to the s-th topic in k-1 topics except the topic i, wherein the value of s is 1 to k-1;

s422, ST _ij Sorting according to descending order to obtain a sorted similarity set, and obtaining the first m similarity forming comparison similarity set STC in the sorted similarity set _ij ；

S423, obtain g ^p (ij, w) and SF ^pw _ij If g ^p (ij，w)≥D1 _p And SF (sulfur hexafluoride) ^pw _ij ≥D2 _p Adding topic w to T _ij In the corresponding candidate topic set, S431 is executed, otherwise S431 is directly executed; wherein topic w is STC _ij W-th topic of the corresponding m topics, g ^p (ij, w) is T at the p-th washing treatment _ij The same number of keywords as topics w, SF ^pw _ij T at the p-th washing treatment _ij The similarity between the corresponding feature vector and the feature vector corresponding to the topic w, wherein the value of w is 1 to m;

s430 is replaced with:

s431, setting w=w+1, if w is less than or equal to m, executing S423, otherwise, executing S432;

5. The method of claim 1, wherein the topic description feature vector of any text is obtained by:

s301, inputting the title of any text into a set topic description generation model to obtain corresponding topic description;

6. The method of claim 5, wherein the topic description feature vector for any topic is obtained by:

s303, based on topic descriptions corresponding to all texts in any topic, obtaining topic descriptions with the largest generation frequency of all topic descriptions as topic descriptions of the topic;

s304, inputting the topic description of the topic into a set topic description feature generation model to obtain a topic description feature vector of the topic.

7. The method of claim 1, wherein the keywords of any one topic are obtained by:

s305, combining the same keywords in the keywords of all texts in any topic, and recalculating weights to obtain the combined keywords;

8. The method of claim 1, wherein the feature vector for any topic is an average of feature vectors for all text in the topic.

9. The method of claim 1, wherein the keywords for each text are obtained by a TextRank algorithm.

10. The method of claim 1, wherein the feature vector for each text is obtained by a SimBERT model.