CN116361470A - Text clustering cleaning and merging method based on topic description - Google Patents

Text clustering cleaning and merging method based on topic description Download PDF

Info

Publication number
CN116361470A
CN116361470A CN202310347961.9A CN202310347961A CN116361470A CN 116361470 A CN116361470 A CN 116361470A CN 202310347961 A CN202310347961 A CN 202310347961A CN 116361470 A CN116361470 A CN 116361470A
Authority
CN
China
Prior art keywords
topic
text
similarity
keywords
description
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310347961.9A
Other languages
Chinese (zh)
Other versions
CN116361470B (en
Inventor
王磊
郭鸿飞
王俊艳
徐才
王柯淇
蔡昌艳
蒋永余
王璋盛
曹家
罗引
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinhua Fusion Media Technology Development Beijing Co ltd
Beijing Zhongke Wenge Technology Co ltd
Original Assignee
Xinhua Fusion Media Technology Development Beijing Co ltd
Beijing Zhongke Wenge Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinhua Fusion Media Technology Development Beijing Co ltd, Beijing Zhongke Wenge Technology Co ltd filed Critical Xinhua Fusion Media Technology Development Beijing Co ltd
Priority to CN202310347961.9A priority Critical patent/CN116361470B/en
Publication of CN116361470A publication Critical patent/CN116361470A/en
Application granted granted Critical
Publication of CN116361470B publication Critical patent/CN116361470B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/358Browsing; Visualisation therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a text clustering cleaning and merging method based on topic description, which comprises the steps of firstly clustering texts to obtain a plurality of clustering results, wherein each clustering result is equivalent to one topic, and then cleaning and merging the clustering results based on three indexes of the text similarity of a topic vector and a text vector in the topic, the text similarity of the topic description and the topic description generated by each text, and the number of keywords of the text and the topic, and finally obtaining the clustering results and the description of each topic, so that the clustering results are more accurate.

Description

Text clustering cleaning and merging method based on topic description
Technical Field
The invention relates to the field of natural language processing, in particular to a text clustering cleaning and merging method based on topic description.
Background
Clustering text of information in text according to corresponding topics has very important applications in the field of text processing. However, since the text information coverage is very wide, the number of text information generated per day is also very large, which presents some challenges for practical application of text clustering. The existing text clustering algorithm is mainly based on a text clustering algorithm such as a kmeans algorithm and the like to cluster a plurality of texts, but the clustered results are not optimized. The kmeans-based algorithm generally needs to set the number of clusters, generally sets k clusters randomly, or determines the number of clusters based on methods such as profile coefficients, elbow rules and the like, but these methods cannot ensure the accuracy of the number of clusters, and also have the situation that texts are clustered wrongly. Although the single-pass based method does not need to set the clustering number, the text is also subject to false clustering, and the clustered results are not cleaned and combined by the existing method.
Disclosure of Invention
Aiming at the technical problems, the invention adopts the following technical scheme:
the embodiment of the invention provides a text clustering cleaning and merging method based on topic description, which comprises the following steps:
s100, obtaining feature vectors and keywords of each text in the texts to be clustered, wherein each text comprises h keywords;
s200, clustering texts to be clustered by using a set clustering algorithm based on the obtained feature vectors to obtain a plurality of topics;
s300, acquiring any text in any topic in the current topic, and keywords, feature vectors and topic description feature vectors of any topic;
s400, carrying out p-th cleaning treatment on the current topic based on the same number of keywords between the text and the topic, the similarity between feature vectors of the text and the topic and the similarity between topic description feature vectors of the text and the topic to obtain n (p) topics after treatment; wherein any topic a of n (p) topics satisfies the following condition: g p (a,q)≥D1 p ,SF p aq ≥D2 p And ST is p aq ≥D3 p The method comprises the steps of carrying out a first treatment on the surface of the Wherein g p (a, q) is topic a and the qth text T in topic a aq The same number of keywords, SF p aq Feature vector sum T for topic a aq Similarity between feature vectors of (1), ST p aq Topic description eigenvector sum T for topic a aq Topic description feature vectors of (a) similarity; D1D 1 p D2 is a first set threshold corresponding to the p-th cleaning treatment p D3 for the second set threshold corresponding to the p-th cleaning treatment p A third set threshold corresponding to the p-th cleaning treatment; a is 1 to n (p), p is 1 to C0, and C0 is preset times; q takes values from 1 to f (a), f (a) being the number of texts in topic a;
s500, setting p=p+1, if p is less than or equal to C0, executing S300; otherwise, H topics after the cleaning treatment are obtained, and S600 is executed;
s600, obtaining a topic list S which is obtained by sequencing according to a decreasing text quantity mode based on H topics, and obtaining keywords, feature vectors, topic description and topic description feature vectors corresponding to any topic u in the S, wherein the value of u is 1 to H;
s700, combining the S based on the same number of keywords among topics, the similarity among feature vectors of the topics and the similarity among topic description feature vectors of the topics to obtain a target topic list;
s800, outputting topic descriptions and corresponding texts of all topics in a target topic list, wherein the text corresponding to each topic is a text which is sequenced according to the text release time.
The invention has at least the following beneficial effects:
according to the text clustering cleaning and merging method based on topic description, firstly, texts are clustered to obtain a plurality of clustering results, each clustering result is equivalent to one topic, and then, based on three indexes of the text similarity of a topic vector and the text vector in the topic, the text similarity of the topic description and the topic description generated by each text, and the number of keywords of the text and the topic, the clustering results are cleaned and merged, and finally, the clustering results and the description of each topic are obtained, so that the clustering results are more accurate.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a text cluster cleaning and merging method based on topic description according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
Fig. 1 is a flowchart of a text cluster cleaning and merging method based on topic description according to an embodiment of the present invention.
The embodiment of the invention provides a text clustering cleaning and merging method based on topic description, as shown in fig. 1, the method can comprise the following steps:
s100, obtaining feature vectors and keywords of each text in the texts to be clustered, wherein each text comprises h keywords.
In the embodiment of the invention, the text can be news text or other types of text. Text may be crawled from various websites or channels.
In the embodiment of the invention, the keywords of each text can be obtained by the existing keyword extraction method. In an exemplary embodiment, the invention obtains the keyword of each text and the weight corresponding to each keyword through a TextRank algorithm. The number of keywords h may be set based on actual needs, in one example, h=7.
In the embodiment of the invention, the feature vector of each text can be obtained through the existing feature extraction method. In one exemplary embodiment, the feature vector of each text may be obtained through a SimBERT model, and in particular, the title of the text may be input into the SimBERT model to obtain the feature vector of each text.
Since news crawled from various websites or channels may be repeated, in the embodiment of the present invention, the text to be clustered is text obtained after the duplication removal process. In the embodiment of the invention, the text can be subjected to the duplicate removal processing by the following two methods.
The method comprises the following steps: the method based on text similarity and rules is adopted for duplication elimination. Specifically, for each text, calculating the similarity between the text and other crawled texts in turn, if the similarity exceeds a set similarity threshold value and the proportion of the same words in the titles of the two texts exceeds a set proportion threshold value, considering the text as a repeated text, and then selecting the text with the latest release time to store in a text library to be clustered. In the embodiment of the present invention, the similarity between the texts a and b can be obtained by the following formula:
Figure BDA0004160446890000031
wherein T is a And T b Feature vectors of text a and b, respectively, ||T a I and I T b I are T respectively a And T b Is a mold of (a).
In the embodiment of the invention, the threshold value of the similarity may be set higher, for example, 0.95, and the ratio threshold value of the same word in the header may be set to 0.5 or 0.6.
The second method is as follows: news deduplication is performed based on DBSCAN clusters. DBSCAN is an unsupervised machine learning clustering algorithm. It does not need to specify the number of clusters, avoids outliers, and works well in data clusters of arbitrary shape and size. It needs to set two parameters to cluster:
1) Epsilon: maximum radius of the community. Data points will be of the same type if their mutual distance is less than or equal to a specified epsilon. It is a distance metric that DBSCAN uses to determine whether two points are similar and belong to the same class. The invention can set a smaller threshold value Epsilon, similar texts are aggregated together (the distance between the similar texts is smaller), then the same clustering result selects one text with the nearest release time, and other texts are discarded. In one exemplary embodiment, epsilon may be set to 0.06.
2) Minimum point (minPts): a neighborhood of minPts numbers within a radius of the neighborhood is considered a cluster. The invention can set the minimum point to 1 or 2 to retain more noise data or outliers, and only very similar text is pruned.
One skilled in the art knows that one or both of the above two methods can be selected according to actual needs to perform de-duplication on the text so as to obtain the text to be clustered.
And S200, clustering the texts to be clustered by using a set clustering algorithm to obtain a plurality of topics.
In the embodiment of the invention, the text to be clustered can be clustered by using the existing clustering algorithm. In one exemplary embodiment, the set clustering algorithm may be a DBSCAN algorithm.
S300, acquiring any text in any topic in the current topic, and keywords, feature vectors, topic description and topic description feature vectors of any topic.
In the embodiment of the invention, the keywords and the feature vectors of any text are acquired in the S100 and only need to be directly called. The topic description of any text is obtained by the following steps:
s301, inputting the title of any text into a set topic description generation model to obtain a corresponding topic description. In the embodiment of the invention, the set topic description generation model can be a pre-training generation model such as T5 or BART. The training step of the topic description generation model comprises the following steps:
(1) Constructing a dataset
And manually selecting a plurality of topics for labeling the data set for all the clustered topics. For each selected topic, a plurality of texts are randomly selected, topic descriptions of the texts are generated manually, each news corresponds to one topic description, news describing the same topic is described, and the corresponding topic descriptions are as same as possible. And taking the title of each news as the input of the generated model, and taking the manually generated topic description as the true value of the model to construct a training data set.
In the embodiment of the invention, topic description can be regarded as topic or subject information corresponding to the current text.
(2) Model training
And inputting the title of each text in the training data set into a pre-training generating model for training to obtain a corresponding prediction result, and calculating loss of the prediction result and the artificially marked true value to train model parameters to obtain a trained topic description generating model.
After the trained topic description generation model is obtained, inputting the title of any text in any topic into the trained topic description generation model to obtain the corresponding topic description.
S302, inputting topic description of any text into a set topic description feature generation model to obtain a corresponding topic description feature vector.
In the embodiment of the invention, the set topic description feature generation model may be a SimBERT model. And inputting topic description of any text into the SimBERT model to obtain a corresponding topic description feature vector.
In the embodiment of the invention, topic description feature vectors of any topic are obtained through the following steps:
s303, based on topic descriptions corresponding to all texts in any topic, obtaining topic descriptions with the largest generation frequency of all topic descriptions as topic descriptions of the topic.
Since topic descriptions of different texts may be the same, topic descriptions corresponding to all texts in any topic are combined, the occurrence frequency of each topic description in the topic descriptions after the combination is acquired, and the topic description with the largest occurrence frequency is used as the topic description of the topic. For example, if a certain topic description is generated by 3 texts, the frequency of generation of the topic description is 3.
S304, inputting topic description of the topic into a set topic description feature generation model to obtain a corresponding topic description feature vector.
Specifically, the topic description of the topic is input to a SimBERT model, and a corresponding topic description feature vector is obtained.
Further, in the embodiment of the present invention, the keyword of any topic is obtained by the following steps:
s305, combining the same keywords in the keywords of all texts in any topic, and recalculating the weight to obtain the combined keywords.
S306, sorting the combined keywords according to the order of the weights from large to small, and acquiring the first h keywords in the sorted keywords as keywords of any topic.
Specifically, if a certain keyword appears in only one text, the weight of the keyword is the weight in the text. If a certain keyword appears in a plurality of texts, the weight of the keyword is the sum of the weights of the keyword in the plurality of texts, for example, the keyword a appears in 3 texts, the weight of the keyword a is b1+b2+b3, wherein b1 to b3 are the weights of the keyword a in 3 texts respectively.
Further, in the embodiment of the present invention, the feature vector of any topic is the average value of the feature vectors of all texts in the topic, namely the feature vector of topic i
Figure BDA0004160446890000061
h ij Is the j-th text T in topic i ij The corresponding feature vector, f (i), is the number of text in topic i.
S400, carrying out p-th cleaning treatment on the current topic based on the same number of keywords between the text and the topic, the similarity between feature vectors of the text and the topic and the similarity between topic description feature vectors of the text and the topic to obtain n (p) topics after treatment; wherein any topic a of n (p) topics satisfies the following condition: g p (a,q)≥D1 p ,SF p aq ≥D2 p And ST is p aq ≥D3 p The method comprises the steps of carrying out a first treatment on the surface of the Wherein g p (a, q) is topic a and the qth text T in topic a aq The same number of keywords, SF p aq Feature vector sum T for topic a aq Similarity between feature vectors of (1), ST p aq Topic description eigenvector sum T for topic a aq Topic description feature vectors of (a) similarity; D1D 1 p D2 is a first set threshold corresponding to the p-th cleaning treatment p D3 for the second set threshold corresponding to the p-th cleaning treatment p A third set threshold corresponding to the p-th cleaning treatment; a is 1 to n (p), p is 1 to C0, and C0 is preset times; q takes values from 1 to f (a), f (a) being the number of text in topic a.
In the embodiment of the invention, the similarity between the feature vectors and the similarity between the topic description feature vectors can be obtained through the existing similarity algorithm, such as cosine similarity and the like.
In the embodiment of the invention, the first set threshold value to the third set threshold value corresponding to each cleaning process may be the same or different, and may be set based on actual needs. The first to third set thresholds may be set based on actual needs, in an exemplary embodiment, the first set threshold may be selected from 2 and 3, the second set threshold may be selected from 0.65, 0.7 and 0.8, and the third set threshold may be selected from 0.7, 0.8 and 0.85, and a desired combination value may be selected according to actual needs.
In the embodiment of the invention, C0 may be set based on actual needs, preferably, c0.ltoreq.3, and more preferably, c0=2.
S500, setting p=p+1, if p is less than or equal to C0, executing S300; otherwise, H topics after the cleaning process are obtained, and S600 is executed. Obviously, h=n (p).
S600, obtaining a topic list S which is obtained by sorting according to a decreasing text quantity mode based on H topics, and obtaining keywords, feature vectors, topic description and topic description feature vectors corresponding to any topic u in the S, wherein the value of u is 1 to H. That is, in S, the number of texts in the former topic is larger than the number of texts in the latter topic.
The keyword, the feature vector, the topic description, and the topic description feature vector corresponding to any topic u can be obtained with reference to S300.
And S700, combining the S based on the same number of keywords among topics, the similarity among feature vectors of the topics and the similarity among topic description feature vectors of the topics to obtain a target topic list.
S800, outputting topic descriptions and corresponding texts of all topics in the target topic list.
In S800, the output text is text that is ordered by the release time, for example, ordered by the order of the release time from early to late. Topic descriptions for each topic in the target topic list can be obtained based on S303.
Further, in an embodiment of the present invention, S400 may specifically include:
s410, regarding a jth text T in a topic i in the current topics corresponding to the p-th cleaning process ij Respectively obtaing p (i,j)、SF p ij And ST (ST) p ij If g p (i,j)≥D1 p And SF (sulfur hexafluoride) p ij ≥D2 p ST p ij ≥D3 p Will T ij Remaining in topic i, execute S440; otherwise, S420 is performed; wherein g p (i, j) is T at the p-th cleaning treatment ij The same number of keywords as topic i, SF p ij T at the p-th washing treatment ij Similarity between the eigenvectors of (1) and topic eigenvectors of topic i, ST p ij T at the p-th washing treatment ij Similarity between topic description feature vectors of topic i and topic description feature vectors of topic i; i has a value of 1 to k, and k is the number of current topics; the value of j is 1 to f (i), and f (i) is the number of texts in the topic i.
S420, obtain g p (ij,s)、SF ps ij And ST (ST) ps ij If g p (ij,s)≥D1 p And SF (sulfur hexafluoride) ps ij ≥D2 p ST ps ij ≥D3 p Will T ij Adding to the topic S and deleting from the original topic, and executing S440; otherwise, S430 is performed; wherein topic s is the s-th topic of k-1 topics except topic i in the current topics, g p (ij, s) is T at the p-th cleaning treatment ij The same number of keywords as topics s, SF ps ij T at the p-th washing treatment ij Similarity between the eigenvectors of (1) and the topic eigenvectors of topic s, ST ps ij T at the p-th washing treatment ij The topic description feature vector of (a) and the topic description feature vector of the topic s, and the value of s is 1 to k-1.
S430, set s=s+1, if s.ltoreq.k-1, execute S420, otherwise, T ij Creates a new topic and adds T ij Adding to the corresponding new topic, setting k=k+1, i.e., if one topic is newly added, increasing the number of current topics by 1, and executing S440.
S440, setting j=j+1, if j is less than or equal to f (i), executing S410; otherwise, i=i+1 is set, and if i is less than or equal to k, S410 is executed, otherwise S500 is executed.
Those skilled in the art know that since the amount of text in each topic may change during processing, the keywords, feature vectors, topic descriptions, and topic description feature vectors of each of the current topics change in real time.
Further, in another embodiment of the present invention, S420 is replaced with:
s421, obtaining topic description similarity set ST ij ={ST 1 ij ,ST 2 ij ,…,ST s ij ,…,ST k-1 ij },ST s ij Is T ij The topic description feature vector of the topic is similar to the topic description feature vector corresponding to the s-th topic in k-1 topics except the topic i in the current topics, and the value of s is 1 to k-1.
S422, ST ij Sorting according to descending order to obtain a sorted similarity set, and obtaining the first m similarity forming comparison similarity set STC in the sorted similarity set ij . m may be set based on actual needs, e.g., m=5.
S423, obtain g p (ij, w) and SF pw ij If g p (ij,w)≥D1 p And SF (sulfur hexafluoride) pw ij ≥D2 p Adding topic w to T ij In the corresponding candidate topic set, S431 is executed, otherwise S431 is directly executed; wherein topic w is STC ij W-th topic of the corresponding m topics, g p (ij, w) is T at the p-th washing treatment ij The same number of keywords as topics w, SF pw ij T at the p-th washing treatment ij The similarity between the corresponding feature vector and the feature vector corresponding to the topic w, and the value of w is 1 to m.
Further, S430 is replaced with:
s431, w=w+1 is set, if w is less than or equal to m, S423 is executed, otherwise S432 is executed.
S432, if T ij The corresponding candidate topic set is Null, then T ij Creates a new topic and adds T ij Adding to the corresponding new topic and deleting from the original topic, setting k=k+1, and executing S440; if T ij If the corresponding candidate topic set contains a similarity, then T is as follows ij Adding to the topic corresponding to the similarity and deleting from the original topic, and executing S440; if T ij If the corresponding candidate topic set contains a plurality of similarities, T is as follows ij Adding to and deleting from the topic corresponding to the maximum similarity in the corresponding candidate topic set, and executing S440.
Further, in the embodiment of the present invention, S700 may specifically include:
s710, obtaining g (u, v), S1 uv And S2 uv If g (u, v) is ≡D4, and S1 uv Not less than D5 and S2 uv If not less than D6, combining the topic u and the topic v, and executing S730; otherwise, executing S720; wherein topic v is the v-th topic in the current merged topic list, g (u, v) is the same number of keywords between topic u and topic v, S1 uv Is the similarity between the topic feature vector of topic u and the topic feature vector of topic v, S2 uv The similarity between the topic description feature vector of the topic u and the topic description feature vector of the topic v is that the value of v is 1 to n, and n is the number of topics in the current combined topic list; d4 is a fourth set threshold, D5 is a fifth set threshold, and D6 is a sixth set threshold; the initial value in the merged topic list is Null.
In the embodiment of the present invention, D4 to D6 may be set to be the same as the first to third set thresholds, respectively.
S720, setting v=v+1, if v is less than or equal to n, executing S710, otherwise, adding the topic u as a new topic into the current combined topic list; s730 is performed.
S730, setting u=u+1, if u is equal to or less than H, executing S710, otherwise executing S740.
S740, acquiring the number of texts in any topic in the current topic list, and deleting the topic from the current topic list if the number of texts in the topic is less than a set number threshold; a target topic list is obtained and S800 is performed.
In the embodiment of the present invention, the set number threshold may be 3. The topic description of the topic obtained after the topic u and the topic v are combined is the topic description of the topic v, and the eigenvector is the average value of the eigenvectors of the topic u and the topic v.
Those skilled in the art know that when u=1, since the number of topics in the current merged topic list is Null, topic 1 will be added to the current merged topic list.
Further, S800 further includes: the method comprises the steps of respectively obtaining a keyword, a feature vector, a topic description and a topic description feature vector of each topic in a target topic list to update the keyword, the feature vector, the topic description and the topic description feature vector of each topic.
According to the text clustering cleaning and merging method based on topic description, firstly, texts are clustered to obtain a plurality of clustering results, each clustering result is equivalent to one topic, and then, based on three indexes of the text similarity of a topic vector and a text vector in the topic, the text similarity of the topic description and the topic description generated by each text, and the number of keywords of the text and the topic, the clustering results are cleaned and optimized, and finally, the clustering results and the description of each topic are obtained, so that the clustering results are more accurate.
Embodiments of the present invention also provide a non-transitory computer readable storage medium that may be disposed in an electronic device to store at least one instruction or at least one program for implementing one of the methods embodiments, the at least one instruction or the at least one program being loaded and executed by the processor to implement the methods provided by the embodiments described above.
Embodiments of the present invention also provide an electronic device comprising a processor and the aforementioned non-transitory computer-readable storage medium.
Embodiments of the invention also provide a computer program product comprising program code for causing an electronic device to carry out the steps of the method according to various exemplary embodiments of the invention as described in this specification, when said program product is run on the electronic device.
While certain specific embodiments of the invention have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the present disclosure is defined by the appended claims.

Claims (10)

1. A text clustering cleaning and merging method based on topic description is characterized by comprising the following steps:
s100, obtaining feature vectors and keywords of each text in the texts to be clustered, wherein each text comprises h keywords;
s200, clustering texts to be clustered by using a set clustering algorithm based on the obtained feature vectors to obtain a plurality of topics;
s300, acquiring any text in any topic in the current topic, and keywords, feature vectors and topic description feature vectors of any topic;
s400, carrying out p-th cleaning treatment on the current topic based on the same number of keywords between the text and the topic, the similarity between feature vectors of the text and the topic and the similarity between topic description feature vectors of the text and the topic to obtain n (p) topics after treatment; wherein any topic a of n (p) topics satisfies the following condition: g p (a,q)≥D1 p ,SF p aq ≥D2 p And ST is p aq ≥D3 p The method comprises the steps of carrying out a first treatment on the surface of the Wherein g p (a, q) is topic a and the qth text T in topic a aq The same number of keywords, SF p aq Feature vector sum T for topic a aq Similarity between feature vectors of (1), ST p aq Topic description feature vector for topic aAnd T aq Topic description feature vectors of (a) similarity; D1D 1 p D2 is a first set threshold corresponding to the p-th cleaning treatment p D3 for the second set threshold corresponding to the p-th cleaning treatment p A third set threshold corresponding to the p-th cleaning treatment; a is 1 to n (p), p is 1 to C0, and C0 is preset times; q takes values from 1 to f (a), f (a) being the number of texts in topic a;
s500, setting p=p+1, if p is less than or equal to C0, executing S300; otherwise, H topics after the cleaning treatment are obtained, and S600 is executed;
s600, obtaining a topic list S which is obtained by sequencing according to a decreasing text quantity mode based on H topics, and obtaining keywords, feature vectors, topic description and topic description feature vectors corresponding to any topic u in the S, wherein the value of u is 1 to H;
s700, combining the S based on the same number of keywords among topics, the similarity among feature vectors of the topics and the similarity among topic description feature vectors of the topics to obtain a target topic list;
s800, outputting topic descriptions and corresponding texts of all topics in a target topic list, wherein the text corresponding to each topic is a text which is sequenced according to the text release time.
2. The method according to claim 1, wherein S400 specifically comprises:
s410, regarding a jth text T in a topic i in the current topics corresponding to the p-th cleaning process ij G is obtained respectively p (i,j)、SF p ij And ST (ST) p ij If g p (i,j)≥D1 p And SF (sulfur hexafluoride) p ij ≥D2 p ST p ij ≥D3 p Will T ij Remaining in topic i, execute S440; otherwise, S420 is performed; wherein g p (i, j) is T at the p-th cleaning treatment ij The same number of keywords as topic i, SF p ij T at the p-th washing treatment ij Is characterized bySimilarity between quantity and topic feature vector of topic i, ST p ij T at the p-th washing treatment ij Similarity between topic description feature vectors of topic i and topic description feature vectors of topic i; i has a value of 1 to k, and k is the number of current topics; the value of j is 1 to f (i), and f (i) is the number of texts in the topic i;
s420, obtain g p (ij,s)、SF ps ij And ST (ST) ps ij If g p (ij,s)≥D1 p And SF (sulfur hexafluoride) ps ij ≥D2 p ST ps ij ≥D3 p Will T ij Adding to the topic S and deleting from the original topic, and executing S440; otherwise, S430 is performed; wherein topic s is the s-th topic of k-1 topics except topic i in the current topics, g p (ij, s) is T at the p-th cleaning treatment ij The same number of keywords as topics s, SF ps ij T at the p-th washing treatment ij Similarity between the eigenvectors of (1) and the topic eigenvectors of topic s, ST ps ij T at the p-th washing treatment ij The similarity between the topic description feature vector of the topic and the topic description feature vector of the topic s, and the value of s is 1 to k-1;
s430, set s=s+1, if s.ltoreq.k-1, execute S420, otherwise, T ij Creates a new topic and adds T ij Adding to the corresponding new topic and deleting from the original topic, setting k=k+1, and executing S440;
s440, setting j=j+1, if j is less than or equal to f (i), executing S410; otherwise, i=i+1 is set, and if i is less than or equal to k, S410 is executed, otherwise S500 is executed.
3. The method according to claim 1, wherein S700 specifically comprises:
s710, obtaining g (u, v), S1 uv And S2 uv If g (u, v) is ≡D4, and S1 uv Not less than D5 and S2 uv If not less than D6, combining the topic u and the topic v, and executing S730; otherwise, executing S720; wherein topic v is the current combinationAnd v-th topic in the topic list, g (u, v) is the same number of keywords between topic u and topic v, S1 uv Is the similarity between the topic feature vector of topic u and the topic feature vector of topic v, S2 uv The similarity between the topic description feature vector of the topic u and the topic description feature vector of the topic v is that the value of v is 1 to n, and n is the number of topics in the current combined topic list; d4 is a fourth set threshold, D5 is a fifth set threshold, and D6 is a sixth set threshold; the initial value in the combined topic list is Null;
s720, setting v=v+1, if v is less than or equal to n, executing S710, otherwise, adding topic u as a new topic into the current merged topic list and setting n=n+1; s730 is performed;
s730, setting u=u+1, if u is not greater than H, executing S710, otherwise, executing S740;
s740, acquiring the number of texts in any topic in the current topic list, and deleting the topic from the current topic list if the number of texts in the topic is less than a set number threshold; a target topic list is obtained and S800 is performed.
4. The method of claim 2, wherein S420 is replaced with:
s421, obtaining topic description similarity set ST ij ={ST 1 ij ,ST 2 ij ,…,ST s ij ,…,ST k-1 ij },ST s ij Is T ij The similarity between the topic description feature vector of the current topic and the topic description feature vector corresponding to the s-th topic in k-1 topics except the topic i, wherein the value of s is 1 to k-1;
s422, ST ij Sorting according to descending order to obtain a sorted similarity set, and obtaining the first m similarity forming comparison similarity set STC in the sorted similarity set ij
S423, obtain g p (ij, w) and SF pw ij If g p (ij,w)≥D1 p And SF (sulfur hexafluoride) pw ij ≥D2 p Adding topic w to T ij In the corresponding candidate topic set, S431 is executed, otherwise S431 is directly executed; wherein topic w is STC ij W-th topic of the corresponding m topics, g p (ij, w) is T at the p-th washing treatment ij The same number of keywords as topics w, SF pw ij T at the p-th washing treatment ij The similarity between the corresponding feature vector and the feature vector corresponding to the topic w, wherein the value of w is 1 to m;
s430 is replaced with:
s431, setting w=w+1, if w is less than or equal to m, executing S423, otherwise, executing S432;
s432, if T ij The corresponding candidate topic set is Null, then T ij Creates a new topic and adds T ij Adding to the corresponding new topic and deleting from the original topic, setting k=k+1, and executing S440; if T ij If the corresponding candidate topic set contains a similarity, then T is as follows ij Adding to the topic corresponding to the similarity and deleting from the original topic, and executing S440; if T ij If the corresponding candidate topic set contains a plurality of similarities, T is as follows ij Adding to and deleting from the topic corresponding to the maximum similarity in the corresponding candidate topic set, and executing S440.
5. The method of claim 1, wherein the topic description feature vector of any text is obtained by:
s301, inputting the title of any text into a set topic description generation model to obtain corresponding topic description;
s302, inputting topic description of any text into a set topic description feature generation model to obtain a corresponding topic description feature vector.
6. The method of claim 5, wherein the topic description feature vector for any topic is obtained by:
s303, based on topic descriptions corresponding to all texts in any topic, obtaining topic descriptions with the largest generation frequency of all topic descriptions as topic descriptions of the topic;
s304, inputting the topic description of the topic into a set topic description feature generation model to obtain a topic description feature vector of the topic.
7. The method of claim 1, wherein the keywords of any one topic are obtained by:
s305, combining the same keywords in the keywords of all texts in any topic, and recalculating weights to obtain the combined keywords;
s306, sorting the combined keywords according to the order of the weights from large to small, and acquiring the first h keywords in the sorted keywords as keywords of any topic.
8. The method of claim 1, wherein the feature vector for any topic is an average of feature vectors for all text in the topic.
9. The method of claim 1, wherein the keywords for each text are obtained by a TextRank algorithm.
10. The method of claim 1, wherein the feature vector for each text is obtained by a SimBERT model.
CN202310347961.9A 2023-04-03 2023-04-03 Text clustering cleaning and merging method based on topic description Active CN116361470B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310347961.9A CN116361470B (en) 2023-04-03 2023-04-03 Text clustering cleaning and merging method based on topic description

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310347961.9A CN116361470B (en) 2023-04-03 2023-04-03 Text clustering cleaning and merging method based on topic description

Publications (2)

Publication Number Publication Date
CN116361470A true CN116361470A (en) 2023-06-30
CN116361470B CN116361470B (en) 2024-05-14

Family

ID=86937686

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310347961.9A Active CN116361470B (en) 2023-04-03 2023-04-03 Text clustering cleaning and merging method based on topic description

Country Status (1)

Country Link
CN (1) CN116361470B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3118751A1 (en) * 2015-07-13 2017-01-18 Pixalione Method of extracting keywords, device and corresponding computer program
KR101828995B1 (en) * 2017-05-08 2018-02-14 한국과학기술정보연구원 Method and Apparatus for clustering keywords
CN109710728A (en) * 2018-11-26 2019-05-03 西南电子技术研究所(中国电子科技集团公司第十研究所) News topic automatic discovering method
CN110134787A (en) * 2019-05-15 2019-08-16 北京信息科技大学 A kind of news topic detection method
CN111460153A (en) * 2020-03-27 2020-07-28 深圳价值在线信息科技股份有限公司 Hot topic extraction method and device, terminal device and storage medium
CN113157857A (en) * 2021-03-13 2021-07-23 中国科学院新疆理化技术研究所 Hot topic detection method, device and equipment for news
CN113722483A (en) * 2021-08-31 2021-11-30 平安银行股份有限公司 Topic classification method, device, equipment and storage medium
CN114492429A (en) * 2022-01-12 2022-05-13 平安科技(深圳)有限公司 Text theme generation method, device and equipment and storage medium
CN114579739A (en) * 2022-01-12 2022-06-03 中国电子科技集团公司第十研究所 Topic detection and tracking method for text data stream
CN115269846A (en) * 2022-08-02 2022-11-01 网易(杭州)网络有限公司 Text processing method and device, electronic equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3118751A1 (en) * 2015-07-13 2017-01-18 Pixalione Method of extracting keywords, device and corresponding computer program
KR101828995B1 (en) * 2017-05-08 2018-02-14 한국과학기술정보연구원 Method and Apparatus for clustering keywords
CN109710728A (en) * 2018-11-26 2019-05-03 西南电子技术研究所(中国电子科技集团公司第十研究所) News topic automatic discovering method
CN110134787A (en) * 2019-05-15 2019-08-16 北京信息科技大学 A kind of news topic detection method
CN111460153A (en) * 2020-03-27 2020-07-28 深圳价值在线信息科技股份有限公司 Hot topic extraction method and device, terminal device and storage medium
CN113157857A (en) * 2021-03-13 2021-07-23 中国科学院新疆理化技术研究所 Hot topic detection method, device and equipment for news
CN113722483A (en) * 2021-08-31 2021-11-30 平安银行股份有限公司 Topic classification method, device, equipment and storage medium
CN114492429A (en) * 2022-01-12 2022-05-13 平安科技(深圳)有限公司 Text theme generation method, device and equipment and storage medium
CN114579739A (en) * 2022-01-12 2022-06-03 中国电子科技集团公司第十研究所 Topic detection and tracking method for text data stream
CN115269846A (en) * 2022-08-02 2022-11-01 网易(杭州)网络有限公司 Text processing method and device, electronic equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
卢天旭: "涉案新闻的话题发现与话题摘要方法研究", 中国优秀硕士学位论文全文数据库 信息科技辑, 15 March 2023 (2023-03-15) *
谭真: "基于 MapReduce 的热点话题发现及演化分析方法研究", 中国优秀硕士学位论文全文数据库 信息科技辑, 15 March 2017 (2017-03-15) *
陈龙: "新闻热点话题发现及演化分析研究与应用", 中国优秀硕士学位论文全文数据库 信息科技辑, 15 July 2017 (2017-07-15) *

Also Published As

Publication number Publication date
CN116361470B (en) 2024-05-14

Similar Documents

Publication Publication Date Title
CN108132927B (en) Keyword extraction method for combining graph structure and node association
CN109299263B (en) Text classification method and electronic equipment
CN111985228B (en) Text keyword extraction method, text keyword extraction device, computer equipment and storage medium
US10135723B2 (en) System and method for supervised network clustering
CN111694941B (en) Reply information determining method and device, storage medium and electronic equipment
CN113806582B (en) Image retrieval method, image retrieval device, electronic equipment and storage medium
CN111966810A (en) Question-answer pair ordering method for question-answer system
Bounabi et al. A comparison of text classification methods using different stemming techniques
Korobkin et al. A multi-stage algorithm for text documents filtering based on physical knowledge
JP2008204374A (en) Cluster generating device and program
CN116361470B (en) Text clustering cleaning and merging method based on topic description
CN116910599A (en) Data clustering method, system, electronic equipment and storage medium
CN115345158A (en) New word discovery method, device, equipment and storage medium based on unsupervised learning
CN116049414B (en) Topic description-based text clustering method, electronic equipment and storage medium
WO2014118976A1 (en) Learning method, information conversion device, and learning program
JP2006285419A (en) Information processor, processing method and program
JP2004326465A (en) Learning device for document classification, and document classification method and document classification device using it
Butler-Yeoman et al. Particle Swarm Optimisation for Feature Selection: A Size-Controlled Approach.
CN116361468B (en) Event context generation method, electronic equipment and storage medium
Sari et al. Combining the active learning algorithm based on the silhouette coefficient with pckmeans algorithm
Mittal et al. A review of some Bayesian Belief Network structure learning algorithms
Iwata et al. Unsupervised Object Matching for Relational Data
CN116361469A (en) Topic generation method based on pre-training model
JP2001265788A (en) Method and device for sorting document and recording medium storing document sorting program
US20210264264A1 (en) Learning device, learning method, learning program, evaluation device, evaluation method, and evaluation program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant