CN115146692A - Data clustering method and device, electronic equipment and readable storage medium - Google Patents

Data clustering method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN115146692A
CN115146692A CN202110352960.4A CN202110352960A CN115146692A CN 115146692 A CN115146692 A CN 115146692A CN 202110352960 A CN202110352960 A CN 202110352960A CN 115146692 A CN115146692 A CN 115146692A
Authority
CN
China
Prior art keywords
cluster
clustering
dimensional feature
text data
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110352960.4A
Other languages
Chinese (zh)
Inventor
郭峰
杨宇轩
沈矗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qianxin Technology Group Co Ltd
Secworld Information Technology Beijing Co Ltd
Original Assignee
Qianxin Technology Group Co Ltd
Secworld Information Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qianxin Technology Group Co Ltd, Secworld Information Technology Beijing Co Ltd filed Critical Qianxin Technology Group Co Ltd
Priority to CN202110352960.4A priority Critical patent/CN115146692A/en
Publication of CN115146692A publication Critical patent/CN115146692A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data clustering method, a data clustering device, electronic equipment and a readable storage medium, and relates to the technical field of data mining. According to the method, after text data are subjected to primary clustering, semantic features of multiple dimensions in a high-dimensional feature vector are replaced by a one-dimensional cluster label according to a primary clustering result, dimension reduction of the data is achieved, and the data subjected to dimension reduction are clustered again, so that the calculated amount can be effectively reduced in the process of secondary clustering, the clustering efficiency is improved, the clustering effect can be ensured through secondary clustering, and the achieving method can give consideration to both the clustering efficiency and the clustering precision.

Description

Data clustering method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of data mining technologies, and in particular, to a data clustering method, an apparatus, an electronic device, and a readable storage medium.
Background
Clustering refers to a process of dividing a set of physical or abstract objects into a plurality of classes composed of similar objects, and is widely applied in the fields of image analysis, text retrieval and the like.
The current clustering is based on high-dimensional vectors corresponding to original data, and because the high-dimensional vectors have more dimensions, the amount of calculation during clustering is large, so that the clustering efficiency is low, and the method is not suitable for a large-data-volume clustering scene.
Disclosure of Invention
An embodiment of the present application aims to provide a data clustering method, an apparatus, an electronic device, and a readable storage medium, so as to solve the problem of low clustering efficiency in the prior art.
In a first aspect, an embodiment of the present application provides a data clustering method, where the method includes: acquiring a high-dimensional feature vector corresponding to text data to be clustered, wherein the high-dimensional feature vector comprises semantic features of multiple dimensions; performing primary clustering on the text data according to semantic features, and determining cluster-like labels corresponding to the high-dimensional feature vectors according to primary clustering results; replacing semantic features of multiple dimensions in the high-dimensional feature vector with a class cluster label of one dimension to form a low-dimensional feature vector; and clustering the low-dimensional feature vectors again to obtain a final clustering result.
In the implementation process, after the text data is subjected to primary clustering, semantic features of multiple dimensions in the high-dimensional feature vector are replaced by a class cluster label of one dimension according to a primary clustering result, dimension reduction of the data is achieved, and the data subjected to dimension reduction is clustered again, so that the calculated amount can be effectively reduced in the process of clustering again, the clustering efficiency is improved, the clustering effect can be ensured through clustering again, and the implementation mode can give consideration to both the clustering efficiency and the clustering precision.
Optionally, the performing primary clustering on the text data according to the semantic features, and determining a cluster label corresponding to the high-dimensional feature vector according to a primary clustering result includes:
calculating first similarity of the high-dimensional feature vector and potential clusters in a cluster-like set formed by historical text data clustering, wherein the potential clusters are clusters in which the number of texts in the cluster-like set is larger than a preset number;
determining a target potential cluster, wherein the first similarity between the target potential cluster and the high-dimensional feature vector is maximum and is greater than a first preset similarity;
clustering the text data into the target potential cluster, and determining a cluster-like label of the target potential cluster as a cluster-like label corresponding to the high-dimensional feature vector.
In the implementation process, the similarity between the text data and the potential clusters is calculated at first, but not the similarity between the text data and all the clusters, so that the calculation amount can be reduced in the primary clustering process, and the clustering efficiency of the primary clustering is improved.
Optionally, the primary clustering is performed on the text data according to semantic features, and determining a cluster label corresponding to the high-dimensional feature vector according to the primary clustering result, wherein the method comprises the following steps:
calculating first similarity of the high-dimensional feature vector and potential clusters in a cluster-like set formed by historical text data clustering, wherein the potential clusters are clusters in which the number of texts in the cluster-like set is larger than a preset number;
if no potential cluster with the first similarity larger than a first preset similarity exists, calculating a second similarity between the high-dimensional feature vector and an isolated cluster in the cluster set, wherein the isolated cluster is a cluster in which the number of texts in the cluster set is smaller than or equal to the preset number;
determining a target isolated cluster, wherein the second similarity between the target isolated cluster and the high-dimensional feature vector is the largest and is greater than a second preset similarity;
clustering the text data into the target isolated cluster, and determining a cluster label of the target isolated cluster as a cluster label corresponding to the high-dimensional feature vector.
In the implementation process, when the text data is not similar to the potential clusters, the similarity between the text data and the isolated clusters is calculated, so that the text data can be prevented from being clustered into wrong potential clusters, and the clustering effect is ensured.
Optionally, after the calculating the second similarity between the high-dimensional feature vector and the isolated cluster in the cluster-like set, the method further includes:
if no isolated cluster with the second similarity larger than the second preset similarity exists, the text data is used as a new cluster, and the cluster label of the new cluster is determined as the cluster label corresponding to the high-dimensional feature vector, so that the primary clustering of the text data is realized.
Optionally, after the clustering the text data into the target isolated cluster, the method further includes:
if the text number of the target isolated cluster is larger than the preset number, the target isolated cluster is determined as a potential cluster, so that when subsequent text data are clustered, the potential cluster can participate in similarity calculation first, and the situation that the calculation amount is increased due to the fact that the similarity is required to be calculated with the isolated cluster is avoided.
Optionally, performing primary clustering on the text data according to the semantic features, and after a primary clustering result is obtained, the method further includes:
updating a plurality of clusters in the primary clustering result;
wherein the update process includes at least one of: updating cluster centers of the clusters with the data volume larger than the first number; merging the clusters with the similarity greater than the preset similarity, and recalculating the cluster center of the merged clusters; and deleting the cluster class with the data volume smaller than the second number.
In the implementation process, the preliminary clustering result can be optimized by updating each cluster, so that the preliminary clustering precision can be improved or the calculation amount in the subsequent secondary clustering process can be reduced.
Optionally, the clustering the low-dimensional feature vectors again to obtain a final clustering result includes:
acquiring text data in each cluster in the primary clustering result;
and performing density clustering on the low-dimensional characteristic vectors corresponding to the text data in each cluster to obtain a final clustering result.
In the implementation process, under the condition of small data quantity, the clustering precision can be effectively improved by carrying out density clustering on all data of each cluster, and a better clustering effect is realized.
Optionally, the clustering the low-dimensional feature vector again to obtain a final clustering result includes:
acquiring low-dimensional feature vectors corresponding to cluster centers of all clusters in the primary clustering result;
and carrying out density clustering on the low-dimensional characteristic vectors corresponding to the cluster centers of all the clusters to obtain a final clustering result.
In the implementation process, under the condition of large data volume, density clustering is only carried out on the cluster center, so that clustering efficiency can be improved.
In a second aspect, an embodiment of the present application provides a data clustering device, where the device includes:
the high-dimensional vector acquisition module is used for acquiring a high-dimensional feature vector corresponding to the text data to be clustered, wherein the high-dimensional feature vector comprises semantic features with multiple dimensions;
the primary clustering module is used for carrying out primary clustering on the text data according to semantic features and determining cluster-like labels corresponding to the high-dimensional feature vectors according to primary clustering results;
the data dimension reduction module is used for replacing semantic features of multiple dimensions in the high-dimensional feature vector with a class cluster label of one dimension to form a low-dimensional feature vector;
and the secondary clustering module is used for clustering the low-dimensional characteristic vectors again to obtain a final clustering result.
Optionally, the primary clustering module is configured to calculate a first similarity between the high-dimensional feature vector and a potential cluster in a cluster set formed by clustering historical text data, where the potential cluster is a cluster in the cluster set in which the number of texts is greater than a preset number; determining a target potential cluster, wherein the first similarity between the target potential cluster and the high-dimensional feature vector is maximum and is greater than a first preset similarity; clustering the text data into the target potential cluster, and determining a cluster-like label of the target potential cluster as a cluster-like label corresponding to the high-dimensional feature vector.
Optionally, the primary clustering module is configured to calculate a first similarity between the high-dimensional feature vector and a potential cluster in a cluster set formed by clustering historical text data, where the potential cluster is a cluster in which the number of texts in the cluster set is greater than a preset number; if no potential cluster with the first similarity larger than a first preset similarity exists, calculating a second similarity between the high-dimensional feature vector and an isolated cluster in the cluster set, wherein the isolated cluster is a cluster in which the number of texts in the cluster set is smaller than or equal to the preset number; determining a target isolated cluster, wherein the second similarity between the target isolated cluster and the high-dimensional feature vector is the largest and is greater than a second preset similarity; clustering the text data into the target isolated cluster, and determining a cluster label of the target isolated cluster as a cluster label corresponding to the high-dimensional feature vector.
Optionally, the primary clustering module is configured to, if there is no isolated cluster with the second similarity greater than the second preset similarity, take the text data as a new cluster, and determine a cluster-like label of the new cluster as a cluster-like label corresponding to the high-dimensional feature vector.
Optionally, the primary clustering module is configured to determine the target isolated cluster as a potential cluster if the number of texts of the target isolated cluster is greater than the preset number.
Optionally, the primary clustering module is further configured to update a plurality of clusters in the primary clustering result;
wherein the update process includes at least one of: updating cluster centers of the clusters with the data volume larger than the first number; merging the clusters with the similarity larger than the preset similarity, and recalculating the cluster center of the merged clusters; and deleting the cluster class with the data volume smaller than the second number.
Optionally, the secondary clustering module is configured to obtain text data in each cluster in the primary clustering result; and carrying out density clustering on the low-dimensional characteristic vectors corresponding to the text data in each cluster to obtain a final clustering result.
Optionally, the secondary clustering module is configured to obtain a low-dimensional feature vector corresponding to a cluster center of each cluster in the primary clustering result; and carrying out density clustering on the low-dimensional characteristic vectors corresponding to the cluster centers of all the clusters to obtain a final clustering result.
In a third aspect, an embodiment of the present application provides a data clustering method, which is applied to a data clustering platform, where the data clustering platform includes an application layer, a data mining layer, a computing layer, a feature representation layer, a preprocessing layer, and a data layer; the method comprises the following steps:
receiving a clustering task sent by an upper application program through the application layer, wherein the clustering task is used for indicating to cluster the specified text data;
acquiring text data to be clustered from a database through the data layer according to the clustering task;
preprocessing the text data through the preprocessing layer to obtain processed text data;
vectorizing the processed text data through the feature representation layer to obtain high-dimensional feature vectors corresponding to the text data;
clustering the text data of the high-dimensional feature vectors through the computing layer according to a data clustering method to obtain a final clustering result;
and performing subject term extraction and/or key sentence extraction on each cluster in the final clustering result through the data mining layer, and outputting the subject terms and/or key sentences corresponding to each cluster through the application layer.
In a fourth aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the steps in the method as provided in the first aspect are executed.
In a fifth aspect, embodiments of the present application provide a readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, performs the steps in the method as provided in the first aspect.
Additional features and advantages of the present application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
To more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic structural diagram of an electronic device for performing a data clustering method according to an embodiment of the present application;
fig. 2 is a flowchart of a data clustering method provided in an embodiment of the present application;
fig. 3 is a schematic diagram of a primary clustering process and optimization of clusters according to an embodiment of the present disclosure;
fig. 4 is a schematic diagram of data clustering provided in an embodiment of the present application;
fig. 5 is a schematic diagram of a data clustering platform provided in an embodiment of the present application;
fig. 6 is a schematic diagram illustrating a process-based clustering of data according to an embodiment of the present application;
fig. 7 is a block diagram of a structure of a data clustering device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
The embodiment of the application provides a data clustering method, the method replaces semantic features of multiple dimensions in a high-dimensional feature vector with a class cluster label of one dimension according to an initial clustering result after text data are subjected to initial clustering, dimension reduction of the data is realized, and the data subjected to dimension reduction are clustered again, so that the calculated amount can be effectively reduced in the process of clustering again, and a clustering effect can be ensured through clustering again, so that the implementation mode of the application can give consideration to both clustering efficiency and clustering precision.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an electronic device for executing a data clustering method according to an embodiment of the present application, where the electronic device may include: at least one processor 110, e.g., a CPU, at least one communication interface 120, at least one memory 130, and at least one communication bus 140. Wherein the communication bus 140 is used for realizing direct connection communication of these components. The communication interface 120 of the device in the embodiment of the present application is used for performing signaling or data communication with other node devices. The memory 130 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). Memory 130 may optionally be at least one memory device located remotely from the aforementioned processor. The memory 130 stores computer readable instructions, and when the computer readable instructions are executed by the processor 110, the electronic device executes the following method shown in fig. 2, for example, the memory 130 may be configured to store text data, and the processor 110 may be configured to perform vectorization processing on the text data, convert the text data into high-dimensional feature vectors, then perform dimensionality reduction on the high-dimensional feature vectors by using the primary clustering result, and then perform secondary clustering to implement clustering of the text data.
The electronic device may be a terminal device or a server or other device with certain data processing capability.
It will be appreciated that the configuration shown in fig. 1 is merely illustrative and that the electronic device may also include more or fewer components than shown in fig. 1 or have a different configuration than shown in fig. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.
Referring to fig. 2, fig. 2 is a flowchart of a data clustering method according to an embodiment of the present application, where the method includes the following steps:
step S110: and acquiring a high-dimensional feature vector corresponding to the text data to be clustered, wherein the high-dimensional feature vector comprises semantic features of multiple dimensions.
In some embodiments, the text data to be clustered may be text stored in the electronic device, or text data received by the electronic device from the outside, and the text data received from the outside in real time may be clustered at this time. The text data may be, for example, news text, user data, image data, device data, or the like, or may be text data obtained by converting voice data.
If the electronic device directly acquires text data, the text data can be vectorized and converted into high-dimensional feature vectors, the vectorization processing can be performed by a Word Vector conversion method, such as a Word2Vector algorithm, a doc2vec algorithm and the like, and a bert multi-language model can be used for Vector conversion. After vectorization processing is performed on text data by using the bert multi-language model, the obtained high-dimensional feature Vector includes rich semantic information, the input of the bert multi-language model is an original Word Vector of each Word/Word in the text data, the original Word Vector can be obtained by performing Vector processing on each Word/Word in the text data by using a Word2Vector algorithm, the output of the bert multi-language model is Vector representation after each Word/Word is fused with full-text semantic information, and the full-text semantic information is rich and the text data can be described by adopting multiple dimensions, so that high-dimensional Vector representation is generally adopted, and the high-dimensional feature Vector is called in the embodiment of the application.
Of course, the electronic device may directly obtain the high-dimensional feature vector corresponding to the text data, and in this case, the high-dimensional feature vector may be obtained after the external device performs vector conversion on the text data. Since the text data can be described by using multiple dimensions, in order to accurately describe the text data, the converted vector is a high-dimensional feature vector, and the high-dimensional feature vector includes semantic features of multiple dimensions, for example, the high-dimensional feature vector includes 512 dimensions, where 100 dimensions are used to represent the semantic features of the text data, and other dimensions may represent other features of the text data, such as upper positions of respective words, time when the text data is generated, and features of multiple dimensions such as sources.
In addition, the text data may refer to one piece of text data or may refer to multiple pieces of text data, and each piece of text data corresponds to one high-dimensional feature vector.
Step S120: and performing primary clustering on the text data according to the semantic features, and determining cluster-like labels corresponding to the high-dimensional feature vectors according to primary clustering results.
In the initial clustering, clustering may be performed according to semantic features, such as clustering text data with similar semantic features into one class. When the text data is a news text and the news text is clustered, the news text is generated in real time, so that clustering is performed on each piece of obtained text data under the condition, the clustering process refers to clustering the text data into a cluster formed by clustering historical text data, and the clustering can be performed in real time or at intervals.
For example, if clustering is performed in real time, initially, if only one piece of text data is obtained, the text data is primarily clustered to form a cluster, and subsequently, new incoming text data is obtained, the similarity between the text data and the clustered cluster is calculated based on semantic features, and then primary clustering is realized. If clustering is performed on some existing text data, the similarity between the text data can be calculated according to semantic features (for example, 100-dimensional semantic features are utilized), and then the text data with the similarity larger than the preset similarity are clustered into one class, so that a plurality of class clusters, namely primary clustering results, can be obtained.
Each cluster can be provided with three cluster labels, such as three clusters, the cluster labels of which are 1,2 and 3, semantic features represented by different cluster labels can be stored in the electronic equipment, for example, the cluster with the cluster label of 1 clusters text data related to a certain current affair news together, and the semantic feature (or can be called as a topic) corresponding to the cluster label of 1 is the current affair news; the cluster with the cluster label of 2 is to cluster the text data related to a certain star topic, and the semantic feature corresponding to the cluster label of 2 is the star topic. Therefore, the cluster label corresponding to the high-dimensional feature vector can be determined according to the primary clustering result, and if the text data 1 is clustered into a cluster with a cluster label of 1, the cluster label of the high-dimensional feature vector corresponding to the text data 1 is 1, and the text data 2 is clustered into a cluster with a cluster label of 2, and the cluster label of the high-dimensional feature vector corresponding to the text data 2 is 2.
Step S130: and replacing semantic features of multiple dimensions in the high-dimensional feature vector with a class cluster label of one dimension to form a low-dimensional feature vector.
Since the direct use of high-dimensional feature vectors for clustering occupies a large amount of resources and has a large amount of calculation, and real-time stream clustering cannot be implemented for a large amount of data (stream clustering refers to clustering data streams, such as a scene of clustering news texts), the high-dimensional feature vectors need to be reduced in dimension first to reduce the calculation amount in the subsequent clustering process. In the embodiment of the present application, semantic features of multiple dimensions in a high-dimensional feature vector are replaced with a class cluster label of one dimension, so as to form a low-dimensional feature vector, and if 100 dimensions are used in the above high-dimensional feature vector to describe semantic features, the 100-dimensional semantic features are replaced with a one-dimensional class cluster label, such as the above class cluster label 1 or 2. Thus, the semantic features in the original high-dimensional feature vector can be compressed to form a simple one-dimensional feature value, namely, the semantic features are represented by using the cluster-like labels.
In the embodiment of the present application, the low-dimensional feature vector may be a three-dimensional feature vector, which may be expressed as < time, service attribute, and class cluster label >, and these three dimensions constitute the low-dimensional feature vector, and may be described as a spatial feature corresponding to text data, and of course, other two dimensions may also be flexibly set according to actual requirements. In the application, the text data is considered to be a news text, and the time correlation is large, so the time dimension is reserved in the low-dimensional feature vector. In addition, the service attribute may be related to the clustering task, for example, if text data in different languages are clustered into one category, the service attribute may be the language used by the text data, or if the sensitive data is classified into one category, the service attribute is whether the text data is sensitive data, or the like.
That is to say, in the primary clustering result, the text data with similar semantic features are preliminarily clustered into one class, and the high-dimensional feature vector is converted into the low-dimensional feature vector according to the primary clustering result, so that the dimensionality reduction of the data is realized, and therefore, when clustering is performed again in the follow-up process, the low-dimensional feature vector can be clustered, the calculation of the data amount can be reduced, and the clustering efficiency is improved.
Step S140: and clustering the low-dimensional feature vectors again to obtain a final clustering result.
After obtaining the low-dimensional feature vectors, the low-dimensional feature vectors may be clustered again. The method comprises the steps of obtaining a primary clustering result after primary clustering, wherein each text data in the primary clustering result corresponds to one low-dimensional feature vector, so that when secondary clustering is carried out, the similarity of two text data is calculated by using the low-dimensional feature vectors, the text data with high similarity can be clustered into one class, batch clustering of the text data is realized, the batch clustering refers to clustering of the text data with large data size, and a better clustering result is obtained.
In the implementation process, after the text data is subjected to primary clustering, semantic features of multiple dimensions in the high-dimensional feature vector are replaced by a class cluster label of one dimension according to a primary clustering result, dimension reduction of the data is achieved, and the data subjected to dimension reduction is clustered again, so that the calculated amount can be effectively reduced in the process of clustering again, the clustering efficiency is improved, the clustering effect can be ensured through clustering again, and the implementation mode can give consideration to both the clustering efficiency and the clustering precision.
In some embodiments, the stream clustering may be a single-pass clustering algorithm, and the batch clustering may be a density clustering algorithm, where a clustering effect of the batch clustering is good relative to a clustering effect of the stream clustering, and if only text data is subjected to batch clustering, repeated computation is required in a computation process, and the computation is complex and requires a large amount of memory, and it is difficult to obtain a result by using the batch clustering algorithm in the case of a large amount of data, for example, incremental line text data can obtain a clustering result by using data obtained by performing batch clustering computation for several days, and a large amount of hardware resources and a large amount of time are occupied, and a general server cannot support such a large amount of computation. For stream clustering, clustering can be realized by using limited memory and limited processing time, but the clustering effect is not as good as batch clustering, and text data oriented to two clusters are high-dimensional vectors, so that the calculation amount is large, and the clustering efficiency is low. Therefore, the two clustering algorithms are combined, the flow clustering algorithm is used for achieving the dimensionality reduction of data, and then the batch clustering algorithm is used for achieving clustering, so that the clustering efficiency and the clustering effect can be considered at the same time.
The following is a detailed description of the two processes of primary clustering and secondary clustering.
In some embodiments, in order to implement real-time clustering on the text data of news and improve the clustering efficiency, in the embodiment of the present application, a single-pass clustering algorithm is used as a framework during primary clustering, and a cluster structure of a potential cluster and an isolated cluster is introduced to implement primary clustering.
The single-pass clustering algorithm belongs to non-hierarchical clustering, the clustering process is an iterative process, the algorithm efficiency is high, the single-pass clustering algorithm is suitable for processing text data with large data volume, the single-pass clustering algorithm is sensitive to the data time sequence, the data time sequence is different, the final clustering results can also be different, and therefore, the single-pass clustering algorithm is very suitable for clustering news text data, and the application requirement of clustering news texts can be better met.
The specific implementation mode is as follows: calculating first similarity of the high-dimensional feature vector and potential clusters in a cluster set formed by clustering historical text data, wherein the potential clusters are clusters with the number of text data larger than a preset number in the cluster set; determining a target potential cluster, wherein the first similarity between the target potential cluster and the high-dimensional feature vector is maximum and is greater than a first preset similarity; clustering the text data into a target potential cluster, and determining a class cluster label of the target potential cluster as a class cluster label corresponding to the high-dimensional feature vector.
The idea of the existing single-pass clustering algorithm is mainly as follows: if no cluster exists, the text data to be clustered currently is used as a first cluster, if a cluster set formed by clustering exists, similarity calculation is carried out on the text data to be clustered and all clusters, the cluster with the maximum similarity and the similarity larger than a preset threshold value is selected, the text data is merged into the cluster, otherwise, the text data is used as a new cluster to be generated, and the process is repeated until the text data to be clustered is processed.
In the embodiment of the application, the existing single-pass clustering algorithm is improved, so that the calculated amount of the single-pass clustering algorithm is reduced, and the clustering efficiency is improved. The implementation mode is that the class clusters are divided into potential clusters and isolated clusters, the potential clusters are the class clusters with the text number larger than the preset number in the class cluster set, and the isolated clusters are the class clusters with the text number smaller than or equal to the preset number in the class cluster set.
Because the clusters with less text quantity may be caused by data noise or clustering deviation, similarity between the text data and potential clusters can be calculated first, so that similarity calculation with all clusters is not needed, some calculation amount is reduced, and clustering efficiency is improved.
For example, if 50 pieces of text data need to be clustered, according to the idea of the single-pass clustering algorithm, a first piece of text data is firstly taken as a class cluster, the formed class cluster set includes a class cluster, the historical text data includes the first piece of text data, and when a second piece of text data is clustered, the similarity between the second piece of text data and the class cluster is calculated, if the second piece of text data is merged into the class cluster, the historical text data includes the two pieces of text data, and according to the mode, subsequent text data can be clustered sequentially. It should be noted here that, because the number of texts in each cluster is small initially, and the clusters in the cluster set may all be isolated clusters, when clustering is performed in an initial time period, text data may be clustered into isolated clusters first, after a certain time period, if the number of texts in a certain isolated cluster is greater than a preset number, the cluster is converted into a potential cluster, and for subsequent clustering of text data, the cluster and the potential cluster may be clustered first, and the similarity to the isolated cluster is not calculated any more, so that a certain amount of calculation may be reduced.
The calculating of the first similarity refers to calculating the similarity between a high-dimensional feature vector corresponding to the text data and a cluster center of a potential cluster, and a vector corresponding to the cluster center may refer to an average vector of high-dimensional feature vectors corresponding to all text data in the cluster, so that the first similarity between two vectors may be calculated by calculating a cosine distance and the like, or the first similarity may be calculated by calculating a Kullback-Leibler, a Jaccard, a Hellinger distance and the like. If a plurality of potential clusters exist, calculating a first similarity between the text data and each potential cluster, then determining the largest first similarity, for example, the first similarity between the potential cluster 1 and the text data is the largest, and the first similarity between the potential cluster 1 and the text data is greater than a first preset similarity, then the potential cluster 1 is called a target potential cluster, then clustering the text data into the potential cluster 1, and then determining a cluster-like label of the potential cluster 1 as a cluster-like label corresponding to a high-dimensional feature vector corresponding to the text data, so as to facilitate subsequent conversion into a low-dimensional feature vector.
In the implementation process, the similarity between the text data and the potential clusters is calculated instead of calculating the similarity between the text data and all the clusters, so that the calculation amount can be reduced in the primary clustering process, and the clustering efficiency of the primary clustering is improved.
In some embodiments, after calculating the similarity between the text data and the potential cluster, if there is no potential cluster with the first similarity greater than the first preset similarity, calculating a second similarity between the high-dimensional feature vector and an isolated cluster in the cluster set, where the isolated cluster is a cluster with the text number less than or equal to the preset number in the cluster set, determining a target isolated cluster, where the second similarity between the target isolated cluster and the high-dimensional feature vector is the largest and greater than the second preset similarity, clustering the text data into the target isolated cluster, and determining a cluster label of the target isolated cluster as a cluster label corresponding to the high-dimensional feature vector.
Wherein the second similarity is calculated in a manner similar to that of the first similarity, namely, similarity calculation is carried out on the high-dimensional feature vector corresponding to the text data and the cluster center of the isolated cluster. If a plurality of isolated clusters exist, calculating to obtain a second similarity between the text data and each isolated cluster, then determining the maximum second similarity, for example, if the second similarity between the isolated cluster 3 and the text data is maximum, and the second similarity between the isolated cluster 3 and the text data is greater than a second preset similarity, then determining the isolated cluster 3 as a target isolated cluster, then clustering the text data into the isolated cluster 3, and then determining a cluster label of the isolated cluster 3 as a cluster label corresponding to a high-dimensional feature vector corresponding to the text data, so as to facilitate subsequent conversion into a low-dimensional feature vector.
The first preset similarity and the second preset similarity in the two modes can be flexibly set according to actual requirements, and the first preset similarity and the second preset similarity can be different or the same, which is not particularly limited in the embodiment of the application.
The preset number may also be flexibly set according to actual requirements, for example, 2 or 3, which is not particularly limited in the embodiment of the present application.
In the implementation process, when the text data is not similar to the potential clusters, the similarity between the text data and the isolated clusters is calculated, so that the text data can be prevented from being clustered into wrong potential clusters, and the clustering effect is ensured.
In some embodiments, after the second similarity between the text data and the isolated cluster is obtained through calculation, if there is no isolated cluster with the second similarity larger than a second preset similarity, the text data is used as a new cluster, and the cluster label of the new cluster is determined as the cluster label corresponding to the high-dimensional feature vector.
That is to say, when there is no class cluster with a higher similarity to the text data in the current class cluster set, the text data is used as the text in a new cluster, that is, a new cluster is constructed, and the class cluster label of the new cluster can be allocated according to a certain rule. For example, in the process of forming a class cluster, a data structure similar to an index is constructed for recording class cluster labels corresponding to each class cluster, so that a class cluster label of a new cluster can be generated according to the class cluster label of an existing class cluster, and if the existing class cluster label is a number, the last class cluster label of the new cluster is 5, and the generated class cluster label of the new cluster can be 6. Therefore, the text data can be clustered into a new cluster, and the cluster-like label of the new cluster is the cluster-like label corresponding to the high-dimensional feature vector of the text data.
In order to facilitate management of each class cluster, the electronic device may record related information of each class cluster, such as information including a class cluster label of the class cluster, a number of texts in the class cluster, a class cluster center, a number of text data in the class cluster, a mark of a potential cluster or an isolated cluster, a failure time of the class cluster (i.e., a time for deleting the class cluster), and the like.
The electronic equipment can scan the text quantity of each cluster in real time, the clusters are marked as potential clusters or isolated clusters according to the text quantity, and when similarity calculation is carried out, which clusters in the cluster set are potential clusters and which clusters are isolated clusters can be known by acquiring information recorded by the electronic equipment, so that the potential clusters or the isolated clusters can be found quickly.
In some embodiments, as the clustering progresses, the number of texts in the isolated cluster may be gradually increased, and the electronic device may further scan the number of texts in the isolated cluster in real time, so as to determine the isolated cluster as a potential cluster after the number of texts in the isolated cluster is greater than a preset number, for example, after the text data is clustered into the target isolated cluster, if the number of texts in the target isolated cluster is greater than the preset number, the target isolated cluster is determined as a potential cluster.
In a specific implementation process, the electronic device may change, when it is determined that a certain isolated cluster is determined as a potential cluster, the recorded related information of the isolated cluster, for example, a mark of the isolated cluster of the recorded cluster is changed to a mark of the potential cluster, for example, the mark of the initially recorded cluster is 0, which indicates that the cluster is the isolated cluster, if the cluster is converted into the potential cluster, the mark is changed to 1, which indicates that the cluster is the potential cluster, and when subsequent text data is clustered, the cluster may be included in the potential cluster, and a similarity with the text data is calculated first.
In the implementation process, the isolated cluster is determined as the potential cluster, so that when subsequent text data are clustered, the potential cluster can participate in similarity calculation first, and the situation that the similarity is required to be calculated with the isolated cluster to increase the calculation amount is avoided.
In some embodiments, in order to achieve better clustering results, after the primary clustering obtains the primary clustering results, an update process may be further performed on a plurality of clusters in the primary clustering results, where the update process includes at least one of: updating cluster centers of the clusters with the data volume larger than the first number; merging at least two clusters with the similarity greater than a preset similarity (the preset similarity can be the same as or different from the first preset similarity or the second preset similarity), and recalculating the cluster centers of the merged clusters; and deleting the cluster class with the data volume smaller than the second number.
For example, the electronic device may be responsible for performing update processing on each class cluster generated in the primary cluster through a class cluster center optimization thread. If the cluster center is updated, because the cluster center is an average vector of high-dimensional feature vectors of each text data in the cluster, when the data amount in the cluster increases, the cluster center may shift (because the key point of a topic that may be concerned may change continuously as time goes on, in order to reflect the change characteristic of the key point of the text in the cluster as time goes on, the cluster center needs to be updated), and in order to obtain more accurate similarity and to mine a new topic subsequently, the cluster center needs to be updated. The updating method is to calculate an average vector by reusing the high-dimensional feature vector of the text data in the current cluster, i.e. to calculate a cluster center again, and then to replace the original cluster center with the new cluster center.
The electronic device may also update the cluster center when the data amount of the cluster-like object increases, so that the cluster center can be updated in time, or the electronic device may also update the cluster center at regular intervals, or the cluster center may be updated when a certain data amount increases, and understandably, the update triggering condition of the cluster center may be flexibly set according to actual requirements.
Certainly, because clustering is essentially approximate calculation, some text data with small topic relevance are inevitably introduced into one cluster, and if the cluster center is frequently updated, the problem of large cluster center offset may be caused.
Although the initial clustering divides the data into a plurality of clusters, the clustering precision of the initial clustering is not very high, and the similarity of some clusters after clustering is still high, at this time, the clusters with high similarity can be merged, so as to reduce the calculation amount when clustering the subsequent text data. If the similarity of every two clusters is calculated, if the similarity is greater than the preset similarity, merging at least two clusters, for example, if the similarity of the cluster 1 and the cluster 2 is greater than the preset similarity, and the similarity of the cluster 2 and the cluster 3 is greater than the preset similarity, merging the three clusters into one cluster; or, two clusters with similarity greater than the preset similarity may be merged, then the similarity with other clusters is calculated, then whether merging is needed is determined, for example, the similarity between each two clusters is calculated for the first time, then merging is performed according to every two clusters (where, if the similarity between one cluster and other clusters is greater than the preset similarity, one cluster is arbitrarily selected for merging), after the first round of merging is completed, the similarity between new clusters is calculated, and the second round of merging is continued in the same manner until the similarity between each two clusters is less than or equal to the preset similarity. The calculating the similarity is to calculate the similarity of the cluster centers of the two clusters, and since the cluster center of the merged cluster is changed, the cluster center of the merged cluster needs to be recalculated, and the cluster center obtained through calculation is used as the cluster center of the merged cluster.
In addition, as some clusters may be formed by some noise data points or clustering errors and the data volume is less, clusters with the data volume less than the second number can be deleted, so that the calculation amount in the subsequent text data clustering process can be reduced.
The first quantity and the second quantity may be flexibly set according to actual requirements, the first quantity and the second quantity have no certain association relationship, and the first quantity may be greater than the second quantity or smaller than the second quantity, or in some cases, may be equal to the second quantity.
In the implementation process, the initial clustering is implemented on the text data, and the initial clustering result can be optimized by updating each cluster, so that the initial clustering precision can be improved or the calculation amount in the subsequent secondary clustering process can be reduced, and the process schematic of the initial clustering and the optimization can be shown in fig. 3.
In the process of performing the primary clustering on the text data by using the above method, in other embodiments, the clustering algorithm used for the primary clustering may also be other clustering algorithms, such as a principal component analysis algorithm, a hierarchical clustering algorithm, and the like, or the text data may be subjected to dimensionality reduction by using the principal component analysis algorithm or the hierarchical clustering algorithm first, and then subjected to the primary clustering by using the above method, and then dimensionality reduction is continued, so that the calculation amount in the subsequent secondary clustering process can be effectively reduced.
The process of re-clustering is described below.
In some embodiments, the amount of data obtained after the initial clustering may be relatively small, and all data may be subjected to density clustering in order to improve clustering accuracy. The specific implementation mode is as follows: and acquiring text data of each cluster in the primary clustering result, and then performing density clustering on the low-dimensional feature vectors corresponding to the text data in each cluster to obtain a final clustering result.
Among them, a typical Density Clustering algorithm, such as a Density-Based Clustering method with Noise (DBSCAN), uses "neighborhood" probability to describe how closely a sample is distributed, divides a region with sufficient Density into clusters, and can find clusters of any shape under the condition with Noise, and its basic implementation idea is: and deriving a maximum density connected sample set according to the density reachable relation, wherein one set has one or more core objects, if only one core object exists, other non-core objects in the cluster are all in the neighborhood of the core object, if the core objects are multiple core objects, the neighborhood of any one core object must contain another core object (otherwise, the density is not reachable), and the core objects and all samples contained in the neighborhood form a cluster.
In this implementation, density clustering is performed on text data in all the class clusters, if a class cluster 1 includes 100 pieces of data, and a class cluster 2 includes 200 pieces of data, then these 300 pieces of data can be merged together for density clustering, that is, the object of density clustering is each piece of data. Since the vectors corresponding to the 300 pieces of data are low-dimensional vectors, therefore, the calculation amount can be effectively reduced in the density clustering process, and the clustering effect can be ensured. Therefore, the text data can be re-clustered based on the low-dimensional feature vector again, the dimension of the text data is reduced through primary clustering, and the clustering of the text data is realized through secondary clustering, so that the clustering efficiency and the clustering precision can be considered at the same time.
Or, in some other embodiments, when performing density clustering, density clustering may be performed on text data in each cluster in the primary clustering result, for example, if 100 pieces of data are included in the cluster 1, density clustering is performed on the 100 pieces of data, and if 200 pieces of data are included in the cluster 2, density clustering is performed on the 200 pieces of data, and density clustering is performed between each cluster, so that similarity calculation between text data in each cluster and text data in other clusters is not required, and a certain amount of calculation can be reduced. In this way, the clusters formed by the primary clustering can be divided into smaller clusters, text data under large topics can be subdivided into small topics, clustering can be more accurate, and clustering precision is improved.
It will be appreciated that the specific clustering process of density clustering may refer to existing correlation implementations and will not be described in detail herein.
In the implementation process, under the condition of small data quantity, the clustering precision can be effectively improved by carrying out density clustering on all data of each cluster, and a better clustering effect is realized.
In other embodiments, since the data amount obtained after the initial clustering may be relatively large, although a certain amount of calculation may be reduced after the dimensionality reduction of the high-dimensional feature vector into the low-dimensional feature vector, if all the data is clustered again, the amount of calculation is still relatively large, so that the density clustering may also be performed only for the clusters, and the specific implementation manner is as follows: and acquiring low-dimensional characteristic vectors corresponding to the cluster centers of all clusters in the primary clustering result, and performing density clustering on the low-dimensional characteristic vectors corresponding to the cluster centers of all clusters to obtain a final clustering result.
In this way, the objects of density clustering are class clusters, and if the number of class clusters formed after primary clustering is large, density clustering can be performed on these class clusters, which is to calculate the distance between the cluster centers of each class cluster, and determine whether each class cluster can be merged in turn, and the process can be as shown in fig. 4 (where some class clusters can also be deleted). Therefore, the objects involved in the combination are classified into clusters, so that the clustering efficiency can be effectively improved, and the rapid clustering of the data is realized.
The embodiment of the present application further provides a data clustering platform, which may be understood as a software platform, running in the electronic device, and capable of being used to run the data clustering method, where the data clustering platform is shown in fig. 5 and includes an application layer, a data mining layer, a computing layer, a feature representation layer, a preprocessing layer, and a data layer.
The platform can acquire streaming data, knowledge data and the like through a data layer, the data enters a preprocessing layer to be subjected to feature selection, filtering and other processing, the preprocessed data enters a feature representation layer to perform feature representation on the number of users and enter a computing layer to be subjected to clustering processing, a clustering result enters a data mining layer to perform data mining work such as subject word extraction, key sentence extraction and the like, and an application layer provides streaming clustering or batch clustering service.
The function of each layer is described separately below.
An application layer: the clustering task sent by the upper layer application program can be received, for example, the clustering task is used for instructing to cluster the specified text data, or performing a stream clustering (i.e., the single-pass clustering algorithm) or batch clustering (i.e., the density clustering algorithm) task on the data, and the application layer can perform the functions of task management, process distribution and the like.
The application layer comprises a master-slave backup module which is used for carrying out master-slave backup and other functions on the clustering result and realizing data interaction with the data mining layer, the computing layer and the data layer. It can also perform task queue management, i.e. managing the execution state data of the tasks in the queue, etc.
The application layer also has a timed task function and is used for regularly recycling leaked memory, cleaning botnet processes caused by abnormity, cleaning logs and cache data of the system in the hard disk, regularly monitoring or restarting daemon processes and the like.
The application layer also has a multi-process management function, and can perform subsequent calculation work in multiple processes according to task allocation result examples of task scheduling, for example, clustering is performed by adopting different processes for different text data, as shown in fig. 6, and the process states and the recovered resources can be managed in real time.
A data mining layer: the method has a keyword extraction function and is responsible for extracting keywords of each cluster in the final clustering result; the system also has a key sentence extraction function and is responsible for extracting key sentences of all clusters in the final clustering result; the method also has a text duplication removal function and is responsible for removing duplication of similar texts in each cluster in the final clustering result, so that subsequent data mining, data retrieval and the like can be conveniently carried out.
Calculating a layer: the method is used for executing the data clustering method and realizing stream clustering and/or batch clustering of the data. When batch clustering is carried out, the clustering calculation process can carry out clustering calculation on data separately according to the requirements of the data and the service, for example, after primary clustering, batch clustering can be carried out in real time, or batch clustering can be carried out after a period of time, and stream clustering and batch clustering are carried out by adopting different processes.
Characteristic representation layer: for vectorizing the text data, such as converting the text data into high-dimensional feature vectors.
A pretreatment layer: the method is used for preprocessing the text data, and comprises language classification, word segmentation, filtering to remove interfering text, feature selection, format conversion and the like.
And (3) a data layer: for receiving data from outside, saving data, supporting data queries, it can use kafka and pre-constructed knowledge base for data storage. This externally received data may include a data stream, a data set, and the like. The data set is suitable for the conditions of time sequence and small data set, the data stream is suitable for a real-time monitoring system, the data has the characteristics of time and rapid change, the data stream continuously changes and the data volume is huge, and the data stream can be called as the data stream.
It can be understood that in practical application, the functions of each layer may be flexibly divided, and the functions of each layer may be flexibly increased or decreased according to the situation, and the division of each layer of the data clustering platform is not limited to the above-described layers, and some layers may also be flexibly increased or decreased according to the needs.
Referring to fig. 7, fig. 7 is a schematic structural diagram of a data clustering device 200 according to an embodiment of the present application, where the device 200 may be a module, a program segment, or a code on an electronic device. It should be understood that the apparatus 200 corresponds to the above-mentioned embodiment of the method of fig. 2, and can perform various steps related to the embodiment of the method of fig. 2, and the specific functions of the apparatus 200 can be referred to the above description, and the detailed description is appropriately omitted here to avoid redundancy.
Optionally, the apparatus 200 comprises:
a high-dimensional vector obtaining module 210, configured to obtain a high-dimensional feature vector corresponding to text data to be clustered, where the high-dimensional feature vector includes semantic features of multiple dimensions;
the primary clustering module 220 is configured to perform primary clustering on the text data according to the semantic features, and determine a cluster label corresponding to the high-dimensional feature vector according to a primary clustering result;
a data dimension reduction module 230, configured to replace semantic features of multiple dimensions in the high-dimensional feature vector with a class cluster label of one dimension, so as to form a low-dimensional feature vector;
and a secondary clustering module 240, configured to perform clustering again on the low-dimensional feature vectors to obtain a final clustering result.
Optionally, the primary clustering module 220 is configured to calculate a first similarity between the high-dimensional feature vector and a potential cluster in a cluster-like set formed by clustering historical text data, where the potential cluster is a cluster in which the number of texts in the cluster-like set is greater than a preset number; determining a target potential cluster, wherein the first similarity of the target potential cluster and the high-dimensional feature vector is the largest and is greater than a first preset similarity; clustering the text data into the target potential cluster, and determining a cluster-like label of the target potential cluster as a cluster-like label corresponding to the high-dimensional feature vector.
Optionally, the primary clustering module 220 is configured to calculate a first similarity between the high-dimensional feature vector and a potential cluster in a cluster set formed by clustering historical text data, where the potential cluster is a cluster in which the number of texts in the cluster set is greater than a preset number; if no potential cluster with the first similarity larger than a first preset similarity exists, calculating a second similarity between the high-dimensional feature vector and an isolated cluster in the cluster set, wherein the isolated cluster is a cluster in which the number of texts in the cluster set is smaller than or equal to the preset number; determining a target isolated cluster, wherein the second similarity between the target isolated cluster and the high-dimensional feature vector is the largest and is greater than a second preset similarity; clustering the text data into the target isolated cluster, and determining a cluster label of the target isolated cluster as a cluster label corresponding to the high-dimensional feature vector.
Optionally, the primary clustering module 220 is configured to, if there is no isolated cluster with the second similarity being greater than the second preset similarity, take the text data as a new cluster, and determine a cluster-like label of the new cluster as a cluster-like label corresponding to the high-dimensional feature vector.
Optionally, the primary clustering module 220 is configured to determine the target isolated cluster as a potential cluster if the number of texts of the target isolated cluster is greater than the preset number.
Optionally, the primary clustering module 220 is further configured to update a plurality of clusters in the primary clustering result;
wherein the update process includes at least one of: updating cluster centers of the clusters with the data volume larger than the first number; merging the clusters with the similarity larger than the preset similarity, and recalculating the cluster center of the merged clusters; and deleting the cluster class with the data volume smaller than the second number.
Optionally, the secondary clustering module 240 is configured to obtain text data in each cluster in the primary clustering result; and carrying out density clustering on the low-dimensional characteristic vectors corresponding to the text data in each cluster to obtain a final clustering result.
Optionally, the secondary clustering module 240 is configured to obtain a low-dimensional feature vector corresponding to a cluster center of each cluster in the primary clustering result; and carrying out density clustering on the low-dimensional characteristic vectors corresponding to the cluster centers of all the clusters to obtain a final clustering result.
It should be noted that, for the convenience and simplicity of description, the specific working process of the above-described device may refer to the corresponding process in the foregoing method embodiment, and the description is not repeated here.
Embodiments of the present application provide a readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the method processes performed by an electronic device in the method embodiment shown in fig. 2.
The present embodiments disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods provided by the above-described method embodiments, for example, comprising: acquiring a high-dimensional feature vector corresponding to text data to be clustered, wherein the high-dimensional feature vector comprises semantic features of multiple dimensions; performing primary clustering on the text data according to semantic features, and determining cluster-like labels corresponding to the high-dimensional feature vectors according to primary clustering results; replacing semantic features of multiple dimensions in the high-dimensional feature vector with a class cluster label of one dimension to form a low-dimensional feature vector; and clustering the low-dimensional feature vectors again to obtain a final clustering result.
To sum up, the embodiment of the application provides a data clustering method, a data clustering device, an electronic device and a readable storage medium, after text data is primarily clustered, semantic features of multiple dimensions in a high-dimensional feature vector are replaced by a class cluster label of one dimension according to a primary clustering result, dimension reduction of the data is realized, and then the data after dimension reduction is clustered again, so that the calculated amount can be effectively reduced in the process of clustering again, the clustering efficiency is improved, the clustering effect can be ensured through clustering again, and the realization mode can take account of both the clustering efficiency and the clustering precision.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (11)

1. A method of clustering data, the method comprising:
acquiring a high-dimensional feature vector corresponding to text data to be clustered, wherein the high-dimensional feature vector comprises semantic features of multiple dimensions;
performing primary clustering on the text data according to semantic features, and determining cluster-like labels corresponding to the high-dimensional feature vectors according to primary clustering results;
replacing semantic features of multiple dimensions in the high-dimensional feature vector with a class cluster label of one dimension to form a low-dimensional feature vector;
and clustering the low-dimensional feature vectors again to obtain a final clustering result.
2. The method of claim 1, wherein the primarily clustering the text data according to the semantic features and determining the class label corresponding to the high-dimensional feature vector according to the primarily clustering result comprises:
calculating first similarity of the high-dimensional feature vector and potential clusters in a cluster-like set formed by historical text data clustering, wherein the potential clusters are clusters in which the number of texts in the cluster-like set is larger than a preset number;
determining a target potential cluster, wherein the first similarity between the target potential cluster and the high-dimensional feature vector is maximum and is greater than a first preset similarity;
clustering the text data into the target potential cluster, and determining a cluster-like label of the target potential cluster as a cluster-like label corresponding to the high-dimensional feature vector.
3. The method of claim 1, wherein the primarily clustering the text data according to the semantic features and determining the class label corresponding to the high-dimensional feature vector according to the primarily clustering result comprises:
calculating first similarity of the high-dimensional feature vector and potential clusters in a cluster-like set formed by historical text data clustering, wherein the potential clusters are clusters in which the number of texts in the cluster-like set is larger than a preset number;
if no potential cluster with the first similarity larger than a first preset similarity exists, calculating a second similarity between the high-dimensional feature vector and an isolated cluster in the cluster set, wherein the isolated cluster is a cluster in which the number of texts in the cluster set is smaller than or equal to the preset number;
determining a target isolated cluster, wherein the second similarity between the target isolated cluster and the high-dimensional feature vector is the largest and is greater than a second preset similarity;
clustering the text data into the target isolated cluster, and determining a cluster label of the target isolated cluster as a cluster label corresponding to the high-dimensional feature vector.
4. The method of claim 3, wherein after calculating the second similarity between the high-dimensional feature vector and the isolated cluster in the cluster-like set, further comprising:
if no isolated cluster with the second similarity larger than the second preset similarity exists, the text data is used as a new cluster, and the cluster-like label of the new cluster is determined as the cluster-like label corresponding to the high-dimensional feature vector.
5. The method of claim 3, wherein after clustering the text data into the target isolated clusters, further comprising:
and if the text quantity of the target isolated cluster is larger than the preset quantity, determining the target isolated cluster as a potential cluster.
6. The method of claim 1, wherein the text data is primarily clustered according to semantic features, and after obtaining a primary clustering result, the method further comprises:
updating a plurality of clusters in the primary clustering result;
wherein the update process includes at least one of: updating cluster centers of the clusters with the data volume larger than the first number; merging at least two clusters with similarity greater than preset similarity, and recalculating cluster centers of the merged clusters; and deleting the cluster class with the data volume smaller than the second number.
7. The method according to claim 1, wherein the clustering the low-dimensional feature vectors again to obtain a final clustering result comprises:
acquiring text data in each cluster in the primary clustering result;
and carrying out density clustering on the low-dimensional characteristic vectors corresponding to the text data in each cluster to obtain a final clustering result.
8. The method according to claim 1, wherein clustering the low-dimensional feature vectors again to obtain a final clustering result comprises:
acquiring low-dimensional feature vectors corresponding to cluster centers of all clusters in the primary clustering result;
and carrying out density clustering on the low-dimensional characteristic vectors corresponding to the cluster centers of all the clusters to obtain a final clustering result.
9. An apparatus for clustering data, the apparatus comprising:
the high-dimensional vector acquisition module is used for acquiring a high-dimensional feature vector corresponding to the text data to be clustered, wherein the high-dimensional feature vector comprises semantic features with multiple dimensions;
the primary clustering module is used for carrying out primary clustering on the text data according to semantic features and determining cluster-like labels corresponding to the high-dimensional feature vectors according to primary clustering results;
the data dimension reduction module is used for replacing semantic features of multiple dimensions in the high-dimensional feature vector with a class cluster label of one dimension to form a low-dimensional feature vector;
and the secondary clustering module is used for clustering the low-dimensional characteristic vectors again to obtain a final clustering result.
10. An electronic device comprising a processor and a memory, the memory storing computer readable instructions that, when executed by the processor, perform the method of any of claims 1-8.
11. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-8.
CN202110352960.4A 2021-03-31 2021-03-31 Data clustering method and device, electronic equipment and readable storage medium Pending CN115146692A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110352960.4A CN115146692A (en) 2021-03-31 2021-03-31 Data clustering method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110352960.4A CN115146692A (en) 2021-03-31 2021-03-31 Data clustering method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN115146692A true CN115146692A (en) 2022-10-04

Family

ID=83404896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110352960.4A Pending CN115146692A (en) 2021-03-31 2021-03-31 Data clustering method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN115146692A (en)

Similar Documents

Publication Publication Date Title
CN111967302B (en) Video tag generation method and device and electronic equipment
US11521603B2 (en) Automatically generating conference minutes
TW202020691A (en) Feature word determination method and device and server
CN110162522A (en) A kind of distributed data search system and method
CN112115232A (en) Data error correction method and device and server
CN111400361A (en) Data real-time storage method and device, computer equipment and storage medium
CN114579104A (en) Data analysis scene generation method, device, equipment and storage medium
CN115145924A (en) Data processing method, device, equipment and storage medium
CN114461792A (en) Alarm event correlation method, device, electronic equipment, medium and program product
CN113722600A (en) Data query method, device, equipment and product applied to big data
CN111950261B (en) Method, device and computer readable storage medium for extracting text keywords
CN110442696B (en) Query processing method and device
CN116955856A (en) Information display method, device, electronic equipment and storage medium
CN115186738B (en) Model training method, device and storage medium
CN116225848A (en) Log monitoring method, device, equipment and medium
CN115146692A (en) Data clustering method and device, electronic equipment and readable storage medium
CN116822491A (en) Log analysis method and device, equipment and storage medium
CN113468866B (en) Method and device for analyzing non-standard JSON string
CN115080745A (en) Multi-scene text classification method, device, equipment and medium based on artificial intelligence
GB2603594A (en) Maintenance of a data glossary
CN110321435B (en) Data source dividing method, device, equipment and storage medium
CN109635281B (en) Method and device for updating nodes in traffic guide graph
CN112926297A (en) Method, apparatus, device and storage medium for processing information
CN111639099A (en) Full-text indexing method and system
CN111159218B (en) Data processing method, device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 332, 3 / F, Building 102, 28 xinjiekouwei street, Xicheng District, Beijing 100088

Applicant after: Qianxin Technology Group Co.,Ltd.

Applicant after: Qianxin Wangshen information technology (Beijing) Co.,Ltd.

Address before: Room 332, 3 / F, Building 102, 28 xinjiekouwei street, Xicheng District, Beijing 100088

Applicant before: Qianxin Technology Group Co.,Ltd.

Applicant before: LEGENDSEC INFORMATION TECHNOLOGY (BEIJING) Inc.

CB02 Change of applicant information