CN117633561A - Text clustering method, system, electronic equipment and medium - Google Patents

Text clustering method, system, electronic equipment and medium Download PDF

Info

Publication number
CN117633561A
CN117633561A CN202410095706.4A CN202410095706A CN117633561A CN 117633561 A CN117633561 A CN 117633561A CN 202410095706 A CN202410095706 A CN 202410095706A CN 117633561 A CN117633561 A CN 117633561A
Authority
CN
China
Prior art keywords
text
clustered
clustering
data points
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410095706.4A
Other languages
Chinese (zh)
Inventor
王本强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Mido Technology Co ltd
Original Assignee
Shanghai Mido Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Mido Technology Co ltd filed Critical Shanghai Mido Technology Co ltd
Priority to CN202410095706.4A priority Critical patent/CN117633561A/en
Publication of CN117633561A publication Critical patent/CN117633561A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a text clustering method, a text clustering system, electronic equipment and a medium, wherein the method comprises the following steps: performing contrast learning based on the text to be clustered to extract text data points to be clustered; constructing a low-dimensional data point based on the text data points to be clustered; performing hierarchical density aggregation on the low-dimensional data points based on Euclidean distance to obtain a data cluster; and acquiring a clustering result of the text to be clustered based on the density of the data clusters. According to the method, the text is characterized by adopting contrast learning, so that semantic understanding of complex sentences is realized; the number of topics in the text data can be automatically determined, resulting in high quality and high consistency topics. Meanwhile, the method and the device can effectively process long texts, accelerate the clustering model while ensuring the effect, and can be applied to clustering of large-scale texts through parallel GPU calculation, so that the processing efficiency is high.

Description

Text clustering method, system, electronic equipment and medium
Technical Field
The application belongs to the technical field of text processing, and relates to a text clustering method, a system, electronic equipment and a medium.
Background
Text clustering is an unsupervised machine learning method, mainly based on clustering assumptions: the documents of the same class are more similar, while the documents of different classes are less similar. The method does not need a training process and manually labeling the types of the documents in advance, so that the method has certain flexibility and higher automatic processing capacity, and is widely applied to the fields of information retrieval, document management, social media analysis, public opinion analysis, recommendation systems and the like. It can help organize and understand large-scale text data, discover patterns and insights that are hidden in the data.
At present, the mainstream text clustering method mainly comprises two technical routes, an LDA-based generated probability model and a K-Means-based distance clustering method. LDA text clustering is a machine learning method for analyzing text data that helps to understand potential structures and topics in the text data, providing useful information for further analysis and mining of the text data. However, LDA requires a pre-specified number of topics when in use, and LDA regards text as a bag of words model, ignoring order information between words, and bag of words based methods are weak for sentence characterization, which may not be ideal for sentence-based text clustering tasks. K-Means text clustering is an unsupervised machine learning method, which can help to find hidden structures and topics in text data, and is beneficial to applications such as text classification and information retrieval. However, K-Means also requires a pre-specified number of topics in use, and is sensitive to the selection of initial cluster center points and outliers, requires multiple runs of algorithms and selection of optimal results, and the clustering results are unstable.
Disclosure of Invention
The application provides a text clustering method, a text clustering system, electronic equipment and a medium, which are used for solving the technical problems that a pre-specified theme is not needed and the clustering processing efficiency is high in the prior art.
In a first aspect, the present application provides a text clustering method, the method including: acquiring a text to be clustered; performing contrast learning based on the text to be clustered to extract text data points to be clustered; constructing a low-dimensional data point based on the text data points to be clustered; performing hierarchical density aggregation on the low-dimensional data points based on Euclidean distance to obtain a data cluster; and acquiring a clustering result of the text to be clustered based on the density of the data clusters.
In one implementation manner of the first aspect, performing contrast learning based on the text to be clustered to extract the text data points to be clustered includes: obtaining a trained contrast learning model; and acquiring sentence vectors of the text to be clustered based on the comparison learning model to serve as text data points to be clustered.
In one implementation manner of the first aspect, obtaining the trained contrast learning model includes: acquiring first feature vectors and second feature vectors of all sentences in the training text based on a bert model; constructing a positive and negative example data set based on all the first feature vectors and the second feature vectors; and optimizing the comparison loss based on the positive and negative example data sets to obtain a trained comparison learning model.
In one implementation of the first aspect, constructing a low-dimensional data point based on the text data point to be clustered includes: calculating Euclidean distances among the text data points to be clustered to construct a distance matrix; constructing a neighbor graph of the text data points to be clustered based on the distance matrix; blurring processing is carried out on the neighbor graphs; the difference between the text data points to be clustered between a high dimensional space and a low dimensional space is minimized to construct the low dimensional data points.
In one implementation of the first aspect, performing hierarchical density aggregation based on the low-dimensional data points to obtain a data cluster includes: calculating the core distance between the low-dimensional data point and the neighbor point to construct a minimum spanning tree; the core distance is a Euclidean distance; and running a preset hierarchical clustering algorithm based on the minimum spanning tree to obtain a clustering feature tree.
In an implementation manner of the first aspect, obtaining the clustering result of the text to be clustered based on the density of the data clusters includes: and screening the data clusters with the density larger than a preset value to serve as the clustering result.
In one implementation manner of the first aspect, acquiring the text to be clustered includes performing data cleaning on the text to be clustered.
In a second aspect, the present application provides a text clustering system, comprising: the acquisition module is used for acquiring texts to be clustered; the comparison learning module is used for carrying out comparison learning based on the text to be clustered so as to extract text data points to be clustered; the construction module is used for constructing low-dimensional data points based on the text data points to be clustered; the hierarchical aggregation module is used for performing hierarchical density aggregation on the low-dimensional data points based on Euclidean distance to acquire a data cluster; and the clustering module is used for acquiring a clustering result of the text to be clustered based on the density of the data clusters.
In a third aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the text clustering method of the first aspect of the present application.
In a fourth aspect, the present application provides an electronic device, comprising: a memory configured to store a computer program; and a processor communicatively coupled to the memory, the processor configured to invoke the computer program to perform the text clustering method of the first aspect of the present application.
The text clustering method, system, electronic equipment and medium have the following beneficial effects:
first, the present application does not need to specify the number of topics in advance, and can automatically determine the number of topics in text data.
Secondly, the application utilizes the contextual understanding and language characterization capabilities of the contrast learning model to capture deep semantic structures in text data, and the generated topics are generally of high quality and high consistency.
Again, the present application is not dependent on word bag models or word frequency information, and can more efficiently process long text data.
Finally, the processing efficiency is high, the clustering model is accelerated while the effect is ensured, and the method can be applied to large-scale text clustering through parallel GPU calculation.
Drawings
Fig. 1 is a schematic flow chart of a text clustering method according to an embodiment of the present application.
Fig. 2 is a schematic flow chart of a text clustering method according to an embodiment of the present application.
Fig. 3 is a schematic flow chart of a text clustering method according to an embodiment of the present application.
FIG. 4 is a schematic diagram of a simcse model according to an embodiment of the present application.
Fig. 5 is a schematic flow chart of a text clustering method according to an embodiment of the present application.
Fig. 6 is a schematic flow chart of a text clustering method according to an embodiment of the present application.
Fig. 7 is a schematic diagram of a text clustering method according to an embodiment of the present application.
Fig. 8 is a schematic diagram of a text clustering system according to an embodiment of the present application.
Fig. 9 is a schematic diagram of an architecture of an electronic device according to an embodiment of the disclosure.
Detailed Description
Other advantages and effects of the present application will become apparent to those skilled in the art from the present disclosure, when the following description of the embodiments is taken in conjunction with the accompanying drawings. The present application may be embodied or carried out in other specific embodiments, and the details of the present application may be modified or changed from various points of view and applications without departing from the spirit of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.
It should be noted that, the illustrations provided in the following embodiments merely illustrate the basic concepts of the application by way of illustration, and only the components related to the application are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complex.
The embodiment of the application provides a text clustering method, a system, electronic equipment and a medium, which can automatically determine the number of topics in text data and generate topics with high quality and high consistency. Meanwhile, the method and the device do not depend on word bag models or word frequency information, can process long text data more effectively, accelerate a clustering model while guaranteeing effects, and can be applied to large-scale text clustering through parallel GPU calculation.
Referring to fig. 1, the text clustering method provided in one embodiment of the present application includes the following steps S1 to S5:
s1: and obtaining the text to be clustered.
Specifically, obtaining the text to be clustered includes cleaning data of the text to be clustered.
Specifically, given a text to be clustered, duplicate text and outlier text are removed from the text to be clustered based on the python or ETL tool. And then, removing useless character strings, such as @ and expressions, on the basis of the regular expression and the rule for the removed text.
It should be noted that, the abnormal data includes blank text, messy code text, overlong text or too short text.
The ETL, which is referred to as Extract-Transform-Load, refers to a process of extracting (Extract), converting (Transform), and loading (Load) a large amount of original data into a target storage data warehouse.
S2: and performing contrast learning based on the text to be clustered to extract text data points to be clustered.
Specifically, as shown in fig. 2, step S2 includes steps S21 and S22.
S21: and obtaining a trained contrast learning model.
Specifically, step S2 uses the context understanding and language characterization capability of the comparison learning model to capture deep semantic structures in text data, so as to realize semantic understanding of complex sentences.
In one embodiment, a simcse model is selected against the learning model. simcse is a method for learning sentence embedding using contrast learning, which improves the quality of sentence representations by minimizing the embedding distance of similar sentences, thereby achieving good performance in various text-related tasks.
Specifically, as shown in fig. 3, step S21 includes steps S211 and S213.
S211: and acquiring first characteristic vectors and second characteristic vectors corresponding to all sentences in the training text based on the bert model.
Specifically, the same sentence in the training text is input to the bert model twice, and a dropout method is utilized to randomly ignore a part of neurons in the bert model network, so that two different feature vectors, namely a first feature vector and a second feature vector, are obtained.
It should be noted that the dropout method is a method for training a deep neural network. It significantly reduces the overfitting by omitting half of the feature detectors (let half of the hidden node values be 0) in each training batch. This approach can reduce the interaction between feature detectors (hidden nodes) to a certain extent achieving regularization.
It should be noted that the bert model is a pre-trained language characterization model. It emphasizes that instead of pre-training as in the past using a conventional one-way language model or shallow stitching of two one-way language models, a new Masked Language Model (MLM) is used to enable deep bi-directional language characterization. The model has the following main advantages:
1) The bi-directional converters are pre-trained with MLM to generate deep bi-directional language representations.
2) After pre-training, only one extra output layer needs to be added to perform fine-tune to get state-of-the-art performance in a variety of downstream tasks. No task-specific structural modifications to the BERT are required in this process.
The structure of the prior pre-training model is limited by a unidirectional language model (from left to right or from right to left), so that the representation capability of the model is limited, and only unidirectional context information can be acquired. The BERT is pre-trained using MLM and builds the entire model using deep bi-directional transducers (unidirectional transducers are commonly referred to as Transformer decoder, each of which is assigned only to left-going transducers, and bi-directional transducers are Transformer encoder, each of which is assigned to all of the transducers), thus ultimately generating deep bi-directional language representations that can fuse the left and right context information.
S212: and constructing a positive and negative example data set based on all the first feature vectors and the second feature vectors.
Specifically, in the same training batch, the two outputs (the first feature vector and the second feature vector) of the same sentence in the model are taken as positive examples, and the outputs of other sentences are taken as negative examples, so that a positive and negative example data set is constructed.
S213: and optimizing the comparison loss based on the positive and negative example data sets to obtain a trained comparison learning model.
Specifically, the comparison learning model is trained based on the positive and negative example data sets, the comparison loss is optimized, the similarity between positive examples is increased, the similarity between negative examples is reduced, and the distance between the current sample and the uncorrelated sample is increased.
Specifically, the comparative loss is calculated as follows:
where h represents a sentence vector, sim represents the cosine similarity of the two vectors,is a temperature constant used to control the degree of interest of the conditioning model in difficult samples.
And repeating the steps S211-S213 to obtain a trained comparison learning model. In one embodiment, the simcse model principle is shown in FIG. 4.
S22: and acquiring sentence vectors of the text to be clustered based on the comparison learning model to serve as text data points to be clustered.
Specifically, the text to be clustered obtained in the step S1 is input into a contrast learning model, and sentence vectors of each sentence in the text are obtained to serve as data points of subsequent clustering.
The step S2 utilizes the context understanding and language characterization capability of the contrast learning model to capture the deep semantic structure in the text data so as to characterize the sentence vector, and the step S2 adopts the contrast loss and training method specially aiming at sentence vector characterization to strengthen the sentence vector characterization effect and solve the problem of insufficient traditional sentence vector characterization effect, so that the generated theme is generally high in quality and consistency.
S3: and constructing a low-dimensional data point based on the text data points to be clustered.
In one embodiment, a vector dimension reduction model UMAP is selected to reduce the dimension of the data points. One of the advantages of the UMAP model over other dimension reduction techniques is that it can better capture global and local relationships while preserving the data structure, and thus perform well in clustering tasks. Secondly, the UMAP model also has parameter adjustability, and can adjust more than ten parameters, so that the UMAP model is adapted to complex and diverse scenes according to the characteristics of data, and the optimal dimension reduction effect is obtained. In addition, the UMAP model can be accelerated by the GPU, so that the UMAP model can be applied to data dimension reduction of millions or more.
Specifically, as shown in fig. 5, step S3 includes steps S31 to S34.
S31: and calculating Euclidean distances among the text data points to be clustered to construct a distance matrix.
In particular, euclidean distance, also known as euclidean distance, is the most common distance measure, which is the absolute distance between two points in a multidimensional space.
S32: and constructing a neighbor graph of the text data points to be clustered based on the distance matrix.
Specifically, a neighbor map is constructed for each data point based on the distance matrix. The neighbor graph represents the similarity between each data point, where each node represents one data point and the edges represent the distance between the data points less than a certain threshold.
S33: and blurring processing is carried out on the neighbor graphs.
Specifically, the edge weights in the neighbor graphs are blurred. In one embodiment, by blurring, the UMAP model can capture diversity among data points, not just simple neighbor relationships.
S34: the difference between the text data points to be clustered between a high dimensional space and a low dimensional space is minimized to construct the low dimensional data points.
Specifically, the topological structure difference between points in the high-dimensional space and points in the low-dimensional space is minimized, so that a corresponding representation of the data is found in the low-dimensional space, and each data point is re-represented with new low-dimensional space data to construct a low-dimensional data point.
Step S3 utilizes Umap dimension reduction and acceleration, and can efficiently map high-dimensional data to a low-dimensional space, so that clustering can be performed on large-scale data. And can be applied to clustering of large-scale texts through parallel GPU calculation.
S4: and carrying out hierarchical density aggregation on the low-dimensional data points based on the Euclidean distance to acquire a data cluster.
In one embodiment, a clustering model HDBSCAN is selected to cluster data points. The HDBSCAN model has the ability to automatically identify clusters of different densities and mark noise points, making it excellent in processing data sets with complex structures and varying densities. Because the HDBSCAN model is adopted for clustering, samples with short distances can be automatically classified according to the distance of Euclidean distances among data points, so that the samples with long distances are separated, the number of topics in text data can be automatically determined while the clustering effect is enhanced, and the clustering can be performed without pre-specifying the number of topics.
Specifically, as shown in fig. 6, step S4 includes steps S41 and S42.
S41: calculating the core distance between the low-dimensional data point and the neighbor point to construct a minimum spanning tree; the core distance is a Euclidean distance.
Specifically, a positive integer K is defined, the euclidean distance between each low-dimensional data point and K neighbors thereof is calculated as a core distance, and a minimum spanning tree is constructed through Prim algorithm based on the core distance.
It should be noted that the minimum spanning tree is a tree that connects all data points, where the weight of the edges is a function of distance. The tree structure with the smallest sum of weights is the smallest spanning tree. Which can help capture the density relationship between data points.
It should be noted that the Prim algorithm is a greedy algorithm, which operates on nodes.
Specifically, the Prim algorithm first establishes an edge set for storing results, establishes a node set for storing nodes and for marking whether the nodes have been accessed, and establishes a minimum heap of edges. After that, it starts traversing all nodes, if not accessed, it adds to node set and then puts its connected edges into the heap. The smallest edge is fetched from the heap, then it is determined whether the to node has been accessed, and if not, the edge is added to the spanning tree (our intended edge) and the node access is marked. And finally adding the edge connected with the to node into the minimum heap. The above operations are looped until all nodes have traversed, thereby constructing a minimum spanning tree.
S42: and running a preset hierarchical clustering algorithm based on the minimum spanning tree to acquire the data cluster.
Specifically, a preset hierarchical clustering algorithm is run on the minimum spanning tree to divide the data points into different clusters. In one embodiment, during hierarchical clustering, edges in the minimum spanning tree are ordered according to weights, and then the edges are connected in sequence to form different subtrees.
In one embodiment, the implementation results of steps S41 to S42 are schematically shown in fig. 7.
S5: and acquiring a clustering result of the text to be clustered based on the density of the data clusters.
Specifically, the data clusters with the density larger than a preset value are screened to be used as the clustering result.
Specifically, the preset value is a predefined positive integer.
In some embodiments, a positive integer M is defined, and subtrees with a density greater than M are screened out as the final clustering result.
According to the method and the device, the text is characterized by adopting contrast learning, so that semantic understanding of complex sentences is realized. Meanwhile, the data point is subjected to dimension reduction and acceleration, and high-dimension data can be efficiently mapped to a low-dimension space, so that clustering is performed on large-scale data. In addition, the method and the device can automatically identify the clusters with different densities, the number of the clusters is not required to be specified in advance, and the method and the device are suitable for processing data with complex density structures.
The protection scope of the text clustering method described in the embodiments of the present application is not limited to the execution sequence of the steps listed in the embodiments, and all the schemes implemented by adding or removing steps and replacing steps according to the principles of the present application in the prior art are included in the protection scope of the present application.
The embodiment of the application also provides a text clustering system, which can realize the text clustering method, but the implementation device of the text clustering method includes but is not limited to the structure of the text clustering system listed in the embodiment, and all structural variations and substitutions of the prior art according to the principles of the application are included in the protection scope of the application.
As shown in fig. 8, the text clustering system provided in this embodiment includes an acquisition module 10, a comparison learning module 20, a construction module 30, a hierarchical aggregation module 40, and a clustering module 50.
The acquisition module 10 is used for acquiring texts to be clustered; the contrast learning module 20 is configured to perform contrast learning based on the text to be clustered to extract text data points to be clustered; the construction module 30 is configured to construct a low-dimensional data point based on the text data points to be clustered; the hierarchical aggregation module 40 is configured to perform hierarchical density aggregation on the low-dimensional data points based on the euclidean distance to obtain a data cluster; the clustering module 50 is configured to obtain a clustering result of the text to be clustered based on the density of the data clusters.
It should be noted that the operation manner of each module may refer to the above, and will not be described herein.
Specifically, the text clustering system further comprises a cleaning module; the cleaning module is used for cleaning data of the text to be clustered.
The application also provides electronic equipment. As shown in fig. 9, the present embodiment provides an electronic apparatus 90, the electronic apparatus 90 including: a memory 901 configured to store a computer program; and a processor 902 communicatively coupled to the memory 901 and configured to invoke the computer program to perform the method of text clustering.
The memory 901 includes: ROM (Read Only Memory image), RAM (Random Access Memory), magnetic disk, USB flash disk, memory card or optical disk, etc.
The processor 902 is connected to the memory 901, and is configured to execute a computer program stored in the memory 901, so that the electronic device performs the text clustering method described above.
Preferably, the processor 902 may be a general-purpose processor, including a central processing unit (Central Processing Unit, abbreviated as CPU), a network processor (Network Processor, abbreviated as NP), etc.; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field programmable gate arrays (Field Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, or methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules/units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple modules or units may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules or units, which may be in electrical, mechanical or other forms.
The modules/units illustrated as separate components may or may not be physically separate, and components shown as modules/units may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules/units may be selected according to actual needs to achieve the purposes of the embodiments of the present application. For example, functional modules/units in various embodiments of the present application may be integrated into one processing module, or each module/unit may exist alone physically, or two or more modules/units may be integrated into one module/unit.
Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Embodiments of the present application also provide a computer-readable storage medium. Those of ordinary skill in the art will appreciate that all or part of the steps in a method implementing the above embodiments may be implemented by a program to instruct a processor, where the program may be stored in a computer readable storage medium, where the storage medium is a non-transitory (non-transitory) medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof. The storage media may be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
Embodiments of the present application may also provide a computer program product comprising one or more computer instructions. When the computer instructions are loaded and executed on a computing device, the processes or functions described in accordance with the embodiments of the present application are produced in whole or in part. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, or data center to another website, computer, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).
The computer program product is executed by a computer, which performs the method according to the preceding method embodiment. The computer program product may be a software installation package, which may be downloaded and executed on a computer in case the aforementioned method is required.
The descriptions of the processes or structures corresponding to the drawings have emphasis, and the descriptions of other processes or structures may be referred to for the parts of a certain process or structure that are not described in detail.
The foregoing embodiments are merely illustrative of the principles of the present application and their effectiveness, and are not intended to limit the application. Modifications and variations may be made to the above-described embodiments by those of ordinary skill in the art without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications and variations which may be accomplished by persons skilled in the art without departing from the spirit and technical spirit of the disclosure be covered by the claims of this application.

Claims (10)

1. A text clustering method, comprising:
acquiring a text to be clustered;
performing contrast learning based on the text to be clustered to extract text data points to be clustered;
constructing a low-dimensional data point based on the text data points to be clustered;
performing hierarchical density aggregation on the low-dimensional data points based on Euclidean distance to obtain a data cluster;
and acquiring a clustering result of the text to be clustered based on the density of the data clusters.
2. The text clustering method of claim 1, wherein performing contrast learning based on the text to be clustered to extract the text data points to be clustered comprises:
obtaining a trained contrast learning model;
and acquiring sentence vectors of the text to be clustered based on the comparison learning model to serve as text data points to be clustered.
3. The text clustering method of claim 2, wherein obtaining a trained contrast learning model comprises:
acquiring first feature vectors and second feature vectors corresponding to all sentences in the training text based on a bert model;
constructing a positive and negative example data set based on all the first feature vectors and the second feature vectors;
and optimizing the comparison loss based on the positive and negative example data sets to obtain a trained comparison learning model.
4. The text clustering method of claim 1, wherein constructing low-dimensional data points based on the text data points to be clustered comprises:
calculating Euclidean distances among the text data points to be clustered to construct a distance matrix;
constructing a neighbor graph of the text data points to be clustered based on the distance matrix;
blurring processing is carried out on the neighbor graphs;
the difference between the text data points to be clustered between a high dimensional space and a low dimensional space is minimized to construct the low dimensional data points.
5. The text clustering method of claim 1, wherein performing hierarchical density clustering on the low-dimensional data points based on euclidean distance to obtain data clusters comprises:
calculating the core distance between the low-dimensional data point and the neighbor point to construct a minimum spanning tree; the core distance is a Euclidean distance;
and running a preset hierarchical clustering algorithm based on the minimum spanning tree to acquire the data cluster.
6. The text clustering method of claim 1, wherein obtaining the clustering result of the text to be clustered based on the density of the data clusters comprises:
and screening the data clusters with the density larger than a preset value to serve as the clustering result.
7. The text clustering method of claim 1, wherein obtaining text to be clustered further comprises data cleaning the text to be clustered.
8. A text clustering system, comprising:
the acquisition module is used for acquiring texts to be clustered;
the comparison learning module is used for carrying out comparison learning based on the text to be clustered so as to extract text data points to be clustered;
the construction module is used for constructing low-dimensional data points based on the text data points to be clustered;
the hierarchical aggregation module is used for performing hierarchical density aggregation on the low-dimensional data points based on Euclidean distance to acquire a data cluster;
and the clustering module is used for acquiring a clustering result of the text to be clustered based on the density of the data clusters.
9. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program, when executed by a processor, implements the text clustering method of any one of claims 1 to 7.
10. An electronic device, the electronic device comprising:
a memory storing a computer program;
a processor, in communication with the memory, which when invoked performs the text clustering method of any one of claims 1 to 7.
CN202410095706.4A 2024-01-24 2024-01-24 Text clustering method, system, electronic equipment and medium Pending CN117633561A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410095706.4A CN117633561A (en) 2024-01-24 2024-01-24 Text clustering method, system, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410095706.4A CN117633561A (en) 2024-01-24 2024-01-24 Text clustering method, system, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN117633561A true CN117633561A (en) 2024-03-01

Family

ID=90034214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410095706.4A Pending CN117633561A (en) 2024-01-24 2024-01-24 Text clustering method, system, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN117633561A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111353028A (en) * 2020-02-20 2020-06-30 支付宝(杭州)信息技术有限公司 Method and device for determining customer service call cluster
CN113918712A (en) * 2021-09-02 2022-01-11 阿里巴巴达摩院(杭州)科技有限公司 Data processing method and device
CN115470344A (en) * 2022-08-24 2022-12-13 西南财经大学 Video barrage and comment theme fusion method based on text clustering
CN115526236A (en) * 2022-09-01 2022-12-27 浙江大学 Text network graph classification method based on multi-modal comparative learning
CN116186259A (en) * 2023-01-06 2023-05-30 上海销氪信息科技有限公司 Session cue scoring method, device, equipment and storage medium
CN117113982A (en) * 2023-03-31 2023-11-24 河海大学 Big data topic analysis method based on embedded model
CN117435685A (en) * 2023-08-29 2024-01-23 中国工商银行股份有限公司 Document retrieval method, document retrieval device, computer equipment, storage medium and product

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111353028A (en) * 2020-02-20 2020-06-30 支付宝(杭州)信息技术有限公司 Method and device for determining customer service call cluster
CN113918712A (en) * 2021-09-02 2022-01-11 阿里巴巴达摩院(杭州)科技有限公司 Data processing method and device
CN115470344A (en) * 2022-08-24 2022-12-13 西南财经大学 Video barrage and comment theme fusion method based on text clustering
CN115526236A (en) * 2022-09-01 2022-12-27 浙江大学 Text network graph classification method based on multi-modal comparative learning
CN116186259A (en) * 2023-01-06 2023-05-30 上海销氪信息科技有限公司 Session cue scoring method, device, equipment and storage medium
CN117113982A (en) * 2023-03-31 2023-11-24 河海大学 Big data topic analysis method based on embedded model
CN117435685A (en) * 2023-08-29 2024-01-23 中国工商银行股份有限公司 Document retrieval method, document retrieval device, computer equipment, storage medium and product

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHONGTAO CHEN: "ClusTop: An unsupervised and integrated text clustering and topic extraction framework", ARXIV:2301.00818, 3 January 2023 (2023-01-03), pages 3 *

Similar Documents

Publication Publication Date Title
Liu et al. Tcgl: Temporal contrastive graph for self-supervised video representation learning
Deng et al. Two-stream deep hashing with class-specific centers for supervised image search
Song et al. Unified binary generative adversarial network for image retrieval and compression
Zhu et al. Robust joint graph sparse coding for unsupervised spectral feature selection
Isola et al. Learning visual groups from co-occurrences in space and time
CN110059181B (en) Short text label method, system and device for large-scale classification system
Gattupalli et al. Weakly supervised deep image hashing through tag embeddings
CN111667022A (en) User data processing method and device, computer equipment and storage medium
CN107357895B (en) Text representation processing method based on bag-of-words model
CN111080551B (en) Multi-label image complement method based on depth convolution feature and semantic neighbor
Zhang et al. Hashgan: Attention-aware deep adversarial hashing for cross modal retrieval
CN111125469A (en) User clustering method and device for social network and computer equipment
Atashgahi et al. Quick and robust feature selection: the strength of energy-efficient sparse training for autoencoders
CN112906873A (en) Graph neural network training method and device, electronic equipment and storage medium
Furht et al. Deep learning techniques in big data analytics
CN112632984A (en) Graph model mobile application classification method based on description text word frequency
CN115795065A (en) Multimedia data cross-modal retrieval method and system based on weighted hash code
Zhao et al. Learning relevance restricted Boltzmann machine for unstructured group activity and event understanding
EP4285281A1 (en) Annotation-efficient image anomaly detection
Kazemi et al. FEM-DBSCAN: AN efficient density-based clustering approach
Fonseca et al. Research trends and applications of data augmentation algorithms
CN117633561A (en) Text clustering method, system, electronic equipment and medium
Weng et al. Random VLAD based deep hashing for efficient image retrieval
Ghashami et al. Binary coding in stream
Schmitt et al. Outlier detection on semantic space for sentiment analysis with convolutional neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination