CN114168729A - Text clustering system, method, device, equipment and medium - Google Patents

Text clustering system, method, device, equipment and medium Download PDF

Info

Publication number
CN114168729A
CN114168729A CN202010947082.6A CN202010947082A CN114168729A CN 114168729 A CN114168729 A CN 114168729A CN 202010947082 A CN202010947082 A CN 202010947082A CN 114168729 A CN114168729 A CN 114168729A
Authority
CN
China
Prior art keywords
clustering
texts
text
similarity
clustering result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010947082.6A
Other languages
Chinese (zh)
Inventor
段新宇
秦善夫
卢栋才
王喆锋
怀宝兴
袁晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Cloud Computing Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Cloud Computing Technologies Co Ltd filed Critical Huawei Cloud Computing Technologies Co Ltd
Priority to CN202010947082.6A priority Critical patent/CN114168729A/en
Priority to PCT/CN2021/117691 priority patent/WO2022053018A1/en
Publication of CN114168729A publication Critical patent/CN114168729A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a text clustering system which comprises a clustering device and an interaction device. The interactive device can present the initial clustering result obtained from the clustering device and respond to the adjustment operation aiming at the first part in the initial clustering result to obtain a first clustering result; the clustering means further updates a second part of the initial clustering results to a second clustering result according to the adjustment operation for the first part. Therefore, the adjusted clustering result meets the expectation of the user, the user directly adjusts the clustering result without adjusting the model parameters of the clustering algorithm according to the clustering error analysis, the time consumption for optimizing the clustering result can be shortened, and the efficiency of the whole text clustering process can be improved. In addition, the application also provides a text clustering method, a text clustering device, text clustering equipment and a text clustering medium.

Description

Text clustering system, method, device, equipment and medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a text clustering system, method, apparatus, device, and computer-readable storage medium.
Background
With the development of information technology, the internet accumulates a large amount of text data. The text clustering technology is used for gathering texts with larger semantic similarity into a cluster by effectively organizing, abstracting and navigating text information, so that effective information can be mined from massive text data.
In the text clustering process, an interactive clustering mode can be adopted to improve the accuracy of text clustering. Specifically, after the clustering result is given by the clustering algorithm, the user can capture the clustering error existing in the clustering result, and adjust the model parameter of the clustering algorithm based on the captured clustering error, so that the clustering algorithm can execute the text clustering process again based on the adjusted model. Therefore, based on multiple adjustments of the model parameters by the user, the accuracy of the clustering result output by the clustering algorithm can meet the requirements of the user.
However, the model parameters of the clustering algorithm are adjusted based on the user to optimize the clustering result output by the clustering algorithm, which causes the time consumption of the whole text clustering process to be high and the text clustering efficiency to be low.
Disclosure of Invention
The application provides a text clustering system based on a collaborative framework, which improves the efficiency of text clustering by automatically adjusting the clustering result which is not adjusted by a user. Corresponding methods, apparatuses, devices, storage medium and computer program products are also provided.
In a first aspect, the present application provides a text clustering system, which includes a clustering device and an interaction device. The clustering device is used for clustering a plurality of texts to obtain an initial clustering result, the interaction device can present the initial clustering result obtained from the clustering device and respond to the adjustment operation of a user on a first part in the initial clustering result to obtain a first clustering result, and correspondingly, the clustering device can update a second part in the initial clustering result to a second clustering result according to the adjustment operation on the first part to realize the optimization on the initial clustering result. In the process of correcting the clustering result, a user can adjust part of the clustering result, and the clustering device automatically adjusts the rest clustering results according to the adjustment operation of the user, so that the adjusted clustering result meets the expectation of the user, and the user directly adjusts the clustering result without adjusting the model parameters of the clustering algorithm according to the clustering error analysis, thereby shortening the time consumption for optimizing the clustering result and improving the efficiency of the whole text clustering process. Meanwhile, compared with the method that the user optimizes the clustering result by adjusting the model parameters, the user directly adjusts the clustering result, so that the technical level requirement on the user can be reduced, and the optimization effect of the clustering result generally better meets the expectation of the user.
With reference to the first aspect, in a first possible implementation manner of the first aspect, the clustering device may further be configured to record intermediate information involved in the process of obtaining the initial clustering result through clustering, and update the second part of the initial clustering result to the second clustering result according to the intermediate information and the adjustment operation. Therefore, when the clustering device automatically adjusts the second part of clustering results in the initial clustering results, all information, such as similarity between texts, does not need to be recalculated, but intermediate information calculated in the process of clustering the initial clustering results before multiplexing can be used, so that the calculation amount required by text clustering can be reduced, and the text clustering efficiency can be effectively improved.
With reference to the first implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the intermediate information may include any one or more of information of similarity between words in multiple texts, similarity between texts, weight value of a word, definition of a word attribute, and the like. In practical applications, the intermediate information may further include other information, such as a preprocessed text, word order of a word in the text, and the like, and the recorded intermediate information is not limited in this application.
With reference to the first aspect to the second implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the adjusting operation may include any one or more of a defining operation of an attribute of a word in a plurality of texts, an association between words defining operation, an association between texts defining operation, a cluster category defining operation, a noise labeling operation, and a labeling operation of a feature of a cluster category. Particularly, when the interaction device can support various adjustment operations of the user on the initial clustering result, the richness of the adjustment operations can be increased, and the user experience is improved.
With reference to the first aspect to the third implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, when clustering the plurality of texts, the clustering device may specifically calculate similarities between different texts in the plurality of texts, calculate similarities between different texts in the plurality of texts and the cluster category according to the similarities between the different texts, determine an initial clustering result based on the similarities between the different texts and the cluster category, and finally calculate texts and keywords for characterizing the cluster category. Therefore, clustering of a plurality of texts can be achieved, and an initial clustering result is obtained.
With reference to the first aspect to the fourth implementation manner of the first aspect, in a fifth possible implementation manner of the first aspect, the plurality of texts acquired by the clustering device include standard texts and texts to be clustered, where the standard texts have already been clustered, and the texts to be clustered have not yet been clustered. In this way, when the clustering device clusters the plurality of texts, the clustering device may cluster the clustered texts according to the standard texts, for example, the clustering device may calculate the similarity between each text to be clustered and the standard text, and determine whether the text to be clustered and the standard text are clustered into one class according to the similarity between the text to be clustered and the standard text.
With reference to the first aspect to the fourth implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, when performing text clustering on a plurality of texts, the clustering device may perform preprocessing on the plurality of texts, where the preprocessing includes performing any one or more of word segmentation, error correction, denoising, stop word removal, and part-of-speech detection on the plurality of texts. Then, the clustering device clusters the plurality of preprocessed texts to obtain an initial clustering result. In general, clustering is performed based on a plurality of preprocessed texts, and the accuracy and/or the clustering efficiency of clustering results can be correspondingly improved. For example, when the error correction is performed on the text, the error expression (error vocabulary or sentence) in the text can be corrected, and the accuracy of the clustering result obtained according to the text obtained after the error correction can be higher than that obtained based on the text containing the error content. For another example, after the stop word removal/denoising is performed on the read text, the data volume of the text can be effectively reduced, so that the text clustering efficiency can be improved by clustering the text with a small data volume, and the accuracy of the text clustering can not be reduced due to the stop word removal and/or the denoising.
In a second aspect, the present application provides a text clustering method, which can be applied to a clustering device, and specifically includes the following steps: clustering a plurality of texts to obtain an initial clustering result; sending the initial clustering result to an interaction device;
and updating a second part in the initial clustering result into a second clustering result according to the adjustment operation aiming at the first part in the initial clustering result and sent by the interaction device.
With reference to the second aspect, in a first possible implementation manner of the second aspect, the method further includes: recording intermediate information involved in the process of obtaining the initial clustering result through clustering; updating a second part of the initial clustering results into a second clustering result according to the adjustment operation sent by the interaction device for the first part of the initial clustering results, including: updating a second part of the initial clustering result to the second clustering result according to the intermediate information and the adjustment operation
With reference to the first implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the intermediate information includes any one or more of information of similarity between words in the plurality of texts, similarity between texts, weight value of a word, definition of a word attribute, and the like.
With reference to the second aspect to the second implementation manner of the second aspect, in a third possible implementation manner of the second aspect, the adjusting operation includes any one or more of a defining operation of an attribute of a word in the plurality of texts, an association between words defining operation, an association between texts defining operation, a cluster category defining operation, a noise labeling operation, and a labeling operation of a feature of a cluster category.
With reference to the second aspect to the third implementation manner of the second aspect, in a fourth possible implementation manner of the second aspect, the clustering the plurality of texts to obtain an initial clustering result includes: calculating the similarity between different texts in the plurality of texts; calculating the similarity between different texts in the plurality of texts and the clustering categories according to the similarity between the different texts, and determining the initial clustering result based on the similarity between the different texts and the clustering categories; and calculating texts and keywords for representing the cluster category characteristics.
With reference to the second aspect to the fourth implementation manner of the second aspect, in a fifth possible implementation manner of the second aspect, the plurality of texts include standard texts and texts to be clustered, and the standard texts are clustered; then, the clustering the plurality of texts to obtain an initial clustering result includes: and clustering the texts to be clustered according to the standard texts.
With reference to the second aspect to the fifth implementation manner of the second aspect, in a sixth possible implementation manner of the second aspect, the clustering the plurality of texts to obtain an initial clustering result includes: preprocessing the plurality of texts, wherein the preprocessing comprises any one or more of word segmentation, error correction, denoising, stop word removal and part of speech detection; and clustering the plurality of preprocessed texts to obtain the initial clustering result.
Since the text clustering method in the second aspect corresponds to the functions of the clustering device in the first aspect, specific implementations and technical effects of various possible embodiments in the second aspect and the second aspect may refer to the related descriptions of the corresponding embodiments in the first aspect, and are not described herein again.
In a third aspect, the present application provides a clustering apparatus, including: the clustering module is used for clustering a plurality of texts to obtain an initial clustering result; the communication module is used for sending the initial clustering result to the interaction device;
the clustering module is further configured to update a second part of the initial clustering results to a second clustering result according to the adjustment operation sent by the interaction device for the first part of the initial clustering results.
With reference to the third aspect, in a first possible implementation manner of the third aspect, the apparatus further includes: the storage module is used for recording intermediate information related in the process of obtaining the initial clustering result through clustering; the clustering module is specifically configured to update the second part of the initial clustering result to the second clustering result according to the intermediate information and the adjustment operation.
With reference to the first implementation manner of the third aspect, in a second possible implementation manner of the third aspect, the intermediate information includes any one or more of information of similarity between words in the plurality of texts, similarity between texts, weight value of a word, definition of a word attribute, and the like.
With reference to the third aspect to the second implementation manner of the first aspect, in a third possible implementation manner of the third aspect, the adjusting operation includes any one or more of a defining operation of an attribute of a word in the plurality of texts, an association between words defining operation, an association between texts defining operation, a cluster category defining operation, a noise labeling operation, and a labeling operation of a feature of a cluster category.
With reference to the third aspect to the third implementation manner of the first aspect, in a fourth possible implementation manner of the third aspect, the clustering module is specifically configured to: calculating the similarity between different texts in the plurality of texts; calculating the similarity between different texts in the plurality of texts and the clustering categories according to the similarity between the different texts, and determining the initial clustering result based on the similarity between the different texts and the clustering categories; and calculating texts and keywords for representing the cluster category characteristics.
With reference to the third aspect to the fourth implementation manner of the first aspect, in a fifth possible implementation manner of the third aspect, the plurality of texts include standard texts and texts to be clustered, where the standard texts have been clustered; the clustering module is specifically used for clustering the texts to be clustered according to the standard texts.
With reference to the third aspect through the fifth implementation manner of the first aspect, in a sixth possible implementation manner of the third aspect, the apparatus further includes: the preprocessing module is used for preprocessing the plurality of texts, and the preprocessing comprises any one or more of word segmentation, error correction, denoising, stop word removal and part of speech detection; the clustering module is specifically configured to cluster the plurality of preprocessed texts to obtain the initial clustering result.
Since the text clustering device in the third aspect corresponds to the functions of the clustering device in the first aspect, specific implementations and technical effects of various possible embodiments in the third aspect and the third aspect may refer to the related descriptions of the corresponding embodiments in the first aspect, and are not described herein again.
In a fourth aspect, the present application provides a computer system comprising at least one computer comprising a processor and a memory; the processor of the at least one computer is configured to execute instructions stored in the memory of the at least one computer to perform the method of any of claims 8 to 14.
In a fifth aspect, the present application provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the method of any of the above-described second aspect or implementation manner of the second aspect.
In a sixth aspect, the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any of the implementations of the second aspect or the second aspect described above.
The present application can further combine to provide more implementations on the basis of the implementations provided by the above aspects.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a schematic diagram of a text clustering process;
fig. 2 is a structural diagram of a text clustering system according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an exemplary interactive interface for presenting initial clustering results in an embodiment of the present application;
fig. 4 is a schematic structural diagram of a clustering apparatus in an embodiment of the present application;
FIG. 5 is a schematic flow chart illustrating a text clustering method according to an embodiment of the present application;
FIG. 6 is a block diagram of a computer system according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of another computing system in the embodiment of the present application.
Detailed Description
In practical applications, a clustering process as shown in fig. 1 may be used to cluster a plurality of texts. The user can initialize the model parameters of the clustering device and trigger the clustering device to operate. And clustering the texts by a clustering algorithm in the clustering device based on the initialized model parameters to obtain corresponding clustering results. In general, the clustering result obtained by the clustering device based on the initialized model parameters may be difficult to achieve the user's expectation, and therefore, the clustering result may be presented to the user. And the user can analyze the presented clustering result, capture clustering errors such as mismatching of texts and clustering categories and the like in the clustering result, and adjust the model parameters in the clustering device based on the determined clustering errors. In this way, the clustering means may re-cluster the plurality of texts based on the user-adjusted model parameters, and the re-clustered result may be presented to the user again. If the clustering result obtained by re-clustering still does not meet the expectation of the user, the user can continue to adjust the model parameters of the clustering device until the finally obtained clustering result meets the expectation of the user, for example, the accuracy of the clustering result can meet the user requirement and the like.
However, this text clustering method generally requires that the user can capture a clustering error according to the clustering error, and can further adjust the model parameters of the clustering device to a more appropriate value according to the clustering error, which is highly required for the technical level of the user. In addition, in practical application, after the user adjusts the model parameters according to the clustering errors, clustering results obtained by re-clustering are still probably not in accordance with the user expectations, so that the user needs to adjust the model parameters according to the clustering errors for many times in a repeated trial and error manner, and long time is consumed for adjusting the model parameters each time, which causes that the total time consumption for obtaining the clustering results in accordance with the user expectations based on a plurality of texts is long, and the text clustering efficiency is low.
Based on this, an embodiment of the present application provides a text clustering system, which may include at least a clustering device and an interaction device, where the clustering device may cluster texts to obtain an initial clustering result, and then the interaction device presents the initial clustering result, and responds to an adjustment operation for a first part in the initial clustering result to obtain a first clustering result, and the clustering device may further update a second part in the initial clustering result to a second clustering result according to the adjustment operation, so as to optimize the initial clustering result. In the process of correcting the clustering result, a user can adjust part of the clustering result, and the clustering device automatically adjusts the rest clustering results according to the adjustment operation of the user, so that the adjusted clustering result meets the expectation of the user, and the user directly adjusts the clustering result without adjusting the model parameters of the clustering algorithm according to the clustering error analysis, thereby shortening the time consumption for optimizing the clustering result and improving the efficiency of the whole text clustering process. Meanwhile, compared with the method that the user optimizes the clustering result by adjusting the model parameters, the user directly adjusts the clustering result, so that the technical level requirement on the user can be reduced, and the optimization effect of the clustering result generally better meets the expectation of the user.
Embodiments of the present application are described below with reference to the accompanying drawings.
The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished.
First, referring to the structure diagram of the text clustering system shown in fig. 2, the text clustering system includes a clustering device 201 and an interacting device 202. The computer where the interaction device 202 is deployed may be a desktop computer, a notebook computer, a smart phone, or the like, the computer where the clustering device 201 is deployed may be a terminal device such as a desktop computer, a notebook computer, a smart phone, or the like, or a server such as a cloud server, and fig. 1 illustrates an example where the clustering device is deployed in a cloud server. The clustering means 201 and the interacting means 202 may be disposed on the same computer, or may be disposed on different computers.
When performing text clustering, a user may input a plurality of texts to the clustering means 201, such as inputting a plurality of texts to the clustering means 201 through the interaction means 202. The text input by the user may be, for example, N customer service work order documents shown in fig. 2, which are respectively customer service work order document _1 to customer service work order document _ N (N is a positive integer greater than 1), or a document serving as customer service corpus or other text, such as a question text provided by the user in a human-computer interaction scene and/or an answer text for a user question.
In practice, the computer on which the interactive device 202 is located may present an interactive interface to the user, and after the user inputs a plurality of texts into the clustering device 201, information of the plurality of texts, such as the identifications of the N customer service order documents shown in fig. 2, may be included on the interactive interface so that the user can view which texts have been input. Then, the user can click a button of "start clustering" on the interactive interface, and the interactive device 202 triggers the clustering device 201 to perform text clustering according to the click operation of the user on the button.
A clustering algorithm may be configured in the clustering device 201, and model parameters in the clustering algorithm may be initialized. The clustering device 201 clusters the plurality of texts based on the clustering algorithm and the initialized model parameters to obtain corresponding clustering results, which will be referred to as initial clustering results hereinafter for convenience of description.
Accordingly, the interaction means 202 may obtain the initial clustering result generated by the clustering means 201, and present the initial clustering result to the user on the interaction interface. As shown in FIG. 3, m cluster categories and document identifications belonging to each cluster category may be presented on the interactive interface, such as documents belonging to category 1 including document 1-1 through document 1-x, documents belonging to category 2 including document 2-1 through document 2-y, … …, and documents belonging to category m including document m-1 through document m-z. Further, any one or more of a central text, a central sentence and a keyword for representing each cluster category semantic can be presented on the interactive interface.
Generally, the text clustering result obtained by the clustering device 201 based on the initialized model parameters may not meet the user's expectation, for example, the text does not match the clustering category. In this embodiment, the user may adjust the clustering result in the initial clustering result presented by the interaction device 202, and the adjusted clustering result may meet the expectation of the user. In practical applications, because the number of texts participating in clustering is large, the user may adjust only a part of the initial clustering results, and accordingly, the interaction device 202 may adjust the part of the clustering results to be the first clustering results according to the adjustment parameters of the user for the part of the clustering results, and the clustering device 201 may adjust other clustering results that are not adjusted by the user according to the adjustment operations for the part of the clustering results, specifically, may update the second part of the initial clustering results to be the second clustering results. For example, assuming that 100 texts in the plurality of texts to be clustered contain a noun a, and a part of texts under category 1 in the initial clustering result contains the noun a, the user can mark that the noun a does not participate in the text clustering process, the clustering device 201 can automatically re-cluster the remaining 99 texts, and the noun a contained in the 99 texts does not participate in the text clustering process during re-clustering, so as to implement adjustment of the clustering result.
Since the user directly adjusts the initial clustering result, rather than adjusting the model parameters of the clustering algorithm in the clustering device 201, the adjusted clustering result generally better meets the expectations of the user. Meanwhile, the user does not need to modify the parameters of the clustering algorithm model according to the clustering error analysis of the initial clustering result, so that the technical level requirement on the user can be reduced, the time consumption for optimizing the initial clustering result can be shortened, and the text clustering efficiency is improved.
In some possible embodiments, the adjustment operation of the user supported by the interaction device 202 for the clustering result may specifically be any one or more of a definition operation for an attribute of a word in a plurality of texts, an association definition operation between words, an association definition operation between texts, a cluster category definition operation, a noise labeling operation, and a labeling operation for a feature of a cluster category. In practical applications, the adjustment operation may further include other operations for clustering results, which is not limited in this embodiment.
The attribute of the word may be, for example, a part of speech, a domain of the word, a weight (e.g., a percentage of the word in a predetermined corpus or a value determined according to the percentage), and the like. Then, the defining operation for the word attribute may specifically be an operation of adding, deleting, setting, modifying, etc. for the word attribute.
The association between words may be, for example, semantic similarity between words (e.g., synonym/antonym), etc. The defining operation for the association between words may be, for example, labeling whether the words are synonyms or synonyms.
The relevance between texts may be, for example, semantic similarity between texts. The defining operation for the association between the texts may be, for example, whether the semantics between the labeled texts are the same or different, or labeling how similar the semantics between the represented texts are (e.g., may be represented by numerical values).
The defining operation of the cluster category may be, for example, an operation of merging a plurality of categories into one category, splitting one category into a plurality of categories, creating a new category, or the like.
The noise labeling operation may be, for example, an operation of marking a part of texts in a plurality of texts input by a user in an invalid manner, or an operation of marking a part of categories in the initial clustering result in an invalid manner. After the text is marked to be invalid, the text can not participate in the text clustering process; after the category is marked as invalid, the category marked as invalid may not be included in the categories included in the initial clustering result.
The labeling operation of the cluster category feature may be, for example, a labeling operation of information such as a central sentence and a keyword for representing the cluster category feature. In general, the semantics of each text in a cluster category are the same as or associated with the semantics of the central text, the central sentence, and the keyword in the cluster category.
It should be noted that the various examples of the adjusting operation supported by the interacting device 202 are only used for explanation, and are not used to limit the specific implementation of the adjusting operation. In practical applications, the adjustment operations supported by the interaction apparatus 202 may include any other operations on the clustering result besides the above operations.
Next, a clustering device according to the text clustering system will be described in detail.
Referring to the schematic structural diagram of the clustering device 201 shown in fig. 4, the clustering device 201 includes a communication module 400, a preprocessing module 401, a clustering module 402, and a storage module 403.
A communication module 400 for receiving a plurality of texts transmitted by the interactive apparatus, wherein the plurality of texts can be provided to the interactive apparatus 202 by the user. In practical applications, the user may directly provide the clustering means 201 and the interacting means 202 forwards the data.
The preprocessing module 401 may be used for preprocessing the plurality of texts, for example, any one or more of segmenting words, correcting errors (e.g., correcting erroneous words in the texts), denoising (e.g., removing characters such as nonsense letters and symbols), removing stop words, and detecting the part of speech of each word. The stop words can include words with low content indication meanings such as functional words and the like, and usually have difficulty in indicating the semantics of the text, such as words with difficulty in indicating the semantics of the text, such as "a", "these", "the" and the like.
In practical application, after the preprocessing module 401 performs the above preprocessing on the plurality of texts, the data amount of the texts can be reduced to a certain extent, so that when the preprocessed texts are clustered, the calculation amount can be reduced, and the clustering efficiency is improved. For example, if one of the texts is "basketball is generally a multi-player competitive sport", after the text is preprocessed by word segmentation and word removal, the words in the text may include "basketball", "multi-player", "competitive" and "sport", and the data amount participating in clustering in the text may be reduced to 8 characters.
The preprocessing module 401 may further provide the relevant information of the text to the clustering module 402 after completing the preprocessing of the plurality of texts. For example, the related information of the text may include information of each word contained in the text, a part of speech of each word, and a word order of the word in the text.
The clustering module 402 may cluster the plurality of texts according to the obtained related information of the plurality of texts. In an exemplary specific implementation manner, the clustering module 402 may include a text similarity calculation unit 4021, a text clustering unit 4022, and a cluster category characterization unit 4033.
The text similarity calculation unit 4021 may be configured to calculate a similarity between any two texts. In a specific implementation, the text similarity calculation unit 4021 may select any two texts, namely, a text a and a text B, and divide the two texts into a plurality of sentences. Then, the text similarity calculation unit 4021 may perform similarity calculation for each sentence in the text a with each sentence in the text B, respectively. Taking the calculation of the sentence a in the text a and the sentence B in the text B as an example:
when the sentence a and the sentence b simultaneously include the verb and the noun, the text similarity calculation unit 4021 may calculate the similarity between the verb, the adverb, the noun, and the adjective in the sentence a and the sentence b, and determine the word with the similarity greater than the first threshold in the sentence a and the sentence b, where the similarity between the words may be calculated by the similarity between word vectors of the words, and of course, may be calculated in other manners. Meanwhile, the text similarity calculation unit 4021 may calculate the similarity between the sentence vectors of the sentence a and the sentence vector of the sentence b, and may determine that the sentence a is similar to the sentence b if the similarity between the sentence vectors of the two sentences is greater than the second threshold, and if the weight value corresponding to the word with the similarity greater than the first threshold in the two sentences is also greater than the third threshold. Otherwise, if there is no word with the similarity greater than the first threshold in the sentence a and the sentence b, or the weight values corresponding to the words with the similarity greater than the first threshold in the two sentences are not greater than the third threshold, or the similarity between the sentence vectors of the two sentences is less than the second threshold, the text similarity calculation unit 4021 may determine that the two sentences are not similar. The weight value of each word may be, for example, a weight value of the word in a preset corpus, and the text similarity calculation unit 4021 may determine a weight value corresponding to any one of the words in the sentence a and the sentence b by table lookup.
When the sentence a and the sentence b do not contain the noun and the verb simultaneously, the sentence a and the sentence b contain the noun (or the verb) only, and of course, the sentence a and the sentence b may also contain words of other parts of speech simultaneously, and the text similarity calculation unit 4021 may calculate the similarity between the noun and the adjective in the sentence a and the sentence b (or the similarity between the verb and the adverb) simultaneously, and determine the word of which the similarity between the sentence a and the sentence b is greater than the first threshold. Meanwhile, the text similarity calculation unit 4021 may calculate the similarity between the sentence vectors of the sentence a and the sentence vector of the sentence b, and if the similarity between the sentence vectors of the two sentences is greater than a second threshold, and the weight value corresponding to the word with the similarity greater than the first threshold in the two sentences is also greater than a third threshold, it may be determined that the sentence a is similar to the sentence b; otherwise, it may be determined that statement a is not similar to statement b.
When neither the sentence a nor the sentence b contains a noun and a verb, the text similarity calculation unit 4021 may calculate the similarity between the sentence vector of the sentence a and the sentence vector of the sentence b, and if the similarity of the sentence vectors of the two sentences is greater than a second threshold, the text similarity calculation unit 4021 may determine that the sentence a is similar to the sentence b, and if the similarity of the sentence vectors of the two sentences is not greater than the second threshold, the text similarity calculation unit 4021 may determine that the sentence a is not similar to the sentence b.
Based on the above process, it can be determined whether any two sentences in the text a and the text B are similar, so that similar sentences in the two texts can be obtained. Then, the text similarity calculation unit 4021 may calculate the occupation ratio of the similar sentence in the text a and the occupation ratio of the similar sentence in the text B, respectively, and when the occupation ratio of the similar sentence in the text a and the occupation ratio of the similar sentence in the text B both reach a fourth threshold, the text similarity calculation unit 4021 determines that the text a is similar to the text B, and when the occupation ratio of the similar sentence in one of the texts does not reach the fourth threshold, the text similarity calculation unit 4021 determines that the text a is not similar to the text B.
Of course, the above specific implementation manner for determining whether two texts are similar is only an example, and in practical applications, it may also be determined whether two texts are similar in other manners, and in the process of determining whether two texts are similar, the threshold value used may be set by itself, and the specific implementation manner of the process is not limited in this embodiment.
In this way, the text similarity calculation unit 4021 can determine whether any two texts in the plurality of texts are similar and the similarity between any two texts by traversal calculation. Then, the text similarity calculation unit 4021 may transfer the obtained result to the text clustering unit 4022.
The text clustering unit 4022 may perform clustering according to the similarity between samples in the plurality of texts, specifically, determine the similarity between each text to be clustered and each text in the clustered text set, and further determine a text in the clustered text set, where the similarity between the text to be clustered and the text to be clustered is greater than a fifth threshold, and when the percentage of the text in the clustered text set, where the similarity is greater than the fifth threshold, is greater than a first proportional threshold, may determine that the text to be clustered belongs to the category to which the clustered text set belongs, and add the text to be clustered to the clustered text set. And when the proportion of the text with the similarity greater than the fifth threshold in the clustered text set is smaller than the first proportional threshold, determining that the text to be clustered does not belong to the cluster category to which the clustered text set belongs, and continuously determining the text with the similarity greater than the fifth threshold in the text set to be clustered and the next clustered text set so as to continuously determine whether the text to be clustered belongs to the cluster category to which the next clustered text set belongs. If the text clustering unit 4022 determines that the text to be clustered does not belong to all existing clustering categories, a new clustering category may be created based on the text to be clustered, and the text to be clustered belongs to the new clustering category.
When text clustering is started, if there is no clustered text set currently, the text clustering unit 4022 may create a clustered text set with any text, and determine whether a text to be clustered belongs to a cluster category to which the clustered text set belongs based on the above process, if so, add the text to be clustered to the clustered text set, and if not, may create a new clustered text set based on the text to be clustered, where the new clustered text set corresponds to the new cluster category. In this way, the text clustering unit 4022 may divide each text into corresponding clustered text sets, where the number of the clustered text sets is the number of the cluster categories.
In some embodiments of practical applications, the plurality of texts in the clustering device 201 may include both the standard texts and the texts to be clustered. Wherein, the text to be clustered is not clustered; and the standard texts are already clustered, and can be divided into a plurality of different clustered text sets according to different clustering categories. In this way, the text clustering unit 4022 in the clustering device 201 may cluster the texts to be clustered according to the clustering condition of the standard texts, for example, may divide each text to be clustered into different clustered text sets in the corresponding standard texts. If the text to be clustered does not belong to all the existing clustering categories, a new clustering category can be created based on the text to be clustered, and the text to be clustered belongs to the new clustering category.
Further, for each cluster category, the cluster category characterizing unit 4023 in the clustering apparatus 201 may determine any one or more of a central text, a central sentence, and a keyword for the cluster category, and the semantics of the determined central text, central sentence, and keyword may characterize the cluster category.
Specifically, when determining the center text of each cluster category, the cluster category characterizing unit 4023 may determine, according to the similarity between different texts calculated by the text similarity calculating unit 4021, a total similarity (or an average) between each text in the clustered text set corresponding to the cluster category and other texts in the clustered text set, rank the total similarity (or the average) corresponding to each text, and select a text with a larger total similarity (or an average) or a largest total similarity as the center text of the cluster category.
When determining the central sentence of each cluster category, the cluster category characterizing unit 4021 may determine the total similarity (or average similarity) between each sentence in the clustered text set corresponding to the cluster category and other sentences in the clustered text set according to the similarities between different sentences calculated by the text similarity calculating unit 4021, rank the total similarity (or average similarity) corresponding to each sentence, and select the sentence with the largest or larger total similarity (or average similarity) as the central sentence of the cluster category.
When determining the keywords of each cluster category, for a part of speech, the cluster category characterizing unit 4021 may determine a word set having the part of speech in the text corresponding to the cluster category, such as a word set whose part of speech is a verb and a word set whose part of speech is a noun, and determine the weight value of each word in the word set by looking up a table. Then, the cluster category characterizing unit 4021 may rank the weighted values of the different words in the word set, and select one or more words with a larger or largest value as the keywords of the cluster category, so as to determine the keywords of different parts of speech corresponding to the cluster category.
Of course, the process of clustering the plurality of texts by the clustering module 402 is only used as an exemplary illustration, and is not limited to the implementation of the text clustering in this embodiment, and in practical application, the clustering module 402 may also use other possible text clustering processes to cluster the plurality of texts.
Further, the clustering module 402 may also record intermediate information related in the text clustering process, specifically, send the intermediate information to the storage module 403 in the clustering module for storage. For example, the intermediate information may be any one or more of the information such as the similarity between different words, the similarity between different texts, the weight value of a word, and the definition of a word attribute, which are calculated by the text similarity calculation unit 4021. In the time application, the recorded intermediate information may also include more other information, such as similar sentences between different texts or identifications of similar sentences (e.g., the sentences may be sequentially numbered in the text, etc.). The storage module 403 may include an index unit 4031 and a storage unit 4032, where the storage module 403 may store the intermediate information by using the storage unit 4032, and establish a query index of the information in the index unit 4031, where the index may include an identifier of the intermediate information and a storage address of the intermediate information in the storage unit.
Based on the above process, the clustering device 201 can cluster a plurality of texts and obtain an initial clustering result, where the initial clustering result includes a clustering category, which can be represented by any one or more of the above-mentioned central text, central sentence, and keyword, and meanwhile, the initial clustering result also includes texts belonging to the clustering category. Further, the initial clustering result may further include intermediate information in the clustering process, such as attributes of words included in the text under each clustering category, and information about similarity between texts. The clustering means 201 may then transmit the obtained initial clustering result to the interaction means 202 through the communication module 400, so that the initial clustering result is presented to the user by the interaction means 202.
Furthermore, the interactive device 202 may also support various interactive operations with the user, such as the above-mentioned defining operation for the word attributes in a plurality of texts, the defining operation for the association between words, the defining operation for the association between texts, the defining operation for cluster categories, the noise labeling operation, and the labeling operation for the features of cluster categories. When the user adjusts the initial clustering result presented by the interaction device 202, because the number of texts involved in the initial clustering result is large, the user can adjust a part of the clustering results with a large number, and the interaction device 202 updates the part of the clustering results adjusted by the user to the first clustering result according to the operation of the user. Meanwhile, the interaction device 202 may also transmit the adjustment operation of the user for the partial clustering result to the clustering device 201, specifically, to the communication module 400 in the clustering device 201, and then the communication module 400 transmits the adjustment parameter to the clustering module 402.
The clustering means 201 updates the related information of the word, the related information of the text, the related information of the clustering category, and the like according to the adjustment operation performed by the user, and performs corresponding adjustment on the rest of the texts based on the updated information, so as to update the second part of the initial clustering result to the second clustering result.
For example, when the adjustment operation performed by the user is to adjust the attribute of a word, such as defining the part of speech of the word as a noun by a verb, etc., when determining the keyword of the cluster category, the word is deleted from the verb set and the word is added from the noun set, so as to re-determine the keyword of different parts of speech of the cluster category based on the updated verb set and the word in the noun set, and the specific implementation manner of determining the keyword can be referred to in the foregoing process related description.
For another example, when the adjustment operation performed by the user defines an operation for defining the association between words, such as directly defining the similar meaning words (or the antisense words or no association) between two words, the clustering apparatus 201 (specifically, the text similarity calculating unit 4021) may set the similarity between the two words to an arbitrary value greater than the first threshold (to an arbitrary value smaller than the first threshold when the words are antisense words), and perform text clustering again based on the updated similarity between the words.
For another example, when the adjustment operation performed by the user defines an operation for the association between texts, such as directly defining that two texts have the same semantic meaning, or migrating the text P in one cluster category to another category, and so on, the clustering device 201 may perform text clustering again based on the definition operation for the association between texts, such as migrating another text having the same semantic meaning as the text P in another cluster category to the cluster category in which the text P is located.
Based on the adjustment operation performed by the user, the clustering apparatus 201 may update the intermediate information stored in the storage module 403, such as updating the attribute of the word, the similarity between words, the similarity between texts, and the like, and specifically may first query the storage location of the information to be updated in the storage unit 4032 by using the indexing unit 4031, and correspondingly update the value at the storage location. In this way, the clustering device 201 can reuse the intermediate information stored in the storage module 403 without recalculation when clustering is performed again. For example, in the re-clustering process, the clustering device 201 may directly read the similarity between two texts from the storage module 403, and may not calculate the similarity between the texts through the above calculation process. Therefore, the calculation amount required by re-clustering can be effectively reduced, and the re-clustering efficiency can be improved, so that the real-time performance of the optimized clustering result can be improved.
After the clustering device 201 adjusts the initial clustering result through the above process, the clustering device 201 may transmit the second clustering result to the interaction device 202, and the interaction device 202 may present the first clustering result (user adjustment) and the second clustering result (clustering device 201 adjustment) to the user, so that the user can check whether the adjusted clustering result can meet the user expectation. In practical applications, if the user adjusts the adjusted clustering result again, the clustering device 201 may automatically adjust the clustering result that is not adjusted by the user based on the above similar process, and deliver the clustering result to the interaction device 202 for displaying again until the finally obtained clustering result meets the expectation of the user.
The embodiment of the application also provides the text clustering method implemented by the method, and then introduces the text clustering method from the perspective of interaction of all devices.
Referring to the flowchart of the text clustering method shown in fig. 5, the method may be applied to the text clustering system shown in fig. 2, and the method may specifically include:
s501: the interaction means 202 receives a plurality of texts provided by the user.
In this embodiment, the text provided by the user may be, for example, a work order document that needs to be clustered and distributed to a department, or a customer service work order document that is a customer service corpus, or a question text that is proposed by the user in a human-computer conversation scene and/or an answer text for a user question.
The text received by the interactive device 202 may include standard text and text to be clustered. And the standard texts are clustered, and the texts to be clustered are not clustered. Of course, the texts received by the interaction apparatus 202 may be all texts to be clustered.
It should be noted that, in this embodiment, an example that a user inputs a plurality of texts into the interaction device 202 is taken as an example for illustration, in other possible embodiments, the user may directly input a plurality of texts into the clustering device 201, which is not limited in this embodiment.
S502: the interaction means 202 passes the plurality of texts to the communication module 400 in the clustering means 201.
S503: the preprocessing module 401 in the clustering device 201 preprocesses the plurality of texts transmitted by the communication module 400, and transmits the related information of the preprocessed texts to the clustering module 402.
In this embodiment, the preprocessing may be performed on the plurality of texts, and may be any one or more of word segmentation, error correction, denoising, stop word removal, and detection of a part of speech of each word. For specific implementation, reference may be made to the description of the related parts, which is not described herein again.
S504: the clustering module 402 in the clustering device 201 clusters a plurality of texts according to the relevant information of the preprocessed texts to obtain an initial clustering result, and transfers the intermediate information related in the clustering process to the storage module 403 for storage.
The related information of the text may include information of each word contained in the text, a part of speech of each word, and a word order of the word in the text.
In this embodiment, when the clustering module 402 clusters a plurality of texts, specifically, the text similarity calculation unit 4021 calculates similarities between different texts, and the text clustering unit 4022 clusters texts with higher similarities into one type according to the similarities between different texts to obtain a plurality of different clustered text sets respectively corresponding to different clustering categories. Meanwhile, the cluster category characterization unit 4023 determines any one or more of a central text, a central sentence, and a keyword for the cluster category. The intermediate information related in the clustering process, such as similarity between texts, may be recorded in the storage module 403, specifically may be recorded in the storage unit 4032 in the storage module 403, and an index is established in the index unit 4031.
For specific implementation of the clustering process of the multiple texts and the storage of the intermediate information by the storage module 403, reference may be made to the foregoing description, which is not repeated herein.
S505: the clustering means 201 communicates the initial clustering result to the interaction means 202 via the communication module 400.
S506: the interaction means 202 presents the initial clustering results to the user.
S507: the interaction means 202 responds to the adjustment operation of the user for the initial clustering result, updates the clustering result of the portion adjusted by the user to the first clustering result, and transmits the adjustment operation to the communication module 400 in the clustering means 201.
S508: the clustering means 201 updates the other part of the initial clustering results to the second clustering results according to the adjustment operation, and transmits the second clustering results to the interaction means 202 through the communication module 400.
The clustering device 201 updates the intermediate information stored in the storage module 403 according to the adjustment operation of the user on the partial clustering result, and updates the clustering result that is not adjusted by the user based on the updated intermediate information to obtain a second clustering result. The specific implementation of the clustering unit 201 for obtaining the second clustering result based on the adjustment operation can be referred to the description of the related parts, which is not described herein again.
S509: the interaction means 202 presents the updated first clustering result and the second clustering result.
The interaction device 202 and the clustering device 201 according to the embodiment of the present application may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of each module in the interaction device 202 and the clustering device 201 are respectively for implementing the corresponding flow of each method in fig. 5, and are not repeated herein for brevity.
FIG. 6 provides a computer system. The computer system 600 shown in fig. 6 comprises a computer, which can be specifically used to implement the functions of the clustering device 201 in the embodiment shown in fig. 4.
Computer system 600 includes a bus 601, a processor 602, a communication interface 603, and a memory 604. The processor 602, memory 604, and communication interface 603 communicate over a bus 601. The bus 601 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus. The communication interface 603 is used for communicating with the outside, such as receiving a plurality of texts sent by the interaction apparatus 201 and transmitting the initial clustering result to the interaction apparatus 201.
The processor 602 may be a Central Processing Unit (CPU). The memory 604 may include a volatile memory (volatile memory), such as a Random Access Memory (RAM). The memory 604 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory, an HDD, or an SSD.
The memory 604 has stored therein executable code that the processor 602 executes to perform the text clustering method described above.
Specifically, in the case of implementing the embodiment shown in fig. 4 and the modules described in the embodiment of fig. 4 are implemented by software, the software or program code required for executing the preprocessing module 401, the clustering module 402 and the storage module 403 in fig. 4 is stored in the memory 604, the communication module 400 is implemented by the communication interface 603, and the processor 602 is configured to execute the instructions in the memory 604 and execute the text clustering method applied to the clustering device 201. In other embodiments, the memory 600 may also be used for storing data, and the storage module 403 functions may be implemented by the memory 604.
It should be noted that the computer system 600 shown in fig. 6 is exemplified by including one computer, and in other possible embodiments, the computer system may further include a plurality of computers, and a plurality of different computers in the computer system cooperate with each other to jointly execute the text clustering method. In this case, the preprocessing module 401, the clustering module 402, and the storage module 403 may be located on a plurality of different computers. For the sake of understanding, the preprocessing module 401 and the clustering module 402 are located in the same computer, and the storage module 403 is located in another computer.
Referring to FIG. 7, FIG. 7 provides another computer system. The computer system 700 shown in fig. 7 includes two computers, i.e., a computer 710 and a computer 720, which cooperate with each other to implement the functions of the clustering device 201 in the embodiment shown in fig. 4.
Computer 710 includes a bus 711, a processor 712, a communication interface 713, and memory 714. Communication between the processor 712, memory 714 and communication interface 713 is via bus 711. Computer 720 includes a bus 721, a processor 722, a communication interface 723, and a memory 724. Processor 722, memory 724, and communication interface 723 communicate over a bus 721. The bus 711 and the bus 721 may be a PCI bus, an EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in each computer of FIG. 7, but this does not represent only one bus or one type of bus. The communication interface 713 is used for communicating with the outside, such as receiving a plurality of texts sent by the interaction device 201 and transmitting the initial clustering result to the interaction device 201, and the communication interface 723 is used for realizing interaction between the computer 710 and the computer 720.
The processors 712 and 722 may be CPUs. Memory 714 and memory 724 may include volatile memory, such as RAM. The memory 714 may also include non-volatile memory, such as ROM, flash memory, HDD, or SSD.
The memory 714 and the memory 724 store executable codes, and the processor 712 and the processor 722 respectively execute the executable codes in the memories to perform the text clustering method.
Specifically, in the case of implementing the embodiment shown in fig. 4, and in the case that the modules described in the embodiment of fig. 4 are implemented by software, software or program codes required for executing the preprocessing module 401 and the clustering module 402 in fig. 4 are stored in the memory 714, software or program codes required for executing the storage module 403 in fig. 4 are stored in the memory 724, the communication module 400 is implemented by the communication interface 713, the processor 712 is configured to execute instructions in the memory 714, and the processor 722 is configured to execute instructions in the memory 724, so as to cooperate with each other to execute the text clustering method applied to the clustering device 201.
Of course, in other possible embodiments, when the computer system includes a plurality of different computers, the preprocessing module 401 and the clustering module 102 may be located in different computers, and the present application does not limit this.
An embodiment of the present application further provides a computer-readable storage medium, which includes instructions, when executed on a computer, causing the computer to execute the above text clustering method applied to the clustering device 201.
Embodiments of the present application further provide a computer program product, and when the computer program product is executed by a computer, the computer executes any one of the foregoing text clustering methods. The computer program product may be a software installation package which may be downloaded and executed on a computer in the event that any of the methods of text clustering previously described are required.
It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be substantially embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, an exercise device, or a network device) to execute the method according to the embodiments of the present application.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.
The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, training device, or data center to another website site, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a training device, a data center, etc., that incorporates one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Claims (23)

1. A text clustering system, the system comprising:
clustering device, interaction device;
the clustering device is used for clustering a plurality of texts to obtain an initial clustering result;
the interactive device is used for presenting the initial clustering result obtained from the clustering device and responding to the adjustment operation aiming at the first part in the initial clustering result to obtain a first clustering result;
and the clustering device is also used for updating the second part in the initial clustering result into a second clustering result according to the adjusting operation.
2. The system according to claim 1, wherein the clustering device is further configured to record intermediate information involved in the process of obtaining the initial clustering result by clustering, and update a second part of the initial clustering result to the second clustering result according to the intermediate information and the adjustment operation.
3. The system according to claim 2, wherein the intermediate information includes any one or more of similarity between words in the plurality of texts, similarity between texts, weight value of a word, and definition of a word attribute.
4. The system according to any one of claims 1 to 3, wherein the adjusting operation comprises any one or more of a defining operation of an attribute of a word in the plurality of texts, an association between words defining operation, an association between texts defining operation, a cluster category defining operation, a noise labeling operation, and a labeling operation of a feature of a cluster category.
5. The system according to any one of claims 1 to 4, wherein the clustering means is specifically configured to:
calculating the similarity between different texts in the plurality of texts;
calculating the similarity between different texts in the plurality of texts and the clustering categories according to the similarity between the different texts, and determining the initial clustering result based on the similarity between the different texts and the clustering categories;
and calculating texts and keywords for representing the cluster category characteristics.
6. The system according to any one of claims 1 to 5, wherein the plurality of texts include standard texts and texts to be clustered, and the standard texts are clustered;
and the clustering device is specifically used for clustering the texts to be clustered according to the standard texts.
7. The system according to any one of claims 1 to 6, wherein the clustering device is specifically configured to perform preprocessing on the plurality of texts, where the preprocessing includes any one or more of word segmentation, error correction, denoising, dead word removal, and part-of-speech detection, and cluster the plurality of preprocessed texts to obtain the initial clustering result.
8. A text clustering method is applied to a clustering device, and the method comprises the following steps:
clustering a plurality of texts to obtain an initial clustering result;
sending the initial clustering result to an interaction device;
and updating a second part in the initial clustering result into a second clustering result according to the adjustment operation aiming at the first part in the initial clustering result and sent by the interaction device.
9. The method of claim 8, further comprising:
recording intermediate information involved in the process of obtaining the initial clustering result through clustering;
updating a second part of the initial clustering results into a second clustering result according to the adjustment operation sent by the interaction device for the first part of the initial clustering results, including:
and updating a second part in the initial clustering result into the second clustering result according to the intermediate information and the adjusting operation.
10. The method according to claim 9, wherein the intermediate information includes any one or more of similarity between words in the plurality of texts, similarity between texts, weight value of a word, and definition of a word attribute.
11. The method according to any one of claims 8 to 10, wherein the adjusting operation comprises any one or more of a defining operation of an attribute of a word in the plurality of texts, an association between words defining operation, an association between texts defining operation, a cluster category defining operation, a noise labeling operation, and a labeling operation of a feature of a cluster category.
12. The method according to any one of claims 8 to 11, wherein the clustering the plurality of texts to obtain an initial clustering result comprises:
calculating the similarity between different texts in the plurality of texts;
calculating the similarity between different texts in the plurality of texts and the clustering categories according to the similarity between the different texts, and determining the initial clustering result based on the similarity between the different texts and the clustering categories;
and calculating texts and keywords for representing the cluster category characteristics.
13. The method according to any one of claims 8 to 12, wherein the plurality of texts include standard texts and texts to be clustered, and the standard texts are clustered;
then, the clustering the plurality of texts to obtain an initial clustering result includes:
and clustering the texts to be clustered according to the standard texts.
14. The method according to any one of claims 8 to 13, wherein clustering the plurality of texts to obtain an initial clustering result comprises:
preprocessing the plurality of texts, wherein the preprocessing comprises any one or more of word segmentation, error correction, denoising, stop word removal and part of speech detection;
and clustering the plurality of preprocessed texts to obtain the initial clustering result.
15. A clustering apparatus, characterized in that the clustering apparatus comprises:
the clustering module is used for clustering a plurality of texts to obtain an initial clustering result;
the communication module is used for sending the initial clustering result to the interaction device;
the clustering module is further configured to update a second part of the initial clustering results to a second clustering result according to the adjustment operation sent by the interaction device for the first part of the initial clustering results.
16. The apparatus of claim 15, further comprising:
the storage module is used for recording intermediate information related in the process of obtaining the initial clustering result through clustering;
the clustering module is specifically configured to update the second part of the initial clustering result to the second clustering result according to the intermediate information and the adjustment operation.
17. The apparatus according to claim 16, wherein the intermediate information includes any one or more of similarity between words in the plurality of texts, similarity between texts, weight value of a word, and definition of a word attribute.
18. The apparatus according to any one of claims 15 to 17, wherein the adjusting operation comprises any one or more of a defining operation of an attribute of a word in the plurality of texts, an association between words defining operation, an association between texts defining operation, a cluster category defining operation, a noise labeling operation, and a labeling operation of a feature of a cluster category.
19. The apparatus according to any one of claims 15 to 18, wherein the clustering module is specifically configured to:
calculating the similarity between different texts in the plurality of texts;
calculating the similarity between different texts in the plurality of texts and the clustering categories according to the similarity between the different texts, and determining the initial clustering result based on the similarity between the different texts and the clustering categories;
and calculating texts and keywords for representing the cluster category characteristics.
20. The apparatus according to any one of claims 15 to 19, wherein the plurality of texts includes standard texts and texts to be clustered, and the standard texts are clustered;
the clustering module is specifically used for clustering the texts to be clustered according to the standard texts.
21. The apparatus of any one of claims 15 to 20, further comprising:
the preprocessing module is used for preprocessing the plurality of texts, and the preprocessing comprises any one or more of word segmentation, error correction, denoising, stop word removal and part of speech detection;
the clustering module is specifically configured to cluster the plurality of preprocessed texts to obtain the initial clustering result.
22. A computer system, comprising at least one computer, the at least one computer comprising a processor and a memory;
the processor of the at least one computer is configured to execute instructions stored in the memory of the at least one computer to perform the method of any of claims 8 to 14.
23. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of any of claims 8 to 14.
CN202010947082.6A 2020-09-10 2020-09-10 Text clustering system, method, device, equipment and medium Pending CN114168729A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010947082.6A CN114168729A (en) 2020-09-10 2020-09-10 Text clustering system, method, device, equipment and medium
PCT/CN2021/117691 WO2022053018A1 (en) 2020-09-10 2021-09-10 Text clustering system, method and apparatus, and device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010947082.6A CN114168729A (en) 2020-09-10 2020-09-10 Text clustering system, method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN114168729A true CN114168729A (en) 2022-03-11

Family

ID=80475606

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010947082.6A Pending CN114168729A (en) 2020-09-10 2020-09-10 Text clustering system, method, device, equipment and medium

Country Status (2)

Country Link
CN (1) CN114168729A (en)
WO (1) WO2022053018A1 (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101571868A (en) * 2009-05-25 2009-11-04 北京航空航天大学 File clustering method based on information bottleneck theory
US10049148B1 (en) * 2014-08-14 2018-08-14 Medallia, Inc. Enhanced text clustering based on topic clusters
CN109508374B (en) * 2018-11-19 2021-12-21 云南电网有限责任公司信息中心 Text data semi-supervised clustering method based on genetic algorithm
US20220138232A1 (en) * 2019-02-28 2022-05-05 Nec Corporation Visualization method, visualization device and computer-readable storage medium

Also Published As

Publication number Publication date
WO2022053018A1 (en) 2022-03-17

Similar Documents

Publication Publication Date Title
CN109840321B (en) Text recommendation method and device and electronic equipment
US11409813B2 (en) Method and apparatus for mining general tag, server, and medium
US20170185581A1 (en) Systems and methods for suggesting emoji
CN106874441B (en) Intelligent question-answering method and device
WO2020077824A1 (en) Method, apparatus, and device for locating abnormality, and storage medium
US20230177360A1 (en) Surfacing unique facts for entities
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
US20170364495A1 (en) Propagation of changes in master content to variant content
CN109033244B (en) Search result ordering method and device
US20220083577A1 (en) Information processing apparatus, method and non-transitory computer readable medium
JP7430820B2 (en) Sorting model training method and device, electronic equipment, computer readable storage medium, computer program
CN111737961B (en) Method and device for generating story, computer equipment and medium
CN112417848A (en) Corpus generation method and device and computer equipment
CN112686051A (en) Semantic recognition model training method, recognition method, electronic device, and storage medium
CN111401039A (en) Word retrieval method, device, equipment and storage medium based on binary mutual information
CN114662676A (en) Model optimization method and device, electronic equipment and computer-readable storage medium
WO2020052060A1 (en) Method and apparatus for generating correction statement
CN117370190A (en) Test case generation method and device, electronic equipment and storage medium
WO2021000400A1 (en) Hospital guide similar problem pair generation method and system, and computer device
CN116484829A (en) Method and apparatus for information processing
CN115964474A (en) Policy keyword extraction method and device, storage medium and electronic equipment
US10084853B2 (en) Distributed processing systems
CN114168729A (en) Text clustering system, method, device, equipment and medium
CN110046346B (en) Corpus intention monitoring method and device and terminal equipment
CN110717008B (en) Search result ordering method and related device based on semantic recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination