WO2022053018A1 - 一种文本聚类系统、方法、装置、设备及介质 - Google Patents

一种文本聚类系统、方法、装置、设备及介质 Download PDF

Info

Publication number
WO2022053018A1
WO2022053018A1 PCT/CN2021/117691 CN2021117691W WO2022053018A1 WO 2022053018 A1 WO2022053018 A1 WO 2022053018A1 CN 2021117691 W CN2021117691 W CN 2021117691W WO 2022053018 A1 WO2022053018 A1 WO 2022053018A1
Authority
WO
WIPO (PCT)
Prior art keywords
clustering
texts
text
similarity
clustering result
Prior art date
Application number
PCT/CN2021/117691
Other languages
English (en)
French (fr)
Inventor
段新宇
秦善夫
卢栋才
王喆锋
怀宝兴
袁晶
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2022053018A1 publication Critical patent/WO2022053018A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Definitions

  • the present application relates to the technical field of data processing, and in particular, to a text clustering system, method, apparatus, device, and computer-readable storage medium.
  • Text clustering technology by effectively organizing, summarizing and navigating text information, aggregates texts with relatively large semantic similarity into a cluster, so that effective information can be mined from massive text data.
  • interactive clustering can be used to improve the accuracy of text clustering.
  • the user can capture the clustering errors existing in the clustering result, and adjust the model parameters of the clustering algorithm based on the captured clustering errors, so as to facilitate clustering
  • the algorithm re-executes the text clustering process based on the adjusted model. In this way, based on the multiple adjustments of the model parameters by the user, the accuracy of the clustering result output by the clustering algorithm can finally meet the requirements of the user.
  • the clustering result output by the clustering algorithm is optimized based on the user's adjustment of the model parameters of the clustering algorithm, which makes the entire text clustering process time-consuming and the text clustering efficiency is low.
  • the present application provides a text clustering system based on a collaborative architecture, which improves the efficiency of text clustering by automatically adjusting the clustering results that are not adjusted by the user.
  • the present application also provides corresponding methods, apparatuses, devices, storage media, and computer program products.
  • the present application provides a text clustering system, which includes a clustering device and an interaction device.
  • the clustering device is used for clustering multiple texts to obtain an initial clustering result
  • the interactive device can present the initial clustering result obtained from the clustering device, and respond to the user's request for the first part of the initial clustering result.
  • the adjustment operation is performed to obtain the first clustering result.
  • the clustering device may also update the second part of the initial clustering result to the second clustering result according to the adjustment operation for the first part, so as to realize the adjustment of the initial clustering result. Optimization of clustering results.
  • the user can adjust some of the clustering results, and the clustering device automatically adjusts the remaining clustering results according to the user's adjustment operation, which not only realizes the adjusted clustering results
  • the results are in line with the user's expectations, and the user directly adjusts the clustering results, and does not need to adjust the model parameters of the clustering algorithm according to the clustering error analysis, which can shorten the time-consuming optimization of the clustering results and improve the overall text.
  • Efficiency of the clustering process compared with the way users optimize the clustering results by adjusting the model parameters, the user can directly adjust the clustering results, which can not only reduce the technical level requirements for the users, but also the optimization effect of the clustering results is usually more in line with the user. expectations.
  • the clustering apparatus may also be used to record intermediate information involved in the process of clustering to obtain an initial clustering result, and based on the intermediate information And the adjustment operation updates the second part in the initial clustering result to the second clustering result.
  • the clustering device automatically adjusts the second part of the clustering results in the initial clustering results, it does not need to recalculate all the information, such as the similarity between texts, etc., but can obtain the initial clustering by clustering before reuse
  • the intermediate information calculated in the process of the result can not only reduce the amount of calculation required for re-clustering the text, but also can effectively improve the efficiency of the text clustering.
  • the intermediate information may include the similarity between words in multiple texts, the similarity between Any one or more of the weight value and the definition of word attributes.
  • the intermediate information may also include other information, such as preprocessed text, word order of words in the text, and other information, and the intermediate information recorded is not limited in this application.
  • the above adjustment operation may include a definition operation for word attributes in multiple texts, an association between words Any one or more of the operation of defining the relationship between texts, the operation of defining the association between the texts, the operation of defining the cluster category, the operation of labeling noise, and the operation of labeling the characteristics of the cluster category.
  • the interactive device can support various adjustment operations of the user on the initial clustering result, the richness of the adjustment operations can be increased, and the user experience can be improved.
  • the clustering apparatus clusters multiple texts, specifically, the multiple The similarity between different texts in the text, and then according to the similarity between the different texts, calculate the similarity between different texts in multiple texts and the clustering category, and based on the different texts and clustering categories The similarity between them is determined to determine the initial clustering result.
  • the clustering device calculates the text and keywords used to characterize the characteristics of the clustering category. In this way, multiple texts can be clustered to obtain an initial clustering result.
  • the multiple texts acquired by the clustering device include standard texts and texts to be clustered, wherein , the standard text has been clustered, and the clustering text has not yet been clustered.
  • the clustering device may cluster the clustered texts according to the standard texts. For example, the clustering device may calculate the similarity between each text to be clustered and the standard texts. , and according to the similarity between the text to be clustered and the standard text, it is determined whether the text to be clustered and the standard text are clustered into one category.
  • the clustering device may first perform text clustering on the multiple texts. Preprocessing includes any one or more of word segmentation, error correction, denoising, stop word removal, and part-of-speech detection for multiple texts. Then, the clustering device performs clustering on the preprocessed multiple texts to obtain an initial clustering result. Generally, clustering based on multiple preprocessed texts can correspondingly improve the accuracy and/or clustering efficiency of the clustering results. For example, when the text is corrected for errors, the wrong expressions (wrong words or sentences) in the text can be corrected.
  • the accuracy of the clustering results obtained by text can be higher.
  • the data volume of the text can be effectively reduced, so that clustering based on the text with a smaller amount of data can improve the efficiency of text clustering, and the accuracy of text clustering can be improved. Performance is also generally not reduced by stopword removal and/or denoising.
  • the present application provides a text clustering method, which can be applied to a clustering device, and specifically includes the following steps: clustering a plurality of texts to obtain an initial clustering result; sending the initial clustering result to an interactive device clustering results;
  • the second part in the initial clustering result is updated to the second clustering result.
  • the method further includes: recording intermediate information involved in the process of obtaining the initial clustering result by clustering;
  • the adjustment operation for the first part in the initial clustering result sent by the interaction device, and updating the second part in the initial clustering result to the second clustering result includes: according to the intermediate information and the The adjustment operation updates the second part of the initial clustering result to the second clustering result
  • the intermediate information includes similarity between words in the multiple texts, similarity between texts, Any one or more of information such as the weight value of the word and the definition of the word attribute.
  • the adjustment operation includes a definition operation of word attributes in the plurality of texts, an operation between words Any one or more of an association definition operation, an association definition operation between texts, a cluster category definition operation, a noise annotation operation, and an annotation operation of cluster category features.
  • the clustering of multiple texts to obtain an initial clustering result includes: calculating the The similarity between different texts in the multiple texts; according to the similarity between the different texts, calculate the similarity between the different texts in the multiple texts and the cluster categories, and based on the different texts and The similarity between the cluster categories determines the initial clustering result; the text and keywords used to characterize the characteristics of the cluster categories are calculated.
  • the multiple texts include standard texts and to-be-clustered texts, and the standard texts have been completed Clustering; then, performing clustering on a plurality of texts to obtain an initial clustering result includes: clustering the texts to be clustered according to the standard texts.
  • the clustering of multiple texts to obtain an initial clustering result includes: A plurality of texts are preprocessed, and the preprocessing includes any one or more of word segmentation, error correction, denoising, stop word removal, and part-of-speech detection; clustering the preprocessed texts to obtain the Describe the initial clustering results.
  • the present application provides a clustering device, the clustering device includes: a clustering module for clustering a plurality of texts to obtain an initial clustering result; a communication module for sending the information to the interactive device Describe the initial clustering results;
  • the clustering module is further configured to update the second part in the initial clustering result to the second clustering result according to the adjustment operation sent by the interactive device for the first part in the initial clustering result .
  • the apparatus further includes: a storage module configured to record intermediate information involved in the process of obtaining the initial clustering result by clustering; Then, the clustering module is specifically configured to update the second part of the initial clustering result to the second clustering result according to the intermediate information and the adjustment operation.
  • the intermediate information includes a degree of similarity between words in the plurality of texts, a degree of similarity between texts, Any one or more of information such as the weight value of the word and the definition of the word attribute.
  • the adjustment operation includes a definition operation of word attributes in the plurality of texts, an operation between words Any one or more of an association definition operation, an association definition operation between texts, a cluster category definition operation, a noise annotation operation, and an annotation operation of cluster category features.
  • the clustering module is specifically configured to: calculate the difference between different texts in the plurality of texts Calculate the similarity between the different texts in the multiple texts and the clustering category according to the similarity between the different texts, and calculate the similarity between the different texts and the clustering category based on the similarity between the different texts and the clustering category Determine the initial clustering result; calculate the text and keywords used to characterize the characteristics of the clustering category.
  • the multiple texts include standard texts and to-be-clustered texts, and the standard texts have been completed Clustering; the clustering module is specifically configured to cluster the text to be clustered according to the standard text.
  • the apparatus further includes: a preprocessing module for preprocessing the plurality of texts, so that the The preprocessing includes any one or more of word segmentation, error correction, denoising, stop word removal, and part-of-speech detection; the clustering module is specifically used to cluster multiple preprocessed texts to obtain the initial clustering result.
  • the text clustering device in the third aspect corresponds to the functions of the clustering device in the first aspect, the third aspect and the specific implementations of the various possible implementations in the third aspect and the technical effects thereof, Reference may be made to the relevant description of the corresponding implementation manner in the first aspect, which is not repeated here.
  • the present application provides a computer system, the computer system includes at least one computer, the at least one computer includes a processor and a memory; the processor of the at least one computer is used to execute the memory of the at least one computer.
  • the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, which, when executed on a computer, cause the computer to execute the second aspect or any one of the second aspects. implement the method described.
  • the present application provides a computer program product comprising instructions, which, when run on a computer, cause the computer to execute the method described in the second aspect or any one of the implementations of the second aspect.
  • the present application may further combine to provide more implementation manners.
  • 1 is a schematic diagram of a text clustering process
  • FIG. 2 is a structural diagram of a text clustering system provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an exemplary interactive interface for presenting an initial clustering result in an embodiment of the present application
  • FIG. 4 is a schematic structural diagram of a clustering apparatus in an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a text clustering method in an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a computer system in an embodiment of the application.
  • FIG. 7 is a schematic structural diagram of another computing system in an embodiment of the present application.
  • the clustering process shown in Figure 1 can be used to cluster multiple texts.
  • the user can initialize the model parameters of the clustering device and trigger the clustering device to run.
  • the clustering algorithm in the clustering device starts to cluster a plurality of texts based on the initialized model parameters, and obtains corresponding clustering results.
  • the clustering result obtained by the clustering apparatus based on the initialized model parameters may be difficult to meet the user's expectation, and therefore, the clustering result can be presented to the user.
  • the user can analyze the presented clustering results to capture the clustering errors existing in the clustering results, such as the mismatch between the text and the clustering categories, etc. to adjust the model parameters in .
  • the clustering device can re-cluster a plurality of texts based on the model parameters adjusted by the user, and the clustering result obtained by the re-clustering can be presented to the user again. If the clustering result obtained by re-clustering still does not meet the user's expectation, the user can continue to adjust the model parameters of the clustering device until the finally obtained clustering result meets the user's expectation, for example, the accuracy of the clustering result Can meet user requirements, etc.
  • this text clustering method usually requires the user to be able to catch the clustering errors according to the clustering errors, and to further adjust the model parameters of the clustering device to more appropriate values according to the clustering errors, which is not suitable for the user's technical level. Higher requirements.
  • the clustering results obtained by re-clustering may still not meet the user's expectations. Clustering incorrectly adjusts the model parameters, and it takes a long time to adjust the model parameters each time, which makes it take a long time to obtain a clustering result that meets the user's expectations based on multiple texts, and the efficiency of text clustering is low.
  • an embodiment of the present application provides a text clustering system
  • the text clustering system may at least include a clustering device and an interaction device, wherein the clustering device can cluster the text to obtain an initial clustering result, Then, the initial clustering result is presented by the interactive device, and the first clustering result is obtained in response to the adjustment operation for the first part of the initial clustering result. The second part of the result is updated to the second clustering result to optimize the initial clustering result.
  • the user can adjust some of the clustering results, and the clustering device automatically adjusts the remaining clustering results according to the user's adjustment operation, which not only realizes the adjusted clustering results
  • the results are in line with the user's expectations, and the user directly adjusts the clustering results, and does not need to adjust the model parameters of the clustering algorithm according to the clustering error analysis, which can shorten the time-consuming optimization of the clustering results and improve the overall text.
  • Efficiency of the clustering process compared with the way users optimize the clustering results by adjusting the model parameters, the user can directly adjust the clustering results, which can not only reduce the technical level requirements for the users, but also the optimization effect of the clustering results is usually more in line with the user. expectations.
  • the text clustering system includes a clustering device 201 and an interaction device 202 .
  • the computer on which the interaction device 202 is deployed may be a desktop computer, a notebook computer, a smart phone, etc.
  • the computer on which the clustering device 201 is deployed may be a terminal device such as a desktop computer, a notebook computer, a smart phone, or a server, such as a cloud server etc.
  • the clustering device is deployed on a cloud server as an example.
  • the clustering device 201 and the interaction device 202 may be deployed on the same computer, or, of course, may be deployed on different computers.
  • the user may input a plurality of texts to the clustering device 201 , for example, inputting a plurality of texts to the clustering device 201 through the interaction device 202 .
  • the text input by the user may be, for example, the N customer service work order documents input by the user as shown in FIG. 2 , which are respectively customer service work order document_1 to customer service work order document_N (N is a positive integer greater than 1), It can also be a document or other text as customer service corpus, such as text for questions raised by users in a human-computer interaction scenario and/or text for answers to user questions.
  • the computer where the interactive device 202 is located can present an interactive interface to the user, and after the user inputs multiple texts into the clustering device 201, the interactive interface can include the information of the multiple texts, as shown in FIG. 2 .
  • the user can click the button of "start clustering" on the interactive interface, and the interactive device 202 triggers the clustering device 201 to perform text clustering according to the user's click operation on the button.
  • the clustering device 201 may be configured with a clustering algorithm, and the model parameters in the clustering algorithm may be initialized.
  • the clustering device 201 clusters a plurality of texts based on the clustering algorithm and the initialized model parameters to obtain corresponding clustering results, which are hereinafter referred to as initial clustering results for ease of description.
  • the interaction device 202 may acquire the initial clustering result generated by the clustering device 201, and present the initial clustering result on the interactive interface to the user.
  • m clustering categories and document identifiers belonging to each clustering category can be displayed on the interactive interface.
  • documents belonging to category 1 include documents 1-1 to 1-x, which belong to the category
  • the documents of 2 include document 2-1 to document 2-y, . . .
  • the documents belonging to category m include document m-1 to document mz.
  • any one or more of central text, central sentences and keywords used to represent the semantics of each cluster category may also be presented on the interactive interface.
  • the text clustering result obtained by the clustering device 201 based on the initialized model parameters may not meet the user's expectation, for example, the text does not match the clustering category.
  • the user can adjust the clustering results in the initial clustering results presented by the interaction device 202, and the adjusted clustering results can meet the user's expectation.
  • the interaction device 202 Adjusting the partial clustering result to the first clustering result according to the adjustment parameter of The second part of the initial clustering result is updated to the second clustering result.
  • the clustering device 201 can automatically re-cluster the remaining 99 texts, and when the 99 texts are re-clustered, the nouns A contained in them also do not participate in the text clustering process, so as to realize the clustering process. Adjustment of class results.
  • the adjusted clustering result is usually more in line with the user's expectation.
  • users do not need to modify the parameters of the clustering algorithm model according to the clustering error analysis of the initial clustering results, which can not only reduce the technical level requirements for users, but also shorten the time required to optimize the initial clustering results. Improve text clustering efficiency.
  • the user's adjustment operation for the clustering result supported by the interaction device 202 may specifically include a definition operation for word attributes in multiple texts, an association definition operation between words, and a textual relationship definition operation. Any one or more of an inter-relationship definition operation, a cluster category definition operation, a noise annotation operation, and an annotation operation of cluster category features.
  • the adjustment operation may further include other operations for clustering results, which are not limited in this embodiment.
  • the attribute of the word may be, for example, the word's part of speech, the domain it belongs to, and the weight (for example, it may be the proportion of the word in the preset corpus or a value determined according to the proportion) and the like.
  • the definition operation for the word attribute may specifically be operations such as adding, deleting, setting, and modifying the word attribute.
  • the relevance between words can be, for example, the semantic similarity between words (such as synonyms/antonyms) and the like.
  • the definition operation of the association between words for example, it may be to mark whether the words are synonyms or antonyms, and so on.
  • the relevance between texts can be, for example, the semantic similarity between texts.
  • the definition operation of the association between texts for example, it can be to mark whether the semantics between the texts are the same or different, or to mark the semantic similarity between the texts (for example, it can be represented by a numerical value).
  • the definition operation of the clustering category may be operations such as merging multiple categories into one category, splitting one category into multiple categories, and creating a new category.
  • the noise labeling operation may be an operation of invalid marking some texts in multiple texts input by a user, or an operation of invalid marking some categories in the initial clustering result.
  • the text after the text is marked as invalid, the text may not participate in the text clustering process; after the category is marked as invalid, the categories included in the initial clustering result may not include the categories marked as invalid.
  • the labeling operation of the cluster category feature may be a labeling operation for information such as central sentences, keywords, etc., which are used to characterize the cluster category feature.
  • the semantics of each text under a cluster category are the same as or related to the semantics of the central text, central sentence and keywords of the cluster category.
  • the various examples of the adjustment operations supported by the interaction apparatus 202 above are only used for explanation, and are not used to limit the specific implementation of the adjustment operations.
  • the adjustment operations supported by the interaction apparatus 202 may include, in addition to the above-mentioned operations, other arbitrary operations on the clustering result.
  • the clustering apparatus 201 includes a communication module 400 , a preprocessing module 401 , a clustering module 402 and a storage module 403 .
  • the communication module 400 is configured to receive a plurality of texts sent by the interactive device, where the plurality of texts can be provided to the interactive device 202 by the user. In practical applications, it may also be provided by the user directly to the clustering device 201, and then forwarded by the interaction device 202.
  • the preprocessing module 401 can be used to preprocess the multiple texts, for example, it can perform word segmentation on multiple texts, error correction (such as correcting wrong words in the text, etc.), denoising (such as removing meaningless letters, symbols, etc.) etc.), remove stop words, and detect any one or more of each word's part-of-speech.
  • stop words may include words such as function words whose content indicates lower meaning, which are usually difficult to indicate the semantics of the text, such as "a", "these", "the” and other words that are difficult to indicate the semantics of the text.
  • the data volume of the texts can be reduced to a certain extent, so that when clustering the preprocessed texts, the amount of calculation can be reduced and the amount of data can be improved. clustering efficiency. For example, if one of the texts is "basketball is generally a multi-player competitive sport", after preprocessing the text, such as word segmentation and removal of stop words, the words in the text can include “basketball”, “multiplayer”, “athletic” ", "motion”, the amount of data involved in clustering in this text can be reduced to 8 characters.
  • the related information of the texts can also be provided to the clustering module 402 .
  • the relevant information of the text may include information such as each word contained in the text, the part of speech of each word, and the word order of the words in the text in the text.
  • the clustering module 402 can cluster the multiple texts according to the acquired related information of the multiple texts.
  • the clustering module 402 may include a text similarity calculation unit 4021 , a text clustering unit 4022 and a cluster category characterizing unit 4033 .
  • the text similarity calculation unit 4021 can be used to calculate the similarity between any two texts. During specific implementation, the text similarity calculation unit 4021 may select any two texts, namely text A and text B, and divide the two texts into multiple sentences. Then, the text similarity calculation unit 4021 may perform similarity calculation between each sentence in text A and each sentence in text B, respectively. Take the calculation of statement a in text A and statement b in text B as an example:
  • the text similarity calculation unit 4021 can calculate the similarity between the verbs, adverbs, nouns, and adjectives in sentence a and sentence b, and determine the similarity between sentence a and sentence b. For words whose similarity is greater than the first threshold, the similarity between words may be calculated by the similarity between word vectors of the words, and of course, may also be calculated in other ways. At the same time, the text similarity calculation unit 4021 can calculate the similarity between the sentence vector of sentence a and the sentence vector of sentence b.
  • the text similarity calculation unit 4021 can determine that the two sentences are not similar.
  • the weight value of each word can be, for example, the weight value of the word in the preset corpus, and the text similarity calculation unit 4021 can determine the weight value corresponding to any word in sentence a and sentence b by looking up a table.
  • sentence a and sentence b contain nouns and verbs at the same time, for example, only nouns (or verbs) are included at the same time, of course, words of other parts of speech can also be included, and the text similarity calculation unit 4021 can calculate the sentence a and the sentence b at the same time.
  • the similarity between the nouns and adjectives (or the similarity between verbs and adverbs) in the sentence a and the sentence b are determined to determine the words whose similarity is greater than the first threshold.
  • the text similarity calculation unit 4021 can calculate the similarity between the sentence vector of sentence a and the sentence vector of sentence b.
  • the text similarity calculation unit 4021 can calculate the similarity between the sentence vector of sentence a and the sentence vector of sentence b, and if the sentence vectors of the two sentences have the same degree of similarity If the similarity is greater than the second threshold, the text similarity calculation unit 4021 may determine that the sentence a is similar to the sentence b, and if the similarity of the sentence vectors of the two sentences is not greater than the second threshold, the text similarity calculation unit 4021 may Determine that statement a is not similar to statement b.
  • the text similarity calculation unit 4021 can respectively calculate the proportion of the similar sentence in text A and the proportion of the similar sentence in text B.
  • the text similarity calculation unit 4021 determines that the text A is similar to the text B, and when the proportion of similar sentences in one of the texts does not reach the fourth threshold, the text similarity calculation unit 4021 Make sure that text A is not similar to text B.
  • the above-mentioned specific implementation manner of determining whether two texts are similar is only an example. In practical applications, other methods may also be used to determine whether two texts are similar, and when determining whether two texts are similar
  • the adopted threshold can be set by itself, and the specific implementation manner of the process is not limited in this embodiment.
  • the text similarity calculation unit 4021 can determine whether any two texts in the plurality of texts are similar and the similarity between any two texts through traversal calculation. Then, the text similarity calculation unit 4021 can pass the obtained result to the text clustering unit 4022 .
  • the text clustering unit 4022 can perform clustering according to the similarity between each sample in the plurality of texts, and specifically can determine the similarity between each text to be clustered and each text in the clustered text set, and further determine the similarity between each text in the clustered text set. For the text whose similarity between the clustered text set and the text to be clustered is greater than the fifth threshold, when the proportion of the text whose similarity is greater than the fifth threshold in the clustered text set is greater than the first proportion threshold, Then, it can be determined that the text to be clustered belongs to the category to which the clustered text set belongs, and the text to be clustered is added to the clustered text set.
  • the proportion of the text whose similarity is greater than the fifth threshold in the clustered text set is less than the first proportion threshold, it can be determined that the text to be clustered does not belong to the clustering category to which the clustered text set belongs. And can continue to determine the text to be clustered and the text in the next clustered text set whose similarity is greater than the fifth threshold, so as to continue to determine whether the text to be clustered belongs to the clustering class to which the next clustered text set belongs. eye.
  • the text clustering unit 4022 determines that the text to be clustered does not belong to all existing clustering categories, it can create a new clustering category based on the text to be clustered, and the text to be clustered belongs to the new clustering category. Cluster category.
  • the text clustering unit 4022 When the text clustering unit 4022 starts to perform text clustering, if there is currently no clustered text set, it can first create a clustered text set with any text, and determine whether the text to be clustered belongs to the clustered text based on the above process. The clustering category to which the text-like set belongs. If it belongs, the text to be clustered is added to the clustered text set, and if it does not belong, a new clustered text set can be created based on the text to be clustered. This new set of clustered text corresponds to a new cluster category. In this way, the text clustering unit 4022 can divide each text into corresponding clustered text sets, and the number of clustered text sets is the number of clustered categories.
  • multiple texts in the clustering apparatus 201 may include standard texts and texts to be clustered at the same time.
  • the text to be clustered has not yet been clustered; while the standard text has been clustered, and can be divided into multiple different clustered text sets according to different clustering categories.
  • the text clustering unit 4022 in the clustering device 201 can cluster the texts to be clustered according to the clustering situation of the standard texts.
  • a collection of clustered texts wherein, if the text to be clustered does not belong to all existing clustering categories, a new clustering category may be created based on the text to be clustered, and the text to be clustered belongs to the new clustering category.
  • the clustering category representation unit 4023 in the clustering device 201 can also determine any one or more of the central text, central sentence and keywords for the clustering category.
  • the semantics of the determined central text, central sentence, and keywords can represent the cluster category.
  • the clustering category characterizing unit 4023 may determine the clustering category according to the similarity between different texts calculated by the text similarity calculating unit 4021 The sum (or average) of the similarity between each text in the corresponding clustered text set and other texts in the clustered text set, and the sum (or average) of the similarity corresponding to each text is calculated. Sort, and select the text with the larger or largest similarity sum (or average) as the central text of the cluster category.
  • the clustering category characterizing unit 4021 may determine the clustered category corresponding to the clustering category according to the similarity between different sentences calculated by the text similarity calculating unit 4021 Sum (or average) of the similarity between each sentence in the text-like set and other sentences in the clustered text set, sort the sum (or average) of the similarity corresponding to each sentence, and select the similarity The sentence with the larger or largest degree sum (or average) is used as the central sentence of the cluster category.
  • the cluster category representation unit 4021 can determine the set of words with the part of speech in the text corresponding to the cluster category, such as words whose part of speech is verb A set, a set of words whose part of speech is a noun, etc., and the weight value of each word in the set of words is determined by looking up a table or the like. Then, the cluster category representation unit 4021 can sort the weight values of different words in the word set, and select the larger or largest word or words as the keywords of the cluster category. Keywords with different parts of speech corresponding to the cluster category are determined.
  • the process that the clustering module 402 completes the clustering of multiple texts is only an exemplary description, and is not intended to limit the implementation of text clustering in this embodiment to the above examples. In practical applications, the clustering module 402 It is also possible to use other possible text clustering processes to complete the clustering of multiple texts.
  • the clustering module 402 may also record the intermediate information involved in the text clustering process, and specifically, the intermediate information may be sent to the storage module 403 in the clustering module for storage.
  • the intermediate information can be, for example, the similarity between different words, the similarity between different texts, the weight value of the word, and the definition of word attributes calculated by the text similarity calculation unit 4021. any one or more of them.
  • the recorded intermediate information may also include more other information, such as similar sentences between different texts or identifications of similar sentences (for example, the sentences may be serially numbered in the text, etc.).
  • the storage module 403 may include an indexing unit 4031 and a storage unit 4032, the storage module 403 may use the storage unit 4032 to store the intermediate information, and establish a query index of the information in the indexing unit 4031, and the index may include the identifier of the intermediate information and The storage address of the intermediate information in the storage unit.
  • the clustering device 201 can realize the clustering of a plurality of texts, and obtain an initial clustering result, where the initial clustering result includes clustering categories, which can use the above-mentioned central texts, central sentences and keywords. At the same time, the initial clustering result also includes the texts belonging to the clustering category. Further, the initial clustering result may also include intermediate information in the clustering process, such as attributes of words included in the texts under each clustering category, and similarity between texts. Then, the clustering device 201 can transmit the obtained initial clustering result to the interaction device 202 through the communication module 400, so that the interaction device 202 can present the initial clustering result to the user.
  • the interaction device 202 can also support a variety of interactive operations with the user, such as the above-mentioned definition operation for word attributes in multiple texts, definition operation of correlation between words, definition operation of correlation between texts, aggregation Category definition operation, noise labeling operation, and labeling operation of clustering category features, etc.
  • the user adjusts the initial clustering results presented by the interaction device 202, since the number of texts involved in the initial clustering results is large, the user can adjust a higher number of clustering results, while the interaction device 202 to update the part of the clustering result adjusted by the user to the first clustering result that meets the user's expectation.
  • the interaction device 202 can also transmit the user's adjustment operation on the part of the clustering result to the clustering device 201, specifically to the communication module 400 in the clustering device 201, and then the communication module 400 transmits the adjustment operation to the clustering device 201.
  • This adjustment parameter is passed to the clustering module 402 .
  • the clustering device 201 updates word-related information, text-related information, and clustering category-related information, etc. according to the adjustment operation performed by the user, and adjusts the remaining texts accordingly based on the updated information, so as to facilitate the initial clustering.
  • the second part of the class result is updated to the second cluster result.
  • the adjustment operation performed by the user is to adjust the attributes of the word, such as defining the part of speech of the word from a verb to a noun, etc.
  • the keywords of the cluster category when determining the keywords of the cluster category, delete the word from the verb set, and remove the word from the noun set
  • the word is added in the cluster, so that the keywords of different parts of speech of the clustering category are re-determined based on the updated verb set and the words in the noun set.
  • the keywords of different parts of speech of the clustering category are re-determined based on the updated verb set and the words in the noun set.
  • the clustering device 201 can set the similarity between the two words to an arbitrary value greater than the first threshold (when they are antonyms to each other, set to an arbitrary value less than the first threshold), and based on The similarity between the updated words is used to re-cluster the text.
  • the clustering device 201 can perform text clustering again based on the definition operation for the correlation between the texts, such as migrating other texts with the same semantics as the text P in other cluster categories to the cluster category where the text P is located. in the eye.
  • the clustering device 201 may update the intermediate information stored in the storage module 403, such as updating the attributes of words, the similarity between words, the similarity between texts, etc.
  • the indexing unit 4031 queries the storage location of the information to be updated in the storage unit 4032, and updates the value at the storage location accordingly.
  • the clustering device 201 can directly read the similarity between two texts from the storage module 403, and it is not necessary to calculate the similarity between the texts through the above calculation process. In this way, not only the amount of computation required for re-clustering can be effectively reduced, but also the efficiency of re-clustering can be improved, thereby improving the real-time performance of optimal clustering results.
  • the clustering device 201 can transmit the second clustering result to the interactive device 202, and the interactive device 202 can transfer the first clustering result (adjusted by the user) and the second clustering result to the interactive device 202.
  • the binary clustering result (adjusted by the clustering device 201 ) is presented to the user, so that the user can check whether the adjusted clustering result can meet the user's expectation.
  • the clustering device 201 can automatically adjust the clustering results that have not been adjusted by the user based on the above-mentioned similar process, and submit them to the interaction device 202 again. Present until the final clustering result meets the user's expectation.
  • the embodiments of the present application also provide the text clustering method implemented in the above-mentioned introduction, and the text clustering method will be introduced from the perspective of interaction of various devices next.
  • the method can be referred to the text clustering system shown in FIG. 2, and the method can specifically include:
  • the interaction device 202 receives a plurality of texts provided by the user.
  • the text provided by the user may be, for example, a work order document that needs to be clustered and distributed to a department, or may be a customer service work order document used as a customer service corpus, or may be a question raised by a user in a man-machine dialogue scenario Text and/or text of answers to user questions, etc.
  • the text received by the interaction device 202 may include standard text and text to be clustered. Among them, the standard text has been clustered, and the text to be clustered has not yet been clustered. Of course, all the texts received by the interaction device 202 may also be texts to be clustered.
  • the user inputting multiple texts to the interaction device 202 is used as an example for illustration.
  • the user may also directly input multiple texts to the clustering device 201. This embodiment does not limit this.
  • the interaction device 202 transmits the plurality of texts to the communication module 400 in the clustering device 201 .
  • the preprocessing module 401 in the clustering device 201 preprocesses the plurality of texts transmitted by the communication module 400 , and transmits relevant information of the preprocessed texts to the clustering module 402 .
  • the preprocessing of multiple texts may be any one or more of word segmentation, error correction, denoising, stop word removal, and part-of-speech detection of each word for the multiple texts.
  • word segmentation error correction
  • denoising denoising
  • stop word removal and part-of-speech detection of each word for the multiple texts.
  • the clustering module 402 in the clustering device 201 performs clustering on a plurality of texts according to the relevant information of the preprocessed texts to obtain an initial clustering result, and transmits the intermediate information involved in the clustering process to stored in the storage module 403 .
  • the relevant information of the text may include information such as each word contained in the text, the part of speech of each word, and the word order of the words in the text in the text.
  • the text similarity calculation unit 4021 may calculate the similarity between different texts, and then the text clustering unit 4022 calculates the similarity between different texts according to the difference between the texts.
  • the texts with high similarity are clustered into one category, and a plurality of different clustered text sets are obtained, corresponding to different clustering categories respectively.
  • the cluster category representation unit 4023 determines any one or more of the central text, the central sentence and the keyword for the cluster category.
  • the intermediate information involved in the clustering process, such as the similarity between texts can be recorded in the storage module 403 , specifically in the storage unit 4032 in the storage module 403 , and an index is established in the indexing unit 4031 .
  • the specific clustering process of the clustering module 402 for a plurality of texts and the specific implementation of the storage module 403 for storing the intermediate information can be referred to the above-mentioned descriptions in the relevant places, and will not be repeated here.
  • the clustering device 201 transmits the initial clustering result to the interaction device 202 through the communication module 400 .
  • the interaction device 202 In response to the user's adjustment operation on the initial clustering result, the interaction device 202 updates the part of the clustering result adjusted by the user to the first clustering result, and transmits the adjustment operation to the communication module 400 in the clustering device 201.
  • the clustering device 201 updates other partial clustering results in the initial clustering results to the second clustering results according to the adjustment operation, and transmits the second clustering results to the interaction device 202 through the communication module 400 .
  • the clustering device 201 updates the intermediate information stored in the storage module 403 according to the user's adjustment operation on some of the clustering results, and updates the clustering results that have not been adjusted by the user based on the updated intermediate information to obtain the second clustering result. class result.
  • the specific implementation of the second clustering result obtained by the clustering apparatus 201 based on the adjustment operation can be referred to the descriptions in the above-mentioned relevant places, and details are not repeated here.
  • the interaction device 202 and the clustering device 201 may correspond to executing the methods described in the embodiments of the present application, and the above and other operations and/or functions of the modules in the interaction device 202 and the clustering device 201 are respectively In order to realize the corresponding flow of each method in FIG. 5 , for brevity, details are not repeated here.
  • Figure 6 provides a computer system.
  • the computer system 600 shown in FIG. 6 includes a computer, and the computer can specifically be used to implement the functions of the clustering apparatus 201 in the embodiment shown in FIG. 4 above.
  • Computer system 600 includes bus 601 , processor 602 , communication interface 603 and memory 604 . Communication between the processor 602 , the memory 604 and the communication interface 603 is through the bus 601 .
  • the bus 601 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus or the like.
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used in FIG. 6, but it does not mean that there is only one bus or one type of bus.
  • the communication interface 603 is used for external communication, such as receiving a plurality of texts sent by the interaction device 201 and transmitting the initial clustering result to the interaction device 201 .
  • the processor 602 may be a central processing unit (central processing unit, CPU).
  • Memory 604 may include volatile memory, such as random access memory (RAM).
  • RAM random access memory
  • the memory 604 may also include non-volatile memory, such as read-only memory (ROM), flash memory, HDD, or SSD.
  • Executable code is stored in the memory 604, and the processor 602 executes the executable code to perform the aforementioned text clustering method.
  • each module described in the embodiment in FIG. 4 is implemented by software
  • the software or program code required by the module 403 is stored in the memory 604
  • the function of the communication module 400 is realized through the communication interface 603
  • the processor 602 is used to execute the instructions in the memory 604 to execute the text clustering method applied to the clustering device 201 .
  • the memory 600 may also be used to store data
  • the function of the storage module 403 may be implemented through the memory 604 .
  • the computer system 600 shown in FIG. 6 is exemplified by including one computer, and in other possible embodiments, the computer system may also include multiple computers, and multiple different computers in the computer system The computers cooperate with each other to jointly implement the above text clustering method.
  • the above-mentioned preprocessing module 401, clustering module 402 and storage module 403 may be located on multiple different computers.
  • the preprocessing module 401 and the clustering module 402 are located in the same computer, and the storage module 403 is located in another computer as an example for illustrative description.
  • the computer system 700 shown in FIG. 7 includes two computers, namely a computer 710 and a computer 720, which cooperate with each other to implement the functions of the clustering apparatus 201 in the embodiment shown in FIG. 4 above.
  • the computer 710 includes a bus 711 , a processor 712 , a communication interface 713 and a memory 714 .
  • the processor 712 , the memory 714 and the communication interface 713 communicate through the bus 711 .
  • Computer 720 includes bus 721 , processor 722 , communication interface 723 and memory 724 . Communication between the processor 722 , the memory 724 and the communication interface 723 is through the bus 721 .
  • the bus 711 and the bus 721 may be a PCI bus, an EISA bus, or the like.
  • the bus can be divided into address bus, data bus, control bus and so on. For convenience of representation, each computer in FIG. 7 is represented by a thick line, but it does not represent that there is only one bus or one type of bus.
  • the communication interface 713 is used to communicate with the outside, such as receiving multiple texts sent by the interaction device 201 and transmitting the initial clustering result to the interaction device 201 .
  • the processor 712 and the processor 722 may be CPUs.
  • Memory 714 and memory 724 may include volatile memory, such as RAM.
  • Memory 714 may also include non-volatile memory such as ROM, flash memory, HDD or SSD.
  • Executable code is stored in the memory 714 and the memory 724, and the processor 712 and the processor 722 respectively execute the executable code in the corresponding memory to execute the aforementioned text clustering method.
  • each module described in the embodiment in FIG. 4 is implemented by software
  • the software or program code is stored in the memory 714, the software or program code required to execute the storage module 403 in FIG.
  • the processor 722 is configured to execute the instructions in the memory 724, and cooperate with each other to execute the text clustering method applied to the clustering device 201.
  • the preprocessing module 401 and the clustering module 102 may also be located in different computers, etc., which is not limited in this application.
  • Embodiments of the present application further provide a computer-readable storage medium, including instructions, which, when executed on a computer, cause the computer to execute the above text clustering method applied to the clustering apparatus 201 .
  • An embodiment of the present application further provides a computer program product, when the computer program product is executed by a computer, the computer executes any one of the foregoing text clustering methods.
  • the computer program product can be a software installation package, and when any of the aforementioned text clustering methods needs to be used, the computer program product can be downloaded and executed on a computer.
  • the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be retrieved from a website, computer, training device, or data Transmission from the center to another website site, computer, training facility or data center via wired (eg coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.) means.
  • wired eg coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a training device, a data center, or the like that includes an integration of one or more available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文本聚类系统、方法、装置、设备及介质,该系统包括聚类装置(201)以及交互装置(202)。其中,聚类装置(201),用于对多个文本进行聚类,得到初始聚类结果,而交互装置(202)可以呈现从聚类装置获取的初始聚类结果,并响应针对初始聚类结果中第一部分的调整操作,得到第一聚类结果;聚类装置还根据针对于该第一部分的调整操作,将初始聚类结果中的第二部分更新为第二聚类结果。如此,不仅实现了调整后的聚类结果符合用户的预期,而且,用户是直接对聚类结果进行调整,无需根据聚类错误分析如何调整聚类算法的模型参数,以此可以缩短优化聚类结果的耗时,从而可以提高整个文本聚类过程的效率。

Description

一种文本聚类系统、方法、装置、设备及介质 技术领域
本申请涉及数据处理技术领域,尤其涉及一种文本聚类系统、方法、装置、设备及计算机可读存储介质。
背景技术
随着信息技术的发展,互联网累计了大量的文本数据。文本聚类技术,通过对文本信息进行有效的组织、摘要和导航,将语义相似度较大的文本汇聚为一簇,以此可以从海量文本数据中挖掘出有效信息。
在文本聚类过程中,可以采用交互式聚类的方式提高文本聚类的准确率。具体的,聚类算法在给出聚类结果后,用户可以捕捉该聚类结果中所存在的聚类错误,并基于所捕捉到的聚类错误调整聚类算法的模型参数,以便于聚类算法基于调整后的模型重新执行文本聚类过程。如此,基于用户对模型参数的多次调整,最终可以使得聚类算法所输出的聚类结果的准确率能够满足用户的要求。
但是,基于用户调整聚类算法的模型参数来优化聚类算法输出的聚类结果,这使得整个文本聚类过程的耗时较高,文本聚类效率较低。
发明内容
本申请提供了一种基于协同架构的文本聚类系统,通过对用户未调整的聚类结果进行自动调整,提高文本聚类的效率。本申请还提供了对应的方法、装置、设备、存储介质以及计算机程序产品。
第一方面,本申请提供了一种文本聚类系统,其包括聚类装置以及交互装置。其中,聚类装置,用于对多个文本进行聚类,得到初始聚类结果,而交互装置可以呈现从聚类装置获取的初始聚类结果,并响应用户针对初始聚类结果中第一部分的调整操作,得到第一聚类结果,相应的,聚类装置还可以根据针对于该第一部分的调整操作,将初始聚类结果中的第二部分更新为第二聚类结果,以实现对初始聚类结果的优化。由于在修正聚类结果的过程中,用户可以对部分聚类结果进行调整,并由聚类装置根据用户的调整操作,对剩余的聚类结果进行自动调整,这不仅实现了调整后的聚类结果符合用户的预期,而且,用户是直接对聚类结果进行调整,无需根据聚类错误分析如何调整聚类算法的模型参数,以此可以缩短优化聚类结果的耗时,从而可以提高整个文本聚类过程的效率。同时,相比于用户通过调整模型参数的方式来优化聚类结果,用户直接对聚类结果进行调整,不仅可以降低对于用户的技术水平要求,而且,可以聚类结果的优化效果通常更符合用户的预期。
结合第一方面,在第一方面的第一种可能的实施方式中,聚类装置,还可以用于 对聚类得到初始聚类结果的过程中涉及的中间信息进行记录,并根据该中间信息以及调整操作将初始聚类结果中的第二部分更新为第二聚类结果。如此,聚类装置在自动调整初始聚类结果中的第二部分聚类结果时,无需重新计算所有信息,如文本之间的相似度等,而是可以是复用之前聚类得到初始聚类结果的过程中所计算出的中间信息,从而不仅可以降低重新进行文本聚类所需的计算量,而且,也可以有效提高文本聚类效率。
结合第一方面的第一种实施方式,在第一方面的第二种可能的实施方式中,该中间信息可以包括多个文本中单词之间的相似度、文本之间的相似度、单词的权重值以及单词属性的定义等信息中的任意一种或多种。实际应用中,中间信息还可以包括其它信息,如经过预处理后的文本、单词在文本中的词序等信息,本申请中对记录的中间信息并不进行限定。
结合第一方面至第一方面的第二种实施方式,在第一方面的第三种可能的实施方式中,上述调整操作,可以包括对多个文本中单词属性的定义操作、单词之间关联性定义操作、文本之间关联性定义操作、聚类类目定义操作、噪音标注操作以及聚类类目特征的标注操作中的任意一种或多种。特别的,当交互装置可以支持用户对初始聚类结果的多种调整操作时,可以增加调整操作的丰富度,提高用户体验。
结合第一方面至第一方面的第三种实施方式,在第一方面的第四种可能的实施方式中,聚类装置在对多个文本进行聚类时,具体可以是先计算该多个文本中不同文本之间的相似度,再根据该不同文本之间的相似度,计算多个文本中不同文本与聚类类目之间的相似度,并基于该不同文本与聚类类目之间的相似度,确定初始聚类结果,最后,聚类装置再计算用于表征聚类类目特征的文本与关键词。如此,可以实现将多个文本进行聚类,得到初始聚类结果。
结合第一方面至第一方面的第四种实施方式,在第一方面的第五种可能的实施方式中,聚类装置所获取的多个文本中,包括标准文本以及待聚类文本,其中,标准文本已经完成聚类,而待聚类文本尚未完成聚类。这样,聚类装置在对该多个文本进行聚类时,可以是根据该标准文本对聚类文本进行聚类,如聚类装置可以计算每个待聚类文本与标准文本之间的相似度,并根据该待聚类文本与标准文本之间的相似度,确定该待聚类文本是否与标准文本聚集为一类。
结合第一方面至第一方面的第四种实施方式,在第一方面的第六种可能的实施方式中,聚类装置在对多个文本进行文本聚类时,可以先对多个文本进行预处理,该预处理包括对多个文本进行分词、错误纠正、去噪、去除停用词、词性检测中的任意一种或多种。然后,聚类装置再对经过预处理的多个文本进行聚类,得到初始聚类结果。通常情况下,基于经过预处理后的多个文本进行聚类,其聚类结果的准确率和/或聚类效率可以得到相应提高。比如,当对文本进行错误纠正后,文本中的错误表达(错误词汇或语句)等可以被纠正,相较于基于包含错误内容的文本所得到的聚类结果,根据经过错误纠正后所得到的文本所得到的聚类结果的准确率可以更高。又比如,当读文本进行去除停用词/去噪后,文本的数据量可以得到有效减少,从而基于较少数据量的文本进行聚类,可以提高文本聚类效率,而文本聚类的准确性通常也不会因为去除停用词和/或去噪处理而被降低。
第二方面,本申请提供了一种文本聚类方法,该方法可以应用于聚类装置,具体包括如下步骤:对多个文本进行聚类,得到初始聚类结果;向交互装置发送所述初始聚类结果;
根据所述交互装置发送的针对于所述初始聚类结果中第一部分的调整操作,将所述初始聚类结果中的第二部分更新为第二聚类结果。
结合第二方面,在第二方面的第一种可能的实施方式中,所述方法还包括:对聚类得到所述初始聚类结果的过程中涉及的中间信息进行记录;则所述根据所述交互装置发送的针对于所述初始聚类结果中第一部分的调整操作,将所述初始聚类结果中的第二部分更新为第二聚类结果,包括:根据所述中间信息以及所述调整操作将所述初始聚类结果中的第二部分更新为所述第二聚类结果
结合第二方面的第一种实施方式,在第二方面的第二种可能的实施方式中,所述中间信息包括所述多个文本中单词之间的相似度、文本之间的相似度、单词的权重值、以及单词属性的定义等信息中的任意一种或多种。
结合第二方面至第二方面的第二种实施方式,在第二方面的第三种可能的实施方式中,所述调整操作,包括所述多个文本中单词属性的定义操作、单词之间关联性定义操作、文本之间关联性定义操作、聚类类目定义操作、噪音标注操作以及聚类类目特征的标注操作中的任意一种或多种。
结合第二方面至第二方面的第三种实施方式,在第二方面的第四种可能的实施方式中,所述对多个文本进行聚类,得到初始聚类结果,包括:计算所述多个文本中不同文本之间的相似度;根据所述不同文本之间的相似度,计算所述多个文本中不同文本与聚类类目之间的相似度,并基于所述不同文本与聚类类目之间的相似度确定所述初始聚类结果;计算用于表征聚类类目特征的文本与关键词。
结合第二方面至第二方面的第四种实施方式,在第二方面的第五种可能的实施方式中,所述多个文本中包括标准文本以及待聚类文本,所述标准文本已完成聚类;则,所述对多个文本进行聚类,得到初始聚类结果,包括:根据所述标准文本对所述待聚类文本进行聚类。
结合第二方面至第二方面的第五种实施方式,在第二方面的第六种可能的实施方式中,所述对多个文本进行聚类,得到初始聚类结果,包括:对所述多个文本进行预处理,所述预处理包括分词、错误纠正、去噪、去除停用词以及词性检测中的任意一种或多种;对经过预处理的多个文本进行聚类,得到所述初始聚类结果。
由于第二方面的文本聚类方法,对应于第一方面中聚类装置所具有的功能,因此,第二方面以及第二方面中各种可能实施方式的具体实现及其所具有的技术效果,可以参见第一方面中相应实施方式的相关描述,在此不做赘述。
第三方面,本申请提供了一种聚类装置,该聚类装置包括:聚类模块,用于对多个文本进行聚类,得到初始聚类结果;通信模块,用于向交互装置发送所述初始聚类结果;
所述聚类模块,还用于根据所述交互装置发送的针对于所述初始聚类结果中第一部分的调整操作,将所述初始聚类结果中的第二部分更新为第二聚类结果。
结合第三方面,在第三方面的第一种可能的实施方式中,所述装置还包括:存储 模块,用于对聚类得到所述初始聚类结果的过程中涉及的中间信息进行记录;则,所述聚类模块,具体用于根据所述中间信息以及所述调整操作将所述初始聚类结果中的第二部分更新为所述第二聚类结果。
结合第三方面的第一种实施方式,在第三方面的第二种可能的实施方式中,所述中间信息包括所述多个文本中单词之间的相似度、文本之间的相似度、单词的权重值、以及单词属性的定义等信息中的任意一种或多种。
结合第三方面至第一方面的第二种实施方式,在第三方面的第三种可能的实施方式中,所述调整操作,包括所述多个文本中单词属性的定义操作、单词之间关联性定义操作、文本之间关联性定义操作、聚类类目定义操作、噪音标注操作以及聚类类目特征的标注操作中的任意一种或多种。
结合第三方面至第一方面的第三种实施方式,在第三方面的第四种可能的实施方式中,所述聚类模块,具体用于:计算所述多个文本中不同文本之间的相似度;根据所述不同文本之间的相似度,计算所述多个文本中不同文本与聚类类目之间的相似度,并基于所述不同文本与聚类类目之间的相似度确定所述初始聚类结果;计算用于表征聚类类目特征的文本与关键词。
结合第三方面至第一方面的第四种实施方式,在第三方面的第五种可能的实施方式中,所述多个文本中包括标准文本以及待聚类文本,所述标准文本已完成聚类;所述聚类模块,具体用于根据所述标准文本对所述待聚类文本进行聚类。
结合第三方面至第一方面的第五种实施方式,在第三方面的第六种可能的实施方式中,所述装置还包括:预处理模块,对所述多个文本进行预处理,所述预处理包括分词、错误纠正、去噪、去除停用词以及词性检测中的任意一种或多种;所述聚类模块,具体用于对经过预处理的多个文本进行聚类,得到所述初始聚类结果。
由于第三方面的文本聚类装置,对应于第一方面中聚类装置所具有的功能,因此,第三方面以及第三方面中各种可能实施方式的具体实现及其所具有的技术效果,可以参见第一方面中相应实施方式的相关描述,在此不做赘述。
第四方面,本申请提供一种计算机系统,所述计算机系统包括至少一个计算机,所述至少一个计算机包括处理器和存储器;所述至少一个计算机的处理器用于执行所述至少一个计算机的存储器中存储的指令,执行如权利要求8至14任一项所述的方法。
第五方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第二方面或第二方面的任一种实现方式所述的方法。
第六方面,本申请提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第二方面或第二方面的任一种实现方式所述的方法。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些 实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其它的附图。
图1为一种文本聚类过程的示意图;
图2为本申请实施例提供的一种文本聚类系统的结构图;
图3为本申请实施例中一示例性呈现初始聚类结果的交互界面示意图;
图4为本申请实施例中一种聚类装置的结构示意图;
图5为本申请实施例中一种文本聚类方法的流程示意图;
图6为本申请实施例中一种计算机系统的结构示意图;
图7为本申请实施例中另一种计算系统的结构示意图。
具体实施方式
实际应用中,可以采用如图1所示的聚类过程对多个文本进行聚类。其中,用户可以对聚类装置的模型参数进行初始化,并触发该聚类装置运行。聚类装置中的聚类算法基于初始化的模型参数开始对多个文本进行聚类,得到相应的聚类结果。通常情况下,聚类装置基于初始化的模型参数所得到的聚类结果可能难以达到用户的预期,因此,该聚类结果可以呈现给用户。而用户可以根据对所呈现的聚类结果进行分析,捕捉该聚类结果中存在的聚类错误,如文本与聚类类目不匹配等,并基于所确定的聚类错误对聚类装置的中的模型参数进行调整。这样,聚类装置可以基于用户调整的模型参数对多个文本进行重新聚类,并且重新聚类得到的聚类结果可以再次呈现给用户。如果重新聚类得到的聚类结果仍然不符合用户的预期,则用户可以继续对聚类装置的模型参数进行调整,直至最终得到的聚类结果符合用户的预期,比如,聚类结果的准确率能够达到用户要求等。
但是,这种文本聚类方式通常要求用户能够根据聚类错误捕捉到聚类错误,并能够根据聚类错误来进一步将聚类装置的模型参数调整为更合适的值,这对于用户的技术水平要求较高。并且,实际应用中,用户根据聚类错误对模型参数进行调整后,重新聚类所得到的聚类结果也很可能仍然不符合用户预期,因此,用户需要通过反复试错的方式,多次根据聚类错误调整模型参数,而每次调整模型参数均需要耗费较长时间,这使得基于多个文本得到符合用户预期的聚类结果的总耗时较长,文本聚类的效率较低。
基于此,本申请实施例提供了一种文本聚类系统,该文本聚类系统至少可以包括聚类装置和交互装置,其中,该聚类装置可以对文本进行聚类,得到初始聚类结果,然后由交互装置将该初始聚类结果进行呈现,并响应针对该初始聚类结果中第一部分的调整操作,得到第一聚类结果,而聚类装置还可以根据该调整操作,将初始聚类结果中的第二部分更新为第二聚类结果,以实现对初始聚类结果的优化。由于在修正聚类结果的过程中,用户可以对部分聚类结果进行调整,并由聚类装置根据用户的调整操作,对剩余的聚类结果进行自动调整,这不仅实现了调整后的聚类结果符合用户的预期,而且,用户是直接对聚类结果进行调整,无需根据聚类错误分析如何调整聚类算法的模型参数,以此可以缩短优化聚类结果的耗时,从而可以提高整个文本聚类过程的效率。同时,相比于用户通过调整模型参数的方式来优化聚类结果,用户直接对聚类结果进行调整,不仅可以降低对于用户的技术水平要求,而且,可以聚类结果的 优化效果通常更符合用户的预期。
下面结合附图,对本申请的实施例进行描述。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。
首先,参见图2所示的文本聚类系统的结构图,该文本聚类系统包括聚类装置201、交互装置202。其中,部署交互装置202的计算机,可以是台式机、笔记本电脑、智能手机等,部署聚类装置201的计算机,可以台式机、笔记本电脑、智能手机等终端设备,也可以是服务器,如云服务器等,图1中以聚类装置部署于云服务器为例。聚类装置201与交互装置202可以是部署于同一计算机上,当然,也可以是部署于不同计算机上。
在进行文本聚类时,用户可以将多个文本输入至聚类装置201,如通过交互装置202将多个文本输入至聚类装置201等。其中,用户输入的文本,例如可以是图2所示的用户输入的N个客服工单文档,分别为客服工单文档_1至客服工单文档_N(N为大于1的正整数),也可以是作为客服语料的文档或者其它文本,如针对于人机交互场景中用户所提出的问题文本和/或针对于用户问题的答案文本等。
实际应用中,交互装置202所在计算机可以向用户呈现交互界面,并且,在用户将多个文本输入至聚类装置201后,该交互界面上可以包括该多个文本的信息,如图2所示的N个客服工单文档的标识,以便用户查看已输入哪些文本。然后,用户可以在该交互界面上点击“开始聚类”的按钮,而交互装置202根据用户针对该按钮的点击操作,触发聚类装置201进行文本聚类。
该聚类装置201中可以配置有聚类算法,并且,该聚类算法中的模型参数可以被初始化。聚类装置201基于该聚类算法以及完成初始化的模型参数,对多个文本进行聚类,得到相应的聚类结果,为便于描述,以下将其称之为初始聚类结果。
相应的,交互装置202可以从聚类装置201中获取其生成的初始聚类结果,并将该初始聚类结果呈现在交互界面上呈现给用户。如图3所示,交互界面上可以呈现m个聚类类目以及属于各个聚类类目下的文档标识,如属于类目1的文档包括文档1-1至文档1-x,属于类目2的文档包括文档2-1至文档2-y,……,属于类目m的文档包括文档m-1至文档m-z。进一步的,交互界面上还可以呈现用于表征每个聚类类目语义的中心文本、中心句以及关键词中的任意一种或多种。
通常情况下,聚类装置201基于初始化的模型参数所得到的文本聚类结果,可能不符合用户的预期,比如,文本与聚类类目不符等。本实施例中,用户可以对交互装置202所呈现的初始聚类结果中聚类结果进行调整,调整所得的聚类结果即能符合用户的预期。实际应用中,由于参与聚类的文本数量较多,因此,用户可以仅对初始聚类结果的中的部分聚类结果进行调整,相应的,交互装置202可以根据用户针对于该部分聚类结果的调整参数将该部分聚类结果调整为第一聚类结果,而聚类装置201可以根据针对部分聚类结果的调整操作,对用户未调整的其它聚类结果进行调整,具体可以是将该初始聚类结果中的第二部分更新为第二聚类结果。比如,假设待聚类的多 个文本中存在100个文本包含名词A,并且,初始聚类结果中的类目1下的部分文本包含该名词A,用户可以标记该名词A不参与文本聚类过程,则聚类装置201可以自动对其余的99个文本进行重新聚类,并且,该99个文本在重新聚类时,其包含的名词A同样不参与文本聚类过程,以此实现对聚类结果的调整。
由于用户是直接对初始聚类结果进行调整,而并非是对聚类装置201中聚类算法的模型参数进行调整,因此,调整得到的聚类结果通常更能符合用户的预期。同时,用户也无需根据初始聚类结果的聚类错误分析如何对聚类算法模型参数进行修改,不仅可以降低对于用户的技术水平要求,而且,可以缩短优化初始聚类结果所需的耗时,提高文本聚类效率。
在一些可能的实施方式中,交互装置202所支持的用户针对于聚类结果的调整操作,具体可以是包括对于多个文本中单词属性的定义操作、单词之间的关联性定义操作、文本之间关联性定义操作、聚类类目定义操作、噪音标注操作以及聚类类目特征的标注操作中的任意一种或多种。实际应用中,该调整操作,还可以包括其它针对于聚类结果的操作,本实施例对此并不进行限定。
其中,单词的属性,例如可以是单词的词性、所属领域以及权重(如可以是该单词在预设语料库中的占比或者根据该占比所确定的值)等。则,针对于单词属性的定义操作,具体可以是针对于该单词属性的添加、删除、设置、修改等操作。
单词之间的关联性,例如可以是单词之间的语义相似度(如近义词/反义词)等。针对于单词之间关联性的定义操作,例如可以是对单词之间是否为近义词或者反义词进行标注等。
文本之间的关联性,例如可以是文本之间的语义相似度等。针对于文本之间关联性的定义操作,例如可以是标注文本之间的语义是否相同或者不相同,或者标注表征文本之间语义相近程度(如可以用数值表征)。
聚类类目的定义操作,例如可以是将多个类目合并为一个类目、将一个类目拆分为多个类目、新建类目等操作。
噪音标注操作,例如可以是对用户输入的多个文本中的部分文本进行无效标记的操作,或者是对初始聚类结果中的部分类目进行无效标记的操作。其中,文本被标记无效后,该文本可以不参与文本聚类过程;类目被标记无效后,初始聚类结果所包含的类目中可以不包含被标记为无效的类目。
聚类类目特征的标注操作,例如可以是用于表征聚类类目特征的中心句、关键词等信息的标注操作。通常情况下,聚类类目下各个文本的语义,均与该聚类类目的中心文本、中心句以及关键词的语义相同或者存在关联。
值得注意的是,上述交互装置202所支持的调整操作的各种示例,仅用于进行解释说明,并不用于对调整操作的具体实现进行限定。实际应用中,交互装置202所支持的调整操作,除了可以包括上述操作以外,还可以包括其它对聚类结果的任意操作。
接下来,对上述文本聚类系统涉及的聚类装置进行详细说明。
参见图4所示的聚类装置201的结构示意图,该聚类装置201包括通信模块400、预处理模块401、聚类模块402以及存储模块403。
通信模块400,用于接收交互装置发送的多个文本,该多个文本可以由用户提供给交互装置202。实际应用中,也可以是由用户直接提供给聚类装置201,而交互装置202转发。
预处理模块401可以用于该多个文本进行预处理,例如可以是对多个文本进行分词、错误纠正(如纠正文本中出现错误的词语等)、去噪(如去除无意义的字母、符号等字符)、去除停用词以及检测每个单词的词性中的任意一种或多种。其中,停用词,可以包括内容指示含义较低的功能词等词汇,通常难以指示文本的语义,如“一个”、“这些”、“的”等难以指示文本语义的词汇。
实际应用中,预处理模块401对该多个文本进行上述预处理后,可以在一定程度上减少文本的数据量,从而再对经过预处理之后的文本进行聚类时,可以减少计算量,提高聚类效率。比如,假设其中一个文本为“篮球一般是多人竞技运动”,则对该文本进行分词、去除停用词等预处理后,文本中的单词可以包括“篮球”、“多人”、“竞技”、“运动”,该文本中参与聚类的数据量可以减少至8个字符。
预处理模块401在完成对多个文本的预处理后,还可以将该文本的相关信息提供给聚类模块402。示例性的,文本的相关信息,可以包括文本中所包含的各个单词、每个单词的词性、以及文本中的单词在该文本中的词序等信息。
聚类模块402可以根据获取到的多个文本的相关信息,对多个文本进行聚类。在一种示例性的具体实现方式中,聚类模块402可以包括有文本相似度计算单元4021、文本聚类单元4022以及聚类类目表征单元4033。
其中,文本相似度计算单元4021,可以用于计算任意两个文本之间的相似度。具体实现时,文本相似度计算单元4021可以选取任意两个文本,分别为文本A以及文本B,并将这两个文本划分成多个语句。然后,文本相似度计算单元4021可以将文本A中的每个语句,分别与文本B中的每个语句进行相似度计算。以计算文本A中的语句a与文本B中的语句b为例:
当语句a与语句b同时包含动词与名词时,文本相似度计算单元4021可以计算出语句a与语句b中的动词、副词、名词、形容词之间的相似度,确定出语句a与语句b中相似度大于第一阈值的单词,其中,单词之间的相似度可以是通过单词的词向量之间的相似度进行计算,当然,也可以是采用其它方式进行计算。同时,文本相似度计算单元4021可以计算语句a的句向量与语句b的句向量之间的相似度,如果两个语句的句向量的相似度大于第二阈值,并且,两个语句中相似度大于第一阈值的单词,其对应的权重值也大于第三阈值,则可以确定语句a与语句b相似。否则,若语句a与语句b中没有相似度大于第一阈值的单词,或者这两个语句中相似度大于第一阈值的单词所对应的权重值均不大于第三阈值,或者这两个语句的句向量之间的相似度小于第二阈值,则文本相似度计算单元4021均可以确定这两个语句不相似。其中,每个单词的权重值例如可以是该单词在预设语料库中的权重值,文本相似度计算单元4021可以通过查表确定语句a以及语句b中任意一个单词所对应的权重值。
当语句a与语句b不同时包含名词和动词时,如同时只包括名词(或动词),当然,还可以同时包括其它词性的单词,文本相似度计算单元4021可以同时计算出语句a与语句b中的名词、形容词之间的相似度(或者动词、副词之间的相似度),确定 出语句a与语句b中相似度大于第一阈值的单词。同时,文本相似度计算单元4021可以计算语句a的句向量与语句b的句向量之间的相似度,如果两个语句的句向量的相似度大于第二阈值,并且,两个语句中相似度大于第一阈值的单词,其对应的权重值也大于第三阈值,则可以确定语句a与语句b相似;否则,可以确定语句a与语句b不相似。
当语句a与语句b均不包含名词和动词时,文本相似度计算单元4021可以计算语句a的句向量与语句b的句向量之间的相似度,并且,如果这两个语句的句向量的相似度大于第二阈值,则文本相似度计算单元4021可以确定语句a与语句b相似,而若这两个语句的句向量的相似度不大于第二阈值时,则文本相似度计算单元4021可以确定语句a与语句b不相似。
基于上述过程,可以确定出文本A与文本B中任意两个语句之间是否相似,由此可以得到这两个文本中的相似语句。然后,文本相似度计算单元4021可以分别计算出该相似语句在文本A中的占比,以及该相似语句在文本B中的占比,当相似语句在文本A中的占比以及在文本B中的占比均达到第四阈值时,文本相似度计算单元4021确定文本A与文本B相似,而当存在相似语句在其中一个文本中的占比未达到第四阈值时,文本相似度计算单元4021确定文本A与文本B不相似。
当然,上述确定两个文本之间是否相似的具体实现方式仅作为一种示例,实际应用中,也可以是采用其它方式确定两个文本之间是否相似,并且,在确定文本之间是否相似的过程中,所采用的阈值可以自行设定,本实施例对该过程的具体实现方式并不进行限定。
如此,文本相似度计算单元4021通过遍历计算可以确定出多个文本中任意两个文本之间是否相似以及任意两个文本之间的相似度。然后,文本相似度计算单元4021可以将所得到的结果传递给文本聚类单元4022。
文本聚类单元4022可以根据多个文本中各个样本之间的相似度进行聚类,具体可以是确定每个待聚类的文本与已聚类文本集合中各个文本的相似度,并进一步确定出已聚类文本集合中与该待聚类文本之间的相似度大于第五阈值的文本,当相似度大于第五阈值的文本在该已聚类文本集合中的占比大于第一比例阈值,则可以确定该待聚类文本属于该已聚类文本集合所属的类目,并将该待聚类文本添加至该已聚类文本集合中。而当相似度大于第五阈值的文本在该已聚类文本集合中的占比小于第一比例阈值,则可以确定该待聚类文本不属于该已聚类文本集合所属的聚类类目,并可以继续确定该待聚类文本与下一已聚类文本集合中相似度大于第五阈值的文本,以便于继续确定该待聚类文本是否属于下一已聚类文本集合所属的聚类类目。文本聚类单元4022若确定该待聚类文本不属于已有的所有聚类类目,则可以基于该待聚类文本创建新的聚类类目,而该待聚类文本则属于该新的聚类类目。
文本聚类单元4022在开始进行文本聚类时,若当前没有已聚类文本集合,则可以先以任意一个文本创建已聚类文本集合,并基于上述过程确定待聚类文本是否属于该已聚类文本集合所属聚类类目,若属于,则将待聚类文本添加至该已聚类文本集合中,而若不属于,则可以基于该待聚类文本创建新的已聚类文本集合,该新的已聚类文本集合对应于新的聚类类目。如此,文本聚类单元4022可以将各个文本划分至相应的已 聚类文本集合中,而已聚类文本集合的数量即为聚类类目的数量。
实际应用的一些实施方式中,聚类装置201中的多个文本,可以同时包括标准文本以及待聚类文本。其中,待聚类文本尚未完成聚类;而标准文本已经完成聚类,并且可以根据聚类类目的不同,划分为多个不同的已聚类文本集合。这样,聚类装置201中的文本聚类单元4022,可以根据标准文本的聚类情况对待聚类文本进行聚类,如可以是将每个待聚类文本划分至相应的标准文本中的不同已聚类文本集合。其中,若待聚类文本不属于已有的所有聚类类目,则可以基于该待聚类文本创建新的聚类类目,而该待聚类文本则属于该新的聚类类目。
进一步的,针对于每个聚类类目,还可以由聚类装置201中的聚类类目表征单元4023为该聚类类目确定中心文本、中心句以及关键词中的任意一种或多种,所确定出的中心文本、中心句、关键词的语义可以表征该聚类类目。
具体的,在确定每个聚类类目的中心文本时,聚类类目表征单元4023可以根据文本相似度计算单元4021所计算出的不同文本之间的相似度,确定出该聚类类目对应的已聚类文本集合中每个文本与该已聚类文本集合中的其它文本之间的相似度总和(或平均值),并对每个文本对应的相似度总和(或平均值)进行排序,从中选取相似度总和(或平均值)较大或者最大的文本作为该聚类类目的中心文本。
在确定每个聚类类目的中心句时,聚类类目表征单元4021可以根据文本相似度计算单元4021所计算出的不同语句之间的相似度,确定该聚类类目对应的已聚类文本集合中的每个语句与该已聚类文本集合中的其它语句的相似度总和(或平均值),并对每个语句对应的相似度总和(或平均值)进行排序,从中选取相似度总和(或平均值)较大或者最大的语句作为该聚类类目的中心句。
在确定每个聚类类目的关键词时,针对于一类词性,聚类类目表征单元4021可以确定该聚类类目对应的文本中具有该词性的单词集合,如词性为动词的单词集合、词性为名词的单词集合等,并通过查表等方式确定单词集合中每个单词的权重值。然后,聚类类目表征单元4021可以对该单词集合中不同的单词的权重值进行排序,并从中选取较大或者最大的一个或者多个单词作为该聚类类目的关键词,如此,可以确定出该聚类类目对应的不同词性的关键词。
当然,上述聚类模块402对多个文本完成聚类的过程,仅作为一种示例性说明,并不用于限定本实施例的文本聚类实现局限于上述示例,实际应用中,聚类模块402也可以是采用其它可能的文本聚类过程完成对多个文本聚类。
进一步的,聚类模块402还可以将文本聚类过程中所涉及的中间信息进行记录,具体可以是将该中间信息发送至聚类模块中的存储模块403进行存储。示例性的,该中间信息,例如可以是上述文本相似度计算单元4021所计算出的不同单词之间的相似度、不同文本之间的相似度、单词的权重值以及单词属性的定义等信息中的任意一种或多种。时间应用中,所记录的中间信息还可以包括更多其它的信息,如不同文本之间的相似语句或者相似语句的标识(如可以是该语句在文本中顺序编号等)。其中,存储模块403可以包括索引单元4031以及存储单元4032,存储模块403可以利用存储单元4032存储该中间信息,并在索引单元4031中建立该信息的查询索引,该索引可以包括中间信息的标识以及该中间信息在存储单元中的存储地址。
基于上述过程,聚类装置201可以实现对多个文本的聚类,并得到初始聚类结果,该初始聚类结果包括聚类类目,其可以用上述中心文本、中心句以及关键词中的任意一种或多种进行表征,同时,初始聚类结果还包括属于该聚类类目的文本。进一步的,该初始聚类结果还可以包括聚类过程中的中间信息,如每个聚类类目下的文本所包含单词的属性等、文本之间相似度等信息。然后,聚类装置201可以将所得到的初始聚类结果通过通信模块400传输给交互装置202,以便由交互装置202将该初始聚类结果呈现给用户。
并且,交互装置202还可以支持多种与用户之间的交互操作,比如,上述针对于多个文本中单词属性的定义操作、单词之间关联性定义操作、文本之间关联性定义操作、聚类类目定义操作、噪音标注操作以及聚类类目特征的标注操作等。当用户对于交互装置202所呈现的初始聚类结果进行调整时,由于初始聚类结果所涉及的文本数量较多,因此,用户可以调整数量较高的一部分聚类结果,而交互装置202根据用户的操作,将用户调整的该部分聚类结果更新为符合用户预期的第一聚类结果。同时,交互装置202还可以将该用户针对于该部分聚类结果的调整操作,传输给聚类装置201,具体可以是给传输给聚类装置201中的通信模块400,再由通信模块400将该调整参数传递给聚类模块402。
聚类装置201根据用户执行的调整操作,更新单词的相关信息、文本的相关信息以及聚类类目的相关信息等,并基于所更新的信息对其余文本进行相应的调整,以便于将初始聚类结果中的第二部分更新为第二聚类结果。
例如,当用户执行的调整操作为调整单词的属性,如将单词的词性由动词定义为名词等,则在确定聚类类目的关键词时,从动词集合中删除该单词,并从名词集合中添加该单词,从而基于该更新后的动词集合以及名词集合中的单词重新确定聚类类目的不同词性的关键词,其确定关键词的具体实现方式可以参见前述过程相关描述。
又比如,当用户执行的调整操作,为单词之间的关联性定义操作,如直接定义两个单词之间互为近义词(或者互为反义词,或者无关联)等,此时,聚类装置201(具体可以是文本相似度计算单元4021)可以将这两个单词之间的相似度设置为大于第一阈值的任意值(互为反义词时,设置为小于第一阈值的任意值),并基于更新后的单词之间的相似度来重新进行文本聚类。
再比如,当用户执行的调整操作,为文本之间的关联性定义操作,如直接定义两个文本之间具有相同语义,或者将一个聚类类目中的文本P迁移至其它类目中等,则聚类装置201可以基于针对于文本之间关联性的定义操作重新进行文本聚类,如将其它聚类类目中与该文本P具有相同语义的其它文本迁移至该文本P所在聚类类目中。
基于用户执行的调整操作,聚类装置201可以存储模块403中所保存的中间信息进行更新,如更新单词的属性、单词之间的相似度、文本之间的相似度等,具体可以是先利用索引单元4031查询出所要更新的信息在存储单元4032中的存储位置,并对该存储位置处的值进行相应的更新。这样,聚类装置201在重新进行聚类时,可以复用存储模块403中所保存的中间信息,而无需重新计算。比如,在重新聚类过程中,聚类装置201可以直接从存储模块403中读取到两个文本之间的相似度,而可以不用再通过上述计算过程计算出文本之间的相似度。如此,不仅可以有效减少重新聚类所 需的计算量,而且,也可以提高重新聚类的效率,从而可以提高优化聚类结果的实时性。
在聚类装置201经过上述过程对初始聚类结果进行调整后,聚类装置201可以将第二聚类结果传递给交互装置202,交互装置202可以将第一聚类结果(用户调整)以及第二聚类结果(聚类装置201调整)呈现给用户,以便于用户查看经过调整后的聚类结果是否能够满足用户预期。实际应用中,若用户对调整后的聚类结果再次进行了调整,则聚类装置201可以基于上述类似过程,对用户未调整的聚类结果自动进行调整,并将其交由交互装置202再次进行呈现,直至最终得到的聚类结果满足用户的预期。
本申请实施例还提供了上述介绍的实施的文本聚类方法,接下来从各装置交互的角度对该文本聚类方法进行介绍。
参见图5所示的文本聚类方法的流程图,该方法可以引用于如图2所示的文本聚类系统,该方法具体可以包括:
S501:交互装置202接收用户提供的多个文本。
本实施例中,用户所提供的文本,例如可以是需要聚类分发到部门的工单文档,或者可以是作为客服语料的客服工单文档,或者可以是人机对话场景中用户所提出的问题文本和/或针对用户问题的答案文本等。
并且,交互装置202所接收到的文本中,可以包含标准文本以及待聚类文本。其中,标准文本已经完成聚类,而待聚类文本尚未进行聚类。当然,交互装置202所接收到的文本中,也可以是全部为待聚类文本。
值得注意的是,本实施例中是以用户向交互装置202输入多个文本为例进行示例性说明,在其它可能的实施方式中,用户也可以是直接向聚类装置201输入多个文本,本实施例对此并不进行限定。
S502:交互装置202将多个文本传递给聚类装置201中的通信模块400。
S503:聚类装置201中的预处理模块401对通信模块400传递的多个文本进行预处理,并将预处理后的文本的相关信息传递给聚类模块402。
本实施例中,对多个文本进行预处理,可以是对多个文本进行分词、错误纠正、去噪、去除停用词以及检测每个单词的词性中的任意一种或多种。其具体实现,可参见前述相关之处描述,在此不做赘述。
S504:聚类装置201中的聚类模块402根据经过预处理后的文本的相关信息,对多个文本进行聚类,得到初始聚类结果,并将聚类过程中所涉及的中间信息传递给存储模块403中保存。
其中,文本的相关信息,可以包括文本中所包含的各个单词、每个单词的词性、以及文本中的单词在该文本中的词序等信息。
本实施例中,聚类模块402在对多个文本进行聚类时,具体可以是由文本相似度计算单元4021计算出不同文本之间的相似度,再由文本聚类单元4022根据不同文本之间的相似度对将相似度较高的文本进行聚集为一类,得到多个不同的已聚类文本集合,分别对应于不同的聚类类目。同时,聚类类目表征单元4023为该聚类类目确定中 心文本、中心句以及关键词中的任意一种或多种。在聚类过程中所涉及的中间信息,如文本之间的相似度等,可以记录于存储模块403中,具体可以是记录于存储模块403中的存储单元4032,并在索引单元4031中建立索引。
其中,聚类模块402对于多个文本的具体聚类过程以及存储模块403存储中间信息的具体实现,可以参见前述相关之处描述,在此不做赘述。
S505:聚类装置201通过通信模块400将初始聚类结果传递给交互装置202。
S506:交互装置202向用户呈现初始聚类结果。
S507:交互装置202响应用户针对于初始聚类结果的调整操作,将用户所调整部分的聚类结果更新为第一聚类结果,并将调整操作传递给聚类装置201中的通信模块400。
S508:聚类装置201根据该调整操作,将初始聚类结果中的其它部分聚类结果更新为第二聚类结果,并将第二聚类结果通过通信模块400传递给交互装置202。
其中,聚类装置201根据用户对部分聚类结果的调整操作,更新存储模块403中所保存的中间信息,并基于更新后的中间信息对用户未调整的聚类结果进行更新,得到第二聚类结果。其中,聚类装置201基于该调整操作得到第二聚类结果的具体实现,可以参见前述相关之处描述,在此不做赘述。
S509:交互装置202呈现更新后的第一聚类结果以及第二聚类结果。
根据本申请实施例的交互装置202以及聚类装置201可对应于执行本申请实施例中描述的方法,并且交互装置202以及聚类装置201中的各个模块的上述和其它操作和/或功能分别为了实现图5中的各个方法的相应流程,为了简洁,在此不再赘述。
图6提供了一种计算机系统。图6所示的计算机系统600包括一个计算机,该计算机具体可以用于实现上述图4所示实施例中聚类装置201的功能。
计算机系统600包括总线601、处理器602、通信接口603和存储器604。处理器602、存储器604和通信接口603之间通过总线601通信。总线601可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图6中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。通信接口603用于与外部通信,例如接收交互装置201发送的多个文本以及向交互装置201传输初始聚类结果等。
其中,处理器602可以为中央处理器(central processing unit,CPU)。存储器604可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器604还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,HDD或SSD。
存储器604中存储有可执行代码,处理器602执行该可执行代码以执行前述文本聚类方法。
具体地,在实现图4所示实施例的情况下,且图4实施例中所描述的各模块为通过软件实现的情况下,执行图4中的预处理模块401、聚类模块402、存储模块403所需的软件或程序代码存储在存储器604中,通信模块400功能通过通信接口603实 现,处理器602用于执行存储器604中的指令,执行应用于聚类装置201的文本聚类方法。在其它实施方式中,存储器600还可以用于存储数据,存储模块403功能可以通过该存储器604实现。
值得注意的是,图6所示的计算机系统600是以包括一个计算机为例进行示例性说明,在其它可能的实施例中,计算机系统还可以包括多个计算机,该计算机系统中的多个不同的计算机相互配合,共同执行上述文本聚类方法。此时,上述预处理模块401、聚类模块402以及存储模块403可以位于多个不同的计算机上。为便于理解,下面以预处理模块401、聚类模块402位于同一计算机,而存储模块403位于另一计算机为例进行示例性说明。
参见图7,图7提供了另一种计算机系统。图7所示的计算机系统700包括两个计算机,分别为计算机710以及计算机720,这两个计算机之间相互协作,用于实现上述图4所示实施例中聚类装置201的功能。
其中,计算机710包括总线711、处理器712、通信接口713和存储器714。处理器712、存储器714和通信接口713之间通过总线711通信。计算机720包括总线721、处理器722、通信接口723和存储器724。处理器722、存储器724和通信接口723之间通过总线721通信。总线711以及总线721可以是PCI总线或EISA总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图7各个计算机中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。通信接口713用于与外部通信,例如接收交互装置201发送的多个文本以及向交互装置201传输初始聚类结果等,通信接口723用于实现计算机710与计算机720之间进行交互。
其中,处理器712以及处理器722可以为CPU。存储器714以及存储器724可以包括易失性存储器,例如RAM。存储器714还可以包括非易失性存储器,例如ROM、快闪存储器、HDD或SSD。
存储器714以及存储器724中存储有可执行代码,处理器712以及处理器722分别执行相应存储器中可执行代码以执行前述文本聚类方法。
具体地,在实现图4所示实施例的情况下,且图4实施例中所描述的各模块为通过软件实现的情况下,执行图4中的预处理模块401、聚类模块402所需的软件或程序代码存储在存储器714中,执行图4中的存储模块403所需的软件或程序代码存储在存储器724中,通信模块400功能通过通信接口713实现,处理器712用于执行存储器714中的指令,处理器722用于执行存储器724中的指令,相互配合执行应用于聚类装置201的文本聚类方法。
当然,在其它可能的实施例中,当计算机系统包括多个不同的计算机时,预处理模块401以及聚类模块102也可以是位于不同的计算机等,本申请对此并不进行限定。
本申请实施例还提供了一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行上述应用于聚类装置201的文本聚类方法。
本申请实施例还提供了一种计算机程序产品,所述计算机程序产品被计算机执行时,所述计算机执行前述文本聚类方法的任一方法。该计算机程序产品可以为一个软件安装包,在需要使用前述文本聚类方法的任一方法的情况下,可以下载该计算机程序产品并在计算机上执行该计算机程序产品。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (23)

  1. 一种文本聚类系统,其特征在于,所述系统包括:
    聚类装置、交互装置;
    所述聚类装置,用于对多个文本进行聚类,得到初始聚类结果;
    所述交互装置,用于呈现从所述聚类装置获取的所述初始聚类结果,并响应针对所述初始聚类结果中第一部分的调整操作,得到第一聚类结果;
    所述聚类装置,还用于根据所述调整操作,将所述初始聚类结果中的第二部分更新为第二聚类结果。
  2. 根据权利要求1所述的系统,其特征在于,所述聚类装置,还用于对聚类得到所述初始聚类结果的过程中涉及的中间信息进行记录,并根据所述中间信息以及所述调整操作将所述初始聚类结果中的第二部分更新为所述第二聚类结果。
  3. 根据权利要求2所述的系统,其特征在于,所述中间信息包括所述多个文本中单词之间的相似度、文本之间的相似度、单词的权重值以及单词属性的定义等信息中的任意一种或多种。
  4. 根据权利要求1至3任一项所述的系统,其特征在于,所述调整操作,包括所述多个文本中单词属性的定义操作、单词之间关联性定义操作、文本之间关联性定义操作、聚类类目定义操作、噪音标注操作以及聚类类目特征的标注操作中的任意一种或多种。
  5. 根据权利要求1至4任一项所述的系统,其特征在于,所述聚类装置具体用于:
    计算所述多个文本中不同文本之间的相似度;
    根据所述不同文本之间的相似度,计算所述多个文本中不同文本与聚类类目之间的相似度,并基于所述不同文本与聚类类目之间的相似度确定所述初始聚类结果;
    计算用于表征聚类类目特征的文本与关键词。
  6. 根据权利要求1至5任一项所述的系统,其特征在于,所述多个文本中包括标准文本以及待聚类文本,所述标准文本已完成聚类;
    所述聚类装置,具体用于根据所述标准文本对所述待聚类文本进行聚类。
  7. 根据权利要求1至6任一项所述的系统,其特征在于,所述聚类装置,具体用于对所述多个文本进行预处理,所述预处理包括分词、错误纠正、去噪、去除停用词、词性检测中的任意一种或多种,并对经过预处理的多个文本进行聚类,得到所述初始聚类结果。
  8. 一种文本聚类方法,其特征在于,所述方法应用于聚类装置,所述方法包括:
    对多个文本进行聚类,得到初始聚类结果;
    向交互装置发送所述初始聚类结果;
    根据所述交互装置发送的针对于所述初始聚类结果中第一部分的调整操作,将所述初始聚类结果中的第二部分更新为第二聚类结果。
  9. 根据权利要求8所述的方法,其特征在于,所述方法还包括:
    对聚类得到所述初始聚类结果的过程中涉及的中间信息进行记录;
    则所述根据所述交互装置发送的针对于所述初始聚类结果中第一部分的调整操作,将所述初始聚类结果中的第二部分更新为第二聚类结果,包括:
    根据所述中间信息以及所述调整操作将所述初始聚类结果中的第二部分更新为所述第二聚类结果。
  10. 根据权利要求9所述的方法,其特征在于,所述中间信息包括所述多个文本中单词之间的相似度、文本之间的相似度、单词的权重值、以及单词属性的定义等信息中的任意一种或多种。
  11. 根据权利要求8至10任一项所述的方法,其特征在于,所述调整操作,包括所述多个文本中单词属性的定义操作、单词之间关联性定义操作、文本之间关联性定义操作、聚类类目定义操作、噪音标注操作以及聚类类目特征的标注操作中的任意一种或多种。
  12. 根据权利要求8至11任一项所述的方法,其特征在于,所述对多个文本进行聚类,得到初始聚类结果,包括:
    计算所述多个文本中不同文本之间的相似度;
    根据所述不同文本之间的相似度,计算所述多个文本中不同文本与聚类类目之间的相似度,并基于所述不同文本与聚类类目之间的相似度确定所述初始聚类结果;
    计算用于表征聚类类目特征的文本与关键词。
  13. 根据权利要求8至12任一项所述的方法,其特征在于,所述多个文本中包括标准文本以及待聚类文本,所述标准文本已完成聚类;
    则,所述对多个文本进行聚类,得到初始聚类结果,包括:
    根据所述标准文本对所述待聚类文本进行聚类。
  14. 根据权利要求8至13任一项所述的方法,其特征在于,所述对多个文本进行聚类,得到初始聚类结果,包括:
    对所述多个文本进行预处理,所述预处理包括分词、错误纠正、去噪、去除停用词以及词性检测中的任意一种或多种;
    对经过预处理的多个文本进行聚类,得到所述初始聚类结果。
  15. 一种聚类装置,其特征在于,所述聚类装置包括:
    聚类模块,用于对多个文本进行聚类,得到初始聚类结果;
    通信模块,用于向交互装置发送所述初始聚类结果;
    所述聚类模块,还用于根据所述交互装置发送的针对于所述初始聚类结果中第一部分的调整操作,将所述初始聚类结果中的第二部分更新为第二聚类结果。
  16. 根据权利要求15所述的装置,其特征在于,所述装置还包括:
    存储模块,用于对聚类得到所述初始聚类结果的过程中涉及的中间信息进行记录;
    则,所述聚类模块,具体用于根据所述中间信息以及所述调整操作将所述初始聚类结果中的第二部分更新为所述第二聚类结果。
  17. 根据权利要求16所述的装置,其特征在于,所述中间信息包括所述多个文本中单词之间的相似度、文本之间的相似度、单词的权重值、以及单词属性的定义等信息中的任意一种或多种。
  18. 根据权利要求15至17任一项所述的装置,其特征在于,所述调整操作,包括所述多个文本中单词属性的定义操作、单词之间关联性定义操作、文本之间关联性定义操作、聚类类目定义操作、噪音标注操作以及聚类类目特征的标注操作中的任意 一种或多种。
  19. 根据权利要求15至18任一项所述的装置,其特征在于,所述聚类模块,具体用于:
    计算所述多个文本中不同文本之间的相似度;
    根据所述不同文本之间的相似度,计算所述多个文本中不同文本与聚类类目之间的相似度,并基于所述不同文本与聚类类目之间的相似度确定所述初始聚类结果;
    计算用于表征聚类类目特征的文本与关键词。
  20. 根据权利要求15至19任一项所述的装置,其特征在于,所述多个文本中包括标准文本以及待聚类文本,所述标准文本已完成聚类;
    所述聚类模块,具体用于根据所述标准文本对所述待聚类文本进行聚类。
  21. 根据权利要求15至20任一项所述的装置,其特征在于,所述装置还包括:
    预处理模块,对所述多个文本进行预处理,所述预处理包括分词、错误纠正、去噪、去除停用词以及词性检测中的任意一种或多种;
    所述聚类模块,具体用于对经过预处理的多个文本进行聚类,得到所述初始聚类结果。
  22. 一种计算机系统,其特征在于,所述计算机系统包括至少一个计算机,所述至少一个计算机包括处理器和存储器;
    所述至少一个计算机的处理器用于执行所述至少一个计算机的存储器中存储的指令,执行如权利要求8至14任一项所述的方法。
  23. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求8至14中任一项所述的方法。
PCT/CN2021/117691 2020-09-10 2021-09-10 一种文本聚类系统、方法、装置、设备及介质 WO2022053018A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010947082.6A CN114168729A (zh) 2020-09-10 2020-09-10 一种文本聚类系统、方法、装置、设备及介质
CN202010947082.6 2020-09-10

Publications (1)

Publication Number Publication Date
WO2022053018A1 true WO2022053018A1 (zh) 2022-03-17

Family

ID=80475606

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/117691 WO2022053018A1 (zh) 2020-09-10 2021-09-10 一种文本聚类系统、方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN114168729A (zh)
WO (1) WO2022053018A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101571868A (zh) * 2009-05-25 2009-11-04 北京航空航天大学 一种基于信息瓶颈理论的文档聚类方法
CN102999516A (zh) * 2011-09-15 2013-03-27 北京百度网讯科技有限公司 一种文本分类的方法及装置
US10049148B1 (en) * 2014-08-14 2018-08-14 Medallia, Inc. Enhanced text clustering based on topic clusters
CN109508374A (zh) * 2018-11-19 2019-03-22 云南电网有限责任公司信息中心 基于遗传算法的文本数据半监督聚类方法
WO2020174672A1 (en) * 2019-02-28 2020-09-03 Nec Corporation Visualization method, visualization device and computer-readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101571868A (zh) * 2009-05-25 2009-11-04 北京航空航天大学 一种基于信息瓶颈理论的文档聚类方法
CN102999516A (zh) * 2011-09-15 2013-03-27 北京百度网讯科技有限公司 一种文本分类的方法及装置
US10049148B1 (en) * 2014-08-14 2018-08-14 Medallia, Inc. Enhanced text clustering based on topic clusters
CN109508374A (zh) * 2018-11-19 2019-03-22 云南电网有限责任公司信息中心 基于遗传算法的文本数据半监督聚类方法
WO2020174672A1 (en) * 2019-02-28 2020-09-03 Nec Corporation Visualization method, visualization device and computer-readable storage medium

Also Published As

Publication number Publication date
CN114168729A (zh) 2022-03-11

Similar Documents

Publication Publication Date Title
US11301637B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
US9373075B2 (en) Applying a genetic algorithm to compositional semantics sentiment analysis to improve performance and accelerate domain adaptation
KR20200094627A (ko) 텍스트 관련도를 확정하기 위한 방법, 장치, 기기 및 매체
CN109299280B (zh) 短文本聚类分析方法、装置和终端设备
US20230177360A1 (en) Surfacing unique facts for entities
WO2019037258A1 (zh) 信息推荐的装置、方法、系统及计算机可读存储介质
US10152478B2 (en) Apparatus, system and method for string disambiguation and entity ranking
WO2020233360A1 (zh) 一种产品测评模型的生成方法及设备
JP2021093163A (ja) ディープラーニングに基づく文書類似度測定モデルを利用した重複文書探知方法およびシステム
US10296635B2 (en) Auditing and augmenting user-generated tags for digital content
CN112686051A (zh) 语义识别模型训练方法、识别方法、电子设备、存储介质
WO2019113938A1 (zh) 数据标注方法、装置及存储介质
WO2021000400A1 (zh) 导诊相似问题对生成方法、系统及计算机设备
CN115062621A (zh) 标签提取方法、装置、电子设备和存储介质
WO2021239078A1 (zh) 领域识别的方法、交互的方法、电子设备及存储介质
CN114661890A (zh) 一种知识推荐方法、装置、系统及存储介质
EP3635575A1 (en) Sibling search queries
WO2020052060A1 (zh) 用于生成修正语句的方法和装置
CN111222032B (zh) 舆情分析方法及相关设备
US11983747B2 (en) Using machine learning to identify hidden software issues
US9946765B2 (en) Building a domain knowledge and term identity using crowd sourcing
CN116484829A (zh) 用于信息处理的方法和设备
WO2022053018A1 (zh) 一种文本聚类系统、方法、装置、设备及介质
JP6680472B2 (ja) 情報処理装置、情報処理方法及び情報処理プログラム
WO2021056740A1 (zh) 语言模型构建方法、系统、计算机设备及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21866070

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21866070

Country of ref document: EP

Kind code of ref document: A1