CN112966104B

CN112966104B - Text clustering method, text clustering device, text processing equipment and storage medium

Info

Publication number: CN112966104B
Application number: CN202110238054.1A
Authority: CN
Inventors: 浦嘉澍; 毛晓曦; 范长杰; 胡志鹏
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2022-07-12
Anticipated expiration: 2041-03-04
Also published as: CN112966104A

Abstract

The invention provides a text clustering method, a text clustering device, processing equipment and a storage medium, and relates to the technical field of data processing. The method comprises the following steps: recognizing a text to be processed by adopting each of a plurality of preset language models to obtain text characteristics, wherein different language models are text characteristic recognition models obtained by training different characteristic learning text data obtained on the basis of sample dialogue texts in advance; clustering the text features output by each language model by adopting a clustering algorithm corresponding to each language model to obtain a clustering result; and fusing the plurality of clustering results to obtain a target clustering result of the text to be processed. The method comprises the steps of identifying a text to be processed based on a plurality of language models to obtain a plurality of text characteristics, so that the identified text characteristics of the text to be processed are more accurate; and then clustering the text features by adopting a corresponding clustering algorithm to obtain a plurality of clustering results, and fusing the plurality of clustering results to obtain a target clustering result, so that the accuracy of the clustering result is also improved.

Description

Text clustering method, text clustering device, text processing equipment and storage medium

Technical Field

The invention relates to the technical field of data processing, in particular to a text clustering method, a text clustering device, text clustering processing equipment and a storage medium.

Background

Automated dialog is widely used in many industries and fields. Automatic dialogs mainly rely on natural language understanding systems, which require setting of preset intents, and therefore acquisition of preset intents is also becoming increasingly important.

In the related technology, a recognition model is adopted to recognize texts to obtain recognition results, the recognition results are clustered and analyzed to obtain clustering results, and the clustering results are labeled to obtain conversation intents.

However, in the related art, one recognition model is adopted to obtain a recognition result, and when the data volume of the text to be recognized is small, the problem of inaccurate recognition result is easily caused, so that the problem of inaccurate clustering result is caused.

Disclosure of Invention

The present invention aims to provide a text clustering method, a text clustering device, a text clustering processing device, and a storage medium, so as to solve the problem in the related art that when the amount of data of a text to be recognized is small, a recognition result is inaccurate, and thus a clustering result is inaccurate, where a recognition model is used to obtain the recognition result.

In order to achieve the above purpose, the embodiment of the present invention adopts the following technical solutions:

in a first aspect, an embodiment of the present invention provides a text clustering method, including:

recognizing a text to be processed by adopting each of a plurality of preset language models to obtain text characteristics, wherein different language models are text characteristic recognition models obtained by training different characteristic learning text data obtained on the basis of sample dialogue texts in advance;

clustering the text features output by each language model by adopting a clustering algorithm corresponding to each language model to obtain a clustering result;

and fusing the plurality of clustering results to obtain a target clustering result of the text to be processed.

Optionally, before the text to be processed is identified by using each of the preset multiple language models to obtain text features, the method further includes:

acquiring a plurality of feature learning text data from a sample dialog text;

and performing model training according to the feature learning text data to obtain the language models.

Optionally, the obtaining a plurality of feature learning text data from the sample dialog text includes:

acquiring the plurality of feature learning text data and the hyper-parameter learning text data corresponding to each feature learning text data from the sample dialogue text;

before clustering the text features output by each language model by adopting the clustering algorithm corresponding to each language model to obtain a clustering result, the method further comprises the following steps:

adopting each language model, and identifying the hyper-parameter learning text data corresponding to the feature learning text data adopted by training each language model to obtain hyper-parameter features;

searching a target hyper-parameter from the hyper-parameter feature;

and updating corresponding hyper-parameters in a preset clustering algorithm according to the target hyper-parameters to obtain the clustering algorithm corresponding to each language model.

Optionally, the obtaining the plurality of feature learning text data from the sample dialog text and the hyperreference learning text data corresponding to each feature learning text data includes:

and carrying out multiple random data segmentation on the sample dialogue text, randomly acquiring one feature learning text data from the sample dialogue text according to a preset proportion in each random data segmentation process, and determining other text data in the sample dialogue text as the hyper-parameter learning text data corresponding to the one feature learning text data.

Optionally, the searching for the target hyper-parameter from the hyper-parameter feature includes:

and randomly searching the target hyper-parameter from the hyper-parameter feature.

Optionally, the randomly searching for the target hyper-parameter from the hyper-parameter feature includes:

and randomly searching the hyper-parameters corresponding to the hyper-parameter types from the hyper-parameter characteristics as the target hyper-parameters according to preset hyper-parameter types.

Optionally, the fusing the multiple clustering results to obtain the target clustering result of the text to be processed includes:

determining mutual information of each clustering result and other clustering results;

and determining the clustering result corresponding to the maximum mutual information as the target clustering result.

Optionally, the target clustering result includes: a plurality of classifications, the method further comprising:

and determining text intents according to the text features in the classification, and labeling the text intents for the text features in the classification.

In a second aspect, an embodiment of the present invention further provides a text clustering device, including:

the recognition module is used for recognizing the text to be processed by adopting each of a plurality of preset language models to obtain text characteristics, and the different language models are text characteristic recognition models obtained by adopting different characteristic learning text data obtained on the basis of sample dialogue texts and training in advance;

the clustering module is used for clustering the text characteristics output by each language model by adopting a clustering algorithm corresponding to each language model to obtain a clustering result;

and the fusion module is used for fusing the plurality of clustering results to obtain a target clustering result of the text to be processed.

Optionally, the apparatus further comprises:

the acquisition module is used for acquiring a plurality of feature learning text data from the sample dialogue text;

and the training module is used for performing model training according to the characteristic learning text data to obtain the language models.

Optionally, the obtaining module is further configured to obtain the plurality of feature learning text data and the hyper-parameter learning text data corresponding to each feature learning text data from the sample dialog text;

the device further comprises:

the first identification module is used for identifying the hyper-parameter learning text data corresponding to the feature learning text data adopted by training each language model by adopting each language model to obtain the hyper-parameter features;

the searching module is used for searching the target hyper-parameter from the hyper-parameter characteristics;

and the updating module is used for updating corresponding hyper-parameters in a preset clustering algorithm according to the target hyper-parameters to obtain the clustering algorithm corresponding to each language model.

Optionally, the obtaining module is further configured to perform multiple random data segmentation on the sample dialog text, randomly obtain one feature learning text data from the sample dialog text in a preset proportion in each random data segmentation process, and determine other text data in the sample dialog text as the hyper-parameter learning text data corresponding to the one feature learning text data.

Optionally, the searching module is further configured to randomly search the target hyper-parameter from the hyper-parameter feature.

Optionally, the searching module is further configured to randomly search, according to a preset hyper-parameter type, a hyper-parameter corresponding to the hyper-parameter type from the hyper-parameter features as the target hyper-parameter.

Optionally, the fusion module is further configured to determine mutual information between each clustering result and other clustering results; and determining the clustering result corresponding to the maximum mutual information as the target clustering result.

Optionally, the target clustering result includes: a plurality of classifications, the apparatus further comprising:

and the marking module is used for determining text intents according to the text features in the classification and marking the text intents for the text features in the classification.

In a third aspect, an embodiment of the present invention further provides a processing device, including: a memory storing a computer program executable by the processor, and a processor implementing the method of any of the first aspects when executing the computer program.

In a fourth aspect, an embodiment of the present invention further provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is read and executed, the method of any one of the above first aspects is implemented.

The invention has the beneficial effects that: the embodiment of the application provides a text clustering method, which comprises the following steps: recognizing a text to be processed by adopting each of a plurality of preset language models to obtain text characteristics, wherein different language models are text characteristic recognition models obtained by training different characteristic learning text data obtained on the basis of sample dialogue texts in advance; clustering the text features output by each language model by adopting a clustering algorithm corresponding to each language model to obtain a clustering result; and fusing the plurality of clustering results to obtain a target clustering result of the text to be processed. The method comprises the steps of identifying a text to be processed based on a plurality of language models to obtain a plurality of text characteristics, so that the identified text characteristics of the text to be processed are more accurate; and then clustering the text features by adopting a corresponding clustering algorithm to obtain a plurality of clustering results, fusing the plurality of clustering results to obtain a target clustering result, and improving the accuracy of the clustering results.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a schematic flow chart of a text clustering method according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a text clustering method according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of a text clustering method according to an embodiment of the present invention;

fig. 4 is a schematic flowchart of a text clustering method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a text clustering apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the description of the present application, it should be noted that if the terms "upper", "lower", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings or the orientation or positional relationship which is usually arranged when the product of the application is used, the description is only for convenience of describing the application and simplifying the description, but the indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation and operation, and thus, cannot be understood as the limitation of the application.

Furthermore, the terms "first," "second," and the like in the description and in the claims, as well as in the drawings, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that the features of the embodiments of the present application may be combined with each other without conflict.

In recent years, applications built by task-oriented chat robots have become ubiquitous in many areas. Although the application of large-scale pre-trained language models in open domains has met with considerable success, task-oriented chat robots still rely on natural language understanding systems that translate user statements into specified dialog intents and corresponding slot information. However, most of the chat robots based on the natural language understanding system require preset intentions, which means that human power needs to be introduced to edit the intentions. However, in a real-life scene, a dialog intention designed by a developer may not cover a question that a user may actually ask, a demand may change with the passage of time, and the design intention may be limited to the experience of the designer itself, and the acquisition of the preset intention becomes more and more important.

In the related technology, a recognition model is adopted to recognize texts to obtain recognition results, the recognition results are clustered and analyzed to obtain clustering results, and the clustering results are labeled to obtain conversation intents. Clustering algorithms can be divided into two categories: one is an unsupervised clustering algorithm, and the other is a semi-supervised clustering algorithm. The unsupervised clustering algorithm means that text feature learning is not performed, and the semi-supervised clustering algorithm is to learn a new text feature on the data of the existing label. Specifically, the unsupervised clustering algorithm includes two steps, first extracting text features, and then clustering on the text features. Compared with the former method, the semi-supervised clustering algorithm comprises one more step, and feature learning is performed before text features are extracted.

In addition, the unsupervised clustering algorithm cannot well utilize the labeled intention, the intention can be manually cleaned and compiled, the quality is high, and subsequent intention clustering can be guided. The semi-supervised text clustering method complements the point of utilizing labeled data, learns a model for judging sentence similarity from labeled intentions and takes the model as an equivalent task of clustering. These methods all allow learning better text characteristics from existing intents.

To solve the foregoing problems in the related art, an embodiment of the present application provides a text clustering method, including: recognizing a text to be processed by adopting each of a plurality of preset language models to obtain text characteristics, wherein different language models are text characteristic recognition models obtained by training different characteristic learning text data obtained on the basis of sample dialogue texts in advance; clustering the text features output by each language model by adopting a clustering algorithm corresponding to each language model to obtain a clustering result; and fusing the plurality of clustering results to obtain a target clustering result of the text to be processed. The method comprises the steps of identifying a text to be processed based on a plurality of language models to obtain a plurality of text characteristics, so that the identified text characteristics of the text to be processed are more accurate; and then clustering the text features by adopting a corresponding clustering algorithm to obtain a plurality of clustering results, fusing the plurality of clustering results to obtain a target clustering result, and improving the accuracy of the clustering results.

The embodiment of the present application provides a text clustering method, where an execution subject of the text clustering method may be a processing device, and the processing device may be a terminal, a server, or other types of devices with processing functions. For example, when the processing device is a terminal, the terminal may be a desktop computer, a notebook computer, or a tablet computer, among others. The following explains the text clustering method provided in the embodiment of the present application, with processing devices as execution subjects.

Fig. 1 is a schematic flow chart of a text clustering method according to an embodiment of the present invention, and as shown in fig. 1, the method may include:

s101, recognizing the text to be processed by adopting each preset language model in the plurality of language models to obtain text characteristics.

The different language models are text feature recognition models obtained by training different feature learning text data obtained based on sample dialogue texts in advance.

It should be noted that different feature learning text data corresponding to different language models may be different. However, the different feature learning text data are all from the sample dialog text. And training by respectively adopting the feature learning text data to obtain corresponding text feature recognition models, namely a plurality of language models.

In a possible implementation manner, each language model in a plurality of preset language models can identify a text to be processed at the same time to obtain a text feature; of course, each language model in the plurality of language models may also adopt a preset sequence to identify the text to be processed to obtain one text feature, and the plurality of language models may obtain a plurality of text features.

In this embodiment of the present application, the text to be processed may include a plurality of dialog sentences, and the text feature may represent semantic information, sentence length information, and the like of each dialog sentence, and certainly, may also include other features of a dialog sentence, which is not specifically limited in this embodiment of the present application.

S102, clustering the text characteristics output by each language model by adopting a clustering algorithm corresponding to each language model to obtain a clustering result.

Each language model can be preset with a corresponding clustering algorithm, that is, the voice models and the clustering algorithms are in one-to-one correspondence, and each clustering algorithm can be different.

In a possible implementation manner, each clustering algorithm can cluster the text features output by the corresponding language model at the same time to obtain a clustering result; of course, each clustering algorithm can also adopt a preset sequence to cluster the text features output by the corresponding language model respectively to obtain a clustering result; and a plurality of clustering results can be obtained correspondingly by the plurality of clustering algorithms.

It should be noted that a plurality of clustering algorithms may belong to the same type of algorithm, but some parameters of the plurality of clustering algorithms may be different. For example, the plurality of clustering algorithms may be density clustering algorithms. The density clustering algorithm can simultaneously perform intention clustering and outlier rejection, and does not need to set the size of a cluster.

For example, the density clustering algorithm may be any one of the following algorithms: DbScan algorithm, OPTICS algorithm, HDBSCAN algorithm. The OPTICS algorithm can be suitable for a plurality of regions with different densities, and has high flexibility.

S103, fusing the plurality of clustering results to obtain a target clustering result of the text to be processed.

Each clustering result can represent multiple categories of the text to be processed, and the target clustering result can also represent multiple categories of the text to be processed.

In addition, the target clustering result may be one of multiple clustering results, or may be a result obtained by processing multiple clustering results, which is not specifically limited in the embodiment of the present application.

In some embodiments, the processing device may fuse the multiple clustering results by using a preset fusion algorithm to obtain a target clustering result of the text to be processed, may also fuse the multiple clustering results by using a preset fusion model to obtain a target clustering result of the text to be processed, and may also fuse the multiple clustering results by using other manners to obtain a target clustering result of the text to be processed, which is not specifically limited in the embodiment of the present application.

In summary, an embodiment of the present application provides a text clustering method, including: recognizing a text to be processed by adopting each of a plurality of preset language models to obtain text characteristics, wherein different language models are text characteristic recognition models obtained by training different characteristic learning text data obtained on the basis of sample dialogue texts in advance; clustering the text features output by each language model by adopting a clustering algorithm corresponding to each language model to obtain a clustering result; and fusing the plurality of clustering results to obtain a target clustering result of the text to be processed. The method comprises the steps of identifying a text to be processed based on a plurality of language models to obtain a plurality of text characteristics, so that the identified text characteristics of the text to be processed are more accurate; and then clustering the text features by adopting a corresponding clustering algorithm to obtain a plurality of clustering results, fusing the plurality of clustering results to obtain a target clustering result, and improving the accuracy of the clustering results.

Optionally, fig. 2 is a schematic flow chart of a text clustering method according to an embodiment of the present invention, and as shown in fig. 2, before a process of identifying a text to be processed by using each of a plurality of preset language models in S101 to obtain text features, the method may further include:

s201, obtaining a plurality of feature learning text data from the sample dialogue text.

The sentences in the sample dialogue text can be labeled with category labels, and correspondingly, the feature learning text data can be labeled with category labels.

In addition, the plurality of feature learning text data may be different. The processing device may select a portion of the data directly from the sample dialog text as feature learning text data. The feature learning text data may be referred to as presentation learning text data.

It should be noted that the processing device may sequentially obtain multiple pieces of feature learning text data from the sample dialog text, may simultaneously obtain multiple pieces of feature learning text data from the sample dialog text, and may also obtain multiple pieces of feature learning text data in other manners, which is not limited in this embodiment of the present application.

In a possible implementation manner, the sample dialog text may be a dialog log between a player and a robot in a game, or may be a dialog log between the player and the robot, or may be a chat record between people in social software, which is not specifically limited by the embodiment of the present application.

And S202, respectively carrying out model training according to the feature learning text data to obtain a plurality of language models.

The processing device may be preset with a plurality of initial language models, and the number of the initial language models, the feature learning text data, and the language models may be the same.

In addition, the processing device may perform model training on a corresponding initial language model by using each feature learning text data to obtain a language model. And performing model training according to the plurality of feature learning text data and the plurality of initial language models to obtain a plurality of language models.

In some embodiments, in model training, the processing device may update the network parameters in the initial language model based on a preset loss function, a preset optimizer, and a plurality of Epoch numbers (hyper-parameters, which define the number of times the learning algorithm works in the whole training data set), and train to obtain a plurality of language models. Among them, the language model may be referred to as a BERT model.

In summary, a plurality of feature learning text data are obtained from the sample dialog text, and model training is performed according to the plurality of feature learning text data, so as to obtain a plurality of language models. The acquired language model is more accurate, and the problem of overfitting is avoided.

Alternatively, the preset loss function may be a cross entropy loss function based on classification, and the preset optimizer may be an Adam (adaptive moment estimation) optimizer.

Optionally, obtaining a plurality of feature learning text data from the sample dialog text includes:

and acquiring a plurality of feature learning text data and the hyper-parameter learning text data corresponding to each feature learning text data from the sample dialogue text.

The processing device may obtain a first part of data from the sample dialog text as a feature learning text data, and obtain a second part of data from the sample dialog text as a hyper-reference learning text data corresponding to the feature learning text data.

It should be noted that the first part of data and the second part of data may directly constitute a sample dialog text; the sample dialog text may also be composed of the first part of data, the second part of data, and other data, which is not specifically limited in this embodiment of the present application.

In summary, a plurality of feature learning text data and the hyper-parameter learning text data corresponding to each feature learning text data are directly obtained from the sample dialog text. The limitation of sample dialogue texts is avoided, so that the problem of overfitting is avoided due to the fact that the feature learning text data and the corresponding hyper-parameter learning text data are obtained.

Optionally, fig. 3 is a schematic flow chart of a text clustering method according to an embodiment of the present invention, as shown in fig. 3, before the process of clustering text features output by each language model by using a clustering algorithm corresponding to each language model in S102 to obtain a clustering result, the method may further include:

s301, recognizing the hyper-parameter learning text data corresponding to the feature learning text data adopted by training each language model by adopting each language model to obtain the hyper-parameter features.

The hyper-parameter feature may be referred to as a text representation feature.

Additionally, the language model may include multiple layers, each of which may output a tensor of size L x 768, L representing the length of the sequence. The layers are provided with layer identifications, and the tensors output by the layers before the layer identifications influence the tensors output by the layers after the layer identifications.

In one possible implementation, the processing device may obtain a tensor of size L x 768 output by a target layer of the plurality of layers, and perform a pooling operation on the tensor, outputting a vector of 768 dimensions as the hyperopic feature. The number of layers included in the language model may be 12, and the target layer may be the 10 th layer.

S302, searching the target hyper-parameter from the hyper-parameter characteristics.

In this embodiment of the present application, the processing device may search the target hyper-parameter from the hyper-parameter feature by using a preset search algorithm, may also search the target hyper-parameter from the hyper-parameter feature by using a preset search model, and may also search the target hyper-parameter from the hyper-parameter feature by using other manners, which is not specifically limited in this embodiment of the present application.

S303, updating corresponding hyper-parameters in a preset clustering algorithm according to the target hyper-parameters to obtain a clustering algorithm corresponding to each language model.

The corresponding hyper-parameter in the preset clustering algorithm may be referred to as an initial hyper-parameter.

In some embodiments, the processing device replaces the initial hyper-parameter in the preset clustering algorithm with the corresponding target hyper-parameter; the target hyper-parameter can also be processed to obtain a processed target hyper-parameter, and the initial hyper-parameter in the preset clustering algorithm is replaced by the corresponding processed target hyper-parameter; this is not particularly limited by the embodiments of the present application.

Of course, the initial hyper-parameter may not be set in the preset clustering algorithm, and the processing device may directly wait for the target hyper-parameter to be in a corresponding position in the preset clustering algorithm, so as to obtain the clustering algorithm corresponding to each language model.

In conclusion, each language model is adopted, and the hyper-parameter learning text data corresponding to the feature learning text data adopted for training each language model is identified to obtain the hyper-parameter features; searching a target hyper-parameter from the hyper-parameter characteristics; and updating corresponding hyper-parameters in a preset clustering algorithm according to the target hyper-parameters to obtain a clustering algorithm corresponding to each language model. The full-automatic clustering algorithm determination is realized, and the super-parameters of the clustering algorithm do not need to be additionally preset manually.

Optionally, the obtaining a plurality of feature learning text data from the sample dialog text, and the hyperreference learning text data corresponding to each feature learning text data includes:

and carrying out random data segmentation on the sample dialogue text for multiple times, randomly acquiring feature learning text data from the sample dialogue text in a preset proportion in the random data segmentation process every time, and determining other text data in the sample dialogue text as hyper-parameter learning text data corresponding to the feature learning text data.

The processing device may preset the number of times and the ratio of division.

In some embodiments, at each time of segmentation, the processing device may segment the sample dialog text by using a segmentation scale to obtain two portions of data, one of which is used as the feature learning text data, and the other is used as the hyper-parameter learning text data corresponding to the feature learning text data. And performing multiple random data segmentation to obtain multiple feature learning text data and hyper-parameter learning text data corresponding to each feature learning text data.

Note that the division ratio used for each random data division is uniform, but the feature learning text data and the corresponding hyper-parameter learning text data acquired for each random data division are different. The preset segmentation proportion can be 50 percent, that is, the feature learning text data and the corresponding hyperparametric learning text data both account for half of the sample dialogue text. Of course, the preset division ratio may also be other numerical values, and this is not specifically limited in the embodiment of the present application.

In the embodiment of the application, one feature learning text data and the hyper-reference learning text data corresponding to the feature learning text data are not overlapped, so that the generalization capability of the language model can be maximized. And feature learning text data is introduced, so that the clustering difficulty can be greatly reduced in a vector space of high latitude.

It should be noted that, because the sample dialogue text is limited, a part of data in the sample dialogue text is divided as feature learning text data, and another part of data is divided as hyper-reference learning text data, so that a single division may cause an overfitting problem. Therefore, the sample dialogue text is subjected to multiple random data segmentation, and multiple groups of feature learning text data and hyper-parameter learning text data are obtained, so that the problem of overfitting is avoided.

In summary, the sample dialog text is subjected to multiple random data segmentations, one feature learning text data is randomly obtained from the sample dialog text in a preset proportion in each random data segmentation process, and other text data in the sample dialog text is determined as the hyper-parameter learning text data corresponding to the one feature learning text data. The feature learning text data and the hyper-parameter learning text data are not overlapped, and the generalization capability of the language model is improved.

Optionally, the process of searching the target hyper-parameter from the hyper-parameter feature in S302 may include:

and randomly searching the target hyper-parameter from the hyper-parameter characteristics.

In one possible implementation, the processing device may employ a random search algorithm to randomly hyper-refer to search for the target hyper-parameter from the hyper-reference feature. Wherein, the number of the target hyper-parameters can be multiple.

Optionally, the process of randomly searching the target hyper-parameter from the hyper-parameter feature may include:

and randomly searching hyper-parameters corresponding to the hyper-parameter types from the hyper-parameter characteristics as target hyper-parameters according to preset hyper-parameter types.

The number of the preset super-parameter types can be multiple.

In some embodiments, the processing device may randomly search the hyper-parameters corresponding to the hyper-parameter types from the hyper-parameter features according to each preset hyper-parameter type to obtain a plurality of hyper-parameters corresponding to the preset hyper-parameter types, so as to obtain the target hyper-parameters.

It should be noted that the processing device may randomly search the hyper-parameters corresponding to the hyper-parameter types from the hyper-parameter features according to each preset hyper-parameter type; the method may also adopt a preset sequence, sequentially and randomly search the hyper-parameters corresponding to the hyper-parameter type from the hyper-parameter features according to each preset hyper-parameter type, and may also adopt other ways to determine the target hyper-parameters, which is not specifically limited in the embodiment of the present application.

Optionally, the preset super-parameter types may include: and according to the distance threshold value super parameter type and the sample threshold value super parameter type, randomly searching from the super parameter characteristics to obtain the distance threshold value and the sample threshold value, and taking the distance threshold value and the sample threshold value as the target super parameter.

In summary, according to the preset hyper-parameter type, the hyper-parameters corresponding to the hyper-parameter type are randomly searched from the hyper-parameter characteristics as the target hyper-parameters. The super-parameter can be automatically determined without manually presetting the super-parameter.

Optionally, fig. 4 is a schematic flow chart of the text clustering method according to the embodiment of the present invention, and as shown in fig. 4, the process of fusing the multiple clustering results in S103 to obtain the target clustering result of the text to be processed may include:

s401, determining mutual information of each clustering result and other clustering results.

The mutual information may be used to characterize the mutual relationship, i.e. the degree of association, between each clustering result and other clustering results. In addition, the other clustering results are clustering results except one clustering result in the plurality of clustering results.

In a possible implementation manner, the processing device may calculate mutual information between each clustering result and other clustering results, or may calculate mutual information between each clustering result and multiple clustering results, which is not specifically limited in this embodiment of the present application.

S402, determining the clustering result corresponding to the maximum mutual information as a target clustering result.

In some embodiments, a plurality of mutual information may be obtained by using the process of S401, the mutual information is sorted to determine the maximum mutual information, and the clustering result corresponding to the maximum mutual information is determined as the target clustering result.

Optionally, the processing device may perform the processes of S401 to S402 by using a BOK fusion model.

In another possible implementation, the processing device may further use a CHM fusion model (a graph-based fusion method), may generate three results, namely, CSPA, HGPA, and MCLA, according to different properties of the graph, and then calculates Mutual Information (Normalized Mutual Information) between the CSPA, HGPA, and MCLA results and the multi-clustering results, and takes the maximum Mutual Information as the final clustering output result. The CSPA is a fusion and segmentation algorithm based on clustering, the HGPA is a fusion and segmentation algorithm based on a hypergraph, and the MCLA is a fusion and segmentation algorithm based on meta-clustering.

In summary, mutual information between each clustering result and other clustering results is determined, and the clustering result corresponding to the largest mutual information is determined as the target clustering result. The determined target clustering result can be more accurate.

Wherein, each classification may include a plurality of text features, and the plurality of text features may correspond to text data. Multiple text features in the same category may be labeled with the same initial label.

In one possible implementation, the processing device determines a text intent of a classification based on text features in the classification, and updates initial labels of the text features in the classification to the text intent. Similarly, text intents can be labeled for text features in multiple classifications.

Optionally, the designer may determine a text intention of a classification according to the text feature in the classification, and perform an editing intention operation, and the processing device may determine the text intention of the classification in response to the input editing intention operation, and update the initial label of the text feature in the classification to the text intention.

In this embodiment of the present application, the processing device may adopt a basic new intention discovery framework to execute the text clustering method provided in this embodiment of the present application. Each basic new intention discovery framework can comprise a language model, a corresponding clustering algorithm (density aggregation model) and a fusion model (fusion algorithm), and the number of the basic new intention discovery frameworks can be multiple. One clustering result may be a clustering result generated by a single basic new intention discovery framework, and a plurality of clustering results may be clustering results generated by a plurality of basic new intention discovery frameworks.

It should be noted that the processing device may perform preset parameters and initialization, including setting a segmentation ratio, setting the number of basic new intention discovery frames, determining the type of the fusion model, the type of the clustering algorithm, and initializing the language model. The initialization of the language model may be empirically pre-trained parameters.

In summary, an embodiment of the present application provides a text clustering method, including: recognizing a text to be processed by adopting each of a plurality of preset language models to obtain text characteristics, wherein different language models are text characteristic recognition models obtained by training different characteristic learning text data obtained on the basis of sample dialogue texts in advance; clustering the text features output by each language model by adopting a clustering algorithm corresponding to each language model to obtain a clustering result; and fusing the plurality of clustering results to obtain a target clustering result of the text to be processed. The method comprises the steps of identifying a text to be processed based on a plurality of language models to obtain a plurality of text characteristics, so that the identified text characteristics of the text to be processed are more accurate; and then clustering the text features by adopting a corresponding clustering algorithm to obtain a plurality of clustering results, and fusing the plurality of clustering results to obtain a target clustering result, so that the accuracy of the clustering result is also improved. Moreover, the sample dialogue texts are utilized to the maximum extent, the hidden information of the sample dialogue texts is utilized to guide the next clustering, learning and error correction can be carried out according to the labeled sample dialogue texts, any hyper-parameter does not need to be set manually, and full automation can be carried out. The marking mode has lower cost and good data utilization rate, and can maximally reduce the cost of human resources.

In addition, the text clustering method provided by the embodiment of the application can be suitable for the chat logs containing a large number of outliers, a large number of outliers usually exist in a large number of real logs, and thousands of high-frequency intentions are found in the real logs based on the text clustering method, so that the text clustering method has strong practicability.

Fig. 5 is a schematic structural diagram of a text clustering device according to an embodiment of the present invention, and as shown in fig. 5, the device may include:

the recognition module 501 is configured to recognize a text to be processed by using each of a plurality of preset language models to obtain text features, where the different language models are text feature recognition models obtained by training different feature learning text data obtained in advance based on a sample dialog text;

a clustering module 502, configured to cluster the text features output by each language model by using a clustering algorithm corresponding to each language model to obtain a clustering result;

and the fusion module 503 is configured to fuse the multiple clustering results to obtain a target clustering result of the text to be processed.

Optionally, the apparatus further comprises:

and the training module is used for performing model training according to the characteristic learning text data to obtain a plurality of language models.

Optionally, the obtaining module is further configured to obtain a plurality of feature learning text data from the sample dialog text, and the hyper-parameter learning text data corresponding to each feature learning text data;

the device still includes:

and the updating module is used for updating corresponding hyper-parameters in the preset clustering algorithm according to the target hyper-parameters to obtain the clustering algorithm corresponding to each language model.

Optionally, the searching module is further configured to randomly search, according to the preset hyper-parameter type, a hyper-parameter corresponding to the hyper-parameter type from the hyper-parameter features as a target hyper-parameter.

Optionally, the fusion module 103 is further configured to determine mutual information between each clustering result and other clustering results; and determining the clustering result corresponding to the maximum mutual information as a target clustering result.

and the marking module is used for determining the text intention according to the text characteristics in the classification and marking the text intention for each text characteristic in the classification.

The above-mentioned apparatus is used for executing the method provided by the foregoing embodiment, and the implementation principle and technical effect are similar, which are not described herein again.

These above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Fig. 6 is a schematic structural diagram of a processing apparatus according to an embodiment of the present invention, and as shown in fig. 6, the processing apparatus may include: a processor 801 and a memory 802. The processing device may be a terminal or a server.

The memory 802 is used for storing programs, and the processor 801 calls the programs stored in the memory 802 to execute the above-mentioned method embodiments. The specific implementation and technical effects are similar, and are not described herein again.

Optionally, the present invention also provides a program product, for example a computer-readable storage medium, comprising a program which, when being executed by a processor, is adapted to carry out the above-mentioned method embodiments.

In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (in english: processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A text clustering method, comprising:

carrying out multiple random data segmentation on a sample dialogue text, randomly obtaining a feature learning text data from the sample dialogue text according to a preset proportion in each random data segmentation process, and determining other text data in the sample dialogue text as hyper-parameter learning text data corresponding to the feature learning text data;

respectively performing model training according to the feature learning text data to obtain a plurality of language models;

identifying the text to be processed by adopting each language model in the plurality of language models to obtain text characteristics;

searching a target hyper-parameter from the hyper-parameter feature;

updating corresponding hyper-parameters in a preset clustering algorithm according to the target hyper-parameters to obtain a clustering algorithm corresponding to each language model;

2. The method of claim 1, wherein the searching for the target hyper-parameter from the hyper-parameter feature comprises:

3. The method of claim 2, wherein the randomly searching for the target hyper-parameter from the hyper-parameter feature comprises:

4. The method according to any one of claims 1 to 3, wherein the fusing the plurality of clustering results to obtain the target clustering result of the text to be processed comprises:

5. The method of any one of claims 1-3, wherein the target clustering results comprise: a plurality of classifications, the method further comprising:

6. A text clustering apparatus, comprising:

the acquisition module is used for carrying out multiple random data segmentation on a sample dialogue text, randomly acquiring one feature learning text data from the sample dialogue text according to a preset proportion in the random data segmentation process each time, and determining other text data in the sample dialogue text as the hyper-parameter learning text data corresponding to the one feature learning text data;

the training module is used for performing model training according to the characteristic learning text data to obtain a plurality of language models;

the recognition module is used for recognizing the text to be processed by adopting each language model in the obtained plurality of language models to obtain text characteristics;

the updating module is used for updating corresponding hyper-parameters in a preset clustering algorithm according to the target hyper-parameters to obtain a clustering algorithm corresponding to each language model;

7. A processing device, comprising: a memory storing a computer program executable by the processor, and a processor implementing the method of any of the preceding claims 1-5 when executing the computer program.

8. A storage medium having stored thereon a computer program which, when read and executed, implements the method of any of claims 1-5.