CN114510567A

CN114510567A - Clustering-based new idea finding method, device, equipment and storage medium

Info

Publication number: CN114510567A
Application number: CN202111592178.6A
Authority: CN
Inventors: 熊艺华; 杨双霞; 周志勇
Original assignee: Guangzhou Ifly Zunhong Information Technology Co ltd
Current assignee: Guangzhou Ifly Zunhong Information Technology Co ltd
Priority date: 2021-12-23
Filing date: 2021-12-23
Publication date: 2022-05-17

Abstract

The application discloses a clustering-based method, a device, equipment and a storage medium for discovering new intentions, which are characterized in that a classifier is pre-trained according to known intention data, then a clustering number is selected through an optimized contour coefficient, the clustering effect is good, the known intention data and label-free data are combined with a mode of training the classifier, the known intention data of the previous round is used as a supervision signal during iteration, the known intention data are continuously updated until no new intention is added, the iteration is stopped, an alignment label for discovering the new intention is output, the known intention data are fully utilized, the information exchange between the classification and clustering processes is enhanced, the clustering process is more beneficial to guiding the clustering process, and the new intention is accurately and fully discovered, so that the problem that the data of the known intention are not fully utilized in the prior art, the difference between the new intention and the known intention is not considered, the clustering effect is poor, and the technical problem that a new intention is difficult to accurately and sufficiently found is caused.

Description

Clustering-based new idea finding method, device, equipment and storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a clustering-based method, a clustering-based device, a clustering-based equipment and a clustering-based method for discovering new ideas.

Background

When the user can not answer the call, the telephone assistant can replace the user to answer the call, perform corresponding interaction and guidance through understanding the words of the caller, and record important incoming call information to convey to the user. The condition that important information is lost due to missed calls is reduced, time and communication cost are saved, and life and working efficiency of people are greatly improved. The premise of intention recognition is to discover as many user intentions as possible, train an intention recognition model on the basis of the discovered intentions, and design corresponding interactive guidance. New intentions and new interesting points of the user can be mined through intention discovery, so that the identification capability of the intentions of the user is further improved, the interaction guidance is perfected, and the use experience of the user is improved.

When the classic unsupervised clustering algorithm is used for the intention discovery task, two defects exist, namely for the clustering algorithm from K-Means to the top, the clustering number needs to be set before clustering, and the clustering number can directly influence the final clustering effect. When intention discovery is performed, the number of true clusters is unclear; secondly, the classical unsupervised clustering algorithm does not have high-dimensional representation of the learning text, and the distance of the text in the feature space is difficult to accurately calculate. The unsupervised clustering algorithm based on deep learning firstly utilizes a neural network of deep learning to extract high-dimensional features or codes of a text, then uses the features or the code vectors to search a clustering center for clustering, trains an intention classifier at the same time, and trains a model by utilizing the results of the clustering and the classifier. The method has two disadvantages, one is that no supervision signal is introduced, so that the method is easily interfered by abnormal values during clustering, and the clustering effect is influenced. In particular, unlabeled data of tasks intended to be found usually mix data of known ideas and even data of other fields, and it is difficult to accurately cluster new ideas; secondly, because the number and the category of the clusters are changed, the parameters of the classifier are required to be initialized again each time. The third method is based on a weakly supervised or semi-supervised clustering algorithm, and guides the clustering process by using the labeled data or applying a limit, for example, training a binary model by using the labeled data for evaluating the clustering effect. The disadvantage of this type of approach is that most weakly supervised or supervised signals still do not fully utilize the tagged data. The three intention discovery methods do not fully utilize data of known intentions, do not consider the difference between the new intention and the known intention, cause poor clustering effect, and are difficult to accurately and fully discover the new intention.

Disclosure of Invention

The application provides a new idea finding method, a new idea finding device, new idea finding equipment and a storage medium based on clustering, which are used for solving the technical problems that in the prior art, data of known intentions are not fully utilized, the difference between a new intention and the known idea is not considered, the clustering effect is poor, and the new intention is difficult to accurately and fully find.

The first aspect of the present application provides a method for discovering new ideas based on clustering, which includes:

s101, pre-training a classifier according to known intention data;

s102, selecting a clustering number according to a preset contour coefficient;

s103, clustering label-free data according to the clustering number to generate a clustering result based on a K-means clustering algorithm, and aligning the clustering result with a real label to obtain an aligned label for finding a new intention;

s104, training the classifier according to the clustering result and the known intention data to obtain a pseudo label of the classifier;

s105, calculating KL divergence of the alignment label and the pseudo label to update the known intention data;

and S106, repeatedly executing the steps S101 to S105 until no new intention is added, and outputting the alignment label.

The method firstly pre-trains the classifier according to the known intention data, then selects the clustering number through the optimized contour coefficient, has good clustering effect, combines the known intention data and the label-free data with the mode of training the classifier, continuously updates the known intention data by taking the known intention data of the previous round as a supervision signal during iteration until no new intention is added, stops iteration, and outputs the alignment label for finding new intention, fully utilizes the known intention data, enhances the information exchange between classification and clustering processes, is more beneficial to guiding the clustering process and accurately and fully finding new intention, therefore, the technical problems that data with known intentions are not fully utilized, differences between new intentions and the known intentions are not considered, the clustering effect is poor, and the new intentions are difficult to accurately and fully find in the prior art are solved.

Optionally, the selecting a cluster number according to a preset contour coefficient includes:

selecting a clustering number according to the first contour coefficient and/or the second contour coefficient;

the first profile coefficient is:

the second profile coefficient is:

wherein l (i) is a contour coefficient with a penalty term, l (i)_aFor expanded contour coefficient with penalty term, s (i) is definition of traditional contour coefficient, s (i) a is contour coefficient, lambda and gamma are both hyper-parameters, K is clustering number, N is total number of samples, C_kIs the cluster to which the sample i belongs,

and sigma and u are respectively the standard deviation and the mean value of the sample number of the current cluster.

Optionally, the training the classifier according to the clustering result and the known intention data to obtain a pseudo label of the classifier previously includes:

and adjusting the label number of the classifier according to the label number of the cluster.

Optionally, the training the classifier according to the clustering result and the known intention data to obtain a pseudo label of the classifier includes:

calculating a joint loss according to the known intention data and the alignment label;

and updating the parameters of the classifier according to the joint loss to obtain the pseudo label of the classifier.

Optionally, the clustering the unlabeled data according to the cluster number based on the K-means clustering algorithm to generate a clustering result, and aligning the clustering result with the real label to obtain an aligned label finding a new intention, includes:

clustering the label-free data according to the clustering number based on a K-means clustering algorithm to generate a clustering result;

and aligning the clustering result and the real label through a Hungarian algorithm to obtain an aligned label for finding a new intention.

Optionally, the pre-training the classifier according to the known intention data includes:

extracting feature vectors of known intention data based on a BERT pre-training language model;

inputting the feature vector into a classifier to obtain a prediction label;

calculating the cross entropy loss of the predicted label and the real label;

and updating the classifier parameters according to the cross entropy loss.

Optionally, the outputting the alignment label until no new intention is added, and then:

converting the implicit alignment tags into explicit intent tags by an intent tag generator.

A second aspect of the present application provides a device for discovering new ideas based on clustering, including:

the preprocessing unit is used for pre-training the classifier according to the known intention data;

the selecting unit is used for selecting the clustering number according to the preset contour coefficient;

the clustering unit is used for clustering the label-free data according to the clustering number based on a K-means clustering algorithm to generate a clustering result, and aligning the clustering result with the real label to obtain an aligned label for finding a new intention;

the training unit is used for training the classifier according to the clustering result and the known intention data to obtain a pseudo label of the classifier;

a calculation unit to calculate KL divergences of the alignment tag and the dummy tag to update the known intent data;

and the output unit is used for outputting the alignment label when no new intention is added.

A third aspect of the present application provides an electronic device comprising a processor and a memory storing a computer program, the processor implementing the steps of the cluster-based new idea discovery method according to the first aspect when executing the computer program.

A fourth aspect of the present application provides a computer-readable storage medium having stored thereon a computer program, wherein the computer program, when being executed by a processor, implements the steps of the cluster-based new idea discovery method according to the first aspect.

Drawings

Fig. 1 is a schematic flow chart of a clustering-based new idea discovery method according to an embodiment of the present application;

FIG. 2 is a block diagram of a model of a clustering-based new idea discovery method according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a clustering-based new idea discovering device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides a clustering-based method, a clustering-based device, a clustering-based equipment and a clustering-based storage medium, which are used for solving the technical problems that the clustering effect is poor and the new intention is difficult to accurately and sufficiently discover due to the fact that the data of the known intention is not fully utilized and the difference between the new intention and the known intention is not considered in the prior art.

In order to make the objects, features and advantages of the present invention more apparent and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the embodiments described below are only a part of the embodiments of the present invention, but not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1 and fig. 2, an embodiment of the present application provides a method for discovering new ideas based on clustering, including:

step 101, pre-training the classifier according to the known intention data.

It should be noted that, the known intention data is usually labeled data that has been used before, and the classification is pre-trained according to the labeled known intention data, so as to ensure that the subsequent training classification process can be performed smoothly.

And 102, selecting a clustering number according to a preset contour coefficient.

It can be understood that, in this example, the optimized contour coefficient (i.e., the preset contour coefficient) is used to select the cluster number, and specifically, the cluster number may be selected according to the first contour coefficient and/or the second contour coefficient, so as to select the cluster number that is more favorable for clustering and finding new intentions, as follows:

substituting a first preset formula and a second preset formula into a traditional contour coefficient definition formula to obtain s (i), wherein the first preset formula is as follows:

the second preset formula is:

the conventional profile coefficient definition formula is:

wherein, a (i) measures the distance between the sample point and the sample in the same cluster, namely the aggregation degree between the same clusters, b (i) measures the distance between the sample point and the sample in different clusters, namely the separation degree between different clusters, s (i) is the definition of the traditional contour coefficient, i and j are samples, and the closer s (i) is to 1, the better the separation degree of different clusters and the aggregation degree of the same clusters are.

S (i) is obtained by calculation according to a third preset formula and a fourth preset formula_aWherein, the third preset formula is:

the fourth preset formula is:

it should be noted that the optimized contour coefficients in this embodiment are improved in two ways compared to the conventional contour coefficients. One aspect is to introduce a distance between the new intent and the known intent data, extending b (i) to b (i) a. The difference between the new intention and the known intention data is more obvious, and the weight occupied by the distance between the new intention and the known intention data is dynamically adjusted through the adaptive parameter alpha, namely the new intention is better to be distinguished from the known intention data, and the smaller the adaptive parameter alpha is. On the other hand two penalty terms are added.

The first profile coefficient is:

the second profile coefficient is:

wherein l (i) is a contour coefficient with a penalty term, l (i)_aFor extended contour coefficients with penalty terms, s (i) isDefinition of System Profile coefficients, s (i)_aIs the contour coefficient, λ and γ are both hyperparameters, K is the number of clusters, N is the total number of samples, C_kIs the cluster to which the sample i belongs,

Because the cluster of the previous iteration does not exist in the first iteration, the cluster number is selected through the first contour coefficient in the first iteration, and the cluster number is selected through the second contour coefficient in the subsequent iteration. The former restricts the sample number of each cluster in the cluster to be balanced as much as possible, which is beneficial to clustering; the latter restricts the number of samples of newly added clusters as little as possible during clustering, and is beneficial to digging new long tail intentions in practical application. λ and γ are the hyperparameters of the two penalty terms, respectively.

And 103, clustering the label-free data according to the clustering number to generate a clustering result based on a K-means clustering algorithm, and aligning the clustering result with the real label to obtain an aligned label for finding a new intention.

And 104, training a classifier according to the clustering result and the known intention data to obtain a pseudo label of the classifier.

Step 105, calculate the KL divergence of the aligned tag and the pseudo tag to update the known intent data.

It should be noted that, first, assuming that the aligned tag and the pseudo tag obey probability distributions P and Q, KL divergences of the distributions of the aligned tag and the pseudo tag are calculated according to a fifth formula, where a smaller KL divergence means that the distributions of the aligned tag and the pseudo tag are closer, indicating that the clustering effect is better, and the distinction between the new intention and the known intention is clearer, if the current KL divergence is smaller than the KL divergence in the previous iteration, the current newly added tag, i.e., the updated known intention data, is retained and used as the known intention data in the next iteration, and otherwise, the previous iteration is returned. The fifth preset formula is:

where P and Q are probability distributions and i and j are samples.

And 106, repeatedly executing the steps 101 to 105 until no new intention is added, and outputting the alignment label.

It should be noted that, steps 101 to 105 are repeatedly executed, continuously iteratively clustering and updating the known intention data, as shown in fig. 2, as an iteratively updated model block diagram, a training mode of multiple rounds of iteratively updating the intention of the known intention data is introduced in the present embodiment, when there is no new intention added for several consecutive times, the iteration is stopped, and an alignment tag is output, that is, a found new intention is output.

Further, step 104 is preceded by: and adjusting the label number of the classifier according to the label number of the cluster.

It should be noted that, before training the classifier according to the clustering result and the known intention data, the number of the classifier labels is adjusted, specifically, if the number of the clustering labels is greater than the number of the classifier labels, the classifier labels are added, otherwise, the current number of the classifier labels is maintained, so as to ensure that the training is performed smoothly.

Further, step 104 includes:

calculating according to the known intention data and the alignment label to obtain joint loss;

The joint loss consists of two parts, the first part is the cross-entropy loss of the output of the classifier and the labeled known intention data, which can ensure that the classifier accurately classifies the known intention data, and the second part is the cross-entropy loss of the output of the classifier and the alignment label, so that the classifier obtains the classification capability of the new intention.

Further, step 103 comprises:

and aligning the clustering result with the real label through a Hungarian algorithm to obtain an aligned label for finding a new intention.

Further, step 101 comprises:

inputting the feature vectors into a classifier to obtain a prediction label;

calculating the cross entropy loss of the predicted label and the real label;

and updating the classifier parameters according to the cross entropy loss.

The embodiment can use a Transformer-based large-scale pre-training language model to extract features of the input text, such as BERT, so that the extracted feature vectors are relatively comprehensive. And inputting the obtained feature vectors into a classifier to obtain a prediction label, and then calculating the prediction label and the real label to calculate cross entropy loss so as to update the classifier parameters and prepare for subsequent iterative training.

Further, step 106 is followed by:

the implicit alignment tags are converted to explicit intent tags by an intent tag generator.

Since the new intent obtained by clustering is in the form of implicit id, it is usually necessary to analyze various data by human observation and then give understandable explicit labels for the purpose recognition task and interaction design of the subsequent telephone assistant. To save labor and time costs, the present embodiment can automatically generate explicit intent tags from implicit alignment tags by an intent tag generator. The specific steps of the intention tag generation are as follows:

1. and performing syntactic dependency analysis on various input texts, and extracting the dynamic guest relationship in the sentence to form a preliminary intention label of 'dynamic word-noun'.

2. And counting the frequency of the preliminary intention labels in each class, and selecting a high-frequency intention label as a final intention label (Intent labels) of the class. In particular, without a high frequency "verb-noun" tag, i.e., missing a verb or noun, a high frequency noun or verb is taken as the final intent tag for this category.

The embodiment firstly pre-trains the classifier according to the known intention data, then selects the clustering number through the optimized contour coefficient, has better clustering effect, combines the known intention data and the label-free data with the mode of training the classifier, continuously updates the known intention data by taking the known intention data of the previous round as a supervision signal during iteration until no new intention is added, stops the iteration, and outputs the alignment label for finding new intention, fully utilizes the known intention data, enhances the information exchange between classification and clustering processes, is more favorable for guiding the clustering process and accurately and fully finding new intention, therefore, the technical problems that data with known intentions are not fully utilized, differences between new intentions and the known intentions are not considered, the clustering effect is poor, and the new intentions are difficult to accurately and fully find in the prior art are solved.

The following is a detailed description of an embodiment of a new idea discovery apparatus based on clustering provided by the present application, and a new idea discovery apparatus based on clustering described below and a new idea discovery method based on clustering described above may be referred to correspondingly.

Referring to fig. 3, an embodiment of the present application provides a new idea discovering device based on clustering, including:

a pre-processing unit 201 for pre-training the classifier according to the known intention data.

A selecting unit 202, configured to select a cluster number according to a preset contour coefficient.

And the clustering unit 203 is used for clustering the unlabeled data according to the clustering number based on a K-means clustering algorithm to generate a clustering result, and aligning the clustering result with the real label to obtain an aligned label for finding a new intention.

And the training unit 204 is configured to train the classifier according to the clustering result and the known intention data to obtain a pseudo label of the classifier.

A calculation unit 205 for calculating KL divergences of the aligned tag and the pseudo tag to update the known intent data.

An output unit 206, configured to output the alignment label when no new intention is added.

Further, the selecting unit 202 is specifically configured to:

a cluster number is selected based on the first profile coefficient and/or the second profile coefficient.

The first profile coefficient is:

the second profile coefficient is:

wherein l (i) is a contour coefficient with a penalty term, l (i)_aFor extended contour coefficients with penalty terms, s (i) is the definition of conventional contour coefficients, s (i)_aIs the contour coefficient, λ and γ are both hyperparameters, K is the number of clusters, N is the total number of samples, C_kIs the cluster to which the sample i belongs,

Further, the system also comprises an adjusting unit used for adjusting the number of the labels of the classifier according to the number of the labels of the cluster.

Further, the training unit 204 includes:

and the first calculation subunit is used for calculating the joint loss according to the known intention data and the alignment label.

And the first updating subunit is used for updating the classifier parameters according to the joint loss to obtain the pseudo label of the classifier.

Further, the clustering unit 203 includes:

and the clustering subunit is used for clustering the unlabeled data according to the clustering number based on a K-means clustering algorithm to generate a clustering result.

And the alignment subunit is used for aligning the clustering result with the real label by Hungarian algorithm to obtain an alignment label finding a new intention.

Further, the preprocessing unit 201 includes:

and the extraction subunit is used for extracting the feature vector of the known intention data based on the BERT pre-training language model.

And the input subunit is used for inputting the feature vectors into the classifier to obtain the prediction labels.

And the second calculation subunit is used for calculating the cross entropy loss of the prediction label and the real label.

And the second updating subunit is used for updating the classifier parameters according to the cross entropy loss.

Further, a conversion unit is included for converting the implicit alignment tags into explicit intent tags by the intent tag generator.

Fig. 4 illustrates a physical structure diagram of an electronic device. As shown in fig. 4, the present invention also provides an electronic device, which may include: a processor (processor)310, a Communication Interface (Communication Interface)320, a memory (memory)330 and a Communication bus 340, wherein the processor 310, the Communication Interface 320 and the memory 330 complete the Communication with each other through the Communication bus 340. The processor 310 may invoke computer programs in the memory 330 to perform the steps of a cluster-based new idea discovery method, including, for example:

pre-training the classifier according to known intention data;

selecting a cluster number according to a preset contour coefficient;

based on a K-means clustering algorithm, clustering the label-free data according to the clustering number to generate a clustering result, and aligning the clustering result with the real label to obtain an aligned label for finding a new intention;

training a classifier according to the clustering result and the known intention data to obtain a pseudo label of the classifier;

calculating KL divergence of the aligned tag and the pseudo tag to update the known intent data;

and repeatedly executing the steps until no new intention is added, and outputting the alignment label.

In addition, the logic instructions in the memory 330 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof, which essentially contributes to the prior art, can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.

On the other hand, embodiments of the present application further provide a computer-readable storage medium, where a processor-readable storage medium stores a computer program, where the computer program is configured to enable a processor to execute the steps of the method provided in each of the above embodiments, and the method includes, for example:

pre-training the classifier according to known intention data;

selecting a cluster number according to a preset contour coefficient;

The processor-readable storage medium may be any available media or data storage device that can be accessed by a processor, including, but not limited to, magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only used to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for discovering new ideas based on clustering is characterized by comprising the following steps:

s101, pre-training a classifier according to known intention data;

s102, selecting a clustering number according to a preset contour coefficient;

s103, clustering unlabeled data according to the clustering number based on a K-means clustering algorithm to generate a clustering result, and aligning the clustering result with a real label to obtain an aligned label for finding a new intention;

2. The method of claim 1, wherein the selecting a cluster number according to a preset profile coefficient comprises:

the first profile coefficient is:

the second profile coefficient is:

wherein l (i) is a contour coefficient with a penalty term, l (i)_aFor expanded contour coefficient with penalty term, s (i) is definition of traditional contour coefficient, lambda and gamma are both hyper-parameters, K is clustering number, N is total number of samples, C is total number of samples_kIs the cluster to which sample i belongs, s (i) a is the contour coefficient,

3. The cluster-based new idea discovery method according to claim 1, wherein said training said classifier according to said clustering result and said known intention data to obtain a pseudo label of said classifier previously comprises:

4. The method of claim 3, wherein the training the classifier according to the clustering result and the known intention data to obtain the pseudo label of the classifier comprises:

and updating the parameters of the classifier according to the joint loss to obtain a pseudo label of the classifier.

5. The method for discovering new ideas based on clustering according to claim 1, wherein the K-means clustering algorithm clusters unlabeled data according to the clustering number to generate a clustering result, and aligns the clustering result with a real label to obtain an aligned label for discovering new ideas, and comprises:

6. The cluster-based new idea discovery method according to claim 1, wherein said pre-training classifiers according to known intention data comprises:

inputting the feature vector into a classifier to obtain a prediction label;

calculating the cross entropy loss of the predicted label and the real label;

and updating the classifier parameters according to the cross entropy loss.

7. The cluster-based new idea discovery method according to claim 1, wherein said outputting said alignment label until no new intention is added, thereafter comprises:

8. A device for discovering new ideas based on clustering, comprising:

9. An electronic device comprising a processor and a memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the cluster-based new idea discovery method according to any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the cluster-based new idea discovery method according to any one of claims 1 to 7.