CN114510567A - Clustering-based new idea finding method, device, equipment and storage medium - Google Patents

Clustering-based new idea finding method, device, equipment and storage medium Download PDF

Info

Publication number
CN114510567A
CN114510567A CN202111592178.6A CN202111592178A CN114510567A CN 114510567 A CN114510567 A CN 114510567A CN 202111592178 A CN202111592178 A CN 202111592178A CN 114510567 A CN114510567 A CN 114510567A
Authority
CN
China
Prior art keywords
clustering
label
intention
classifier
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111592178.6A
Other languages
Chinese (zh)
Inventor
熊艺华
杨双霞
周志勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Ifly Zunhong Information Technology Co ltd
Original Assignee
Guangzhou Ifly Zunhong Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Ifly Zunhong Information Technology Co ltd filed Critical Guangzhou Ifly Zunhong Information Technology Co ltd
Priority to CN202111592178.6A priority Critical patent/CN114510567A/en
Publication of CN114510567A publication Critical patent/CN114510567A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a clustering-based method, a device, equipment and a storage medium for discovering new intentions, which are characterized in that a classifier is pre-trained according to known intention data, then a clustering number is selected through an optimized contour coefficient, the clustering effect is good, the known intention data and label-free data are combined with a mode of training the classifier, the known intention data of the previous round is used as a supervision signal during iteration, the known intention data are continuously updated until no new intention is added, the iteration is stopped, an alignment label for discovering the new intention is output, the known intention data are fully utilized, the information exchange between the classification and clustering processes is enhanced, the clustering process is more beneficial to guiding the clustering process, and the new intention is accurately and fully discovered, so that the problem that the data of the known intention are not fully utilized in the prior art, the difference between the new intention and the known intention is not considered, the clustering effect is poor, and the technical problem that a new intention is difficult to accurately and sufficiently found is caused.

Description

Clustering-based new idea finding method, device, equipment and storage medium
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a clustering-based method, a clustering-based device, a clustering-based equipment and a clustering-based method for discovering new ideas.
Background
When the user can not answer the call, the telephone assistant can replace the user to answer the call, perform corresponding interaction and guidance through understanding the words of the caller, and record important incoming call information to convey to the user. The condition that important information is lost due to missed calls is reduced, time and communication cost are saved, and life and working efficiency of people are greatly improved. The premise of intention recognition is to discover as many user intentions as possible, train an intention recognition model on the basis of the discovered intentions, and design corresponding interactive guidance. New intentions and new interesting points of the user can be mined through intention discovery, so that the identification capability of the intentions of the user is further improved, the interaction guidance is perfected, and the use experience of the user is improved.
When the classic unsupervised clustering algorithm is used for the intention discovery task, two defects exist, namely for the clustering algorithm from K-Means to the top, the clustering number needs to be set before clustering, and the clustering number can directly influence the final clustering effect. When intention discovery is performed, the number of true clusters is unclear; secondly, the classical unsupervised clustering algorithm does not have high-dimensional representation of the learning text, and the distance of the text in the feature space is difficult to accurately calculate. The unsupervised clustering algorithm based on deep learning firstly utilizes a neural network of deep learning to extract high-dimensional features or codes of a text, then uses the features or the code vectors to search a clustering center for clustering, trains an intention classifier at the same time, and trains a model by utilizing the results of the clustering and the classifier. The method has two disadvantages, one is that no supervision signal is introduced, so that the method is easily interfered by abnormal values during clustering, and the clustering effect is influenced. In particular, unlabeled data of tasks intended to be found usually mix data of known ideas and even data of other fields, and it is difficult to accurately cluster new ideas; secondly, because the number and the category of the clusters are changed, the parameters of the classifier are required to be initialized again each time. The third method is based on a weakly supervised or semi-supervised clustering algorithm, and guides the clustering process by using the labeled data or applying a limit, for example, training a binary model by using the labeled data for evaluating the clustering effect. The disadvantage of this type of approach is that most weakly supervised or supervised signals still do not fully utilize the tagged data. The three intention discovery methods do not fully utilize data of known intentions, do not consider the difference between the new intention and the known intention, cause poor clustering effect, and are difficult to accurately and fully discover the new intention.
Disclosure of Invention
The application provides a new idea finding method, a new idea finding device, new idea finding equipment and a storage medium based on clustering, which are used for solving the technical problems that in the prior art, data of known intentions are not fully utilized, the difference between a new intention and the known idea is not considered, the clustering effect is poor, and the new intention is difficult to accurately and fully find.
The first aspect of the present application provides a method for discovering new ideas based on clustering, which includes:
s101, pre-training a classifier according to known intention data;
s102, selecting a clustering number according to a preset contour coefficient;
s103, clustering label-free data according to the clustering number to generate a clustering result based on a K-means clustering algorithm, and aligning the clustering result with a real label to obtain an aligned label for finding a new intention;
s104, training the classifier according to the clustering result and the known intention data to obtain a pseudo label of the classifier;
s105, calculating KL divergence of the alignment label and the pseudo label to update the known intention data;
and S106, repeatedly executing the steps S101 to S105 until no new intention is added, and outputting the alignment label.
The method firstly pre-trains the classifier according to the known intention data, then selects the clustering number through the optimized contour coefficient, has good clustering effect, combines the known intention data and the label-free data with the mode of training the classifier, continuously updates the known intention data by taking the known intention data of the previous round as a supervision signal during iteration until no new intention is added, stops iteration, and outputs the alignment label for finding new intention, fully utilizes the known intention data, enhances the information exchange between classification and clustering processes, is more beneficial to guiding the clustering process and accurately and fully finding new intention, therefore, the technical problems that data with known intentions are not fully utilized, differences between new intentions and the known intentions are not considered, the clustering effect is poor, and the new intentions are difficult to accurately and fully find in the prior art are solved.
Optionally, the selecting a cluster number according to a preset contour coefficient includes:
selecting a clustering number according to the first contour coefficient and/or the second contour coefficient;
the first profile coefficient is:
Figure BDA0003430187860000031
the second profile coefficient is:
Figure BDA0003430187860000032
wherein l (i) is a contour coefficient with a penalty term, l (i)aFor expanded contour coefficient with penalty term, s (i) is definition of traditional contour coefficient, s (i) a is contour coefficient, lambda and gamma are both hyper-parameters, K is clustering number, N is total number of samples, CkIs the cluster to which the sample i belongs,
Figure BDA0003430187860000033
and sigma and u are respectively the standard deviation and the mean value of the sample number of the current cluster.
Optionally, the training the classifier according to the clustering result and the known intention data to obtain a pseudo label of the classifier previously includes:
and adjusting the label number of the classifier according to the label number of the cluster.
Optionally, the training the classifier according to the clustering result and the known intention data to obtain a pseudo label of the classifier includes:
calculating a joint loss according to the known intention data and the alignment label;
and updating the parameters of the classifier according to the joint loss to obtain the pseudo label of the classifier.
Optionally, the clustering the unlabeled data according to the cluster number based on the K-means clustering algorithm to generate a clustering result, and aligning the clustering result with the real label to obtain an aligned label finding a new intention, includes:
clustering the label-free data according to the clustering number based on a K-means clustering algorithm to generate a clustering result;
and aligning the clustering result and the real label through a Hungarian algorithm to obtain an aligned label for finding a new intention.
Optionally, the pre-training the classifier according to the known intention data includes:
extracting feature vectors of known intention data based on a BERT pre-training language model;
inputting the feature vector into a classifier to obtain a prediction label;
calculating the cross entropy loss of the predicted label and the real label;
and updating the classifier parameters according to the cross entropy loss.
Optionally, the outputting the alignment label until no new intention is added, and then:
converting the implicit alignment tags into explicit intent tags by an intent tag generator.
A second aspect of the present application provides a device for discovering new ideas based on clustering, including:
the preprocessing unit is used for pre-training the classifier according to the known intention data;
the selecting unit is used for selecting the clustering number according to the preset contour coefficient;
the clustering unit is used for clustering the label-free data according to the clustering number based on a K-means clustering algorithm to generate a clustering result, and aligning the clustering result with the real label to obtain an aligned label for finding a new intention;
the training unit is used for training the classifier according to the clustering result and the known intention data to obtain a pseudo label of the classifier;
a calculation unit to calculate KL divergences of the alignment tag and the dummy tag to update the known intent data;
and the output unit is used for outputting the alignment label when no new intention is added.
A third aspect of the present application provides an electronic device comprising a processor and a memory storing a computer program, the processor implementing the steps of the cluster-based new idea discovery method according to the first aspect when executing the computer program.
A fourth aspect of the present application provides a computer-readable storage medium having stored thereon a computer program, wherein the computer program, when being executed by a processor, implements the steps of the cluster-based new idea discovery method according to the first aspect.
Drawings
Fig. 1 is a schematic flow chart of a clustering-based new idea discovery method according to an embodiment of the present application;
FIG. 2 is a block diagram of a model of a clustering-based new idea discovery method according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a clustering-based new idea discovering device according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The embodiment of the application provides a clustering-based method, a clustering-based device, a clustering-based equipment and a clustering-based storage medium, which are used for solving the technical problems that the clustering effect is poor and the new intention is difficult to accurately and sufficiently discover due to the fact that the data of the known intention is not fully utilized and the difference between the new intention and the known intention is not considered in the prior art.
In order to make the objects, features and advantages of the present invention more apparent and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the embodiments described below are only a part of the embodiments of the present invention, but not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1 and fig. 2, an embodiment of the present application provides a method for discovering new ideas based on clustering, including:
step 101, pre-training the classifier according to the known intention data.
It should be noted that, the known intention data is usually labeled data that has been used before, and the classification is pre-trained according to the labeled known intention data, so as to ensure that the subsequent training classification process can be performed smoothly.
And 102, selecting a clustering number according to a preset contour coefficient.
It can be understood that, in this example, the optimized contour coefficient (i.e., the preset contour coefficient) is used to select the cluster number, and specifically, the cluster number may be selected according to the first contour coefficient and/or the second contour coefficient, so as to select the cluster number that is more favorable for clustering and finding new intentions, as follows:
substituting a first preset formula and a second preset formula into a traditional contour coefficient definition formula to obtain s (i), wherein the first preset formula is as follows:
Figure BDA0003430187860000051
the second preset formula is:
Figure BDA0003430187860000052
the conventional profile coefficient definition formula is:
Figure BDA0003430187860000053
wherein, a (i) measures the distance between the sample point and the sample in the same cluster, namely the aggregation degree between the same clusters, b (i) measures the distance between the sample point and the sample in different clusters, namely the separation degree between different clusters, s (i) is the definition of the traditional contour coefficient, i and j are samples, and the closer s (i) is to 1, the better the separation degree of different clusters and the aggregation degree of the same clusters are.
S (i) is obtained by calculation according to a third preset formula and a fourth preset formulaaWherein, the third preset formula is:
Figure RE-GDA0003571559650000072
the fourth preset formula is:
Figure BDA0003430187860000062
it should be noted that the optimized contour coefficients in this embodiment are improved in two ways compared to the conventional contour coefficients. One aspect is to introduce a distance between the new intent and the known intent data, extending b (i) to b (i) a. The difference between the new intention and the known intention data is more obvious, and the weight occupied by the distance between the new intention and the known intention data is dynamically adjusted through the adaptive parameter alpha, namely the new intention is better to be distinguished from the known intention data, and the smaller the adaptive parameter alpha is. On the other hand two penalty terms are added.
The first profile coefficient is:
Figure BDA0003430187860000063
the second profile coefficient is:
Figure BDA0003430187860000064
wherein l (i) is a contour coefficient with a penalty term, l (i)aFor extended contour coefficients with penalty terms, s (i) isDefinition of System Profile coefficients, s (i)aIs the contour coefficient, λ and γ are both hyperparameters, K is the number of clusters, N is the total number of samples, CkIs the cluster to which the sample i belongs,
Figure BDA0003430187860000065
and sigma and u are respectively the standard deviation and the mean value of the sample number of the current cluster.
Because the cluster of the previous iteration does not exist in the first iteration, the cluster number is selected through the first contour coefficient in the first iteration, and the cluster number is selected through the second contour coefficient in the subsequent iteration. The former restricts the sample number of each cluster in the cluster to be balanced as much as possible, which is beneficial to clustering; the latter restricts the number of samples of newly added clusters as little as possible during clustering, and is beneficial to digging new long tail intentions in practical application. λ and γ are the hyperparameters of the two penalty terms, respectively.
And 103, clustering the label-free data according to the clustering number to generate a clustering result based on a K-means clustering algorithm, and aligning the clustering result with the real label to obtain an aligned label for finding a new intention.
And 104, training a classifier according to the clustering result and the known intention data to obtain a pseudo label of the classifier.
Step 105, calculate the KL divergence of the aligned tag and the pseudo tag to update the known intent data.
It should be noted that, first, assuming that the aligned tag and the pseudo tag obey probability distributions P and Q, KL divergences of the distributions of the aligned tag and the pseudo tag are calculated according to a fifth formula, where a smaller KL divergence means that the distributions of the aligned tag and the pseudo tag are closer, indicating that the clustering effect is better, and the distinction between the new intention and the known intention is clearer, if the current KL divergence is smaller than the KL divergence in the previous iteration, the current newly added tag, i.e., the updated known intention data, is retained and used as the known intention data in the next iteration, and otherwise, the previous iteration is returned. The fifth preset formula is:
Figure BDA0003430187860000071
where P and Q are probability distributions and i and j are samples.
And 106, repeatedly executing the steps 101 to 105 until no new intention is added, and outputting the alignment label.
It should be noted that, steps 101 to 105 are repeatedly executed, continuously iteratively clustering and updating the known intention data, as shown in fig. 2, as an iteratively updated model block diagram, a training mode of multiple rounds of iteratively updating the intention of the known intention data is introduced in the present embodiment, when there is no new intention added for several consecutive times, the iteration is stopped, and an alignment tag is output, that is, a found new intention is output.
Further, step 104 is preceded by: and adjusting the label number of the classifier according to the label number of the cluster.
It should be noted that, before training the classifier according to the clustering result and the known intention data, the number of the classifier labels is adjusted, specifically, if the number of the clustering labels is greater than the number of the classifier labels, the classifier labels are added, otherwise, the current number of the classifier labels is maintained, so as to ensure that the training is performed smoothly.
Further, step 104 includes:
calculating according to the known intention data and the alignment label to obtain joint loss;
and updating the parameters of the classifier according to the joint loss to obtain the pseudo label of the classifier.
The joint loss consists of two parts, the first part is the cross-entropy loss of the output of the classifier and the labeled known intention data, which can ensure that the classifier accurately classifies the known intention data, and the second part is the cross-entropy loss of the output of the classifier and the alignment label, so that the classifier obtains the classification capability of the new intention.
Further, step 103 comprises:
clustering the label-free data according to the clustering number based on a K-means clustering algorithm to generate a clustering result;
and aligning the clustering result with the real label through a Hungarian algorithm to obtain an aligned label for finding a new intention.
Further, step 101 comprises:
extracting feature vectors of known intention data based on a BERT pre-training language model;
inputting the feature vectors into a classifier to obtain a prediction label;
calculating the cross entropy loss of the predicted label and the real label;
and updating the classifier parameters according to the cross entropy loss.
The embodiment can use a Transformer-based large-scale pre-training language model to extract features of the input text, such as BERT, so that the extracted feature vectors are relatively comprehensive. And inputting the obtained feature vectors into a classifier to obtain a prediction label, and then calculating the prediction label and the real label to calculate cross entropy loss so as to update the classifier parameters and prepare for subsequent iterative training.
Further, step 106 is followed by:
the implicit alignment tags are converted to explicit intent tags by an intent tag generator.
Since the new intent obtained by clustering is in the form of implicit id, it is usually necessary to analyze various data by human observation and then give understandable explicit labels for the purpose recognition task and interaction design of the subsequent telephone assistant. To save labor and time costs, the present embodiment can automatically generate explicit intent tags from implicit alignment tags by an intent tag generator. The specific steps of the intention tag generation are as follows:
1. and performing syntactic dependency analysis on various input texts, and extracting the dynamic guest relationship in the sentence to form a preliminary intention label of 'dynamic word-noun'.
2. And counting the frequency of the preliminary intention labels in each class, and selecting a high-frequency intention label as a final intention label (Intent labels) of the class. In particular, without a high frequency "verb-noun" tag, i.e., missing a verb or noun, a high frequency noun or verb is taken as the final intent tag for this category.
The embodiment firstly pre-trains the classifier according to the known intention data, then selects the clustering number through the optimized contour coefficient, has better clustering effect, combines the known intention data and the label-free data with the mode of training the classifier, continuously updates the known intention data by taking the known intention data of the previous round as a supervision signal during iteration until no new intention is added, stops the iteration, and outputs the alignment label for finding new intention, fully utilizes the known intention data, enhances the information exchange between classification and clustering processes, is more favorable for guiding the clustering process and accurately and fully finding new intention, therefore, the technical problems that data with known intentions are not fully utilized, differences between new intentions and the known intentions are not considered, the clustering effect is poor, and the new intentions are difficult to accurately and fully find in the prior art are solved.
The following is a detailed description of an embodiment of a new idea discovery apparatus based on clustering provided by the present application, and a new idea discovery apparatus based on clustering described below and a new idea discovery method based on clustering described above may be referred to correspondingly.
Referring to fig. 3, an embodiment of the present application provides a new idea discovering device based on clustering, including:
a pre-processing unit 201 for pre-training the classifier according to the known intention data.
A selecting unit 202, configured to select a cluster number according to a preset contour coefficient.
And the clustering unit 203 is used for clustering the unlabeled data according to the clustering number based on a K-means clustering algorithm to generate a clustering result, and aligning the clustering result with the real label to obtain an aligned label for finding a new intention.
And the training unit 204 is configured to train the classifier according to the clustering result and the known intention data to obtain a pseudo label of the classifier.
A calculation unit 205 for calculating KL divergences of the aligned tag and the pseudo tag to update the known intent data.
An output unit 206, configured to output the alignment label when no new intention is added.
Further, the selecting unit 202 is specifically configured to:
a cluster number is selected based on the first profile coefficient and/or the second profile coefficient.
The first profile coefficient is:
Figure BDA0003430187860000091
the second profile coefficient is:
Figure BDA0003430187860000092
wherein l (i) is a contour coefficient with a penalty term, l (i)aFor extended contour coefficients with penalty terms, s (i) is the definition of conventional contour coefficients, s (i)aIs the contour coefficient, λ and γ are both hyperparameters, K is the number of clusters, N is the total number of samples, CkIs the cluster to which the sample i belongs,
Figure BDA0003430187860000093
and sigma and u are respectively the standard deviation and the mean value of the sample number of the current cluster.
Further, the system also comprises an adjusting unit used for adjusting the number of the labels of the classifier according to the number of the labels of the cluster.
Further, the training unit 204 includes:
and the first calculation subunit is used for calculating the joint loss according to the known intention data and the alignment label.
And the first updating subunit is used for updating the classifier parameters according to the joint loss to obtain the pseudo label of the classifier.
Further, the clustering unit 203 includes:
and the clustering subunit is used for clustering the unlabeled data according to the clustering number based on a K-means clustering algorithm to generate a clustering result.
And the alignment subunit is used for aligning the clustering result with the real label by Hungarian algorithm to obtain an alignment label finding a new intention.
Further, the preprocessing unit 201 includes:
and the extraction subunit is used for extracting the feature vector of the known intention data based on the BERT pre-training language model.
And the input subunit is used for inputting the feature vectors into the classifier to obtain the prediction labels.
And the second calculation subunit is used for calculating the cross entropy loss of the prediction label and the real label.
And the second updating subunit is used for updating the classifier parameters according to the cross entropy loss.
Further, a conversion unit is included for converting the implicit alignment tags into explicit intent tags by the intent tag generator.
Fig. 4 illustrates a physical structure diagram of an electronic device. As shown in fig. 4, the present invention also provides an electronic device, which may include: a processor (processor)310, a Communication Interface (Communication Interface)320, a memory (memory)330 and a Communication bus 340, wherein the processor 310, the Communication Interface 320 and the memory 330 complete the Communication with each other through the Communication bus 340. The processor 310 may invoke computer programs in the memory 330 to perform the steps of a cluster-based new idea discovery method, including, for example:
pre-training the classifier according to known intention data;
selecting a cluster number according to a preset contour coefficient;
based on a K-means clustering algorithm, clustering the label-free data according to the clustering number to generate a clustering result, and aligning the clustering result with the real label to obtain an aligned label for finding a new intention;
training a classifier according to the clustering result and the known intention data to obtain a pseudo label of the classifier;
calculating KL divergence of the aligned tag and the pseudo tag to update the known intent data;
and repeatedly executing the steps until no new intention is added, and outputting the alignment label.
In addition, the logic instructions in the memory 330 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof, which essentially contributes to the prior art, can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.
On the other hand, embodiments of the present application further provide a computer-readable storage medium, where a processor-readable storage medium stores a computer program, where the computer program is configured to enable a processor to execute the steps of the method provided in each of the above embodiments, and the method includes, for example:
pre-training the classifier according to known intention data;
selecting a cluster number according to a preset contour coefficient;
based on a K-means clustering algorithm, clustering the label-free data according to the clustering number to generate a clustering result, and aligning the clustering result with the real label to obtain an aligned label for finding a new intention;
training a classifier according to the clustering result and the known intention data to obtain a pseudo label of the classifier;
calculating KL divergence of the aligned tag and the pseudo tag to update the known intent data;
and repeatedly executing the steps until no new intention is added, and outputting the alignment label.
The processor-readable storage medium may be any available media or data storage device that can be accessed by a processor, including, but not limited to, magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only used to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for discovering new ideas based on clustering is characterized by comprising the following steps:
s101, pre-training a classifier according to known intention data;
s102, selecting a clustering number according to a preset contour coefficient;
s103, clustering unlabeled data according to the clustering number based on a K-means clustering algorithm to generate a clustering result, and aligning the clustering result with a real label to obtain an aligned label for finding a new intention;
s104, training the classifier according to the clustering result and the known intention data to obtain a pseudo label of the classifier;
s105, calculating KL divergence of the alignment label and the pseudo label to update the known intention data;
and S106, repeatedly executing the steps S101 to S105 until no new intention is added, and outputting the alignment label.
2. The method of claim 1, wherein the selecting a cluster number according to a preset profile coefficient comprises:
selecting a clustering number according to the first contour coefficient and/or the second contour coefficient;
the first profile coefficient is:
Figure FDA0003430187850000011
the second profile coefficient is:
Figure FDA0003430187850000012
wherein l (i) is a contour coefficient with a penalty term, l (i)aFor expanded contour coefficient with penalty term, s (i) is definition of traditional contour coefficient, lambda and gamma are both hyper-parameters, K is clustering number, N is total number of samples, C is total number of sampleskIs the cluster to which sample i belongs, s (i) a is the contour coefficient,
Figure FDA0003430187850000013
and sigma and u are respectively the standard deviation and the mean value of the sample number of the current cluster.
3. The cluster-based new idea discovery method according to claim 1, wherein said training said classifier according to said clustering result and said known intention data to obtain a pseudo label of said classifier previously comprises:
and adjusting the label number of the classifier according to the label number of the cluster.
4. The method of claim 3, wherein the training the classifier according to the clustering result and the known intention data to obtain the pseudo label of the classifier comprises:
calculating a joint loss according to the known intention data and the alignment label;
and updating the parameters of the classifier according to the joint loss to obtain a pseudo label of the classifier.
5. The method for discovering new ideas based on clustering according to claim 1, wherein the K-means clustering algorithm clusters unlabeled data according to the clustering number to generate a clustering result, and aligns the clustering result with a real label to obtain an aligned label for discovering new ideas, and comprises:
clustering the label-free data according to the clustering number based on a K-means clustering algorithm to generate a clustering result;
and aligning the clustering result and the real label through a Hungarian algorithm to obtain an aligned label for finding a new intention.
6. The cluster-based new idea discovery method according to claim 1, wherein said pre-training classifiers according to known intention data comprises:
extracting feature vectors of known intention data based on a BERT pre-training language model;
inputting the feature vector into a classifier to obtain a prediction label;
calculating the cross entropy loss of the predicted label and the real label;
and updating the classifier parameters according to the cross entropy loss.
7. The cluster-based new idea discovery method according to claim 1, wherein said outputting said alignment label until no new intention is added, thereafter comprises:
converting the implicit alignment tags into explicit intent tags by an intent tag generator.
8. A device for discovering new ideas based on clustering, comprising:
the preprocessing unit is used for pre-training the classifier according to the known intention data;
the selecting unit is used for selecting the clustering number according to the preset contour coefficient;
the clustering unit is used for clustering the label-free data according to the clustering number based on a K-means clustering algorithm to generate a clustering result, and aligning the clustering result with the real label to obtain an aligned label for finding a new intention;
the training unit is used for training the classifier according to the clustering result and the known intention data to obtain a pseudo label of the classifier;
a calculation unit to calculate KL divergences of the alignment tag and the dummy tag to update the known intent data;
and the output unit is used for outputting the alignment label when no new intention is added.
9. An electronic device comprising a processor and a memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the cluster-based new idea discovery method according to any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the cluster-based new idea discovery method according to any one of claims 1 to 7.
CN202111592178.6A 2021-12-23 2021-12-23 Clustering-based new idea finding method, device, equipment and storage medium Pending CN114510567A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111592178.6A CN114510567A (en) 2021-12-23 2021-12-23 Clustering-based new idea finding method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111592178.6A CN114510567A (en) 2021-12-23 2021-12-23 Clustering-based new idea finding method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114510567A true CN114510567A (en) 2022-05-17

Family

ID=81547948

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111592178.6A Pending CN114510567A (en) 2021-12-23 2021-12-23 Clustering-based new idea finding method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114510567A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115168593A (en) * 2022-09-05 2022-10-11 深圳爱莫科技有限公司 Intelligent dialogue management system, method and processing equipment capable of self-learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115168593A (en) * 2022-09-05 2022-10-11 深圳爱莫科技有限公司 Intelligent dialogue management system, method and processing equipment capable of self-learning

Similar Documents

Publication Publication Date Title
CN110379409B (en) Speech synthesis method, system, terminal device and readable storage medium
CN110110080A (en) Textual classification model training method, device, computer equipment and storage medium
CN110990543A (en) Intelligent conversation generation method and device, computer equipment and computer storage medium
CN103280216B (en) Improve the speech recognition device the relying on context robustness to environmental change
CN107346340A (en) A kind of user view recognition methods and system
CN108710704B (en) Method and device for determining conversation state, electronic equipment and storage medium
CN110969020A (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN111177310A (en) Intelligent scene conversation method and device for power service robot
CN110910283A (en) Method, device, equipment and storage medium for generating legal document
CN113204952B (en) Multi-intention and semantic slot joint identification method based on cluster pre-analysis
CN111445898B (en) Language identification method and device, electronic equipment and storage medium
CN111813954B (en) Method and device for determining relationship between two entities in text statement and electronic equipment
CN111914555B (en) Automatic relation extraction system based on Transformer structure
CN112466316A (en) Zero-sample voice conversion system based on generation countermeasure network
CN111950294A (en) Intention identification method and device based on multi-parameter K-means algorithm and electronic equipment
CN113505198A (en) Keyword-driven generating type dialogue reply method and device and electronic equipment
CN110942774A (en) Man-machine interaction system, and dialogue method, medium and equipment thereof
CN112632248A (en) Question answering method, device, computer equipment and storage medium
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
CN114510567A (en) Clustering-based new idea finding method, device, equipment and storage medium
CN111966798A (en) Intention identification method and device based on multi-round K-means algorithm and electronic equipment
CN116384405A (en) Text processing method, text classification method and emotion recognition method
CN112749530B (en) Text encoding method, apparatus, device and computer readable storage medium
CN113836892A (en) Sample size data extraction method and device, electronic equipment and storage medium
CN114333790A (en) Data processing method, device, equipment, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination