CN111950294A - Intention identification method and device based on multi-parameter K-means algorithm and electronic equipment - Google Patents

Intention identification method and device based on multi-parameter K-means algorithm and electronic equipment Download PDF

Info

Publication number
CN111950294A
CN111950294A CN202010721000.6A CN202010721000A CN111950294A CN 111950294 A CN111950294 A CN 111950294A CN 202010721000 A CN202010721000 A CN 202010721000A CN 111950294 A CN111950294 A CN 111950294A
Authority
CN
China
Prior art keywords
fusion
clustering
class
clustering result
means algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010721000.6A
Other languages
Chinese (zh)
Inventor
孔醍
刘宗全
张家兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qibao Xinan Technology Co ltd
Original Assignee
Beijing Qibao Xinan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qibao Xinan Technology Co ltd filed Critical Beijing Qibao Xinan Technology Co ltd
Priority to CN202010721000.6A priority Critical patent/CN111950294A/en
Publication of CN111950294A publication Critical patent/CN111950294A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an intention identification method and device based on a multi-parameter K-means algorithm and electronic equipment. The method comprises the following steps: establishing a sample data set, wherein the sample data set comprises a plurality of semantic vectors obtained by converting the dialog text; performing multi-round clustering processing on the sample data set by using a K-means algorithm, wherein each round of clustering adopts different K values, and outputting an initial clustering result; setting a fusion strategy and determining an initial clustering result to be fused, wherein the fusion strategy comprises fusion parameters and fusion rules; according to the fusion rule, carrying out fusion processing on the initial clustering result to be fused to form a final clustering result; and performing intention recognition on the voice input by the current user dialog based on the final clustering result. The method of the invention adopts the improved K-means algorithm, realizes more accurate intention classification and identification, improves the intention clustering quality and optimizes the method.

Description

Intention identification method and device based on multi-parameter K-means algorithm and electronic equipment
Technical Field
The invention relates to the field of computer information processing, in particular to an intention identification method and device based on a multi-parameter K-means algorithm and electronic equipment.
Background
With the development of internet technology, the conversation system has wide application in e-commerce, intelligent equipment and the like, and is more and more concerned by people. Intent recognition is a primary and important task in dialog systems and is a multi-classification problem, and accordingly, a multi-classification model needs to be used for corresponding processing. In fact, the classification problem is a basic problem of machine learning, and there are regression, clustering, and the like corresponding thereto.
Clustering is to classify existing data objects to make the similarity between the inner parts of each class as large as possible and the similarity between the outer parts of each class as small as possible. Generally, in the process of dividing data, only simple similarity between information is taken as a criterion, no background knowledge and corresponding hypothesis exist, categories and properties are not required to be defined, and classification is performed according to natural attributes, so that the cluster analysis is considered to be an unsupervised analysis method. Clustering finds application in many areas, such as: biology, statistics, neural networks, information retrieval, image processing, and data mining. However, how to utilize these domain knowledge to improve the quality of cluster analysis is an important research content of semi-supervised cluster analysis.
The existing clustering algorithms are generally divided into five categories, namely a partitioning method, a hierarchical method, a density-based method, a grid-based method and a model-based method. Among them, the most classical k-means algorithm is a clustering method based on partitioning. The traditional k-means clustering algorithm is an unsupervised learning method, classifies data according to basic optimization criteria, and does not consider any knowledge provided by the real world of a user through the most natural correlation. However, there is still much room for improvement in the applied clustering quality and method optimization.
Therefore, there is a need to provide a more clustering quality and optimized intent recognition method based on multi-parameter K-means algorithm.
Disclosure of Invention
In order to solve the above problems, the present invention provides an intention recognition method based on a multi-parameter K-means algorithm, applied to recognition of user intention in an intelligent voice robot, comprising: establishing a sample data set, wherein the sample data set comprises a plurality of semantic vectors converted from a dialog text, and the dialog text is converted from voice input when a user has a dialog with the intelligent voice robot; performing multi-round clustering processing on the sample data set by using a K-means algorithm, wherein each round of clustering adopts different K values, and outputting an initial clustering result, wherein the K value is the number of initial central vectors in the K-means algorithm; setting a fusion strategy and determining an initial clustering result to be fused, wherein the fusion strategy comprises fusion parameters and fusion rules; according to the fusion rule, carrying out fusion processing on the initial clustering result to be fused to form a final clustering result; and based on the final clustering result, performing intention recognition on the voice input when the current user has a conversation with the intelligent voice robot.
Preferably, the setting of the fusion policy includes: setting a plurality of fusion parameters, wherein the fusion parameters comprise at least two of purity, overall purity, purity gain, noise ratio, coverage and contour coefficient in the same class; the purity reaches a set threshold for purity and/or the noise ratio is less than a set threshold for noise.
Preferably, the setting of the fusion policy further comprises: setting a fusion rule, wherein the fusion rule comprises a first rule with the same or similar semantic and a second rule with the vector similarity exceeding a set threshold; and performing fusion processing on the different sets to be fused according to the first rule and/or the second rule, and adaptively selecting an optimal k' value.
Preferably, the method further comprises the following steps: setting the sample to obey normal distribution, wherein the number of classes in the initial clustering result is increased along with the increase of the number of rounds of the multi-round clustering processing, and the number of classes in the initial clustering result is gradually reduced when the number of rounds of the multi-round clustering processing reaches a specific number of rounds; and determining a Qrglow class set and a Qrghigh class set as initial clustering results to be fused based on a rule that at least one parameter in the fusion parameters is optimal.
Preferably, the method further comprises the following steps: and screening samples with one-to-one, one-to-many and many-to-many sample relationships in the Qrglow class set and the Qrghigh class set according to the fusion rule, and performing fusion processing to adaptively select an optimal k' value so as to output a final clustering result.
Preferably, the method further comprises the following steps: the value of k' is at kQrglowAnd k isQrghighIn the meantime.
Preferably, the method further comprises the following steps: performing multiple rounds of clustering processing by using a centroid algorithm, wherein each round of clustering processing comprises the following steps: setting an initial k value; and randomly generating K class center vectors, and iteratively updating the class center vectors by using a K-means algorithm until the distance between the class center vector of the current iteration direction and the class center vector of the last iteration direction is less than a specified threshold value.
Preferably, iteratively updating the class of center vectors using the K-means algorithm comprises: calculating Euclidean distance from the sample to each class center vector; in Euclidean distances from the sample to various central vectors, the class where the class central vector with the minimum distance is located is used as the class to which the sample belongs in the iteration; and taking the mean vector of the samples belonging to the same class as the class center vector of the next iteration.
In addition, the invention also provides an intention recognition device based on the multi-parameter K-means algorithm, which is applied to recognition of the intention of the user in the intelligent voice robot and comprises the following steps: the system comprises an establishing module, a processing module and a processing module, wherein the establishing module is used for establishing a sample data set, the sample data set comprises a plurality of semantic vectors obtained by converting a dialog text, and the dialog text is converted from a voice input when a user dialogues with the intelligent voice robot; the clustering processing module is used for carrying out multi-round clustering processing on the sample data set by using a K-means algorithm, wherein each round of clustering adopts different K values, and an initial clustering result is output, and the K values are the number of initial central vectors in the K-means algorithm; the system comprises a setting module, a clustering module and a fusion module, wherein the setting module is used for setting a fusion strategy and determining an initial clustering result to be fused, and the fusion strategy comprises fusion parameters and fusion rules; the fusion processing module is used for carrying out fusion processing on the initial clustering result to be fused according to the fusion rule to form a final clustering result; and the recognition module is used for recognizing the intention of the voice input when the current user is in conversation with the intelligent voice robot based on the final clustering result.
Preferably, the setting of the fusion policy includes: setting a plurality of fusion parameters, wherein the fusion parameters comprise at least two of purity, overall purity, purity gain, noise ratio, coverage and contour coefficient in the same class; the purity reaches a set threshold for purity and/or the noise ratio is less than a set threshold for noise.
Preferably, the setting of the fusion policy further comprises: setting a fusion rule, wherein the fusion rule comprises a first rule with the same or similar semantic and a second rule with the vector similarity exceeding a set threshold; and performing fusion processing on the different sets to be fused according to the first rule and/or the second rule, and adaptively selecting an optimal k' value.
Preferably, the method further comprises the following steps: setting the sample to obey normal distribution, wherein the number of classes in the initial clustering result is increased along with the increase of the number of rounds of the multi-round clustering processing, and the number of classes in the initial clustering result is gradually reduced when the number of rounds of the multi-round clustering processing reaches a specific number of rounds; and determining a Qrglow class set and a Qrghigh class set as initial clustering results to be fused based on a rule that at least one parameter in the fusion parameters is optimal.
Preferably, the system further comprises a screening module, wherein the screening module screens samples in which the sample relationship between the Qrglow type set and the Qrghigh type set is one-to-one, one-to-many, or many-to-many according to the fusion rule, and performs fusion processing to adaptively select an optimal k' value so as to output a final clustering result.
Preferably, the method further comprises the following steps: the value of k' is at kQrglowAnd k isQrghighIn the meantime.
Preferably, the method further comprises the following steps: performing multiple rounds of clustering processing by using a centroid algorithm, wherein each round of clustering processing comprises the following steps: setting an initial k value; and randomly generating K class center vectors, and iteratively updating the class center vectors by using a K-means algorithm until the distance between the class center vector of the current iteration direction and the class center vector of the last iteration direction is less than a specified threshold value.
Preferably, the method further comprises a calculation module, wherein the calculation module is used for calculating the Euclidean distance from the sample to each class center vector; in Euclidean distances from the sample to various central vectors, the class where the class central vector with the minimum distance is located is used as the class to which the sample belongs in the iteration; and taking the mean vector of the samples belonging to the same class as the class center vector of the next iteration.
In addition, the present invention also provides an electronic device, wherein the electronic device includes: a processor; and a memory storing computer executable instructions that, when executed, cause the processor to perform the multi-parameter K-means algorithm based intent recognition method of the present invention.
Furthermore, the present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores one or more programs which, when executed by a processor, implement the multi-parameter K-means algorithm-based intention recognition method of the present invention.
Advantageous effects
Compared with the prior art, the intention identification method adopts the improved K-means algorithm, carries out multi-round clustering processing on the user dialog text to be identified, and carries out fusion denoising on the clustering results of the multi-round clustering through multi-parameters and fusion rules, thereby realizing more accurate intention classification and identification, improving the intention clustering quality and optimizing the method.
Drawings
In order to make the technical problems solved by the present invention, the technical means adopted and the technical effects obtained more clear, the following will describe in detail the embodiments of the present invention with reference to the accompanying drawings. It should be noted, however, that the drawings described below are only illustrations of exemplary embodiments of the invention, from which other embodiments can be derived by those skilled in the art without inventive faculty.
FIG. 1 is a flow chart of an example of the multi-parameter K-means algorithm based intent recognition method of the present invention.
FIG. 2 is a flow chart of another example of the multi-parameter K-means algorithm based intent recognition method of the present invention.
FIG. 3 is a flow chart of yet another example of the multi-parameter K-means algorithm based intent recognition method of the present invention.
FIG. 4 is a schematic block diagram of an example of the intention identifying apparatus based on the multi-parameter K-means algorithm of the present invention.
Fig. 5 is a schematic structural block diagram of another example of the intention identifying apparatus of the present invention based on the multi-parameter K-means algorithm.
Fig. 6 is a schematic structural block diagram of still another example of the intention identifying apparatus of the present invention based on the multi-parameter K-means algorithm.
Fig. 7 is a block diagram of an exemplary embodiment of an electronic device according to the present invention.
Fig. 8 is a block diagram of an exemplary embodiment of a computer-readable medium according to the present invention.
Detailed Description
Exemplary embodiments of the present invention will now be described more fully with reference to the accompanying drawings. The exemplary embodiments, however, may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art. The same reference numerals denote the same or similar elements, components, or parts in the drawings, and thus their repetitive description will be omitted.
Features, structures, characteristics or other details described in a particular embodiment do not preclude the fact that the features, structures, characteristics or other details may be combined in a suitable manner in one or more other embodiments in accordance with the technical idea of the invention.
In describing particular embodiments, the present invention has been described with reference to features, structures, characteristics or other details that are within the purview of one skilled in the art to provide a thorough understanding of the embodiments. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific features, structures, characteristics, or other details.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these terms should not be construed as limiting. These phrases are used to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention.
The term "and/or" and/or "includes any and all combinations of one or more of the associated listed items.
In order to further improve the accuracy of intention identification and classification, the invention provides an intention identification method based on a multi-parameter K-means algorithm, which adopts the improved K-means algorithm to perform multi-round clustering processing on a user dialog text to be identified, and performs fusion denoising on the clustering results of the multi-round clustering through multi-parameter and fusion rules, thereby realizing more accurate intention classification and identification, improving the quality of intention clustering and optimizing the method.
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.
Example 1
Hereinafter, an embodiment of the multi-parameter K-means algorithm-based intention recognition method of the present invention will be described with reference to fig. 1 to 3.
FIG. 1 is a flow chart of an example of the multi-parameter K-means algorithm based intent recognition method of the present invention.
As shown in fig. 1, an intention recognition method based on a multi-parameter K-means algorithm includes the following steps.
Step S101, establishing a sample data set, wherein the sample data set comprises a plurality of semantic vectors converted from conversation texts, and the conversation texts are converted from voices input by a user during conversation with the intelligent voice robot.
And S102, performing multi-round clustering processing on the sample data set by using a K-means algorithm, wherein each round of clustering adopts different K values, and outputting an initial clustering result, wherein the K value is the number of initial central vectors in the K-means algorithm.
Step S103, setting a fusion strategy and determining an initial clustering result to be fused, wherein the fusion strategy comprises fusion parameters and fusion rules.
And step S104, performing fusion processing on the initial clustering result to be fused according to the fusion rule to form a final clustering result.
And step S105, based on the final clustering result, performing intention recognition on the voice input when the current user has a conversation with the intelligent voice robot.
In the present example, the method of the present invention is applied to recognition of a user's intention in an intelligent voice robot, and a specific process will be described below.
First, in step S101, a sample data set is established, where the sample data set includes a plurality of semantic vectors converted from a dialog text, where the dialog text is converted from a voice input by a user when the user has a dialog with the intelligent voice robot.
In this example, when a user has a conversation with the intelligent voice robot, the user's conversational voice input is obtained and converted into a conversational text.
Specifically, the dialog text includes intention category information. For example, "a" is the name of a financial product or financial service product, e.g. user 1 enters "i want to know a", forming intent 1, and for example, user 2 enters "what is the interest rate pricing of a? ", form the intent 2.
Preferably, the dialog text is preprocessed to divide the dialog text into tag sentences, bottom-of-pocket sentences and intermediate sentences for intent classification.
Further, the dialog text of the user is subjected to semantic vector conversion, for example, using a BERT pre-training model, to form a semantic vector representation for intent recognition.
It should be noted that for semantic vector conversion, in other examples, a RoBERTa model, a DistilBERT model, or the like may also be used. The foregoing is illustrative only and is not to be construed as limiting the invention.
Next, in step S102, a K-means algorithm is used to perform multiple rounds of clustering processing on the sample data set, each round of clustering adopts a different K value, and an initial clustering result is output, where the K value is the number of initial center vectors in the K-means algorithm.
In this example, it is set that the samples follow a normal distribution, the number of classes in the initial clustering result increases as the number of rounds of the multi-round clustering process increases, and the number of classes in the initial clustering result gradually decreases when the number of rounds of the multi-round clustering process reaches a certain number of rounds.
Specifically, the clustering calculation is performed using an algorithm of a centroid, for example.
The clustering algorithm is not limited to this, and the above description is only given as a preferred example, and is not to be construed as limiting the present invention, and other examples may be an algorithm based on probability distribution, an EM algorithm, and the like.
Further, each round of clustering process comprises: setting an initial k value; randomly generating K class center vectors, and iteratively updating the class center vectors by using a K-means algorithm until the distance between the class center vector in the current iteration and the class center vector in the last iteration is less than a specified threshold value.
In this example, for the Toronto clustering process, each round of clustering takes a different k value, and outputs an initial clustering result.
Further, calculating the Euclidean distance from each sample in the sample data set to each class center vector; in Euclidean distances from each sample to various central vectors, the class of the class central vector with the minimum distance is taken as the class to which the sample belongs in the iteration; and taking the mean vector of the samples belonging to the same class as the class center vector of the next iteration.
In order to more clearly illustrate the method of the present invention, the calculation principle of the method of the present invention will be described in detail below.
In the inventive method, a centroid-based algorithm is used to calculate the center vector for each cluster (in this specification, referred to as a class set). Specifically, the sample data set is divided into k classes, and the selection of the value of k is set by the technician. The specific flow of the algorithm is as follows.
First, the center vectors μ of k classes are initialized1,μ2,...,μk
And (4) an allocation stage. Determining the class of each sample according to the current class center value:
loop, for each sample xiCalculate sample from each class center μjThe distance of (c): dij=||xijAnd | l, allocating the sample to the class closest to the sample, and ending the loop.
And (5) an updating stage. Update class center for each class:
and circulating, updating the class center of each class according to the distribution scheme of the previous step for each class, and taking the average value of all samples of the class:
Figure BDA0002599996920000091
the loop is ended.
Thus, through the algorithm, multi-round clustering processing is executed, and an initial clustering result corresponding to the number of rounds is obtained.
Next, in step S103, a fusion policy is set, and an initial clustering result to be fused is determined, where the fusion policy includes a fusion parameter and a fusion rule.
Specifically, setting the fusion policy includes: a plurality of fusion parameters are set, the fusion parameters including at least two of purity, overall purity, purity gain, noise ratio, coverage and contour factor within the same class.
Preferably, the fusion parameters include purity, overall purity, purity gain, and noise ratio within the same class.
Note that, in this example, the coverage ratio refers to a ratio of the number of tagged sentences, bottom-of-pocket sentences, or intermediate sentences to the number of sets of each type, and the overall coverage ratio refers to a ratio of the number of tagged sentences, bottom-of-pocket sentences, or intermediate sentences to the total number of sample data sets. But not limited thereto, in other application scenarios, the coverage fraction may be reset according to a specific service.
Specifically, it is judged for the parameter, for example, whether the purity reaches a purity setting threshold, and/or the noise ratio is smaller than a noise setting threshold.
As shown in fig. 2, a step S201 of setting a fusion rule is further included.
In step S201, a fusion rule is set. Specifically, the fusion rule includes a first rule setting the semanteme to be the same or similar and a second rule setting the vector similarity to exceed a set threshold.
Further, based on a rule that at least one parameter of the fusion parameters is optimal, for example, a parameter optimal rule that purity is maximum, noise ratio is minimum, and the like, a Qrglow class set and a Qrghigh class set are determined as initial clustering results to be fused.
For example, there are 9 classes with k being 4 in the Qrglow class set, and 12 classes with k being 10 in the Qrghigh class set.
Next, in step S104, according to the fusion rule, the initial clustering results to be fused are fused to form a final clustering result.
In this example, according to the fusion rule of the first rule and/or the second rule, different class sets (including a qrgulw class set and a Qrghigh class set) to be fused are fused.
Specifically, the Qrglow class is screened according to the fusion ruleSet (e.g. k)Qrglow4, 9 classes) and Qrghigh class set (k)Qrghigh10, 12 classes) as one-to-one, one-to-many, many-to-many samples, and fusion processing is performed.
Further, for example, the samples of the first class in the Qrglow class set are in a one-to-many relationship with the samples of the first, third, and fifth classes in the Qrghigh class set. For example, some samples in the Qrglow class set do not appear in the Qrghigh class set. For another example, the samples of the second, sixth and ninth classes in the Qrglow class set and the samples of the sixth and tenth classes in the Qrghigh class set are many-to-many, and so on.
For the class sets to be fused screened from the two class sets of the Qrglow class set and the Qrghigh class set, matching fusion is performed specifically according to the fact that semantic words are the same or similar (a first rule) and/or vector similarity exceeds a set threshold (a second rule), and an optimal k' value is selected in a self-adaptive mode to output a final clustering result.
Preferably, the method further comprises the following steps: the value of k' is at kQrglowAnd kQrghighIn the meantime.
Therefore, compared with the traditional K-means algorithm, the improved K-means algorithm has the advantage that the obtained clustering result is more accurate, in other words, the K' value is more accurate.
Next, in step S105, based on the final clustering result, intention recognition is performed on the voice input when the current user has a conversation with the intelligent voice robot.
In this example, an intent category database is built based on the clustering results.
Specifically, user dialogue voice input of a current user is acquired in real time, semantic vector conversion is carried out on intention category information of the dialogue voice input, and a class set to which a user intention category of the current user belongs is judged based on the intention category database and the intention category information so as to be used for intention identification.
Further, under the condition that the intention of the current user is judged to be a new intention, new intention labeling is carried out, and a corresponding reply is generated.
Preferably, the data related to the new intent is added to a sample data set to be classified for updating the sample data set.
In another example, the method further comprises the step of setting a data update time, and updating the sample data set to be classified according to the data update time.
In another example, the method further includes a step of denoising the fused class set.
Specifically, the method comprises the steps of judging whether the purity of each type of set reaches a set purity threshold value and/or judging whether the noise ratio of each type of set is smaller than a set noise threshold value so as to perform denoising processing.
Preferably, a minimum sample number threshold value in the class set is set, and the class set with the sample number smaller than the minimum sample number threshold value is used as a noise set or a removal set.
Specifically, the purity, noise ratio, and coverage fraction of the same class set are calculated for each class set.
Further, based on the calculation result, the class set with the calculated purity being greater than or equal to the set threshold of the purity is used as a reserved class set; taking the class set with the calculated noise ratio smaller than the noise setting threshold value as a reserved class set; and/or using the class set with the calculated coverage ratio in the set ratio range as a reserved class set.
In this example, the number of class sets k' is determined based on all class sets to be retained and class sets to be removed.
The above-described procedure of the intent recognition method based on the multi-parameter K-means algorithm is only for explanation of the present invention, wherein the order and number of steps are not particularly limited. In addition, the steps in the method may also be split into two or three, for example, the step S104 is split into steps S104 and S301 (see fig. 3), or some steps may also be combined into one step, and the adjustment is performed according to an actual example.
Compared with the prior art, the intention identification method adopts the improved K-means algorithm, carries out multi-round clustering processing on the user dialog text to be identified, and carries out fusion denoising on the clustering results of the multi-round clustering through multi-parameters and fusion rules, thereby realizing more accurate intention classification and identification, improving the intention clustering quality and optimizing the method.
Those skilled in the art will appreciate that all or part of the steps to implement the above-described embodiments are implemented as programs (computer programs) executed by a computer data processing apparatus. When the computer program is executed, the method provided by the invention can be realized. Furthermore, the computer program may be stored in a computer readable storage medium, which may be a readable storage medium such as a magnetic disk, an optical disk, a ROM, a RAM, or a storage array composed of a plurality of storage media, such as a magnetic disk or a magnetic tape storage array. The storage medium is not limited to centralized storage, but may be distributed storage, such as cloud storage based on cloud computing.
Embodiments of the apparatus of the present invention are described below, which may be used to perform method embodiments of the present invention. The details described in the device embodiments of the invention should be regarded as complementary to the above-described method embodiments; reference is made to the above-described method embodiments for details not disclosed in the apparatus embodiments of the invention.
Example 2
Referring to fig. 4, 5 and 6, the present invention also provides an intention recognition apparatus 400 based on a multi-parameter K-means algorithm, applied to recognition of user intention in an intelligent voice robot, including: the establishing module 401 is configured to establish a sample data set, where the sample data set includes a plurality of semantic vectors converted from a dialog text, where the dialog text is converted from a voice input by a user when the user has a dialog with the intelligent voice robot; a clustering module 402, configured to perform multiple rounds of clustering on the sample data set by using a K-means algorithm, where each round of clustering uses a different K value, and output an initial clustering result, where the K value is the number of initial center vectors in the K-means algorithm; a setting module 403, configured to set a fusion policy and determine an initial clustering result to be fused, where the fusion policy includes a fusion parameter and a fusion rule; a fusion processing module 404, configured to perform fusion processing on the initial clustering result to be fused according to the fusion rule to form a final clustering result; and the recognition module 405 performs intention recognition on the voice input when the current user has a conversation with the intelligent voice robot based on the final clustering result.
Preferably, the setting of the fusion policy includes: setting a plurality of fusion parameters, wherein the fusion parameters comprise at least two of purity, overall purity, purity gain, noise ratio, coverage and contour coefficient in the same class; the purity reaches a set threshold for purity and/or the noise ratio is less than a set threshold for noise.
Preferably, the setting of the fusion policy further comprises: setting a fusion rule, wherein the fusion rule comprises a first rule with the same or similar semantic and a second rule with the vector similarity exceeding a set threshold; and performing fusion processing on the different sets to be fused according to the first rule and/or the second rule, and adaptively selecting an optimal k' value.
Preferably, the method further comprises the following steps: setting the sample to obey normal distribution, wherein the number of classes in the initial clustering result is increased along with the increase of the number of rounds of the multi-round clustering processing, and the number of classes in the initial clustering result is gradually reduced when the number of rounds of the multi-round clustering processing reaches a specific number of rounds; and determining a Qrglow class set and a Qrghigh class set as initial clustering results to be fused based on a rule that at least one parameter in the fusion parameters is optimal.
As shown in fig. 5, the method further includes a screening module 501, where the screening module 501 screens samples in the Qrglow class set and the Qrghigh class set according to the fusion rule, where the sample relationship is one-to-one, one-to-many, and many-to-many, and performs fusion processing to adaptively select an optimal k' value, so as to output a final clustering result.
Preferably, the method further comprises the following steps: the value of k' is at kQrglowAnd k isQrghighIn the meantime.
Preferably, the method further comprises the following steps: performing multiple rounds of clustering processing by using a centroid algorithm, wherein each round of clustering processing comprises the following steps: setting an initial k value; and randomly generating K class center vectors, and iteratively updating the class center vectors by using a K-means algorithm until the distance between the class center vector of the current iteration direction and the class center vector of the last iteration direction is less than a specified threshold value.
As shown in fig. 6, further comprising a calculating module 601, where the calculating module 601 is configured to calculate euclidean distances from the samples to each class center vector; in Euclidean distances from the sample to various central vectors, the class where the class central vector with the minimum distance is located is used as the class to which the sample belongs in the iteration; and taking the mean vector of the samples belonging to the same class as the class center vector of the next iteration.
In embodiment 2, the same portions as those in embodiment 1 are not described.
Compared with the prior art, the intention recognition device adopts the improved K-means algorithm to perform multi-round clustering processing on the user dialog text to be recognized, and performs fusion denoising on the clustering results of the multi-round clustering through multi-parameters and fusion rules, so that more accurate intention classification and recognition are realized, the intention clustering quality is improved, and the method is optimized.
Those skilled in the art will appreciate that the modules in the above-described embodiments of the apparatus may be distributed as described in the apparatus, and may be correspondingly modified and distributed in one or more apparatuses other than the above-described embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
Example 3
In the following, embodiments of the electronic device of the present invention are described, which may be regarded as specific physical implementations for the above-described embodiments of the method and apparatus of the present invention. Details described in the embodiments of the electronic device of the invention should be considered supplementary to the embodiments of the method or apparatus described above; for details which are not disclosed in embodiments of the electronic device of the invention, reference may be made to the above-described embodiments of the method or the apparatus.
Fig. 7 is a block diagram of an exemplary embodiment of an electronic device according to the present invention. An electronic device 200 according to the invention will be described below with reference to fig. 7. The electronic device 200 shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 7, the electronic device 200 is embodied in the form of a general purpose computing device. The components of the electronic device 200 may include, but are not limited to: at least one processing unit 210, at least one memory unit 220, a bus 230 connecting different system components (including the memory unit 220 and the processing unit 210), a display unit 240, and the like.
Wherein the storage unit stores program code executable by the processing unit 210 to cause the processing unit 210 to perform steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic device processing method section of the present specification. For example, the processing unit 210 may perform the steps as shown in fig. 1.
The memory unit 220 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)2201 and/or a cache memory unit 2202, and may further include a read only memory unit (ROM) 2203.
The storage unit 220 may also include a program/utility 2204 having a set (at least one) of program modules 2205, such program modules 2205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 230 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 200 may also communicate with one or more external devices 300 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 200, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 200 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 250. Also, the electronic device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 260. The network adapter 260 may communicate with other modules of the electronic device 200 via the bus 230. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments of the present invention described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a computer-readable storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a computing device (which can be a personal computer, a server, or a network device, etc.) execute the above-mentioned method according to the present invention. The computer program, when executed by a data processing apparatus, enables the computer readable medium to carry out the above-described methods of the invention.
As shown in fig. 8, the computer program may be stored on one or more computer readable media. The computer readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
In summary, the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functionality of some or all of the components in embodiments in accordance with the invention may be implemented in practice using a general purpose data processing device such as a microprocessor or a Digital Signal Processor (DSP). The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
While the foregoing embodiments have described the objects, aspects and advantages of the present invention in further detail, it should be understood that the present invention is not inherently related to any particular computer, virtual machine or electronic device, and various general-purpose machines may be used to implement the present invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims (10)

1. An intention recognition method based on a multi-parameter K-means algorithm is applied to recognition of user intention in an intelligent voice robot, and is characterized by comprising the following steps:
establishing a sample data set, wherein the sample data set comprises a plurality of semantic vectors converted from a dialog text, and the dialog text is converted from voice input when a user has a dialog with the intelligent voice robot;
performing multi-round clustering processing on the sample data set by using a K-means algorithm, wherein each round of clustering adopts different K values, and outputting an initial clustering result, wherein the K value is the number of initial central vectors in the K-means algorithm;
setting a fusion strategy and determining an initial clustering result to be fused, wherein the fusion strategy comprises fusion parameters and fusion rules;
according to the fusion rule, carrying out fusion processing on the initial clustering result to be fused to form a final clustering result;
and based on the final clustering result, performing intention recognition on the voice input when the current user has a conversation with the intelligent voice robot.
2. The intent recognition method according to claim 1, wherein said setting a fusion policy comprises:
setting a plurality of fusion parameters, wherein the fusion parameters comprise at least two of purity, overall purity, purity gain, noise ratio, coverage and contour coefficient in the same class;
the purity reaches a set threshold for purity and/or the noise ratio is less than a set threshold for noise.
3. The intention recognition method according to claim 1 or 2, wherein the setting of the fusion policy further comprises:
setting a fusion rule, wherein the fusion rule comprises a first rule with the same or similar semantic and a second rule with the vector similarity exceeding a set threshold;
and performing fusion processing on the different sets to be fused according to the first rule and/or the second rule, and adaptively selecting an optimal k' value.
4. The intention recognition method according to any one of claims 1 to 3, characterized by further comprising:
setting the sample to obey normal distribution, wherein the number of classes in the initial clustering result is increased along with the increase of the number of rounds of the multi-round clustering processing, and the number of classes in the initial clustering result is gradually reduced when the number of rounds of the multi-round clustering processing reaches a specific number of rounds;
and determining a Qrglow class set and a Qrghigh class set as initial clustering results to be fused based on a rule that at least one parameter in the fusion parameters is optimal.
5. The intention recognition method according to any one of claims 1 to 4, characterized by further comprising:
and screening samples with one-to-one, one-to-many and many-to-many sample relationships in the Qrglow class set and the Qrghigh class set according to the fusion rule, and performing fusion processing to adaptively select an optimal k' value so as to output a final clustering result.
6. The intention recognition method according to any one of claims 1 to 5, characterized by further comprising:
the value of k' is at kQrglowAnd k isQrghighIn the meantime.
7. The intention recognition method according to any one of claims 1 to 6, characterized by further comprising:
performing multiple rounds of clustering processing by using a centroid algorithm, wherein each round of clustering processing comprises the following steps:
setting an initial k value;
and randomly generating K class center vectors, and iteratively updating the class center vectors by using a K-means algorithm until the distance between the class center vector of the current iteration direction and the class center vector of the last iteration direction is less than a specified threshold value.
8. An intention recognition device based on a multi-parameter K-means algorithm is applied to recognition of user intention in an intelligent voice robot, and is characterized by comprising the following components:
the system comprises an establishing module, a processing module and a processing module, wherein the establishing module is used for establishing a sample data set, the sample data set comprises a plurality of semantic vectors obtained by converting a dialog text, and the dialog text is converted from a voice input when a user dialogues with the intelligent voice robot;
the clustering processing module is used for carrying out multi-round clustering processing on the sample data set by using a K-means algorithm, wherein each round of clustering adopts different K values, and an initial clustering result is output, and the K values are the number of initial central vectors in the K-means algorithm;
the system comprises a setting module, a clustering module and a fusion module, wherein the setting module is used for setting a fusion strategy and determining an initial clustering result to be fused, and the fusion strategy comprises fusion parameters and fusion rules;
the fusion processing module is used for carrying out fusion processing on the initial clustering result to be fused according to the fusion rule to form a final clustering result;
and the recognition module is used for recognizing the intention of the voice input when the current user is in conversation with the intelligent voice robot based on the final clustering result.
9. An electronic device, wherein the electronic device comprises:
a processor; and the number of the first and second groups,
a memory storing computer executable instructions that, when executed, cause the processor to perform the multi-parameter K-means algorithm based intent recognition method according to any of claims 1 to 7.
10. A computer-readable storage medium, wherein the computer-readable storage medium stores one or more programs which, when executed by a processor, implement the multi-parameter K-means algorithm-based intention recognition method of any one of claims 1 to 7.
CN202010721000.6A 2020-07-24 2020-07-24 Intention identification method and device based on multi-parameter K-means algorithm and electronic equipment Withdrawn CN111950294A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010721000.6A CN111950294A (en) 2020-07-24 2020-07-24 Intention identification method and device based on multi-parameter K-means algorithm and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010721000.6A CN111950294A (en) 2020-07-24 2020-07-24 Intention identification method and device based on multi-parameter K-means algorithm and electronic equipment

Publications (1)

Publication Number Publication Date
CN111950294A true CN111950294A (en) 2020-11-17

Family

ID=73340908

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010721000.6A Withdrawn CN111950294A (en) 2020-07-24 2020-07-24 Intention identification method and device based on multi-parameter K-means algorithm and electronic equipment

Country Status (1)

Country Link
CN (1) CN111950294A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112530409A (en) * 2020-12-01 2021-03-19 平安科技(深圳)有限公司 Voice sample screening method and device based on geometry and computer equipment
CN112989040A (en) * 2021-03-10 2021-06-18 河南中原消费金融股份有限公司 Dialog text labeling method and device, electronic equipment and storage medium
CN113658710A (en) * 2021-08-11 2021-11-16 东软集团股份有限公司 Data matching method and related equipment thereof
CN113688323A (en) * 2021-09-03 2021-11-23 支付宝(杭州)信息技术有限公司 Method and device for constructing intention triggering strategy and intention identification
CN116719831A (en) * 2023-08-03 2023-09-08 四川中测仪器科技有限公司 Standard database establishment and update method for health monitoring

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112530409A (en) * 2020-12-01 2021-03-19 平安科技(深圳)有限公司 Voice sample screening method and device based on geometry and computer equipment
CN112530409B (en) * 2020-12-01 2024-01-23 平安科技(深圳)有限公司 Speech sample screening method and device based on geometry and computer equipment
CN112989040A (en) * 2021-03-10 2021-06-18 河南中原消费金融股份有限公司 Dialog text labeling method and device, electronic equipment and storage medium
CN112989040B (en) * 2021-03-10 2024-02-27 河南中原消费金融股份有限公司 Dialogue text labeling method and device, electronic equipment and storage medium
CN113658710A (en) * 2021-08-11 2021-11-16 东软集团股份有限公司 Data matching method and related equipment thereof
CN113688323A (en) * 2021-09-03 2021-11-23 支付宝(杭州)信息技术有限公司 Method and device for constructing intention triggering strategy and intention identification
CN116719831A (en) * 2023-08-03 2023-09-08 四川中测仪器科技有限公司 Standard database establishment and update method for health monitoring
CN116719831B (en) * 2023-08-03 2023-10-27 四川中测仪器科技有限公司 Standard database establishment and update method for health monitoring

Similar Documents

Publication Publication Date Title
WO2021174757A1 (en) Method and apparatus for recognizing emotion in voice, electronic device and computer-readable storage medium
CN111950294A (en) Intention identification method and device based on multi-parameter K-means algorithm and electronic equipment
CN110377916B (en) Word prediction method, word prediction device, computer equipment and storage medium
US20200258007A1 (en) Systems and methods for automatically configuring training data for training machine learning models of a machine learning-based dialogue system
CN112270546A (en) Risk prediction method and device based on stacking algorithm and electronic equipment
CN110532558B (en) Multi-intention recognition method and system based on sentence structure deep parsing
CN110297888B (en) Domain classification method based on prefix tree and cyclic neural network
CN112348519A (en) Method and device for identifying fraudulent user and electronic equipment
CN111191000A (en) Dialog management method, device and system of intelligent voice robot
JP2014026455A (en) Media data analysis device, method and program
US20200272435A1 (en) Systems and methods for virtual programming by artificial intelligence
CN110223134B (en) Product recommendation method based on voice recognition and related equipment
CN113434683A (en) Text classification method, device, medium and electronic equipment
CN112035626A (en) Rapid identification method and device for large-scale intentions and electronic equipment
CN111368878A (en) Optimization method based on SSD target detection, computer equipment and medium
JP2021081713A (en) Method, device, apparatus, and media for processing voice signal
CN112418320A (en) Enterprise association relation identification method and device and storage medium
CN112100339A (en) User intention recognition method and device for intelligent voice robot and electronic equipment
CN111966798A (en) Intention identification method and device based on multi-round K-means algorithm and electronic equipment
CN113870863A (en) Voiceprint recognition method and device, storage medium and electronic equipment
CN113918710A (en) Text data processing method and device, electronic equipment and readable storage medium
US20190228072A1 (en) Information processing device, learning method, and storage medium
CN116978367A (en) Speech recognition method, device, electronic equipment and storage medium
CN114510567A (en) Clustering-based new idea finding method, device, equipment and storage medium
CN112328784B (en) Data information classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20201117