CN112256844A - Text classification method and device - Google Patents
Text classification method and device Download PDFInfo
- Publication number
- CN112256844A CN112256844A CN201911148366.2A CN201911148366A CN112256844A CN 112256844 A CN112256844 A CN 112256844A CN 201911148366 A CN201911148366 A CN 201911148366A CN 112256844 A CN112256844 A CN 112256844A
- Authority
- CN
- China
- Prior art keywords
- sub
- intents
- intention
- question
- questions
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 56
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 22
- 239000011159 matrix material Substances 0.000 claims abstract description 22
- 238000012545 processing Methods 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000012216 screening Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 3
- 230000003190 augmentative effect Effects 0.000 claims 1
- 238000002372 labelling Methods 0.000 description 8
- 238000013145 classification model Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000007635 classification algorithm Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 2
- 238000007796 conventional method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a text classification method and device. The method comprises the following steps: the method comprises the steps of obtaining example problems corresponding to each sub-intention in a current scene, wherein the example problems are standard question methods of a class of problems, expanding the example problems corresponding to each sub-intention according to problems input by a user in the current scene and a text matching algorithm to obtain the expanded example problems corresponding to each sub-intention, and combining all the sub-intents according to a confusion matrix, the confusion degree between every two sub-intents and the expanded example problems corresponding to each sub-intention to obtain N sub-intents, wherein N is a positive integer. Therefore, the online corpora under a new scene are automatically and reasonably classified, the time consumed by comparison is short, and the accuracy is improved.
Description
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a text classification method and apparatus.
Background
At present, in a customer service robot dialog system, text classification is mainly used for identifying the intention of a user to inquire, and plays a very important role in the customer service robot dialog system. The text classification process comprises the following steps: firstly, on-line corpora (namely, problems input by a user) are manually taken for labeling, after the labeling is finished, labeled data are subjected to confusion and tuning to adjust the classification, so that the classification is more reasonable, and finally, a classification model is obtained by training and adjusting the classified data through a classification algorithm.
When service expansion is performed in a customer service robot dialogue system, a large amount of manpower is generally consumed for labeling, adjusting and classifying, and a new classification model is trained to adapt to a newly added scene. In the prior art, when business is expanded, business personnel can analyze the online corpora, then the online corpora are classified according to business types, then all the online corpora of a new scene are labeled according to classification results, similarly, after the labeling is completed, labeled data are mixed and optimized to adjust the classification, so that the classification is more reasonable, and finally, a classification model is obtained by training and adjusting the classified data through a classification algorithm.
When a service person classifies the online corpora according to the service type, some sub-intentions (which refer to a class of problems that a customer service robot can directly respond to in one scene) are usually combined into a large class according to the service type, and manual classification is long in time consumption and low in accuracy.
Disclosure of Invention
The application provides a text classification method and a text classification device, which are used for solving the problems of long time consumption and low accuracy rate of the conventional method in service expansion.
In a first aspect, the present application provides a text classification method, including:
acquiring an example question corresponding to each sub-intention in a current scene, wherein the example question is a standard question method of a class of questions;
expanding the example problem corresponding to each sub-intention according to the problem input by the user in the current scene and a text matching algorithm to obtain the expanded example problem corresponding to each sub-intention;
and according to the confusion matrix, the confusion degree between every two sub-intents and the example problem corresponding to each expanded sub-intention, combining all the sub-intents to obtain N sub-intents, wherein N is a positive integer.
Further, the expanding the example problem corresponding to each sub-intention according to the problem input by the user in the current scene and the text matching algorithm to obtain the expanded example problem corresponding to each sub-intention includes:
screening example problems matched with the example problems corresponding to each sub-intention from the problems input by the user in the current scene through a text matching algorithm;
and respectively taking the screened example questions matched with the example questions corresponding to each sub-intention as the example questions corresponding to each expanded sub-intention.
Further, the merging all the sub-intents according to the confusion matrix, the confusion degree between every two sub-intents and the extended example problem corresponding to each sub-intention to obtain N sub-intents includes:
s1, sequentially inputting the example questions corresponding to each sub-intention into the confusion matrix to obtain the sub-intentions predicted by each example question, the harmonic average value of the accuracy and the recall ratio of each sub-intention and the harmonic average value of the overall accuracy and the recall ratio of all the sub-intentions;
s2, calculating the confusion degree of any two sub-intents in all the sub-intents according to the actual example problem and the predicted example problem corresponding to the sub-intents;
s3, combining the M sub-intents with the maximum confusion degree to obtain a new sub-intention, wherein M is a preset positive integer, and the number of example problems corresponding to the new sub-intention is the sum of the number of example problems corresponding to the M sub-intents with the maximum confusion degree;
continuing to execute the S1-S3 according to the new sub-intents and sub-intents except the M sub-intents until the number of all sub-intents is N, wherein N is a preset value; or,
continuing to execute the S1-S3 according to the new sub-intents and sub-intents except the M sub-intents until the obtained harmonic mean of the accuracy and the recall of each of the N sub-intents is greater than a first preset threshold, and the harmonic mean of the overall accuracy and the recall of the N sub-intents is greater than a second preset threshold.
Further, the calculating the confusion degree of any two sub-intents in all the sub-intents according to the actual example problem and the predicted example problem corresponding to the sub-intents comprises:
calculating the confusion degree of any two sub-intentions, namely, the catei and the catej in all the sub-intentions according to the actual example problem corresponding to the sub-intentions and the predicted example problem through the following calculation formula
Wherein N iscatei,catejNumber of example questions for which the actual child is intended as catei but is predicted to catej, NcateiRepresents the number of actual example questions of the catei,indicating the number of example problems that are predicted to catei.
Further, before calculating the confusion of any two sub-intents in all sub-intents according to the actual example problem and the predicted example problem corresponding to the sub-intents, the method further includes:
and determining that the harmonic mean value of the accuracy rate and the recall rate of each sub-intention is smaller than the first preset threshold value, and the harmonic mean value of the overall accuracy rate and the recall rate of all sub-intentions is smaller than the second preset threshold value.
In a second aspect, the present application provides a text classification apparatus, comprising:
the acquisition module is used for acquiring an example question corresponding to each sub-intention in the current scene, wherein the example question is a standard question method of a class of questions;
the question expansion module is used for expanding the example question corresponding to each sub-intention according to the question input by the user in the current scene and a text matching algorithm to obtain the expanded example question corresponding to each sub-intention;
and the processing module is used for merging all the sub-intents according to the confusion matrix, the confusion degree between every two sub-intents and the example problem corresponding to each expanded sub-intention to obtain N sub-intents, wherein N is a positive integer.
Further, the problem expansion module is configured to:
screening example problems matched with the example problems corresponding to each sub-intention from the problems input by the user in the current scene through a text matching algorithm;
and respectively taking the screened example questions matched with the example questions corresponding to each sub-intention as the example questions corresponding to each expanded sub-intention.
Further, the processing module is configured to perform the following operations:
s1, sequentially inputting the example questions corresponding to each sub-intention into the confusion matrix to obtain the sub-intentions predicted by each example question, the harmonic average value of the accuracy and the recall ratio of each sub-intention and the harmonic average value of the overall accuracy and the recall ratio of all the sub-intentions;
s2, calculating the confusion degree of any two sub-intents in all the sub-intents according to the actual example problem and the predicted example problem corresponding to the sub-intents;
s3, combining the M sub-intents with the maximum confusion degree to obtain a new sub-intention, wherein M is a preset positive integer, and the number of example problems corresponding to the new sub-intention is the sum of the number of example problems corresponding to the M sub-intents with the maximum confusion degree;
continuing to execute the S1-S3 according to the new sub-intents and sub-intents except the M sub-intents until the number of all sub-intents is N, wherein N is a preset value; or,
continuing to execute the S1-S3 according to the new sub-intents and sub-intents except the M sub-intents until the obtained harmonic mean of the accuracy and the recall of each of the N sub-intents is greater than a first preset threshold, and the harmonic mean of the overall accuracy and the recall of the N sub-intents is greater than a second preset threshold.
Further, the processing module is specifically configured to:
calculating the confusion degree of any two sub-intentions, namely, the catei and the catej in all the sub-intentions according to the actual example problem corresponding to the sub-intentions and the predicted example problem through the following calculation formula
Wherein N iscatei,catejNumber of example questions for which the actual child is intended as catei but is predicted to catej, NcateiRepresents the number of actual example questions of the catei,indicating the number of example problems that are predicted to catei.
Further, the processing module is further configured to:
before calculating the confusion degree of any two sub-intentions in all the sub-intentions according to the actual example questions and the predicted example questions corresponding to the sub-intentions, determining that the harmonic mean value of the accuracy rate and the recall rate of each sub-intention is smaller than the first preset threshold value, and the harmonic mean value of the overall accuracy rate and the recall rate of all the sub-intentions is smaller than the second preset threshold value.
According to the text classification method and device, the example problem corresponding to each sub-intention in the current scene is obtained, the example problem corresponding to each sub-intention is expanded according to the problem input by the user in the current scene and a text matching algorithm, the example problem corresponding to each expanded sub-intention is obtained, and finally all sub-intents are combined according to a confusion matrix, the confusion degree between every two sub-intents and the example problem corresponding to each expanded sub-intention, so that N sub-intents are obtained. And combining a plurality of sub-intents which are seriously confused into one classification according to the confusion degree, thereby realizing the automatic and reasonable classification of the online corpora under the new scene, shortening the time consumed by comparison with manpower, and improving the accuracy.
Drawings
In order to more clearly illustrate the technical solutions in the present application or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic view of an application scenario of the present application;
FIG. 2 is a flowchart of an embodiment of a text classification method provided in the present application;
FIG. 3 is a flowchart of an embodiment of a text classification method provided in the present application;
fig. 4 is a schematic structural diagram of a text classification apparatus provided in the present application;
fig. 5 is a schematic diagram of a hardware structure of an electronic device provided in the present application.
Detailed Description
To make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the present application will be clearly and completely described below with reference to the drawings in the present application, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
First, some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.
1. Example questions (example questions): some standard questions, also called standard questions, represent a certain class of questions. For example: two example problems under the child intention of modifying invoice heads: 1. i want to modify the invoice header. 2. I want to modify the invoice capitalization company name.
2. The sub-purpose refers to a kind of problem that the customer service robot can directly respond in a scene.
3. And the online corpora are questions input by the user when the user talks with the customer service robot.
In the existing text classification method, generally, when a business is expanded, business personnel classify corpora on lines according to business types, some sub-intentions (namely corpora sets under a class of scenes) are combined into a large class according to the business types, then all corpora on lines of a new scene are labeled according to classification results, after the labeling is finished, labeled data are mixed and optimized to adjust the classification, so that the classification is more reasonable, and finally, a classification model is obtained by training and adjusting the classified data through a classification algorithm. The method comprises the steps of obtaining an example problem corresponding to each sub-intention in a current scene, expanding the example problem corresponding to each sub-intention according to a problem input by a user in the current scene and a text matching algorithm to obtain an expanded example problem corresponding to each sub-intention, wherein the expanded example problem can basically contain all sub-intents of the current scene, and finally merging all sub-intents according to a confusion matrix, the confusion degree between every two sub-intents and the expanded example problem corresponding to each sub-intention to obtain N sub-intents. Therefore, automatic and reasonable classification of the online corpora in the new scene is achieved, time consumption is short in comparison with manual work, and accuracy is improved. The following describes a specific implementation process of the text classification method according to the embodiment of the present application in detail by using a specific embodiment with reference to the drawings.
In the embodiment of the present application, the customer service robot is a machine device for automatically executing a session, and may specifically be a terminal device such as a smart phone, a tablet computer, a notebook computer, or other machine devices with specific shapes and functions.
An execution main body of the text classification method can be a text classification device, fig. 1 is an application scene schematic diagram of the text classification method, as shown in fig. 1, online corpora (namely, a problem input by a user) are generated in a conversation process between a customer service robot and the user, when a new scene is expanded, relevant operators can firstly analyze the online corpora to obtain example problems corresponding to each sub-intention under the current new scene and then input the example problems into the text classification device provided by the application, and after the text classification device provided by the application obtains the example problems corresponding to each sub-intention under the current new scene, the sub-intents are combined according to the text classification method provided by the application to obtain a classification result. After the classification result is obtained, business personnel can label all online corpora of the new scene according to the classification result, after the labeling is completed, the labeled data is subjected to confusion and tuning to adjust the classification, so that the classification is more reasonable, and finally, the classified data is trained and adjusted through a classification algorithm to obtain a classification model.
Fig. 2 is a flowchart of an embodiment of a text classification method provided in the present application, and as shown in fig. 2, the method of the present embodiment may include:
s101, acquiring an example question corresponding to each sub-intention in the current scene, wherein the example question is a standard question method of a class of questions.
When a new scene is expanded, relevant operators may analyze the corpus to obtain example questions corresponding to each sub-intention in the current new scene, and generally, the amount of the obtained example questions is small due to limited labor cost, and then S102 is performed to obtain more questions of the current scene.
S102, expanding the example problems corresponding to the sub-intentions according to the problems input by the user in the current scene and a text matching algorithm to obtain the expanded example problems corresponding to the sub-intentions.
Specifically, S102 may specifically be: and screening example problems matched with the example problems corresponding to each sub-intention from the problems input by the user in the current scene through a text matching algorithm, and respectively taking the screened example problems matched with the example problems corresponding to each sub-intention as the example problems corresponding to each expanded sub-intention. The text matching algorithm may be, for example, some similarity matching algorithm. After the example problem corresponding to each expanded sub-intention is obtained through a text matching algorithm, the example problems can be manually cleaned, so that the accuracy is improved. For example, the number of example questions corresponding to the sub-intents analyzed by the operator is usually 2-10, the number of example questions corresponding to the expanded sub-intents is 10-200, and the question category of the expanded example questions can basically contain all question types of the new scenario.
S103, according to the confusion matrix, the confusion degree between every two sub-intents and the example problem corresponding to each expanded sub-intention, all the sub-intents are combined to obtain N sub-intents, wherein N is a positive integer.
As an implementable manner, S103 may specifically include:
and S1, sequentially inputting the example questions corresponding to each sub-intention into a confusion matrix to obtain the sub-intentions predicted by each example question, the harmonic mean value of the accuracy and the recall ratio of each sub-intention and the harmonic mean value of the overall accuracy and the recall ratio of all the sub-intentions.
The confusion matrix (fusion _ matrix) is a tool for obtaining the accuracy and recall of each classification identification in the text classification to distinguish whether the classification is reasonable. The tool will usually cut the data into several groups that are equal (or not very different), and then choose one of the groups as the test set and the others as the training set to obtain the prediction results of all the data.
For example, there are 20 example questions corresponding to the sub-intention 1, the example questions corresponding to the sub-intention 1 are sequentially input into the confusion matrix, and output is that the sub-intention predicted by each example question, the harmonic mean value of the accuracy and the recall ratio of each sub-intention and the harmonic mean value of the overall accuracy and the recall ratio of all sub-intentions are obtained, wherein the sub-intention predicted by each example question is also the sub-intention to which each example question prediction belongs, for example, the sub-intention predicted by the example question 1 corresponding to the sub-intention 1 is the sub-intention 2, the sub-intention predicted by the example question 2 corresponding to the sub-intention 1 is the sub-intention 1 itself, and the like.
And S2, calculating the confusion degree of any two sub-intents in all the sub-intents according to the actual example problems corresponding to the sub-intents and the predicted example problems.
As an implementable manner, S2 may specifically be:
calculating the confusion degree of any two sub-intentions, namely, the catei and the catej in all the sub-intentions according to the actual example problem corresponding to the sub-intentions and the predicted example problem through the following calculation formula
Wherein N iscatei,catejNumber of example questions for which the actual child is intended as catei but is predicted to catej, NcateiRepresents the number of actual example questions of the catei,indicating the number of example problems that are predicted to catei.
And S3, merging the M sub-intents with the maximum confusion degree to obtain a new sub-intention, wherein M is a preset positive integer, and the number of example problems corresponding to the new sub-intention is the sum of the number of example problems corresponding to the M sub-intents with the maximum confusion degree.
Specifically, the M sub-intents with the largest confusion degree are the M sub-intents with the greatest confusion degree, the M sub-intents with the greatest confusion degree are merged into one sub-intention, which is also called to be merged into one large category, the number of example problems corresponding to the merged sub-intents is the sum of the number of example problems corresponding to the M sub-intents, where M is a preset positive integer, and M can be preset according to the service type of the current scene.
Then, in an implementable manner, the execution of S1-S3 is continued according to the new sub-intention and the sub-intents except for M sub-intents until the number of all sub-intents is N, N being a preset value.
For example, 50 sub-intents are obtained in S102, each sub-intention corresponds to a plurality of example questions, for example, M is 3, after the first merging, 3 sub-intents with the highest confusion degree are merged into a new sub-intention, and finally 48 sub-intents are obtained, and then S1-S3 are continuously performed on the new sub-intention and the other 47 sub-intents, and the merging is continued, for example, N is preset to be 30, until the number of all sub-intents is 30.
In another practical way, S1-S3 are continuously executed according to the new sub-intents and the sub-intents except M sub-intents until the harmonic mean value of the accuracy and the recall ratio of each of the N sub-intents is greater than a first preset threshold value, and the harmonic mean value of the overall accuracy and the recall ratio of the N sub-intents is greater than a second preset threshold value.
For example, 60 sub-intents are obtained in S102, each sub-intention corresponds to a plurality of example questions, for example, M is 5, after the first merging, 5 sub-intents with the largest confusion are merged into a new sub-intention, 46 sub-intents are finally obtained, then S1-S3 are continuously performed on the new sub-intention and other 45 sub-intents, and the merging is continuously performed until the harmonic mean of the accuracy and the recall of each sub-intention in the obtained N sub-intents is greater than a first preset threshold, and the harmonic mean of the overall accuracy and the recall of the N sub-intents is greater than a second preset threshold, for example, N is 35, which satisfies the above condition.
In this embodiment, optionally, before S2, the method may further include:
and determining that the harmonic mean value of the accuracy rate and the recall rate of each sub-intention is smaller than a first preset threshold value, and the harmonic mean value of the overall accuracy rate and the recall rate of all sub-intentions is smaller than a second preset threshold value.
Specifically, after the step of performing S1 to obtain the harmonic mean value of the accuracy and the recall ratio of each sub-intention and the harmonic mean value of the overall accuracy and the recall ratio of all sub-intentions, it is first determined whether the harmonic mean value of the accuracy and the recall ratio of each sub-intention is smaller than a first preset threshold and whether the harmonic mean value of the overall accuracy and the recall ratio of all sub-intentions is smaller than a second preset threshold, if so, the step of continuing to perform S2, and if not, it is determined that the classification of all sub-intentions obtained in S102 is reasonable and does not need to be performed again.
According to the text classification method provided by the embodiment, the example problem corresponding to each sub-intention in the current scene is obtained, the example problem corresponding to each sub-intention is expanded according to the problem input by the user in the current scene and a text matching algorithm to obtain the expanded example problem corresponding to each sub-intention, and finally all sub-intents are combined according to a confusion matrix, the confusion degree between every two sub-intents and the expanded example problem corresponding to each sub-intention to obtain N sub-intents. And combining the sub-intents with more serious confusion into a classification according to the confusion degree, namely combining the sub-intents of the similar question methods into a classification. Therefore, the online corpora under a new scene are automatically and reasonably classified, the time consumed by comparison is short, and the accuracy is improved.
The following describes the technical solution of the embodiment of the method shown in fig. 2 in detail by using a specific embodiment.
Fig. 3 is a flowchart of an embodiment of a text classification method provided in the present application, and as shown in fig. 3, the method of the present embodiment may include:
s201, obtaining example questions corresponding to each sub-intention in the current scene, wherein the example questions are standard questions of a class of questions.
S202, expanding the example problem corresponding to each sub-intention according to the problem input by the user in the current scene and a text matching algorithm to obtain the expanded example problem corresponding to each sub-intention.
S203, inputting the example questions corresponding to each sub-intention into a confusion matrix in sequence to obtain the sub-intention predicted by each example question, the harmonic mean value of the accuracy and the recall ratio of each sub-intention and the harmonic mean value of the overall accuracy and the recall ratio of all sub-intents.
And S204, judging whether the harmonic average value of the accuracy rate and the recall rate of each sub-intention is smaller than a first preset threshold value or not, and whether the harmonic average value of the total accuracy rate and the recall rate of all sub-intentions is smaller than a second preset threshold value or not, if so, executing S205, and if not, determining that the classification of all sub-intentions obtained in S202 is reasonable and not needing to be classified.
And S205, calculating the confusion degree of any two sub-intents in all the sub-intents according to the actual example problem and the predicted example problem corresponding to the sub-intents.
Specifically, the confusion degree of any two sub-intentions, namely, the catei and the catej in all the sub-intentions is calculated according to the actual example problem corresponding to the sub-intentions and the predicted example problem through the following calculation formula
Wherein N iscatei,catejNumber of example questions for which the actual child is intended as catei but is predicted to catej, NcateiRepresents the number of actual example questions of the catei,indicating the number of example problems that are predicted to catei.
S206, combining the M sub-intents with the maximum confusion degree to obtain a new sub-intention, wherein M is a preset positive integer, and the number of example problems corresponding to the new sub-intention is the sum of the number of example problems corresponding to the M sub-intents with the maximum confusion degree.
S207, continuing to execute S203-S206 according to the new sub-intents and the sub-intents except for the M sub-intents until the number of all the sub-intents is N, wherein N is a preset value.
Or, continuing to execute S203-S206 according to the new sub-intents and the sub-intents except the M sub-intents until the harmonic mean value of the accuracy and the recall ratio of each of the N sub-intents is greater than a first preset threshold value, and the harmonic mean value of the overall accuracy and the recall ratio of the N sub-intents is greater than a second preset threshold value.
After the N sub-intents are obtained in S207, optionally, the service personnel may make another adjustment according to the N sub-intents in combination with the service, and finally obtain a primary version definition result of the classification, that is, the sub-intents of some similar questioning methods are combined into one classification.
According to verification, the existing manual classification method can meet the requirements (the first preset threshold is 0.7, and the second preset threshold is 0.85) only by mixing and optimizing more than 3 rounds after the labeling is finished, and the requirements can be met only by mixing and optimizing 1-2 rounds after the labeling is finished by the method provided by the application. On the whole, the human effect can be improved by 30% according to actual statistics.
Fig. 4 is a schematic structural diagram of a text classification apparatus provided in the present application, and as shown in fig. 4, the apparatus of this embodiment may include: an acquisition module 11, a problem expansion module 12 and a processing module 13, wherein,
the obtaining module 11 is configured to obtain an example question corresponding to each sub-intention in a current scene, where the example question is a standard question of a class of questions;
the question expansion module 12 is configured to expand the example question corresponding to each sub-intention according to the question input by the user in the current scene and a text matching algorithm, so as to obtain an example question corresponding to each expanded sub-intention;
the processing module 13 is configured to merge all the sub-intents according to the confusion matrix, the confusion degree between every two sub-intents, and the extended example problem corresponding to each sub-intention, so as to obtain N sub-intents, where N is a positive integer.
Optionally, the problem expansion module 12 is configured to:
screening example problems matched with the example problems corresponding to each sub-intention from the problems input by the user in the current scene through a text matching algorithm;
and respectively taking the screened example questions matched with the example questions corresponding to each sub-intention as the example questions corresponding to each expanded sub-intention.
Optionally, the processing module 13 is configured to perform the following operations:
s1, sequentially inputting the example questions corresponding to each sub-intention into a confusion matrix to obtain the sub-intentions predicted by each example question, the harmonic average value of the accuracy and the recall ratio of each sub-intention and the harmonic average value of the overall accuracy and the recall ratio of all the sub-intentions;
s2, calculating the confusion degree of any two sub-intents in all the sub-intents according to the actual example problem and the predicted example problem corresponding to the sub-intents;
s3, combining the M sub-intents with the maximum confusion degree to obtain a new sub-intention, wherein M is a preset positive integer, and the number of example problems corresponding to the new sub-intention is the sum of the number of example problems corresponding to the M sub-intents with the maximum confusion degree;
continuing to execute S1-S3 according to the new sub-intents and the sub-intents except for the M sub-intents until the number of all the sub-intents is N, wherein N is a preset value; or,
S1-S3 are continuously executed according to the new sub-intents and the sub-intents except for the M sub-intents until the harmonic mean value of the accuracy and the recall ratio of each of the N sub-intents is larger than a first preset threshold value, and the harmonic mean value of the overall accuracy and the recall ratio of the N sub-intents is larger than a second preset threshold value.
Optionally, the processing module 13 is specifically configured to:
calculating the confusion degree of any two sub-intentions, namely, the catei and the catej in all the sub-intentions according to the actual example problem corresponding to the sub-intentions and the predicted example problem through the following calculation formula
Wherein N iscatei,catejNumber of example questions for which the actual child is intended as catei but is predicted to catej, NcateiRepresents the number of actual example questions of the catei,indicating the number of example problems that are predicted to catei.
Optionally, the processing module 13 is further configured to:
before calculating the confusion degree of any two sub-intentions in all the sub-intentions according to the actual example questions and the predicted example questions corresponding to the sub-intentions, determining that the harmonic mean value of the accuracy rate and the recall rate of each sub-intention is smaller than a first preset threshold value, and the harmonic mean value of the overall accuracy rate and the recall rate of all the sub-intentions is smaller than a second preset threshold value.
The apparatus provided in the embodiment of the present application may implement the method embodiment, and specific implementation principles and technical effects thereof may be referred to the method embodiment, which is not described herein again.
Fig. 5 is a schematic diagram of a hardware structure of an electronic device provided in the present application. As shown in fig. 5, the electronic device 20 of the present embodiment may include: a memory 21 and a processor 22;
a memory 21 for storing a computer program;
a processor 22 for executing the computer program stored in the memory to implement the text classification method in the above-described embodiments. Reference may be made in particular to the description relating to the method embodiments described above.
Alternatively, the memory 21 may be separate or integrated with the processor 22.
When the memory 21 is a device separate from the processor 22, the electronic device 20 may further include:
a bus 23 for connecting the memory 21 and the processor 22.
Optionally, this embodiment further includes: a communication interface 24, the communication interface 24 being connectable to the processor 22 via a bus 23. The processor 22 may control the communication interface 23 to implement the above-described receiving and transmitting functions of the electronic device 20.
The electronic device provided by this embodiment can be used to execute the above method, and its implementation manner and technical effect are similar, and this embodiment is not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present application may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The unit formed by the modules can be realized in a hardware form, and can also be realized in a form of hardware and a software functional unit.
The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.
It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.
The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.
The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.
The computer-readable storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.
Claims (10)
1. A method of text classification, comprising:
acquiring an example question corresponding to each sub-intention in a current scene, wherein the example question is a standard question method of a class of questions;
expanding the example problem corresponding to each sub-intention according to the problem input by the user in the current scene and a text matching algorithm to obtain the expanded example problem corresponding to each sub-intention;
and according to the confusion matrix, the confusion degree between every two sub-intents and the example problem corresponding to each expanded sub-intention, combining all the sub-intents to obtain N sub-intents, wherein N is a positive integer.
2. The method according to claim 1, wherein the expanding the example question corresponding to each sub-intention according to the question and the text matching algorithm input by the user in the current scene to obtain the expanded example question corresponding to each sub-intention comprises:
screening example problems matched with the example problems corresponding to each sub-intention from the problems input by the user in the current scene through a text matching algorithm;
and respectively taking the screened example questions matched with the example questions corresponding to each sub-intention as the example questions corresponding to each expanded sub-intention.
3. The method according to claim 1, wherein the merging all sub-intents according to the confusion matrix, the confusion degree between every two sub-intents and the augmented example question corresponding to each sub-intention results in N sub-intents, comprising:
s1, sequentially inputting the example questions corresponding to each sub-intention into the confusion matrix to obtain the sub-intentions predicted by each example question, the harmonic average value of the accuracy and the recall ratio of each sub-intention and the harmonic average value of the overall accuracy and the recall ratio of all the sub-intentions;
s2, calculating the confusion degree of any two sub-intents in all the sub-intents according to the actual example problem and the predicted example problem corresponding to the sub-intents;
s3, combining the M sub-intents with the maximum confusion degree to obtain a new sub-intention, wherein M is a preset positive integer, and the number of example problems corresponding to the new sub-intention is the sum of the number of example problems corresponding to the M sub-intents with the maximum confusion degree;
continuing to execute the S1-S3 according to the new sub-intents and sub-intents except the M sub-intents until the number of all sub-intents is N, wherein N is a preset value; or,
continuing to execute the S1-S3 according to the new sub-intents and sub-intents except the M sub-intents until the obtained harmonic mean of the accuracy and the recall of each of the N sub-intents is greater than a first preset threshold, and the harmonic mean of the overall accuracy and the recall of the N sub-intents is greater than a second preset threshold.
4. The method according to claim 3, wherein the calculating the confusion of any two sub-intents in all the sub-intents according to the actual example problem and the predicted example problem corresponding to the sub-intents comprises:
calculating the confusion degree of any two sub-intentions, namely, the catei and the catej in all the sub-intentions according to the actual example problem corresponding to the sub-intentions and the predicted example problem through the following calculation formula
5. The method according to claim 3, wherein before calculating the confusion of any two sub-intents in all sub-intents according to the actual example problem and the predicted example problem corresponding to the sub-intents, the method further comprises:
and determining that the harmonic mean value of the accuracy rate and the recall rate of each sub-intention is smaller than the first preset threshold value, and the harmonic mean value of the overall accuracy rate and the recall rate of all sub-intentions is smaller than the second preset threshold value.
6. A text classification apparatus, comprising:
the acquisition module is used for acquiring an example question corresponding to each sub-intention in the current scene, wherein the example question is a standard question method of a class of questions;
the question expansion module is used for expanding the example question corresponding to each sub-intention according to the question input by the user in the current scene and a text matching algorithm to obtain the expanded example question corresponding to each sub-intention;
and the processing module is used for merging all the sub-intents according to the confusion matrix, the confusion degree between every two sub-intents and the example problem corresponding to each expanded sub-intention to obtain N sub-intents, wherein N is a positive integer.
7. The apparatus of claim 6, wherein the problem expansion module is configured to:
screening example problems matched with the example problems corresponding to each sub-intention from the problems input by the user in the current scene through a text matching algorithm;
and respectively taking the screened example questions matched with the example questions corresponding to each sub-intention as the example questions corresponding to each expanded sub-intention.
8. The apparatus of claim 6, wherein the processing module is configured to:
s1, sequentially inputting the example questions corresponding to each sub-intention into the confusion matrix to obtain the sub-intentions predicted by each example question, the harmonic average value of the accuracy and the recall ratio of each sub-intention and the harmonic average value of the overall accuracy and the recall ratio of all the sub-intentions;
s2, calculating the confusion degree of any two sub-intents in all the sub-intents according to the actual example problem and the predicted example problem corresponding to the sub-intents;
s3, combining the M sub-intents with the maximum confusion degree to obtain a new sub-intention, wherein M is a preset positive integer, and the number of example problems corresponding to the new sub-intention is the sum of the number of example problems corresponding to the M sub-intents with the maximum confusion degree;
continuing to execute the S1-S3 according to the new sub-intents and sub-intents except the M sub-intents until the number of all sub-intents is N, wherein N is a preset value; or,
continuing to execute the S1-S3 according to the new sub-intents and sub-intents except the M sub-intents until the obtained harmonic mean of the accuracy and the recall of each of the N sub-intents is greater than a first preset threshold, and the harmonic mean of the overall accuracy and the recall of the N sub-intents is greater than a second preset threshold.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method for text classification according to any one of claims 1 to 5.
10. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the text classification method of any of claims 1-5 via execution of the executable instructions.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911148366.2A CN112256844B (en) | 2019-11-21 | 2019-11-21 | Text classification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911148366.2A CN112256844B (en) | 2019-11-21 | 2019-11-21 | Text classification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112256844A true CN112256844A (en) | 2021-01-22 |
CN112256844B CN112256844B (en) | 2024-09-20 |
Family
ID=74223850
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911148366.2A Active CN112256844B (en) | 2019-11-21 | 2019-11-21 | Text classification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112256844B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114330562A (en) * | 2021-12-31 | 2022-04-12 | 大箴(杭州)科技有限公司 | Small sample refinement classification and multi-classification model construction method |
CN117235270A (en) * | 2023-11-16 | 2023-12-15 | 中国人民解放军国防科技大学 | Text classification method and device based on belief confusion matrix and computer equipment |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103996312A (en) * | 2014-05-23 | 2014-08-20 | 北京理工大学 | Pilotless automobile control system with social behavior interaction function |
CN107292338A (en) * | 2017-06-14 | 2017-10-24 | 大连海事大学 | A kind of feature selection approach based on sample characteristics Distribution value degree of aliasing |
US20180114142A1 (en) * | 2016-10-26 | 2018-04-26 | Swiss Reinsurance Company Ltd. | Data extraction engine for structured, semi-structured and unstructured data with automated labeling and classification of data patterns or data elements therein, and corresponding method thereof |
CN109902285A (en) * | 2019-01-08 | 2019-06-18 | 平安科技(深圳)有限公司 | Corpus classification method, device, computer equipment and storage medium |
CN109934293A (en) * | 2019-03-15 | 2019-06-25 | 苏州大学 | Image-recognizing method, device, medium and obscure perception convolutional neural networks |
CN109948664A (en) * | 2019-02-28 | 2019-06-28 | 深圳智链物联科技有限公司 | Charge mode recognition methods, device, terminal device and storage medium |
US20190260694A1 (en) * | 2018-02-16 | 2019-08-22 | Mz Ip Holdings, Llc | System and method for chat community question answering |
CN110413746A (en) * | 2019-06-25 | 2019-11-05 | 阿里巴巴集团控股有限公司 | The method and device of intention assessment is carried out to customer problem |
-
2019
- 2019-11-21 CN CN201911148366.2A patent/CN112256844B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103996312A (en) * | 2014-05-23 | 2014-08-20 | 北京理工大学 | Pilotless automobile control system with social behavior interaction function |
US20180114142A1 (en) * | 2016-10-26 | 2018-04-26 | Swiss Reinsurance Company Ltd. | Data extraction engine for structured, semi-structured and unstructured data with automated labeling and classification of data patterns or data elements therein, and corresponding method thereof |
CN107292338A (en) * | 2017-06-14 | 2017-10-24 | 大连海事大学 | A kind of feature selection approach based on sample characteristics Distribution value degree of aliasing |
US20190260694A1 (en) * | 2018-02-16 | 2019-08-22 | Mz Ip Holdings, Llc | System and method for chat community question answering |
CN109902285A (en) * | 2019-01-08 | 2019-06-18 | 平安科技(深圳)有限公司 | Corpus classification method, device, computer equipment and storage medium |
CN109948664A (en) * | 2019-02-28 | 2019-06-28 | 深圳智链物联科技有限公司 | Charge mode recognition methods, device, terminal device and storage medium |
CN109934293A (en) * | 2019-03-15 | 2019-06-25 | 苏州大学 | Image-recognizing method, device, medium and obscure perception convolutional neural networks |
CN110413746A (en) * | 2019-06-25 | 2019-11-05 | 阿里巴巴集团控股有限公司 | The method and device of intention assessment is carried out to customer problem |
Non-Patent Citations (1)
Title |
---|
杨春妮;冯朝胜;: "结合句法特征和卷积神经网络的多意图识别模型", 计算机应用, no. 07 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114330562A (en) * | 2021-12-31 | 2022-04-12 | 大箴(杭州)科技有限公司 | Small sample refinement classification and multi-classification model construction method |
CN114330562B (en) * | 2021-12-31 | 2023-09-26 | 大箴(杭州)科技有限公司 | Small sample refinement classification and multi-classification model construction method |
CN117235270A (en) * | 2023-11-16 | 2023-12-15 | 中国人民解放军国防科技大学 | Text classification method and device based on belief confusion matrix and computer equipment |
CN117235270B (en) * | 2023-11-16 | 2024-02-02 | 中国人民解放军国防科技大学 | Text classification method and device based on belief confusion matrix and computer equipment |
Also Published As
Publication number | Publication date |
---|---|
CN112256844B (en) | 2024-09-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9087108B2 (en) | Determination of category information using multiple stages | |
EP3413221A1 (en) | Risk assessment method and system | |
CN112733042B (en) | Recommendation information generation method, related device and computer program product | |
CN107423613A (en) | The method, apparatus and server of device-fingerprint are determined according to similarity | |
CN111931809A (en) | Data processing method and device, storage medium and electronic equipment | |
WO2020253506A1 (en) | Contract content extraction method and apparatus, and computer device and storage medium | |
CN111159404B (en) | Text classification method and device | |
CN109918645B (en) | Method and device for deeply analyzing text, computer equipment and storage medium | |
CN105630931A (en) | Document classification method and device | |
CN112256844A (en) | Text classification method and device | |
CN108153719A (en) | Merge the method and apparatus of electrical form | |
CN116090867A (en) | Index rule generation method and device, electronic equipment and storage medium | |
CN115730605B (en) | Data analysis method based on multidimensional information | |
CN110968664A (en) | Document retrieval method, device, equipment and medium | |
CN116340831B (en) | Information classification method and device, electronic equipment and storage medium | |
CN106651408B (en) | Data analysis method and device | |
CN111782541A (en) | Test case generation method, device, equipment and computer readable storage medium | |
CN115062132A (en) | Recognition model training method and device, and intention type recognition method and device | |
CN111625619A (en) | Query omission method and device, computer readable medium and electronic equipment | |
CN114860608A (en) | Scene construction based system automation testing method, device, equipment and medium | |
CN115099934A (en) | High-latency customer identification method, electronic equipment and storage medium | |
CN114257523A (en) | User perception prediction method, system, device and computer storage medium | |
CN111694962A (en) | Data processing method and device | |
CN112016308A (en) | Language identification method | |
CN111625458A (en) | Service system testing method, device and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |