CN108920694B - Short text multi-label classification method and device - Google Patents

Short text multi-label classification method and device Download PDF

Info

Publication number
CN108920694B
CN108920694B CN201810769761.1A CN201810769761A CN108920694B CN 108920694 B CN108920694 B CN 108920694B CN 201810769761 A CN201810769761 A CN 201810769761A CN 108920694 B CN108920694 B CN 108920694B
Authority
CN
China
Prior art keywords
classification
probability
probability set
classified
labels
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810769761.1A
Other languages
Chinese (zh)
Other versions
CN108920694A (en
Inventor
熊文灿
廖翔
周继烈
张昊
刘铭
张骏
单培
李士勇
张瑞飞
李广刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Science and Technology (Beijing) Co., Ltd.
Original Assignee
Dingfu Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dingfu Intelligent Technology Co Ltd filed Critical Dingfu Intelligent Technology Co Ltd
Priority to CN201810769761.1A priority Critical patent/CN108920694B/en
Publication of CN108920694A publication Critical patent/CN108920694A/en
Application granted granted Critical
Publication of CN108920694B publication Critical patent/CN108920694B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a short text multi-label classification method and device, which comprises the steps of obtaining a first forward classification probability set by using a single classification model corresponding to classification labels; screening forward classification probabilities in the first forward classification probability set to obtain a first target forward classification probability set; judging whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and if so, determining a classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs; if not, determining the classification label corresponding to the forward classification probability as a residual classification label; and classifying the short texts to be classified by using a multi-classification model to obtain a second classification set. According to the short text classification method and device, the short text to be classified is firstly subjected to initial classification processing, then secondary classification processing is carried out on the short text to be classified, multi-classification processing of the short text can be achieved, the complexity of data processing is reduced, and the speed of data processing is improved.

Description

Short text multi-label classification method and device
Technical Field
The present application relates to the field of text classification, and in particular, to a short text multi-label classification method and apparatus.
Background
With the rapid development of the internet in recent years, a large number of short texts (shorttexts) are generated by various information interaction platforms, the short texts relate to various fields of people's life and gradually become frequently used and generally accepted communication modes for people, and for example, report information, electronic commerce comments, intelligent question and answer systems and the like in the field of public security are generation sources of a large number of short texts. How to mine effective information from massive short texts is a subject of extensive research by many scholars in recent years. Text classification is an effective method for text mining, but due to the characteristics of short text length, sparse lexical features and the like, the traditional long text classification method is not applicable.
At present, Convolutional Neural Network (CNN) technology has been widely applied to the field of Natural Language Processing (NLP). The convolutional neural network technology comprises a plurality of layers, namely a convolutional layer, a pooling layer, a full-link layer and a classification layer, wherein the convolutional layer and the pooling layer are used for extracting feature words in short texts to be classified, the full-link layer is used for integrating the feature words, and finally the classification layer is used for classifying the short texts to be classified. However, since the classifier used by the classification layer is a single-class classifier, the requirement of multi-classification of short texts to be classified cannot be met.
Disclosure of Invention
The application provides a short text multi-label classification method and device, which aim to solve the problem that multi-classification of short texts to be classified cannot be realized because a classifier used by a classification layer is a single-class classifier.
In a first aspect, the present application provides a short text multi-label classification method, including:
acquiring short texts to be classified;
obtaining a first forward classification probability set by using a single classification model corresponding to the classification labels, wherein the first forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different classification labels and the corresponding classification labels, which are obtained by using the single classification model;
screening the forward classification probabilities in the first forward classification probability set to obtain a first target forward classification probability set;
judging whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and if the forward classification probability is greater than or equal to the first preset classification threshold, determining a classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs;
if the forward classification probability is smaller than the first preset classification threshold, determining the classification label corresponding to the forward classification probability as a residual classification label;
classifying the short texts to be classified by utilizing a multi-classification model to obtain a second forward classification probability set, wherein the multi-classification model is composed of two classification models corresponding to residual classification labels, and the second forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different residual classification labels and corresponding residual classification labels, which are obtained by utilizing the multi-classification model;
screening the forward classification probabilities in the second forward classification probability set to obtain a second classification probability set, wherein the second classification probability set is composed of classification labels corresponding to the forward classification probabilities obtained after the second forward classification probability set is screened;
and merging the first classification category and the second classification category to obtain a classification result.
In a second aspect, the present application provides a short text multi-label classification apparatus, including:
the first acquisition module is used for acquiring short texts to be classified;
the single classification model calculation module is used for obtaining a first forward classification probability set by using a single classification model corresponding to the classification labels, wherein the first forward classification probability set consists of forward classification probabilities of the short texts to be classified in different classification labels and the corresponding classification labels, which are obtained by using the single classification model;
the first screening module is used for screening the forward classification probability in the first forward classification probability set to obtain a first target forward classification probability set;
a judging module, configured to judge whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and if the forward classification probability is greater than or equal to the first preset classification threshold, determine a classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs;
if the forward classification probability is smaller than the first preset classification threshold, determining the classification label corresponding to the forward classification probability as a residual classification label;
the multi-classification model calculation module is used for classifying the short texts to be classified by utilizing a multi-classification model to obtain a second forward classification probability set, wherein the multi-classification model is composed of two classification models corresponding to residual classification labels, and the second forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different residual classification labels and corresponding residual classification labels, which are obtained by utilizing the multi-classification model;
the second screening module is used for screening the forward classification probabilities in the second forward classification probability set to obtain a second classification probability set, and the second classification probability set is composed of classification labels corresponding to the forward classification probabilities obtained after the second forward classification probability set is screened;
and the output module is used for collecting the first classification category and the second classification category to obtain a classification result.
According to the technical scheme, the short text label classification method and the short text label classification device are provided, the method firstly utilizes a single classification model to perform initial classification processing on short texts to be classified, and then utilizes a multi-classification model composed of two classification models to perform secondary classification processing on the short texts to be classified, so that the multi-classification processing of the short texts can be realized, the complexity of data processing can be reduced, and the data processing speed is increased.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.
Fig. 1 is a flowchart of a short text multi-label classification method according to an embodiment of the present application;
fig. 2 is a flowchart of a short text multi-label classification method according to another embodiment of the present application;
fig. 3 is a flowchart of a short text multi-label classification method according to another embodiment of the present application;
fig. 4 is a schematic structural diagram of a short text multi-label classification apparatus provided in the present application;
FIG. 5 is a schematic structural diagram of a first screening module;
FIG. 6 is a schematic structural diagram of a second screening module;
FIG. 7 is a schematic structural diagram of a second acquisition module;
FIG. 8 is a schematic view of another structure of the second screening module.
Detailed Description
Referring to fig. 1, in a first aspect, an embodiment of the present application provides a short text multi-label classification method, including the following steps:
step 101: and acquiring short texts to be classified.
The short text is a text which is relatively short compared with a long text, and the short text to be classified is a short text which needs to be classified, for example, the short text to be classified may be report information in the public security field, or state information in instant messaging application, such as a statement or a state log of a QQ community, or a web page fragment, a short message, or a microblog.
Step 102: and obtaining a first forward classification probability set by using a single classification model corresponding to the classification labels, wherein the first forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different classification labels and the corresponding classification labels, which are obtained by using the single classification model.
The classification labels can be set by workers according to actual classification requirements, for example, in the field of public security, the workers can set the classification labels to be 'burglary', 'robbery', and the like, each classification label is provided with a corresponding single classification model, namely the number of the single classification models is the same as that of the classification labels, the single classification models can adopt the existing single classifiers, and the embodiment is not limited. The forward classification probability is the probability that the short text to be classified belongs to the class of the classification label.
And calculating forward classification probabilities of the short texts to be classified in different classification labels by using the single classification model, and obtaining a first forward classification probability set according to the forward classification probabilities of the short texts to be classified in the different classification labels and the corresponding classification labels.
Step 103: and screening the forward classification probabilities in the first forward classification probability set to obtain a first target forward classification probability set.
The forward classification probabilities in the first forward classification probability set are screened, the forward classification probabilities and the corresponding classification labels which do not accord with preset conditions are removed, only the forward classification probabilities and the corresponding classification labels which accord with the preset conditions are reserved, the number of data processing can be reduced, and the speed of data processing is improved.
Step 104: judging whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and executing a step 105 if the forward classification probability is greater than or equal to the first preset classification threshold; if the forward classification probability is less than the first preset classification threshold, step 106 will be performed.
The first preset classification threshold may be preset by a worker, and may be the same value, or may correspond to different values according to different classification tags, for example, the first preset classification thresholds of the classification tags "burglary" and "robbery" are both 0.6, or the first preset classification threshold of "burglary" is 0.8, and the first preset classification threshold of "robbery" is 0.6.
Step 105: and determining the classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs.
Step 106: and determining the classification label corresponding to the forward classification probability as a residual classification label.
And screening the forward classification probabilities in the first target forward classification probability set again, if the forward classification probabilities meeting the screening conditions are obtained, determining the classification labels corresponding to the forward classification probabilities as the first classification categories to which the short texts to be classified belong, and then performing subsequent processing on other forward classification probabilities in the first target forward classification probability set, so that the initial classification processing of the short texts to be classified can be completed, the complexity of subsequent data processing is reduced, and the data processing speed is increased.
Step 107: and classifying the short texts to be classified by utilizing a multi-classification model to obtain a second forward classification probability set, wherein the multi-classification model is composed of two classification models corresponding to the residual classification labels, and the second forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different residual classification labels, which are obtained by utilizing the multi-classification model to calculate.
Step 108: and screening the forward classification probabilities in the second forward classification probability set to obtain a second classification probability set, wherein the second classification probability set is composed of classification labels corresponding to the forward classification probabilities obtained after the second forward classification probability set is screened and corresponding residual classification labels.
Step 109: and merging the first classification category and the second classification category to obtain a classification result.
According to the technical scheme, the short text label classification method provided by the embodiment of the application comprises the steps of firstly carrying out initial classification processing on short texts to be classified by using a single classification model, and then carrying out secondary classification processing on the short texts to be classified by using a multi-classification model formed by two classification models, so that the multi-classification processing of the short texts can be realized, the complexity of data processing can be reduced, and the data processing speed is increased.
Referring to fig. 2, another embodiment of the present application provides a short text multi-label classification method, including the following steps:
step 201: and acquiring short texts to be classified.
The short text is a text which is relatively short compared with a long text, and the short text to be classified is a short text which needs to be classified, for example, the short text to be classified may be report information in the public security field, or state information in instant messaging application, such as a statement or a state log of a QQ community, or a web page fragment, a short message, or a microblog.
Step 202: and obtaining a first forward classification probability set by using a single classification model corresponding to the classification labels, wherein the first forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different classification labels and the corresponding classification labels, which are obtained by using the single classification model.
Each classification label has a corresponding single classification model, namely the number of the single classification models is the same as that of the classification labels. The forward classification probability is the probability that the short text to be classified belongs to the class of the classification label. The single classification model may adopt an existing single classifier, and this embodiment is not limited.
And calculating forward classification probabilities of the short texts to be classified in different classification labels by using the single classification model, and obtaining a first forward classification probability set according to the forward classification probabilities of the short texts to be classified in the different classification labels and the corresponding classification labels.
Step 203: and ordering the forward classification probabilities in the first forward classification probability set from high to low.
Step 204: and extracting the N forward classification probabilities and the corresponding classification labels before sequencing to obtain a first target forward classification probability set.
For example, after a short text to be classified, that the short text is found stolen after home, the door lock is intact and the safe is pried in the home, is processed by a single classification model, a first forward classification probability set is obtained, wherein the probability set is { 0.048 time-between-the-day door, 0.11 time-between-the-day door, 0.8 time-between-the-day door, 0.026 window sill, 0.07 time window glass and 0.0003 time-between-the-day wall hole). Then, sequencing is carried out on the first forward classification probability set to obtain a sequenced first forward classification probability set { 0.8 door is not pried, 0.11 safe is pried, 0.07 window glass is broken, 0.048 door kick-start, 0.026 window sill and 0.0003 wall hole digging }, then the first N forward classification probabilities and corresponding classification labels are extracted to obtain a first target forward classification probability set, N can be set according to actual requirements, if N is 4, the first target forward classification probability set is { 0.8 door is not pried, 0.11 safe is pried, 0.07 window glass is broken, and 0.048 door kick-start }.
The forward classification probabilities in the first forward classification probability set are screened, the forward classification probabilities which do not accord with preset conditions are removed, only the forward classification probabilities which accord with the preset conditions are reserved, the number of data processing can be reduced, and the speed of data processing is improved.
Step 205: judging whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and executing step 206 if the forward classification probability is greater than or equal to the first preset classification threshold; if the forward classification probability is smaller than the first preset classification threshold, step 207 is executed.
The first preset classification threshold value can be preset by a worker, the first preset classification threshold values corresponding to different classification labels can be the same numerical value or different numerical values, for example, the first preset classification threshold values of the classification labels of "burglary" and "robbery" are both 0.6, or the first preset classification threshold value of "burglary" is 0.8, the first preset classification threshold value of "robbery" is 0.6, and the forward classification probabilities corresponding to different classification labels are compared with the first preset classification threshold values corresponding to the different classification labels.
Step 206: and determining the classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs.
Step 207: and determining the classification label corresponding to the forward classification probability as a residual classification label.
For example, continuing to take the example of step 204 as an example, assuming that the first preset classification threshold corresponding to each classification label is 0.8, the first target forward classification probability is concentrated, and "the door is not pried by 0.8" is equal to the first preset classification threshold, and then "the door is not pried" is determined as a classification category of the short text to be classified. Meanwhile, the classification labels corresponding to the forward classification probability smaller than the first preset classification threshold are determined as the remaining classification labels, namely 'picking the safe', 'breaking the window glass', 'kicking the door'.
Assuming that the first preset classification threshold values corresponding to the classification labels are all 0.85, the first target forward classification probability set has no forward classification probability greater than or equal to the first preset classification threshold value, and all the classification labels corresponding to the forward probabilities in the first target forward classification probability set are all residual classification labels, namely 'door not pried', 'safe prying', 'window breaking glass' and 'door kicking'.
And screening the forward classification probabilities in the first target forward classification probability set again, if the forward classification probabilities meeting the screening conditions are obtained, determining the classification labels corresponding to the forward classification probabilities as the first classification categories to which the short texts to be classified belong, and then performing subsequent processing on other forward classification probabilities in the first target forward classification probability set, so that the initial classification processing of the short texts to be classified can be completed, the complexity of subsequent data processing is reduced, and the data processing speed is increased.
Step 208: and classifying the short texts to be classified by utilizing a multi-classification model to obtain a second forward classification probability set, wherein the multi-classification model is composed of two classification models corresponding to residual classification labels, and the second forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different residual classification labels and corresponding residual classification labels, which are obtained by utilizing the multi-classification model.
The multi-classification model is composed of at least one two-classification model, and the number of the two-classification models is the same as that of the rest classification labels and corresponds to one another. The two classification models corresponding to different remaining classification labels may be the same or different. The two classification models are used for calculating the forward classification probability of the short text to be classified belonging to the corresponding classification label category. The two classification models can adopt the existing two classification models, such as a Logistic regression model, the Logistic regression model is an existing efficient two classifier, a plurality of two classification models form a multi-classification model, and forward classification probabilities of short texts to be classified in different residual classification labels are calculated by using the multi-classification model. For example, if the short text to be classified includes "home found stolen, door lock intact, safe prised in home" belongs to the remaining classification labels "door not prised", "safe prised", "broken window glass", "door kicking" with forward classification probabilities of "door not prised 0.9", "safe prised 0.8", "broken window glass 0.45" and "door kicking 0.6", the second classification probability set is { door not prised 0.9, safe prised 0.8, broken window glass 0.45 and door kicking 0.6 }.
Step 209: and judging whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold, and if the forward classification probability is greater than or equal to the second preset classification threshold, executing step 210.
Similarly, the second preset classification threshold may be preset by a worker, and the second preset classification thresholds corresponding to different remaining classification tags may be the same value or different values.
Step 210: and determining the classification label corresponding to the forward classification probability as a second classification category to which the short text to be classified belongs.
Step 211: and collecting all the second classification categories to obtain a second classification category set.
Assuming that the second preset classification threshold is 0.5, the forward classification probabilities of the second forward classification probability set greater than or equal to the second preset classification threshold are '0.9 door not pried', '0.8 safe pried' and '0.6 door kicked'. Determining the door not pried, the safety box prised and the door kicked up as second classification categories, so as to obtain the second classification categories of { 'door not pried', 'safety box prised' and 'door kicked up' }.
Step 212: and merging the first classification category and the second classification category to obtain a classification result.
In order to further optimize the classification result and make the classification result more accurate, in another embodiment of the present application, a classification mutual exclusion tag pair is introduced, specifically, referring to fig. 3, another embodiment of the present application provides a short text multi-tag classification method, including the following steps:
step 301: and obtaining the training sample marked with the classification label.
The classification label can be set for by the staff according to actual classification demand, for example, in the public security field, the staff can set up classification label as "the door is not pried", "prize the safe case" and "robber", etc. the staff carries out the mark of classification label to training sample one by one, for example, the training sample is "go home discovery stolen, the lock is intact, the safe case is pried in the family", the staff is "the door is not pried" and "prize the safe case" to the classification label of this training sample mark.
Step 302: and calculating to obtain a classification mutual exclusion probability matrix according to the classification labels of the training samples, wherein the classification mutual exclusion probability matrix is composed of the probability that each classification label and one classification label in other classification labels appear in the same training sample.
The class mutual exclusion probability matrix may reflect the likelihood that every two class labels are present in the same training sample at the same time. For example, all the classification labels are "door not pried", "safe picking", "door top" and "window glass smashing", which can obtain a classification mutual exclusion matrix as shown in the following table according to the classification label of each training sample.
Figure BDA0001729956070000071
The probability values in the classification mutual exclusion probability matrix are calculated by the following formula,
k — N1/N2, where K is the probability value, N1 is the number of training samples containing two class labels, and N2 is the total number of training samples containing either of the two class labels. And the classification mutual exclusion matrix generated after calculation is checked by workers to prevent the generation of the calculation error caused by the selection of the training samples.
For example, to calculate the probability values of the occurrence of the "not pried" and the "door-kick" in the same training text, the number of the training texts containing the classification tags of the "not pried" and the "door-kick" is counted first, and then the number of the training texts containing the classification tags of the "not pried" or the "door-kick" is counted, so that the probability values of the occurrence of the same time are calculated by using the formula.
Step 303: and sequentially judging whether the probability of each classification label in the mutual exclusion probability matrix and one classification label in other classification labels appearing in the same training sample is smaller than a preset mutual exclusion threshold, if so, executing step 304.
Step 304: and determining the two classification labels corresponding to the probability as a classification mutual exclusion label pair.
The preset mutual exclusion threshold value can be preset by a worker, two classification tags smaller than the preset mutual exclusion threshold value are determined as classification mutual exclusion tags, and taking the classification mutual exclusion matrix as an example, if the preset mutual exclusion threshold value is 0.4, it can be seen that the gate-kick and the gate are not pried as a classification mutual exclusion tag pair.
It should be noted that the classification mutual exclusion tag pair can also be directly set by the staff according to the actual situation. For example, the category labels "door not pried" and "picking" may be directly set as a category mutually exclusive label pair according to common sense.
The staff can store the classified mutual exclusion tag pairs obtained through the steps 301-304 and the directly set classified mutual exclusion tag pairs in a database for later use.
Step 305: and acquiring short texts to be classified.
The short text is a text which is relatively short compared with a long text, and the short text to be classified is a short text which needs to be classified, for example, the short text to be classified may be report information in the public security field, or state information in instant messaging application, such as a statement or a state log of a QQ community, or a web page fragment, a short message, or a microblog.
Step 306: and obtaining a first forward classification probability set by using a single classification model corresponding to the classification labels, wherein the first forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different classification labels and the corresponding classification labels, which are obtained by using the single classification model.
Each classification label has a corresponding single classification model, namely the number of the single classification models is the same as that of the classification labels. The forward classification probability is the probability that the short text to be classified belongs to the class of the classification label. The single classification model may adopt an existing single classifier, and this embodiment is not limited.
And calculating forward classification probabilities of the short texts to be classified in different classification labels by using the single classification model, and obtaining a first forward classification probability set according to the forward classification probabilities of the short texts to be classified in the different classification labels and the corresponding classification labels.
Step 307: and ordering the forward classification probabilities in the first forward classification probability set from high to low.
Step 308: and extracting the N forward classification probabilities and the corresponding classification labels before sequencing to obtain a first target forward classification probability set.
For example, after a short text to be classified, that the short text is found stolen after home, the door lock is intact and the safe is pried in the home, is processed by a single classification model, a first forward classification probability set is obtained, wherein the probability set is { 0.048 time-between-the-day door, 0.11 time-between-the-day door, 0.8 time-between-the-day door, 0.026 window sill, 0.07 time window glass and 0.0003 time-between-the-day wall hole). Then, the first forward classification probability set is sequenced to obtain a sequenced first forward classification probability set { 0.8 door not pried, 0.11 safe case prying, 0.07 window glass breaking, 0.048 door kicking, 0.026 window sill with foot print and 0.0003 wall hole digging }, then the first N forward classification probabilities and corresponding classification labels are extracted to obtain a first target forward classification probability set, and if N is 4, the first target forward classification probability set is { 0.8 door not pried, 0.11 safe case prying, 0.07 window glass breaking and 0.048 door kicking }.
The forward classification probabilities in the first forward classification probability set are screened, the forward classification probabilities which do not accord with preset conditions are removed, only the forward classification probabilities which accord with the preset conditions are reserved, the number of data processing can be reduced, and the speed of data processing is improved.
Step 309: judging whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and if the forward classification probability is greater than or equal to the first preset classification threshold, executing step 310; if the forward classification probability is smaller than the first preset classification threshold, step 311 is executed.
The first preset classification threshold value can be preset by a worker, the first preset classification threshold values corresponding to different classification labels can be the same numerical value or different numerical values, for example, the first preset classification threshold values of the classification labels of "burglary" and "robbery" are both 0.6, or the first preset classification threshold value of "burglary" is 0.8, the first preset classification threshold value of "robbery" is 0.6, and the forward classification probabilities corresponding to different classification labels are compared with the first preset classification threshold values corresponding to the different classification labels.
Step 310: and determining the classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs.
Step 311: and determining the classification label corresponding to the forward classification probability as a residual classification label.
For example, continuing to take the example of step 308 as an example, assuming that the first preset classification threshold corresponding to each classification label is 0.8, the first target forward classification probability is concentrated, and "the door is not pried by 0.8" is equal to the first preset classification threshold, and then "the door is not pried" is determined as a classification category of the short text to be classified. Meanwhile, the classification labels corresponding to the forward classification probability smaller than the first preset classification threshold are determined as the remaining classification labels, namely 'picking the safe', 'breaking the window glass', 'kicking the door'.
Assuming that the first preset classification threshold values corresponding to the classification labels are all 0.85, the first target forward classification probability set has no forward classification probability greater than or equal to the first preset classification threshold value, and all the classification labels corresponding to the forward probabilities in the first target forward classification probability set are all residual classification labels, namely 'door not pried', 'safe prying', 'window breaking glass' and 'door kicking'.
And screening the forward classification probabilities in the first target forward classification probability set again, if the forward classification probabilities meeting the screening conditions are obtained, determining the classification labels corresponding to the forward classification probabilities as the first classification categories to which the short texts to be classified belong, and then performing subsequent processing on other forward classification probabilities in the first target forward classification probability set, so that the initial classification processing of the short texts to be classified can be completed, the complexity of subsequent data processing is reduced, and the data processing speed is increased.
Step 312: and classifying the short texts to be classified by utilizing a multi-classification model to obtain a second forward classification probability set, wherein the multi-classification model is composed of two classification models corresponding to residual classification labels, and the second forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different residual classification labels and corresponding residual classification labels, which are obtained by utilizing the multi-classification model.
The multi-classification model is composed of at least one two-classification model, the number of the two-classification models is the same as that of the rest classification labels, the two-classification models correspond to one another, and the two-classification models corresponding to different rest classification labels can be the same or different. The two classification models are used for calculating the forward classification probability of the short text to be classified belonging to the corresponding classification label category. The two classification models can adopt the existing two classification models, such as a Logistic regression model, the Logistic regression model is an existing efficient two classifier, a plurality of two classification models form a multi-classification model, and forward classification probabilities of short texts to be classified in different residual classification labels are calculated by using the multi-classification model. For example, if the short text to be classified includes "home found stolen, door lock intact, safe prised in home" belongs to the remaining classification labels "door not prised", "safe prised", "broken window glass", "door kicking" with forward classification probabilities of "door not prised 0.9", "safe prised 0.8", "broken window glass 0.45" and "door kicking 0.6", the second classification probability set is { door not prised 0.9, safe prised 0.8, broken window glass 0.45 and door kicking 0.6 }.
Step 313: and determining whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold, and if the forward classification probability is greater than or equal to the second preset classification threshold, executing step 314.
Similarly, the second preset classification threshold may be preset by a worker, and the second preset classification thresholds corresponding to different remaining classification tags may be the same value or different values.
Step 314: and extracting the forward classification probability and the corresponding classification label to obtain a second target forward classification probability set.
Assuming that a second preset classification threshold value corresponding to each classification label is 0.5, the second forward classification probability set is greater than or equal to the forward classification probability of the second preset classification threshold value, and a second target forward classification probability set is obtained, wherein the second target forward classification probability set is { 0.9 when the door is not pried, 0.8 when the safety box is pried, and 0.6 when the door is kicked over }.
Step 315: judging whether a classified mutual exclusion tag pair exists in the second target forward classified probability set or not by using the classified mutual exclusion tag pair, and if the classified mutual exclusion tag pair exists in the second target forward classified probability set, step 316; if there is no classification mutually exclusive label pair in the second target forward classification probability set, then step 317 is performed.
Step 316: and removing the classification label corresponding to the lower forward classification probability in the classification mutual exclusion label pair, and determining the classification label corresponding to the residual forward classification probability in the second target forward classification probability set as the second classification category to which the short text to be classified belongs.
As can be seen from the example of step 302, the "door not pried" and the "door-stepping" are classified mutual exclusion tag pairs, and therefore, to remove the forward classification tag with the lower probability in the classified mutual exclusion tag pair, i.e., remove the classified tag "door-stepping", keep the "door not pried", and finally obtain the "door not pried" and the "safe prying".
Step 317: and determining the classification labels corresponding to all the forward classification probabilities in the second target forward classification probability set as a second classification category to which the short text to be classified belongs.
If the second target forward classification probability set does not have a classification mutual exclusion tag pair, the classification tags corresponding to all the forward classification probabilities can be determined as the classes to which the short texts to be classified belong.
Step 318: and collecting all the second classification categories to obtain a second classification category set.
Step 319: and merging the first classification category and the second classification category to obtain a classification result.
According to the technical scheme, the short text label classification method provided by the embodiment of the application comprises the steps of firstly carrying out initial classification processing on short texts to be classified by using a single classification model, and then carrying out secondary classification processing on the short texts to be classified by using a multi-classification model formed by two classification models, so that the multi-classification processing of the short texts can be realized, the complexity of data processing can be reduced, and the data processing speed is increased.
Referring to fig. 4, in a second aspect, the present application provides a short text multi-label classification apparatus, including:
a first obtaining module 401, configured to obtain a short text to be classified;
a single classification model calculation module 402, configured to obtain a first forward classification probability set by using a single classification model corresponding to a classification tag, where the first forward classification probability set is composed of forward classification probabilities of the short text to be classified in different classification tags and corresponding classification tags, which are obtained by using the single classification model;
a first screening module 403, configured to screen the forward classification probabilities in the first forward classification probability set to obtain a first target forward classification probability set;
a determining module 404, configured to determine whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and if the forward classification probability is greater than or equal to the first preset classification threshold, determine a classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs;
if the forward classification probability is smaller than the first preset classification threshold, determining the classification label corresponding to the forward classification probability as a residual classification label;
a multi-classification model calculation module 405, configured to classify the short text to be classified by using a multi-classification model to obtain a second forward classification probability set, where the multi-classification model is composed of two classification models corresponding to remaining classification labels, and the second forward classification probability set is composed of forward classification probabilities of the short text to be classified in different remaining classification labels, which are calculated by using the multi-classification model;
a second screening module 406, configured to screen the forward classification probabilities in the second forward classification probability set to obtain a second classification probability set, where the second classification probability set is composed of classification labels corresponding to the forward classification probabilities obtained after the second forward classification probability set is screened;
and the output module 407 is configured to merge the first classification set and the second classification set to obtain a classification result.
According to the technical scheme, the short text label classification device provided by the embodiment of the application performs initial classification processing on short texts to be classified by using the single classification model, and then performs secondary classification processing on the short texts to be classified by using the multi-classification model formed by the two classification models, so that the multi-classification processing of the short texts can be realized, the complexity of data processing can be reduced, and the data processing speed is increased.
Further, referring to fig. 5, the first screening module 403 includes:
a sorting unit 501, configured to sort the forward classification probabilities in the first forward classification probability set from high to low;
an extracting unit 502, configured to extract the N forward classification probabilities before the sorting and the corresponding classification labels thereof, to obtain a first target forward classification probability set.
Further, referring to fig. 6, the second filtering module 406 includes:
a first determining unit 601, configured to determine whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold, and if the forward classification probability is greater than or equal to the second preset classification threshold, determine a classification label corresponding to the forward classification probability as a second classification category to which the short text to be classified belongs;
a first output unit 602, configured to aggregate all the second classification categories to obtain a second classification category set.
Further, referring to fig. 7 and 8, the apparatus further includes:
a second obtaining module 701, configured to obtain a pair of classified mutually exclusive tags;
the second filtering module 406 includes:
a screening unit 801, configured to determine whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold, and if the forward classification probability is greater than or equal to the second preset classification threshold, extract the forward classification probability and a corresponding classification label to obtain a second target forward classification probability set;
a second determining unit 802, configured to determine, by using the classified mutual exclusion tag pair, whether a classified mutual exclusion tag pair exists in the second target forward classified probability set, and if a classified mutual exclusion tag pair exists in the second target forward classified probability set, remove a classification tag corresponding to a smaller forward classified probability in the classified mutual exclusion tag pair, and determine a classification tag corresponding to a remaining forward classified probability in the second target forward classified probability set as a second classification category to which the short text to be classified belongs;
if the second target forward classification probability set does not have a classification mutual exclusion tag pair, determining the classification tags corresponding to all the forward classification probabilities in the second target forward classification probability set as a second classification category to which the short text to be classified belongs;
a second output unit 803, configured to aggregate all the second classification categories to obtain a second classification category set.
Further, referring to fig. 7, the second obtaining module 701 includes:
an obtaining unit 7011, configured to obtain a training sample labeled with a classification label;
a classification mutual exclusion probability matrix calculation unit 7012, configured to calculate a classification mutual exclusion probability matrix according to the classification labels of the training samples, where the classification mutual exclusion probability matrix is formed by probabilities that each classification label appears in the same training sample as one of the other classification labels in sequence;
a classification mutual exclusion tag pair determining unit 7013, configured to sequentially determine whether a probability that each classification tag in the mutual exclusion probability matrix and one classification tag in the other classification tags appear in the same training sample is smaller than a preset mutual exclusion threshold, and if so, determine two classification tags corresponding to the probability as a classification mutual exclusion tag pair.
According to the technical scheme, the short text label classification method and the short text label classification device are provided, the method firstly utilizes a single classification model to perform initial classification processing on short texts to be classified, and then utilizes a multi-classification model composed of two classification models to perform secondary classification processing on the short texts to be classified, so that the multi-classification processing of the short texts can be realized, the complexity of data processing can be reduced, and the data processing speed is increased.
Those skilled in the art will clearly understand that the techniques in the embodiments of the present application may be implemented by way of software plus a required general hardware platform. Based on such understanding, the technical solutions in the embodiments of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes a computer device (which may be a personal computer, a server, or a network device) for executing the method described in the embodiments or some parts of the embodiments of the present application.
The embodiments of the present disclosure are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments, and particularly, for the embodiment of the apparatus, since it is substantially similar to the embodiment of the method, the description is relatively simple, and related parts can be referred to the part of the embodiment of the method.

Claims (8)

1. A short text multi-label classification method is characterized by comprising the following steps:
acquiring short texts to be classified;
obtaining a first forward classification probability set by using a single classification model corresponding to the classification labels, wherein the first forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different classification labels and the corresponding classification labels, which are obtained by using the single classification model;
screening the forward classification probabilities in the first forward classification probability set to obtain a first target forward classification probability set;
judging whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and if the forward classification probability is greater than or equal to the first preset classification threshold, determining a classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs;
if the forward classification probability is smaller than the first preset classification threshold, determining the classification label corresponding to the forward classification probability as a residual classification label;
classifying the short texts to be classified by utilizing a multi-classification model to obtain a second forward classification probability set, wherein the multi-classification model is composed of two classification models corresponding to residual classification labels, and the second forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different residual classification labels and corresponding residual classification labels, which are obtained by utilizing the multi-classification model;
screening the forward classification probabilities in the second forward classification probability set to obtain a second classification probability set, wherein the second classification probability set is composed of classification labels corresponding to the forward classification probabilities obtained after the second forward classification probability set is screened;
merging the first classification category and the second classification category to obtain a classification result;
the screening of the forward classification probabilities in the second forward classification probability set to obtain a second classification probability set includes:
judging whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold, and if the forward classification probability is greater than or equal to the second preset classification threshold, determining a classification label corresponding to the forward classification probability as a second classification category to which the short text to be classified belongs;
and collecting all the second classification categories to obtain a second classification category set.
2. The method of claim 1, wherein the screening forward classification probabilities in the first forward classification probability set to extract a first target forward classification probability meeting a predetermined condition comprises:
sorting forward classification probabilities in the first set of forward classification probabilities from high to low;
and extracting the N forward classification probabilities and the corresponding classification labels before sequencing to obtain a first target forward classification probability set.
3. The method of claim 1, wherein the obtaining the short text to be classified comprises, prior to:
acquiring a classification mutual exclusion tag pair;
the screening of the forward classification probabilities in the second forward classification probability set to obtain a second classification probability set includes:
judging whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold, and if the forward classification probability is greater than or equal to the second preset classification threshold, extracting the forward classification probability and a corresponding classification label to obtain a second target forward classification probability set;
judging whether a classification mutual exclusion tag pair exists in the second target forward classification probability set or not by utilizing the classification mutual exclusion tag pair, if the classification mutual exclusion tag pair exists in the second target forward classification probability set, removing a classification tag corresponding to a lower forward classification probability in the classification mutual exclusion tag pair, and determining a classification tag corresponding to a residual forward classification probability in the second target forward classification probability set as a second classification category to which the short text to be classified belongs;
if the second target forward classification probability set does not have a classification mutual exclusion tag pair, determining the classification tags corresponding to all the forward classification probabilities in the second target forward classification probability set as a second classification category to which the short text to be classified belongs;
and collecting all the second classification categories to obtain a second classification category set.
4. The method of claim 3, wherein the obtaining the sorted mutually exclusive tag pair comprises:
acquiring a training sample marked with a classification label;
calculating to obtain a classification mutual exclusion probability matrix according to the classification labels of the training samples, wherein the classification mutual exclusion probability matrix is composed of the probability that each classification label and one classification label in other classification labels appear in the same training sample;
and sequentially judging whether the probability of each classification label in the mutual exclusion probability matrix and one classification label in other classification labels appearing in the same training sample is smaller than a preset mutual exclusion threshold value, and if so, determining the two classification labels corresponding to the probability as a classification mutual exclusion label pair.
5. A short text multi-label classification device, comprising:
the first acquisition module is used for acquiring short texts to be classified;
the single classification model calculation module is used for obtaining a first forward classification probability set by using a single classification model corresponding to the classification labels, wherein the first forward classification probability set consists of forward classification probabilities of the short texts to be classified in different classification labels and the corresponding classification labels, which are obtained by using the single classification model;
the first screening module is used for screening the forward classification probability in the first forward classification probability set to obtain a first target forward classification probability set;
a judging module, configured to judge whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and if the forward classification probability is greater than or equal to the first preset classification threshold, determine a classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs;
if the forward classification probability is smaller than the first preset classification threshold, determining the classification label corresponding to the forward classification probability as a residual classification label;
the multi-classification model calculation module is used for classifying the short texts to be classified by utilizing a multi-classification model to obtain a second forward classification probability set, wherein the multi-classification model is composed of two classification models corresponding to residual classification labels, and the second forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different residual classification labels and corresponding residual classification labels, which are obtained by utilizing the multi-classification model;
the second screening module is used for screening the forward classification probabilities in the second forward classification probability set to obtain a second classification probability set, and the second classification probability set is composed of classification labels corresponding to the forward classification probabilities obtained after the second forward classification probability set is screened;
the output module is used for collecting the first classification category and the second classification category to obtain a classification result;
the second screening module includes:
a first judging unit, configured to judge whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold, and if the forward classification probability is greater than or equal to the second preset classification threshold, determine a classification label corresponding to the forward classification probability as a second classification category to which the short text to be classified belongs;
and the first output unit is used for merging all the second classification categories to obtain a second classification category set.
6. The apparatus of claim 5, wherein the first screening module comprises:
a sorting unit, configured to sort forward classification probabilities in the first forward classification probability set from high to low;
and the extraction unit is used for extracting the N forward classification probabilities and the corresponding classification labels before the sequencing to obtain a first target forward classification probability set.
7. The apparatus of claim 5, wherein the apparatus further comprises:
the second acquisition module is used for acquiring the classified mutual exclusion tag pair;
the second screening module includes:
the screening unit is used for judging whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold value or not, and if the forward classification probability is greater than or equal to the second preset classification threshold value, extracting the forward classification probability and a corresponding classification label to obtain a second target forward classification probability set;
a second judging unit, configured to judge whether a classification mutual exclusion tag pair exists in the second target forward classification probability set by using the classification mutual exclusion tag pair, and if a classification mutual exclusion tag pair exists in the second target forward classification probability set, remove a classification tag corresponding to a smaller forward classification probability in the classification mutual exclusion tag pair, and determine a classification tag corresponding to a remaining forward classification probability in the second target forward classification probability set as a second classification category to which the short text to be classified belongs;
if the second target forward classification probability set does not have a classification mutual exclusion tag pair, determining the classification tags corresponding to all the forward classification probabilities in the second target forward classification probability set as a second classification category to which the short text to be classified belongs;
and the second output unit is used for collecting all the second classification categories to obtain a second classification category set.
8. The apparatus of claim 7, wherein the second obtaining module comprises:
the acquisition unit is used for acquiring training samples marked with classification labels;
the classification mutual exclusion probability matrix calculation unit is used for calculating to obtain a classification mutual exclusion probability matrix according to the classification labels of the training samples, and the classification mutual exclusion probability matrix consists of the probability that each classification label and one of other classification labels appear in the same training sample in sequence;
and the classified mutual exclusion tag pair determining unit is used for sequentially judging whether the probability of each classified tag in the mutual exclusion probability matrix and one classified tag in other classified tags appearing in the same training sample is smaller than a preset mutual exclusion threshold value, and if so, determining the two classified tags corresponding to the probability as a classified mutual exclusion tag pair.
CN201810769761.1A 2018-07-13 2018-07-13 Short text multi-label classification method and device Active CN108920694B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810769761.1A CN108920694B (en) 2018-07-13 2018-07-13 Short text multi-label classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810769761.1A CN108920694B (en) 2018-07-13 2018-07-13 Short text multi-label classification method and device

Publications (2)

Publication Number Publication Date
CN108920694A CN108920694A (en) 2018-11-30
CN108920694B true CN108920694B (en) 2020-08-28

Family

ID=64412717

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810769761.1A Active CN108920694B (en) 2018-07-13 2018-07-13 Short text multi-label classification method and device

Country Status (1)

Country Link
CN (1) CN108920694B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948160B (en) * 2019-03-15 2023-04-18 智者四海(北京)技术有限公司 Short text classification method and device
CN110458245B (en) * 2019-08-20 2021-11-02 图谱未来(南京)人工智能研究院有限公司 Multi-label classification model training method, data processing method and device
CN112883189A (en) * 2021-01-26 2021-06-01 浙江香侬慧语科技有限责任公司 Text classification method and device based on label description, storage medium and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101059796A (en) * 2006-04-19 2007-10-24 中国科学院自动化研究所 Two-stage combined file classification method based on probability subject
CN105447031A (en) * 2014-08-28 2016-03-30 百度在线网络技术(北京)有限公司 Training sample labeling method and device
CN106156120A (en) * 2015-04-07 2016-11-23 阿里巴巴集团控股有限公司 The method and apparatus that character string is classified
CN106897428A (en) * 2017-02-27 2017-06-27 腾讯科技(深圳)有限公司 Text classification feature extracting method, file classification method and device
CN107273500A (en) * 2017-06-16 2017-10-20 中国电子技术标准化研究院 Text classifier generation method, file classification method, device and computer equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10229117B2 (en) * 2015-06-19 2019-03-12 Gordon V. Cormack Systems and methods for conducting a highly autonomous technology-assisted review classification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101059796A (en) * 2006-04-19 2007-10-24 中国科学院自动化研究所 Two-stage combined file classification method based on probability subject
CN105447031A (en) * 2014-08-28 2016-03-30 百度在线网络技术(北京)有限公司 Training sample labeling method and device
CN106156120A (en) * 2015-04-07 2016-11-23 阿里巴巴集团控股有限公司 The method and apparatus that character string is classified
CN106897428A (en) * 2017-02-27 2017-06-27 腾讯科技(深圳)有限公司 Text classification feature extracting method, file classification method and device
CN107273500A (en) * 2017-06-16 2017-10-20 中国电子技术标准化研究院 Text classifier generation method, file classification method, device and computer equipment

Also Published As

Publication number Publication date
CN108920694A (en) 2018-11-30

Similar Documents

Publication Publication Date Title
CN109547423B (en) WEB malicious request deep detection system and method based on machine learning
CN111045847B (en) Event auditing method, device, terminal equipment and storage medium
CN108920694B (en) Short text multi-label classification method and device
CN111798312A (en) Financial transaction system abnormity identification method based on isolated forest algorithm
CN109831460B (en) Web attack detection method based on collaborative training
CN110175851B (en) Cheating behavior detection method and device
Jain et al. Machine Learning based Fake News Detection using linguistic features and word vector features
CN111177367B (en) Case classification method, classification model training method and related products
CN113590764B (en) Training sample construction method and device, electronic equipment and storage medium
CN110008699B (en) Software vulnerability detection method and device based on neural network
CN111143840B (en) Method and system for identifying abnormity of host operation instruction
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
CN110909542A (en) Intelligent semantic series-parallel analysis method and system
CN114385775A (en) Sensitive word recognition method based on big data
CN112395881A (en) Material label construction method and device, readable storage medium and electronic equipment
CN110889451A (en) Event auditing method and device, terminal equipment and storage medium
CN116032741A (en) Equipment identification method and device, electronic equipment and computer storage medium
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN109918638B (en) Network data monitoring method
CN112052453A (en) Webshell detection method and device based on Relief algorithm
CN111538893A (en) Method for extracting network security new words from unstructured data
CN108717637B (en) Automatic mining method and system for E-commerce safety related entities
CN114005004B (en) Fraud website identification method and system based on picture instance level characteristics
CN115994531A (en) Multi-dimensional text comprehensive identification method
CN113378156B (en) API-based malicious file detection method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20190906

Address after: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Applicant after: China Science and Technology (Beijing) Co., Ltd.

Address before: 100089 Beijing city Haidian District wanquanzhuang Road No. 28 Wanliu new building block A Room 601

Applicant before: Beijing Shenzhou Taiyue Software Co., Ltd.

CB02 Change of applicant information
CB02 Change of applicant information

Address after: 19 / F-B, building A1, 3333 Xiyou Road, hi tech Zone, Hefei City, Anhui Province

Applicant after: Dingfu Intelligent Technology Co., Ltd

Address before: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Applicant before: DINFO (BEIJING) SCIENCE DEVELOPMENT Co.,Ltd.

GR01 Patent grant
GR01 Patent grant