CN108920694B

CN108920694B - Short text multi-label classification method and device

Info

Publication number: CN108920694B
Application number: CN201810769761.1A
Authority: CN
Inventors: 熊文灿; 廖翔; 周继烈; 张昊; 刘铭; 张骏; 单培; 李士勇; 张瑞飞; 李广刚
Original assignee: Dingfu Intelligent Technology Co Ltd
Current assignee: China Science and Technology (Beijing) Co., Ltd.
Priority date: 2018-07-13
Filing date: 2018-07-13
Publication date: 2020-08-28
Anticipated expiration: 2038-07-13
Also published as: CN108920694A

Abstract

The application provides a short text multi-label classification method and device, which comprises the steps of obtaining a first forward classification probability set by using a single classification model corresponding to classification labels; screening forward classification probabilities in the first forward classification probability set to obtain a first target forward classification probability set; judging whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and if so, determining a classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs; if not, determining the classification label corresponding to the forward classification probability as a residual classification label; and classifying the short texts to be classified by using a multi-classification model to obtain a second classification set. According to the short text classification method and device, the short text to be classified is firstly subjected to initial classification processing, then secondary classification processing is carried out on the short text to be classified, multi-classification processing of the short text can be achieved, the complexity of data processing is reduced, and the speed of data processing is improved.

Description

Short text multi-label classification method and device

Technical Field

The present application relates to the field of text classification, and in particular, to a short text multi-label classification method and apparatus.

Background

With the rapid development of the internet in recent years, a large number of short texts (shorttexts) are generated by various information interaction platforms, the short texts relate to various fields of people's life and gradually become frequently used and generally accepted communication modes for people, and for example, report information, electronic commerce comments, intelligent question and answer systems and the like in the field of public security are generation sources of a large number of short texts. How to mine effective information from massive short texts is a subject of extensive research by many scholars in recent years. Text classification is an effective method for text mining, but due to the characteristics of short text length, sparse lexical features and the like, the traditional long text classification method is not applicable.

At present, Convolutional Neural Network (CNN) technology has been widely applied to the field of Natural Language Processing (NLP). The convolutional neural network technology comprises a plurality of layers, namely a convolutional layer, a pooling layer, a full-link layer and a classification layer, wherein the convolutional layer and the pooling layer are used for extracting feature words in short texts to be classified, the full-link layer is used for integrating the feature words, and finally the classification layer is used for classifying the short texts to be classified. However, since the classifier used by the classification layer is a single-class classifier, the requirement of multi-classification of short texts to be classified cannot be met.

Disclosure of Invention

The application provides a short text multi-label classification method and device, which aim to solve the problem that multi-classification of short texts to be classified cannot be realized because a classifier used by a classification layer is a single-class classifier.

In a first aspect, the present application provides a short text multi-label classification method, including:

acquiring short texts to be classified;

obtaining a first forward classification probability set by using a single classification model corresponding to the classification labels, wherein the first forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different classification labels and the corresponding classification labels, which are obtained by using the single classification model;

screening the forward classification probabilities in the first forward classification probability set to obtain a first target forward classification probability set;

judging whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and if the forward classification probability is greater than or equal to the first preset classification threshold, determining a classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs;

if the forward classification probability is smaller than the first preset classification threshold, determining the classification label corresponding to the forward classification probability as a residual classification label;

classifying the short texts to be classified by utilizing a multi-classification model to obtain a second forward classification probability set, wherein the multi-classification model is composed of two classification models corresponding to residual classification labels, and the second forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different residual classification labels and corresponding residual classification labels, which are obtained by utilizing the multi-classification model;

screening the forward classification probabilities in the second forward classification probability set to obtain a second classification probability set, wherein the second classification probability set is composed of classification labels corresponding to the forward classification probabilities obtained after the second forward classification probability set is screened;

and merging the first classification category and the second classification category to obtain a classification result.

In a second aspect, the present application provides a short text multi-label classification apparatus, including:

the first acquisition module is used for acquiring short texts to be classified;

the single classification model calculation module is used for obtaining a first forward classification probability set by using a single classification model corresponding to the classification labels, wherein the first forward classification probability set consists of forward classification probabilities of the short texts to be classified in different classification labels and the corresponding classification labels, which are obtained by using the single classification model;

the first screening module is used for screening the forward classification probability in the first forward classification probability set to obtain a first target forward classification probability set;

a judging module, configured to judge whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and if the forward classification probability is greater than or equal to the first preset classification threshold, determine a classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs;

the multi-classification model calculation module is used for classifying the short texts to be classified by utilizing a multi-classification model to obtain a second forward classification probability set, wherein the multi-classification model is composed of two classification models corresponding to residual classification labels, and the second forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different residual classification labels and corresponding residual classification labels, which are obtained by utilizing the multi-classification model;

the second screening module is used for screening the forward classification probabilities in the second forward classification probability set to obtain a second classification probability set, and the second classification probability set is composed of classification labels corresponding to the forward classification probabilities obtained after the second forward classification probability set is screened;

and the output module is used for collecting the first classification category and the second classification category to obtain a classification result.

According to the technical scheme, the short text label classification method and the short text label classification device are provided, the method firstly utilizes a single classification model to perform initial classification processing on short texts to be classified, and then utilizes a multi-classification model composed of two classification models to perform secondary classification processing on the short texts to be classified, so that the multi-classification processing of the short texts can be realized, the complexity of data processing can be reduced, and the data processing speed is increased.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.

Fig. 1 is a flowchart of a short text multi-label classification method according to an embodiment of the present application;

fig. 2 is a flowchart of a short text multi-label classification method according to another embodiment of the present application;

fig. 3 is a flowchart of a short text multi-label classification method according to another embodiment of the present application;

fig. 4 is a schematic structural diagram of a short text multi-label classification apparatus provided in the present application;

FIG. 5 is a schematic structural diagram of a first screening module;

FIG. 6 is a schematic structural diagram of a second screening module;

FIG. 7 is a schematic structural diagram of a second acquisition module;

FIG. 8 is a schematic view of another structure of the second screening module.

Detailed Description

Referring to fig. 1, in a first aspect, an embodiment of the present application provides a short text multi-label classification method, including the following steps:

step 101: and acquiring short texts to be classified.

The short text is a text which is relatively short compared with a long text, and the short text to be classified is a short text which needs to be classified, for example, the short text to be classified may be report information in the public security field, or state information in instant messaging application, such as a statement or a state log of a QQ community, or a web page fragment, a short message, or a microblog.

Step 102: and obtaining a first forward classification probability set by using a single classification model corresponding to the classification labels, wherein the first forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different classification labels and the corresponding classification labels, which are obtained by using the single classification model.

The classification labels can be set by workers according to actual classification requirements, for example, in the field of public security, the workers can set the classification labels to be 'burglary', 'robbery', and the like, each classification label is provided with a corresponding single classification model, namely the number of the single classification models is the same as that of the classification labels, the single classification models can adopt the existing single classifiers, and the embodiment is not limited. The forward classification probability is the probability that the short text to be classified belongs to the class of the classification label.

And calculating forward classification probabilities of the short texts to be classified in different classification labels by using the single classification model, and obtaining a first forward classification probability set according to the forward classification probabilities of the short texts to be classified in the different classification labels and the corresponding classification labels.

Step 103: and screening the forward classification probabilities in the first forward classification probability set to obtain a first target forward classification probability set.

The forward classification probabilities in the first forward classification probability set are screened, the forward classification probabilities and the corresponding classification labels which do not accord with preset conditions are removed, only the forward classification probabilities and the corresponding classification labels which accord with the preset conditions are reserved, the number of data processing can be reduced, and the speed of data processing is improved.

Step 104: judging whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and executing a step 105 if the forward classification probability is greater than or equal to the first preset classification threshold; if the forward classification probability is less than the first preset classification threshold, step 106 will be performed.

The first preset classification threshold may be preset by a worker, and may be the same value, or may correspond to different values according to different classification tags, for example, the first preset classification thresholds of the classification tags "burglary" and "robbery" are both 0.6, or the first preset classification threshold of "burglary" is 0.8, and the first preset classification threshold of "robbery" is 0.6.

Step 105: and determining the classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs.

Step 106: and determining the classification label corresponding to the forward classification probability as a residual classification label.

And screening the forward classification probabilities in the first target forward classification probability set again, if the forward classification probabilities meeting the screening conditions are obtained, determining the classification labels corresponding to the forward classification probabilities as the first classification categories to which the short texts to be classified belong, and then performing subsequent processing on other forward classification probabilities in the first target forward classification probability set, so that the initial classification processing of the short texts to be classified can be completed, the complexity of subsequent data processing is reduced, and the data processing speed is increased.

Step 107: and classifying the short texts to be classified by utilizing a multi-classification model to obtain a second forward classification probability set, wherein the multi-classification model is composed of two classification models corresponding to the residual classification labels, and the second forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different residual classification labels, which are obtained by utilizing the multi-classification model to calculate.

Step 108: and screening the forward classification probabilities in the second forward classification probability set to obtain a second classification probability set, wherein the second classification probability set is composed of classification labels corresponding to the forward classification probabilities obtained after the second forward classification probability set is screened and corresponding residual classification labels.

Step 109: and merging the first classification category and the second classification category to obtain a classification result.

According to the technical scheme, the short text label classification method provided by the embodiment of the application comprises the steps of firstly carrying out initial classification processing on short texts to be classified by using a single classification model, and then carrying out secondary classification processing on the short texts to be classified by using a multi-classification model formed by two classification models, so that the multi-classification processing of the short texts can be realized, the complexity of data processing can be reduced, and the data processing speed is increased.

Referring to fig. 2, another embodiment of the present application provides a short text multi-label classification method, including the following steps:

step 201: and acquiring short texts to be classified.

Step 202: and obtaining a first forward classification probability set by using a single classification model corresponding to the classification labels, wherein the first forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different classification labels and the corresponding classification labels, which are obtained by using the single classification model.

Each classification label has a corresponding single classification model, namely the number of the single classification models is the same as that of the classification labels. The forward classification probability is the probability that the short text to be classified belongs to the class of the classification label. The single classification model may adopt an existing single classifier, and this embodiment is not limited.

Step 203: and ordering the forward classification probabilities in the first forward classification probability set from high to low.

Step 204: and extracting the N forward classification probabilities and the corresponding classification labels before sequencing to obtain a first target forward classification probability set.

For example, after a short text to be classified, that the short text is found stolen after home, the door lock is intact and the safe is pried in the home, is processed by a single classification model, a first forward classification probability set is obtained, wherein the probability set is { 0.048 time-between-the-day door, 0.11 time-between-the-day door, 0.8 time-between-the-day door, 0.026 window sill, 0.07 time window glass and 0.0003 time-between-the-day wall hole). Then, sequencing is carried out on the first forward classification probability set to obtain a sequenced first forward classification probability set { 0.8 door is not pried, 0.11 safe is pried, 0.07 window glass is broken, 0.048 door kick-start, 0.026 window sill and 0.0003 wall hole digging }, then the first N forward classification probabilities and corresponding classification labels are extracted to obtain a first target forward classification probability set, N can be set according to actual requirements, if N is 4, the first target forward classification probability set is { 0.8 door is not pried, 0.11 safe is pried, 0.07 window glass is broken, and 0.048 door kick-start }.

The forward classification probabilities in the first forward classification probability set are screened, the forward classification probabilities which do not accord with preset conditions are removed, only the forward classification probabilities which accord with the preset conditions are reserved, the number of data processing can be reduced, and the speed of data processing is improved.

Step 205: judging whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and executing step 206 if the forward classification probability is greater than or equal to the first preset classification threshold; if the forward classification probability is smaller than the first preset classification threshold, step 207 is executed.

The first preset classification threshold value can be preset by a worker, the first preset classification threshold values corresponding to different classification labels can be the same numerical value or different numerical values, for example, the first preset classification threshold values of the classification labels of "burglary" and "robbery" are both 0.6, or the first preset classification threshold value of "burglary" is 0.8, the first preset classification threshold value of "robbery" is 0.6, and the forward classification probabilities corresponding to different classification labels are compared with the first preset classification threshold values corresponding to the different classification labels.

Step 206: and determining the classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs.

Step 207: and determining the classification label corresponding to the forward classification probability as a residual classification label.

For example, continuing to take the example of step 204 as an example, assuming that the first preset classification threshold corresponding to each classification label is 0.8, the first target forward classification probability is concentrated, and "the door is not pried by 0.8" is equal to the first preset classification threshold, and then "the door is not pried" is determined as a classification category of the short text to be classified. Meanwhile, the classification labels corresponding to the forward classification probability smaller than the first preset classification threshold are determined as the remaining classification labels, namely 'picking the safe', 'breaking the window glass', 'kicking the door'.

Assuming that the first preset classification threshold values corresponding to the classification labels are all 0.85, the first target forward classification probability set has no forward classification probability greater than or equal to the first preset classification threshold value, and all the classification labels corresponding to the forward probabilities in the first target forward classification probability set are all residual classification labels, namely 'door not pried', 'safe prying', 'window breaking glass' and 'door kicking'.

Step 208: and classifying the short texts to be classified by utilizing a multi-classification model to obtain a second forward classification probability set, wherein the multi-classification model is composed of two classification models corresponding to residual classification labels, and the second forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different residual classification labels and corresponding residual classification labels, which are obtained by utilizing the multi-classification model.

The multi-classification model is composed of at least one two-classification model, and the number of the two-classification models is the same as that of the rest classification labels and corresponds to one another. The two classification models corresponding to different remaining classification labels may be the same or different. The two classification models are used for calculating the forward classification probability of the short text to be classified belonging to the corresponding classification label category. The two classification models can adopt the existing two classification models, such as a Logistic regression model, the Logistic regression model is an existing efficient two classifier, a plurality of two classification models form a multi-classification model, and forward classification probabilities of short texts to be classified in different residual classification labels are calculated by using the multi-classification model. For example, if the short text to be classified includes "home found stolen, door lock intact, safe prised in home" belongs to the remaining classification labels "door not prised", "safe prised", "broken window glass", "door kicking" with forward classification probabilities of "door not prised 0.9", "safe prised 0.8", "broken window glass 0.45" and "door kicking 0.6", the second classification probability set is { door not prised 0.9, safe prised 0.8, broken window glass 0.45 and door kicking 0.6 }.

Step 209: and judging whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold, and if the forward classification probability is greater than or equal to the second preset classification threshold, executing step 210.

Similarly, the second preset classification threshold may be preset by a worker, and the second preset classification thresholds corresponding to different remaining classification tags may be the same value or different values.

Step 210: and determining the classification label corresponding to the forward classification probability as a second classification category to which the short text to be classified belongs.

Step 211: and collecting all the second classification categories to obtain a second classification category set.

Assuming that the second preset classification threshold is 0.5, the forward classification probabilities of the second forward classification probability set greater than or equal to the second preset classification threshold are '0.9 door not pried', '0.8 safe pried' and '0.6 door kicked'. Determining the door not pried, the safety box prised and the door kicked up as second classification categories, so as to obtain the second classification categories of { 'door not pried', 'safety box prised' and 'door kicked up' }.

Step 212: and merging the first classification category and the second classification category to obtain a classification result.

In order to further optimize the classification result and make the classification result more accurate, in another embodiment of the present application, a classification mutual exclusion tag pair is introduced, specifically, referring to fig. 3, another embodiment of the present application provides a short text multi-tag classification method, including the following steps:

step 301: and obtaining the training sample marked with the classification label.

The classification label can be set for by the staff according to actual classification demand, for example, in the public security field, the staff can set up classification label as "the door is not pried", "prize the safe case" and "robber", etc. the staff carries out the mark of classification label to training sample one by one, for example, the training sample is "go home discovery stolen, the lock is intact, the safe case is pried in the family", the staff is "the door is not pried" and "prize the safe case" to the classification label of this training sample mark.

Step 302: and calculating to obtain a classification mutual exclusion probability matrix according to the classification labels of the training samples, wherein the classification mutual exclusion probability matrix is composed of the probability that each classification label and one classification label in other classification labels appear in the same training sample.

The class mutual exclusion probability matrix may reflect the likelihood that every two class labels are present in the same training sample at the same time. For example, all the classification labels are "door not pried", "safe picking", "door top" and "window glass smashing", which can obtain a classification mutual exclusion matrix as shown in the following table according to the classification label of each training sample.

The probability values in the classification mutual exclusion probability matrix are calculated by the following formula,

k — N1/N2, where K is the probability value, N1 is the number of training samples containing two class labels, and N2 is the total number of training samples containing either of the two class labels. And the classification mutual exclusion matrix generated after calculation is checked by workers to prevent the generation of the calculation error caused by the selection of the training samples.

For example, to calculate the probability values of the occurrence of the "not pried" and the "door-kick" in the same training text, the number of the training texts containing the classification tags of the "not pried" and the "door-kick" is counted first, and then the number of the training texts containing the classification tags of the "not pried" or the "door-kick" is counted, so that the probability values of the occurrence of the same time are calculated by using the formula.

Step 303: and sequentially judging whether the probability of each classification label in the mutual exclusion probability matrix and one classification label in other classification labels appearing in the same training sample is smaller than a preset mutual exclusion threshold, if so, executing step 304.

Step 304: and determining the two classification labels corresponding to the probability as a classification mutual exclusion label pair.

The preset mutual exclusion threshold value can be preset by a worker, two classification tags smaller than the preset mutual exclusion threshold value are determined as classification mutual exclusion tags, and taking the classification mutual exclusion matrix as an example, if the preset mutual exclusion threshold value is 0.4, it can be seen that the gate-kick and the gate are not pried as a classification mutual exclusion tag pair.

It should be noted that the classification mutual exclusion tag pair can also be directly set by the staff according to the actual situation. For example, the category labels "door not pried" and "picking" may be directly set as a category mutually exclusive label pair according to common sense.

The staff can store the classified mutual exclusion tag pairs obtained through the steps 301-304 and the directly set classified mutual exclusion tag pairs in a database for later use.

Step 305: and acquiring short texts to be classified.

Step 306: and obtaining a first forward classification probability set by using a single classification model corresponding to the classification labels, wherein the first forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different classification labels and the corresponding classification labels, which are obtained by using the single classification model.

Step 307: and ordering the forward classification probabilities in the first forward classification probability set from high to low.

Step 308: and extracting the N forward classification probabilities and the corresponding classification labels before sequencing to obtain a first target forward classification probability set.

For example, after a short text to be classified, that the short text is found stolen after home, the door lock is intact and the safe is pried in the home, is processed by a single classification model, a first forward classification probability set is obtained, wherein the probability set is { 0.048 time-between-the-day door, 0.11 time-between-the-day door, 0.8 time-between-the-day door, 0.026 window sill, 0.07 time window glass and 0.0003 time-between-the-day wall hole). Then, the first forward classification probability set is sequenced to obtain a sequenced first forward classification probability set { 0.8 door not pried, 0.11 safe case prying, 0.07 window glass breaking, 0.048 door kicking, 0.026 window sill with foot print and 0.0003 wall hole digging }, then the first N forward classification probabilities and corresponding classification labels are extracted to obtain a first target forward classification probability set, and if N is 4, the first target forward classification probability set is { 0.8 door not pried, 0.11 safe case prying, 0.07 window glass breaking and 0.048 door kicking }.

Step 309: judging whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and if the forward classification probability is greater than or equal to the first preset classification threshold, executing step 310; if the forward classification probability is smaller than the first preset classification threshold, step 311 is executed.

Step 310: and determining the classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs.

Step 311: and determining the classification label corresponding to the forward classification probability as a residual classification label.

For example, continuing to take the example of step 308 as an example, assuming that the first preset classification threshold corresponding to each classification label is 0.8, the first target forward classification probability is concentrated, and "the door is not pried by 0.8" is equal to the first preset classification threshold, and then "the door is not pried" is determined as a classification category of the short text to be classified. Meanwhile, the classification labels corresponding to the forward classification probability smaller than the first preset classification threshold are determined as the remaining classification labels, namely 'picking the safe', 'breaking the window glass', 'kicking the door'.

Step 312: and classifying the short texts to be classified by utilizing a multi-classification model to obtain a second forward classification probability set, wherein the multi-classification model is composed of two classification models corresponding to residual classification labels, and the second forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different residual classification labels and corresponding residual classification labels, which are obtained by utilizing the multi-classification model.

The multi-classification model is composed of at least one two-classification model, the number of the two-classification models is the same as that of the rest classification labels, the two-classification models correspond to one another, and the two-classification models corresponding to different rest classification labels can be the same or different. The two classification models are used for calculating the forward classification probability of the short text to be classified belonging to the corresponding classification label category. The two classification models can adopt the existing two classification models, such as a Logistic regression model, the Logistic regression model is an existing efficient two classifier, a plurality of two classification models form a multi-classification model, and forward classification probabilities of short texts to be classified in different residual classification labels are calculated by using the multi-classification model. For example, if the short text to be classified includes "home found stolen, door lock intact, safe prised in home" belongs to the remaining classification labels "door not prised", "safe prised", "broken window glass", "door kicking" with forward classification probabilities of "door not prised 0.9", "safe prised 0.8", "broken window glass 0.45" and "door kicking 0.6", the second classification probability set is { door not prised 0.9, safe prised 0.8, broken window glass 0.45 and door kicking 0.6 }.

Step 313: and determining whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold, and if the forward classification probability is greater than or equal to the second preset classification threshold, executing step 314.

Step 314: and extracting the forward classification probability and the corresponding classification label to obtain a second target forward classification probability set.

Assuming that a second preset classification threshold value corresponding to each classification label is 0.5, the second forward classification probability set is greater than or equal to the forward classification probability of the second preset classification threshold value, and a second target forward classification probability set is obtained, wherein the second target forward classification probability set is { 0.9 when the door is not pried, 0.8 when the safety box is pried, and 0.6 when the door is kicked over }.

Step 315: judging whether a classified mutual exclusion tag pair exists in the second target forward classified probability set or not by using the classified mutual exclusion tag pair, and if the classified mutual exclusion tag pair exists in the second target forward classified probability set, step 316; if there is no classification mutually exclusive label pair in the second target forward classification probability set, then step 317 is performed.

Step 316: and removing the classification label corresponding to the lower forward classification probability in the classification mutual exclusion label pair, and determining the classification label corresponding to the residual forward classification probability in the second target forward classification probability set as the second classification category to which the short text to be classified belongs.

As can be seen from the example of step 302, the "door not pried" and the "door-stepping" are classified mutual exclusion tag pairs, and therefore, to remove the forward classification tag with the lower probability in the classified mutual exclusion tag pair, i.e., remove the classified tag "door-stepping", keep the "door not pried", and finally obtain the "door not pried" and the "safe prying".

Step 317: and determining the classification labels corresponding to all the forward classification probabilities in the second target forward classification probability set as a second classification category to which the short text to be classified belongs.

If the second target forward classification probability set does not have a classification mutual exclusion tag pair, the classification tags corresponding to all the forward classification probabilities can be determined as the classes to which the short texts to be classified belong.

Step 318: and collecting all the second classification categories to obtain a second classification category set.

Step 319: and merging the first classification category and the second classification category to obtain a classification result.

Referring to fig. 4, in a second aspect, the present application provides a short text multi-label classification apparatus, including:

a first obtaining module 401, configured to obtain a short text to be classified;

a single classification model calculation module 402, configured to obtain a first forward classification probability set by using a single classification model corresponding to a classification tag, where the first forward classification probability set is composed of forward classification probabilities of the short text to be classified in different classification tags and corresponding classification tags, which are obtained by using the single classification model;

a first screening module 403, configured to screen the forward classification probabilities in the first forward classification probability set to obtain a first target forward classification probability set;

a determining module 404, configured to determine whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and if the forward classification probability is greater than or equal to the first preset classification threshold, determine a classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs;

a multi-classification model calculation module 405, configured to classify the short text to be classified by using a multi-classification model to obtain a second forward classification probability set, where the multi-classification model is composed of two classification models corresponding to remaining classification labels, and the second forward classification probability set is composed of forward classification probabilities of the short text to be classified in different remaining classification labels, which are calculated by using the multi-classification model;

a second screening module 406, configured to screen the forward classification probabilities in the second forward classification probability set to obtain a second classification probability set, where the second classification probability set is composed of classification labels corresponding to the forward classification probabilities obtained after the second forward classification probability set is screened;

and the output module 407 is configured to merge the first classification set and the second classification set to obtain a classification result.

According to the technical scheme, the short text label classification device provided by the embodiment of the application performs initial classification processing on short texts to be classified by using the single classification model, and then performs secondary classification processing on the short texts to be classified by using the multi-classification model formed by the two classification models, so that the multi-classification processing of the short texts can be realized, the complexity of data processing can be reduced, and the data processing speed is increased.

Further, referring to fig. 5, the first screening module 403 includes:

a sorting unit 501, configured to sort the forward classification probabilities in the first forward classification probability set from high to low;

an extracting unit 502, configured to extract the N forward classification probabilities before the sorting and the corresponding classification labels thereof, to obtain a first target forward classification probability set.

Further, referring to fig. 6, the second filtering module 406 includes:

a first determining unit 601, configured to determine whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold, and if the forward classification probability is greater than or equal to the second preset classification threshold, determine a classification label corresponding to the forward classification probability as a second classification category to which the short text to be classified belongs;

a first output unit 602, configured to aggregate all the second classification categories to obtain a second classification category set.

Further, referring to fig. 7 and 8, the apparatus further includes:

a second obtaining module 701, configured to obtain a pair of classified mutually exclusive tags;

the second filtering module 406 includes:

a screening unit 801, configured to determine whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold, and if the forward classification probability is greater than or equal to the second preset classification threshold, extract the forward classification probability and a corresponding classification label to obtain a second target forward classification probability set;

a second determining unit 802, configured to determine, by using the classified mutual exclusion tag pair, whether a classified mutual exclusion tag pair exists in the second target forward classified probability set, and if a classified mutual exclusion tag pair exists in the second target forward classified probability set, remove a classification tag corresponding to a smaller forward classified probability in the classified mutual exclusion tag pair, and determine a classification tag corresponding to a remaining forward classified probability in the second target forward classified probability set as a second classification category to which the short text to be classified belongs;

if the second target forward classification probability set does not have a classification mutual exclusion tag pair, determining the classification tags corresponding to all the forward classification probabilities in the second target forward classification probability set as a second classification category to which the short text to be classified belongs;

a second output unit 803, configured to aggregate all the second classification categories to obtain a second classification category set.

Further, referring to fig. 7, the second obtaining module 701 includes:

an obtaining unit 7011, configured to obtain a training sample labeled with a classification label;

a classification mutual exclusion probability matrix calculation unit 7012, configured to calculate a classification mutual exclusion probability matrix according to the classification labels of the training samples, where the classification mutual exclusion probability matrix is formed by probabilities that each classification label appears in the same training sample as one of the other classification labels in sequence;

a classification mutual exclusion tag pair determining unit 7013, configured to sequentially determine whether a probability that each classification tag in the mutual exclusion probability matrix and one classification tag in the other classification tags appear in the same training sample is smaller than a preset mutual exclusion threshold, and if so, determine two classification tags corresponding to the probability as a classification mutual exclusion tag pair.

Those skilled in the art will clearly understand that the techniques in the embodiments of the present application may be implemented by way of software plus a required general hardware platform. Based on such understanding, the technical solutions in the embodiments of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes a computer device (which may be a personal computer, a server, or a network device) for executing the method described in the embodiments or some parts of the embodiments of the present application.

The embodiments of the present disclosure are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments, and particularly, for the embodiment of the apparatus, since it is substantially similar to the embodiment of the method, the description is relatively simple, and related parts can be referred to the part of the embodiment of the method.

Claims

1. A short text multi-label classification method is characterized by comprising the following steps:

acquiring short texts to be classified;

merging the first classification category and the second classification category to obtain a classification result;

the screening of the forward classification probabilities in the second forward classification probability set to obtain a second classification probability set includes:

judging whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold, and if the forward classification probability is greater than or equal to the second preset classification threshold, determining a classification label corresponding to the forward classification probability as a second classification category to which the short text to be classified belongs;

and collecting all the second classification categories to obtain a second classification category set.

2. The method of claim 1, wherein the screening forward classification probabilities in the first forward classification probability set to extract a first target forward classification probability meeting a predetermined condition comprises:

sorting forward classification probabilities in the first set of forward classification probabilities from high to low;

and extracting the N forward classification probabilities and the corresponding classification labels before sequencing to obtain a first target forward classification probability set.

3. The method of claim 1, wherein the obtaining the short text to be classified comprises, prior to:

acquiring a classification mutual exclusion tag pair;

judging whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold, and if the forward classification probability is greater than or equal to the second preset classification threshold, extracting the forward classification probability and a corresponding classification label to obtain a second target forward classification probability set;

judging whether a classification mutual exclusion tag pair exists in the second target forward classification probability set or not by utilizing the classification mutual exclusion tag pair, if the classification mutual exclusion tag pair exists in the second target forward classification probability set, removing a classification tag corresponding to a lower forward classification probability in the classification mutual exclusion tag pair, and determining a classification tag corresponding to a residual forward classification probability in the second target forward classification probability set as a second classification category to which the short text to be classified belongs;

4. The method of claim 3, wherein the obtaining the sorted mutually exclusive tag pair comprises:

acquiring a training sample marked with a classification label;

calculating to obtain a classification mutual exclusion probability matrix according to the classification labels of the training samples, wherein the classification mutual exclusion probability matrix is composed of the probability that each classification label and one classification label in other classification labels appear in the same training sample;

and sequentially judging whether the probability of each classification label in the mutual exclusion probability matrix and one classification label in other classification labels appearing in the same training sample is smaller than a preset mutual exclusion threshold value, and if so, determining the two classification labels corresponding to the probability as a classification mutual exclusion label pair.

5. A short text multi-label classification device, comprising:

the output module is used for collecting the first classification category and the second classification category to obtain a classification result;

the second screening module includes:

a first judging unit, configured to judge whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold, and if the forward classification probability is greater than or equal to the second preset classification threshold, determine a classification label corresponding to the forward classification probability as a second classification category to which the short text to be classified belongs;

and the first output unit is used for merging all the second classification categories to obtain a second classification category set.

6. The apparatus of claim 5, wherein the first screening module comprises:

a sorting unit, configured to sort forward classification probabilities in the first forward classification probability set from high to low;

and the extraction unit is used for extracting the N forward classification probabilities and the corresponding classification labels before the sequencing to obtain a first target forward classification probability set.

7. The apparatus of claim 5, wherein the apparatus further comprises:

the second acquisition module is used for acquiring the classified mutual exclusion tag pair;

the second screening module includes:

the screening unit is used for judging whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold value or not, and if the forward classification probability is greater than or equal to the second preset classification threshold value, extracting the forward classification probability and a corresponding classification label to obtain a second target forward classification probability set;

a second judging unit, configured to judge whether a classification mutual exclusion tag pair exists in the second target forward classification probability set by using the classification mutual exclusion tag pair, and if a classification mutual exclusion tag pair exists in the second target forward classification probability set, remove a classification tag corresponding to a smaller forward classification probability in the classification mutual exclusion tag pair, and determine a classification tag corresponding to a remaining forward classification probability in the second target forward classification probability set as a second classification category to which the short text to be classified belongs;

and the second output unit is used for collecting all the second classification categories to obtain a second classification category set.

8. The apparatus of claim 7, wherein the second obtaining module comprises:

the acquisition unit is used for acquiring training samples marked with classification labels;

the classification mutual exclusion probability matrix calculation unit is used for calculating to obtain a classification mutual exclusion probability matrix according to the classification labels of the training samples, and the classification mutual exclusion probability matrix consists of the probability that each classification label and one of other classification labels appear in the same training sample in sequence;

and the classified mutual exclusion tag pair determining unit is used for sequentially judging whether the probability of each classified tag in the mutual exclusion probability matrix and one classified tag in other classified tags appearing in the same training sample is smaller than a preset mutual exclusion threshold value, and if so, determining the two classified tags corresponding to the probability as a classified mutual exclusion tag pair.