Detailed Description
Referring to fig. 1, in a first aspect, an embodiment of the present application provides a short text multi-label classification method, including the following steps:
step 101: and acquiring short texts to be classified.
The short text is a text which is relatively short compared with a long text, and the short text to be classified is a short text which needs to be classified, for example, the short text to be classified may be report information in the public security field, or state information in instant messaging application, such as a statement or a state log of a QQ community, or a web page fragment, a short message, or a microblog.
Step 102: and obtaining a first forward classification probability set by using a single classification model corresponding to the classification labels, wherein the first forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different classification labels and the corresponding classification labels, which are obtained by using the single classification model.
The classification labels can be set by workers according to actual classification requirements, for example, in the field of public security, the workers can set the classification labels to be 'burglary', 'robbery', and the like, each classification label is provided with a corresponding single classification model, namely the number of the single classification models is the same as that of the classification labels, the single classification models can adopt the existing single classifiers, and the embodiment is not limited. The forward classification probability is the probability that the short text to be classified belongs to the class of the classification label.
And calculating forward classification probabilities of the short texts to be classified in different classification labels by using the single classification model, and obtaining a first forward classification probability set according to the forward classification probabilities of the short texts to be classified in the different classification labels and the corresponding classification labels.
Step 103: and screening the forward classification probabilities in the first forward classification probability set to obtain a first target forward classification probability set.
The forward classification probabilities in the first forward classification probability set are screened, the forward classification probabilities and the corresponding classification labels which do not accord with preset conditions are removed, only the forward classification probabilities and the corresponding classification labels which accord with the preset conditions are reserved, the number of data processing can be reduced, and the speed of data processing is improved.
Step 104: judging whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and executing a step 105 if the forward classification probability is greater than or equal to the first preset classification threshold; if the forward classification probability is less than the first preset classification threshold, step 106 will be performed.
The first preset classification threshold may be preset by a worker, and may be the same value, or may correspond to different values according to different classification tags, for example, the first preset classification thresholds of the classification tags "burglary" and "robbery" are both 0.6, or the first preset classification threshold of "burglary" is 0.8, and the first preset classification threshold of "robbery" is 0.6.
Step 105: and determining the classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs.
Step 106: and determining the classification label corresponding to the forward classification probability as a residual classification label.
And screening the forward classification probabilities in the first target forward classification probability set again, if the forward classification probabilities meeting the screening conditions are obtained, determining the classification labels corresponding to the forward classification probabilities as the first classification categories to which the short texts to be classified belong, and then performing subsequent processing on other forward classification probabilities in the first target forward classification probability set, so that the initial classification processing of the short texts to be classified can be completed, the complexity of subsequent data processing is reduced, and the data processing speed is increased.
Step 107: and classifying the short texts to be classified by utilizing a multi-classification model to obtain a second forward classification probability set, wherein the multi-classification model is composed of two classification models corresponding to the residual classification labels, and the second forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different residual classification labels, which are obtained by utilizing the multi-classification model to calculate.
Step 108: and screening the forward classification probabilities in the second forward classification probability set to obtain a second classification probability set, wherein the second classification probability set is composed of classification labels corresponding to the forward classification probabilities obtained after the second forward classification probability set is screened and corresponding residual classification labels.
Step 109: and merging the first classification category and the second classification category to obtain a classification result.
According to the technical scheme, the short text label classification method provided by the embodiment of the application comprises the steps of firstly carrying out initial classification processing on short texts to be classified by using a single classification model, and then carrying out secondary classification processing on the short texts to be classified by using a multi-classification model formed by two classification models, so that the multi-classification processing of the short texts can be realized, the complexity of data processing can be reduced, and the data processing speed is increased.
Referring to fig. 2, another embodiment of the present application provides a short text multi-label classification method, including the following steps:
step 201: and acquiring short texts to be classified.
The short text is a text which is relatively short compared with a long text, and the short text to be classified is a short text which needs to be classified, for example, the short text to be classified may be report information in the public security field, or state information in instant messaging application, such as a statement or a state log of a QQ community, or a web page fragment, a short message, or a microblog.
Step 202: and obtaining a first forward classification probability set by using a single classification model corresponding to the classification labels, wherein the first forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different classification labels and the corresponding classification labels, which are obtained by using the single classification model.
Each classification label has a corresponding single classification model, namely the number of the single classification models is the same as that of the classification labels. The forward classification probability is the probability that the short text to be classified belongs to the class of the classification label. The single classification model may adopt an existing single classifier, and this embodiment is not limited.
And calculating forward classification probabilities of the short texts to be classified in different classification labels by using the single classification model, and obtaining a first forward classification probability set according to the forward classification probabilities of the short texts to be classified in the different classification labels and the corresponding classification labels.
Step 203: and ordering the forward classification probabilities in the first forward classification probability set from high to low.
Step 204: and extracting the N forward classification probabilities and the corresponding classification labels before sequencing to obtain a first target forward classification probability set.
For example, after a short text to be classified, that the short text is found stolen after home, the door lock is intact and the safe is pried in the home, is processed by a single classification model, a first forward classification probability set is obtained, wherein the probability set is { 0.048 time-between-the-day door, 0.11 time-between-the-day door, 0.8 time-between-the-day door, 0.026 window sill, 0.07 time window glass and 0.0003 time-between-the-day wall hole). Then, sequencing is carried out on the first forward classification probability set to obtain a sequenced first forward classification probability set { 0.8 door is not pried, 0.11 safe is pried, 0.07 window glass is broken, 0.048 door kick-start, 0.026 window sill and 0.0003 wall hole digging }, then the first N forward classification probabilities and corresponding classification labels are extracted to obtain a first target forward classification probability set, N can be set according to actual requirements, if N is 4, the first target forward classification probability set is { 0.8 door is not pried, 0.11 safe is pried, 0.07 window glass is broken, and 0.048 door kick-start }.
The forward classification probabilities in the first forward classification probability set are screened, the forward classification probabilities which do not accord with preset conditions are removed, only the forward classification probabilities which accord with the preset conditions are reserved, the number of data processing can be reduced, and the speed of data processing is improved.
Step 205: judging whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and executing step 206 if the forward classification probability is greater than or equal to the first preset classification threshold; if the forward classification probability is smaller than the first preset classification threshold, step 207 is executed.
The first preset classification threshold value can be preset by a worker, the first preset classification threshold values corresponding to different classification labels can be the same numerical value or different numerical values, for example, the first preset classification threshold values of the classification labels of "burglary" and "robbery" are both 0.6, or the first preset classification threshold value of "burglary" is 0.8, the first preset classification threshold value of "robbery" is 0.6, and the forward classification probabilities corresponding to different classification labels are compared with the first preset classification threshold values corresponding to the different classification labels.
Step 206: and determining the classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs.
Step 207: and determining the classification label corresponding to the forward classification probability as a residual classification label.
For example, continuing to take the example of step 204 as an example, assuming that the first preset classification threshold corresponding to each classification label is 0.8, the first target forward classification probability is concentrated, and "the door is not pried by 0.8" is equal to the first preset classification threshold, and then "the door is not pried" is determined as a classification category of the short text to be classified. Meanwhile, the classification labels corresponding to the forward classification probability smaller than the first preset classification threshold are determined as the remaining classification labels, namely 'picking the safe', 'breaking the window glass', 'kicking the door'.
Assuming that the first preset classification threshold values corresponding to the classification labels are all 0.85, the first target forward classification probability set has no forward classification probability greater than or equal to the first preset classification threshold value, and all the classification labels corresponding to the forward probabilities in the first target forward classification probability set are all residual classification labels, namely 'door not pried', 'safe prying', 'window breaking glass' and 'door kicking'.
And screening the forward classification probabilities in the first target forward classification probability set again, if the forward classification probabilities meeting the screening conditions are obtained, determining the classification labels corresponding to the forward classification probabilities as the first classification categories to which the short texts to be classified belong, and then performing subsequent processing on other forward classification probabilities in the first target forward classification probability set, so that the initial classification processing of the short texts to be classified can be completed, the complexity of subsequent data processing is reduced, and the data processing speed is increased.
Step 208: and classifying the short texts to be classified by utilizing a multi-classification model to obtain a second forward classification probability set, wherein the multi-classification model is composed of two classification models corresponding to residual classification labels, and the second forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different residual classification labels and corresponding residual classification labels, which are obtained by utilizing the multi-classification model.
The multi-classification model is composed of at least one two-classification model, and the number of the two-classification models is the same as that of the rest classification labels and corresponds to one another. The two classification models corresponding to different remaining classification labels may be the same or different. The two classification models are used for calculating the forward classification probability of the short text to be classified belonging to the corresponding classification label category. The two classification models can adopt the existing two classification models, such as a Logistic regression model, the Logistic regression model is an existing efficient two classifier, a plurality of two classification models form a multi-classification model, and forward classification probabilities of short texts to be classified in different residual classification labels are calculated by using the multi-classification model. For example, if the short text to be classified includes "home found stolen, door lock intact, safe prised in home" belongs to the remaining classification labels "door not prised", "safe prised", "broken window glass", "door kicking" with forward classification probabilities of "door not prised 0.9", "safe prised 0.8", "broken window glass 0.45" and "door kicking 0.6", the second classification probability set is { door not prised 0.9, safe prised 0.8, broken window glass 0.45 and door kicking 0.6 }.
Step 209: and judging whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold, and if the forward classification probability is greater than or equal to the second preset classification threshold, executing step 210.
Similarly, the second preset classification threshold may be preset by a worker, and the second preset classification thresholds corresponding to different remaining classification tags may be the same value or different values.
Step 210: and determining the classification label corresponding to the forward classification probability as a second classification category to which the short text to be classified belongs.
Step 211: and collecting all the second classification categories to obtain a second classification category set.
Assuming that the second preset classification threshold is 0.5, the forward classification probabilities of the second forward classification probability set greater than or equal to the second preset classification threshold are '0.9 door not pried', '0.8 safe pried' and '0.6 door kicked'. Determining the door not pried, the safety box prised and the door kicked up as second classification categories, so as to obtain the second classification categories of { 'door not pried', 'safety box prised' and 'door kicked up' }.
Step 212: and merging the first classification category and the second classification category to obtain a classification result.
In order to further optimize the classification result and make the classification result more accurate, in another embodiment of the present application, a classification mutual exclusion tag pair is introduced, specifically, referring to fig. 3, another embodiment of the present application provides a short text multi-tag classification method, including the following steps:
step 301: and obtaining the training sample marked with the classification label.
The classification label can be set for by the staff according to actual classification demand, for example, in the public security field, the staff can set up classification label as "the door is not pried", "prize the safe case" and "robber", etc. the staff carries out the mark of classification label to training sample one by one, for example, the training sample is "go home discovery stolen, the lock is intact, the safe case is pried in the family", the staff is "the door is not pried" and "prize the safe case" to the classification label of this training sample mark.
Step 302: and calculating to obtain a classification mutual exclusion probability matrix according to the classification labels of the training samples, wherein the classification mutual exclusion probability matrix is composed of the probability that each classification label and one classification label in other classification labels appear in the same training sample.
The class mutual exclusion probability matrix may reflect the likelihood that every two class labels are present in the same training sample at the same time. For example, all the classification labels are "door not pried", "safe picking", "door top" and "window glass smashing", which can obtain a classification mutual exclusion matrix as shown in the following table according to the classification label of each training sample.
The probability values in the classification mutual exclusion probability matrix are calculated by the following formula,
k — N1/N2, where K is the probability value, N1 is the number of training samples containing two class labels, and N2 is the total number of training samples containing either of the two class labels. And the classification mutual exclusion matrix generated after calculation is checked by workers to prevent the generation of the calculation error caused by the selection of the training samples.
For example, to calculate the probability values of the occurrence of the "not pried" and the "door-kick" in the same training text, the number of the training texts containing the classification tags of the "not pried" and the "door-kick" is counted first, and then the number of the training texts containing the classification tags of the "not pried" or the "door-kick" is counted, so that the probability values of the occurrence of the same time are calculated by using the formula.
Step 303: and sequentially judging whether the probability of each classification label in the mutual exclusion probability matrix and one classification label in other classification labels appearing in the same training sample is smaller than a preset mutual exclusion threshold, if so, executing step 304.
Step 304: and determining the two classification labels corresponding to the probability as a classification mutual exclusion label pair.
The preset mutual exclusion threshold value can be preset by a worker, two classification tags smaller than the preset mutual exclusion threshold value are determined as classification mutual exclusion tags, and taking the classification mutual exclusion matrix as an example, if the preset mutual exclusion threshold value is 0.4, it can be seen that the gate-kick and the gate are not pried as a classification mutual exclusion tag pair.
It should be noted that the classification mutual exclusion tag pair can also be directly set by the staff according to the actual situation. For example, the category labels "door not pried" and "picking" may be directly set as a category mutually exclusive label pair according to common sense.
The staff can store the classified mutual exclusion tag pairs obtained through the steps 301-304 and the directly set classified mutual exclusion tag pairs in a database for later use.
Step 305: and acquiring short texts to be classified.
The short text is a text which is relatively short compared with a long text, and the short text to be classified is a short text which needs to be classified, for example, the short text to be classified may be report information in the public security field, or state information in instant messaging application, such as a statement or a state log of a QQ community, or a web page fragment, a short message, or a microblog.
Step 306: and obtaining a first forward classification probability set by using a single classification model corresponding to the classification labels, wherein the first forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different classification labels and the corresponding classification labels, which are obtained by using the single classification model.
Each classification label has a corresponding single classification model, namely the number of the single classification models is the same as that of the classification labels. The forward classification probability is the probability that the short text to be classified belongs to the class of the classification label. The single classification model may adopt an existing single classifier, and this embodiment is not limited.
And calculating forward classification probabilities of the short texts to be classified in different classification labels by using the single classification model, and obtaining a first forward classification probability set according to the forward classification probabilities of the short texts to be classified in the different classification labels and the corresponding classification labels.
Step 307: and ordering the forward classification probabilities in the first forward classification probability set from high to low.
Step 308: and extracting the N forward classification probabilities and the corresponding classification labels before sequencing to obtain a first target forward classification probability set.
For example, after a short text to be classified, that the short text is found stolen after home, the door lock is intact and the safe is pried in the home, is processed by a single classification model, a first forward classification probability set is obtained, wherein the probability set is { 0.048 time-between-the-day door, 0.11 time-between-the-day door, 0.8 time-between-the-day door, 0.026 window sill, 0.07 time window glass and 0.0003 time-between-the-day wall hole). Then, the first forward classification probability set is sequenced to obtain a sequenced first forward classification probability set { 0.8 door not pried, 0.11 safe case prying, 0.07 window glass breaking, 0.048 door kicking, 0.026 window sill with foot print and 0.0003 wall hole digging }, then the first N forward classification probabilities and corresponding classification labels are extracted to obtain a first target forward classification probability set, and if N is 4, the first target forward classification probability set is { 0.8 door not pried, 0.11 safe case prying, 0.07 window glass breaking and 0.048 door kicking }.
The forward classification probabilities in the first forward classification probability set are screened, the forward classification probabilities which do not accord with preset conditions are removed, only the forward classification probabilities which accord with the preset conditions are reserved, the number of data processing can be reduced, and the speed of data processing is improved.
Step 309: judging whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and if the forward classification probability is greater than or equal to the first preset classification threshold, executing step 310; if the forward classification probability is smaller than the first preset classification threshold, step 311 is executed.
The first preset classification threshold value can be preset by a worker, the first preset classification threshold values corresponding to different classification labels can be the same numerical value or different numerical values, for example, the first preset classification threshold values of the classification labels of "burglary" and "robbery" are both 0.6, or the first preset classification threshold value of "burglary" is 0.8, the first preset classification threshold value of "robbery" is 0.6, and the forward classification probabilities corresponding to different classification labels are compared with the first preset classification threshold values corresponding to the different classification labels.
Step 310: and determining the classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs.
Step 311: and determining the classification label corresponding to the forward classification probability as a residual classification label.
For example, continuing to take the example of step 308 as an example, assuming that the first preset classification threshold corresponding to each classification label is 0.8, the first target forward classification probability is concentrated, and "the door is not pried by 0.8" is equal to the first preset classification threshold, and then "the door is not pried" is determined as a classification category of the short text to be classified. Meanwhile, the classification labels corresponding to the forward classification probability smaller than the first preset classification threshold are determined as the remaining classification labels, namely 'picking the safe', 'breaking the window glass', 'kicking the door'.
Assuming that the first preset classification threshold values corresponding to the classification labels are all 0.85, the first target forward classification probability set has no forward classification probability greater than or equal to the first preset classification threshold value, and all the classification labels corresponding to the forward probabilities in the first target forward classification probability set are all residual classification labels, namely 'door not pried', 'safe prying', 'window breaking glass' and 'door kicking'.
And screening the forward classification probabilities in the first target forward classification probability set again, if the forward classification probabilities meeting the screening conditions are obtained, determining the classification labels corresponding to the forward classification probabilities as the first classification categories to which the short texts to be classified belong, and then performing subsequent processing on other forward classification probabilities in the first target forward classification probability set, so that the initial classification processing of the short texts to be classified can be completed, the complexity of subsequent data processing is reduced, and the data processing speed is increased.
Step 312: and classifying the short texts to be classified by utilizing a multi-classification model to obtain a second forward classification probability set, wherein the multi-classification model is composed of two classification models corresponding to residual classification labels, and the second forward classification probability set is composed of forward classification probabilities of the short texts to be classified in different residual classification labels and corresponding residual classification labels, which are obtained by utilizing the multi-classification model.
The multi-classification model is composed of at least one two-classification model, the number of the two-classification models is the same as that of the rest classification labels, the two-classification models correspond to one another, and the two-classification models corresponding to different rest classification labels can be the same or different. The two classification models are used for calculating the forward classification probability of the short text to be classified belonging to the corresponding classification label category. The two classification models can adopt the existing two classification models, such as a Logistic regression model, the Logistic regression model is an existing efficient two classifier, a plurality of two classification models form a multi-classification model, and forward classification probabilities of short texts to be classified in different residual classification labels are calculated by using the multi-classification model. For example, if the short text to be classified includes "home found stolen, door lock intact, safe prised in home" belongs to the remaining classification labels "door not prised", "safe prised", "broken window glass", "door kicking" with forward classification probabilities of "door not prised 0.9", "safe prised 0.8", "broken window glass 0.45" and "door kicking 0.6", the second classification probability set is { door not prised 0.9, safe prised 0.8, broken window glass 0.45 and door kicking 0.6 }.
Step 313: and determining whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold, and if the forward classification probability is greater than or equal to the second preset classification threshold, executing step 314.
Similarly, the second preset classification threshold may be preset by a worker, and the second preset classification thresholds corresponding to different remaining classification tags may be the same value or different values.
Step 314: and extracting the forward classification probability and the corresponding classification label to obtain a second target forward classification probability set.
Assuming that a second preset classification threshold value corresponding to each classification label is 0.5, the second forward classification probability set is greater than or equal to the forward classification probability of the second preset classification threshold value, and a second target forward classification probability set is obtained, wherein the second target forward classification probability set is { 0.9 when the door is not pried, 0.8 when the safety box is pried, and 0.6 when the door is kicked over }.
Step 315: judging whether a classified mutual exclusion tag pair exists in the second target forward classified probability set or not by using the classified mutual exclusion tag pair, and if the classified mutual exclusion tag pair exists in the second target forward classified probability set, step 316; if there is no classification mutually exclusive label pair in the second target forward classification probability set, then step 317 is performed.
Step 316: and removing the classification label corresponding to the lower forward classification probability in the classification mutual exclusion label pair, and determining the classification label corresponding to the residual forward classification probability in the second target forward classification probability set as the second classification category to which the short text to be classified belongs.
As can be seen from the example of step 302, the "door not pried" and the "door-stepping" are classified mutual exclusion tag pairs, and therefore, to remove the forward classification tag with the lower probability in the classified mutual exclusion tag pair, i.e., remove the classified tag "door-stepping", keep the "door not pried", and finally obtain the "door not pried" and the "safe prying".
Step 317: and determining the classification labels corresponding to all the forward classification probabilities in the second target forward classification probability set as a second classification category to which the short text to be classified belongs.
If the second target forward classification probability set does not have a classification mutual exclusion tag pair, the classification tags corresponding to all the forward classification probabilities can be determined as the classes to which the short texts to be classified belong.
Step 318: and collecting all the second classification categories to obtain a second classification category set.
Step 319: and merging the first classification category and the second classification category to obtain a classification result.
According to the technical scheme, the short text label classification method provided by the embodiment of the application comprises the steps of firstly carrying out initial classification processing on short texts to be classified by using a single classification model, and then carrying out secondary classification processing on the short texts to be classified by using a multi-classification model formed by two classification models, so that the multi-classification processing of the short texts can be realized, the complexity of data processing can be reduced, and the data processing speed is increased.
Referring to fig. 4, in a second aspect, the present application provides a short text multi-label classification apparatus, including:
a first obtaining module 401, configured to obtain a short text to be classified;
a single classification model calculation module 402, configured to obtain a first forward classification probability set by using a single classification model corresponding to a classification tag, where the first forward classification probability set is composed of forward classification probabilities of the short text to be classified in different classification tags and corresponding classification tags, which are obtained by using the single classification model;
a first screening module 403, configured to screen the forward classification probabilities in the first forward classification probability set to obtain a first target forward classification probability set;
a determining module 404, configured to determine whether each forward classification probability in the first target forward classification probability set is greater than or equal to a first preset classification threshold, and if the forward classification probability is greater than or equal to the first preset classification threshold, determine a classification label corresponding to the forward classification probability as a first classification category to which the short text to be classified belongs;
if the forward classification probability is smaller than the first preset classification threshold, determining the classification label corresponding to the forward classification probability as a residual classification label;
a multi-classification model calculation module 405, configured to classify the short text to be classified by using a multi-classification model to obtain a second forward classification probability set, where the multi-classification model is composed of two classification models corresponding to remaining classification labels, and the second forward classification probability set is composed of forward classification probabilities of the short text to be classified in different remaining classification labels, which are calculated by using the multi-classification model;
a second screening module 406, configured to screen the forward classification probabilities in the second forward classification probability set to obtain a second classification probability set, where the second classification probability set is composed of classification labels corresponding to the forward classification probabilities obtained after the second forward classification probability set is screened;
and the output module 407 is configured to merge the first classification set and the second classification set to obtain a classification result.
According to the technical scheme, the short text label classification device provided by the embodiment of the application performs initial classification processing on short texts to be classified by using the single classification model, and then performs secondary classification processing on the short texts to be classified by using the multi-classification model formed by the two classification models, so that the multi-classification processing of the short texts can be realized, the complexity of data processing can be reduced, and the data processing speed is increased.
Further, referring to fig. 5, the first screening module 403 includes:
a sorting unit 501, configured to sort the forward classification probabilities in the first forward classification probability set from high to low;
an extracting unit 502, configured to extract the N forward classification probabilities before the sorting and the corresponding classification labels thereof, to obtain a first target forward classification probability set.
Further, referring to fig. 6, the second filtering module 406 includes:
a first determining unit 601, configured to determine whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold, and if the forward classification probability is greater than or equal to the second preset classification threshold, determine a classification label corresponding to the forward classification probability as a second classification category to which the short text to be classified belongs;
a first output unit 602, configured to aggregate all the second classification categories to obtain a second classification category set.
Further, referring to fig. 7 and 8, the apparatus further includes:
a second obtaining module 701, configured to obtain a pair of classified mutually exclusive tags;
the second filtering module 406 includes:
a screening unit 801, configured to determine whether each forward classification probability in the second forward classification probability set is greater than or equal to a second preset classification threshold, and if the forward classification probability is greater than or equal to the second preset classification threshold, extract the forward classification probability and a corresponding classification label to obtain a second target forward classification probability set;
a second determining unit 802, configured to determine, by using the classified mutual exclusion tag pair, whether a classified mutual exclusion tag pair exists in the second target forward classified probability set, and if a classified mutual exclusion tag pair exists in the second target forward classified probability set, remove a classification tag corresponding to a smaller forward classified probability in the classified mutual exclusion tag pair, and determine a classification tag corresponding to a remaining forward classified probability in the second target forward classified probability set as a second classification category to which the short text to be classified belongs;
if the second target forward classification probability set does not have a classification mutual exclusion tag pair, determining the classification tags corresponding to all the forward classification probabilities in the second target forward classification probability set as a second classification category to which the short text to be classified belongs;
a second output unit 803, configured to aggregate all the second classification categories to obtain a second classification category set.
Further, referring to fig. 7, the second obtaining module 701 includes:
an obtaining unit 7011, configured to obtain a training sample labeled with a classification label;
a classification mutual exclusion probability matrix calculation unit 7012, configured to calculate a classification mutual exclusion probability matrix according to the classification labels of the training samples, where the classification mutual exclusion probability matrix is formed by probabilities that each classification label appears in the same training sample as one of the other classification labels in sequence;
a classification mutual exclusion tag pair determining unit 7013, configured to sequentially determine whether a probability that each classification tag in the mutual exclusion probability matrix and one classification tag in the other classification tags appear in the same training sample is smaller than a preset mutual exclusion threshold, and if so, determine two classification tags corresponding to the probability as a classification mutual exclusion tag pair.
According to the technical scheme, the short text label classification method and the short text label classification device are provided, the method firstly utilizes a single classification model to perform initial classification processing on short texts to be classified, and then utilizes a multi-classification model composed of two classification models to perform secondary classification processing on the short texts to be classified, so that the multi-classification processing of the short texts can be realized, the complexity of data processing can be reduced, and the data processing speed is increased.
Those skilled in the art will clearly understand that the techniques in the embodiments of the present application may be implemented by way of software plus a required general hardware platform. Based on such understanding, the technical solutions in the embodiments of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes a computer device (which may be a personal computer, a server, or a network device) for executing the method described in the embodiments or some parts of the embodiments of the present application.
The embodiments of the present disclosure are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments, and particularly, for the embodiment of the apparatus, since it is substantially similar to the embodiment of the method, the description is relatively simple, and related parts can be referred to the part of the embodiment of the method.