CN110968693A

CN110968693A - Multi-label text classification calculation method based on ensemble learning

Info

Publication number: CN110968693A
Application number: CN201911085655.2A
Authority: CN
Inventors: 马应龙; 闫君璐; 李莉敏; 张冰; 陈亮; 王乔木; 张大伟; 王玮; 郗子月
Original assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; North China Electric Power University; Information and Telecommunication Branch of State Grid Shandong Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; North China Electric Power University; Information and Telecommunication Branch of State Grid Shandong Electric Power Co Ltd
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2020-04-07

Abstract

The invention belongs to the technical field of computer text classification, and particularly relates to a multi-label text classification calculation method based on ensemble learning, which comprises the following steps: step 1: preprocessing an original data set, segmenting sentences into independent words, and deleting non-key words; step 2: performing feature extraction vectorization processing on the text in a word frequency-inverse text frequency mode; and step 3: decomposing a multi-label learning problem into a plurality of independent binary classification problems by adopting a binary association method, wherein each binary classification problem corresponds to one label in a label space; and 4, step 4: and classifying the labels by adopting an ensemble learning algorithm. The method reduces the time complexity, improves the training speed, improves the generalization capability of the weak learner, reduces the risk of overfitting, and increases the robustness of the model.

Description

Multi-label text classification calculation method based on ensemble learning

Technical Field

The invention belongs to the technical field of computer text classification, and particularly relates to a multi-label text classification calculation method based on ensemble learning.

Background

Classification techniques, a form of data analysis mining, can extract models that can describe sets of important data for predicting classes of data objects. The classification problem can be divided into a single-label classification problem and a multi-label classification problem according to different numbers of sample class labels after classification prediction. The purpose of multi-label classification is to predict whether, in an example associated with multiple classes, certain labels are associated with this example.

Multi-label learning algorithms can be broadly divided into two genres: one is a problem transformation method, and the other is an algorithm adaptation method. The first group of methods is algorithm independent. They convert a multi-label classification task into one or more single-label classification, regression, or label sorting tasks. The multi-label learning problem is solved by converting the multi-label learning problem into other learning scenarios. Representative algorithms include Binary Relevance (BR), Classifer Chain (CC), which convert multi-Label learning tasks into Binary classification tasks, Calibrated Label Ranking which converts multi-Label learning tasks into a second order method of Label Ranking tasks, and Random k-labels which convert multi-Label tasks into multi-class classification tasks. The second group of methods expands the specific learning algorithm to directly process multi-label data. The algorithm directly processes multi-label data by modifying a common learning algorithm, so that the multi-label learning problem is solved. Common algorithms such as decision trees, support vector machines, neural networks, bayes, boosting, etc. can be applied. Representative algorithms include an ML-kNN adaptive lazy learning algorithm, an ML-DT adaptive decision tree algorithm, a Rank-SVM adaptive different core technology and a CML adaptive information theory algorithm.

There are some disadvantages to the existing algorithms, more or less. Algorithms such as BR, CC, LP, etc. may create a sample imbalance problem after a classification problem conversion occurs. It results in n-ary classifiers being summed up from the dataset, where the number of negative examples tends to be more than the number of positive examples. A further problem is associated with the high dimensionality of the labels, which may increase sample imbalance and may also increase the number of signatures to be trained. Robustness is poor and may be less than desirable, especially in the handling of certain noise points. ML-kNN consists of two main steps. In a first step, for each test case, k nearest neighbors in the training set are identified. Next, in a second step, the maximum a posteriori probability identifies a labelset for the test case based on statistical information obtained from the labelsets of these neighboring cases. However, the time complexity of this approach is relatively high.

Disclosure of Invention

Aiming at the technical problem, the invention provides a multi-label text classification calculation method based on ensemble learning, which comprises the following steps:

step 1: preprocessing an original data set, segmenting sentences into independent words, and deleting non-keywords;

step 2: performing feature extraction vectorization processing on the text in a word frequency-inverse text frequency mode;

and step 3: decomposing a multi-label learning problem into a plurality of independent binary classification problems by adopting a binary association method, wherein each binary classification problem corresponds to one label in a label space;

and 4, step 4: and classifying the labels by adopting an ensemble learning algorithm.

The non-keywords include: pronouns, prepositions, conjunctions.

The step 2 comprises the following steps: the number of times a word appears in the text and the frequency of occurrence in the data set are counted to calculate the importance of the word in the data set.

The step 3 comprises the following steps:

firstly, judging whether each training example is associated with a certain label or not, and creating a corresponding binary training set;

and (3) inducing the binary classifier by using a binary correlation method, wherein the training examples related to the labels are positive examples, the training examples related to the labels are negative examples, and the training examples with unknown label correlation predict the correlation label set.

The step 4 comprises the following steps:

a boosting algorithm is adopted to serialize a plurality of base classifiers, after each base classifier is classified, new weights based on negative gradients are recalculated according to the positive errors and the negative errors of the classification, and the weights are loaded into the next base classifier; the errors are continuously corrected in the iterations.

The invention has the beneficial effects that:

the method combines a binary correlation method with ensemble learning, singles the labels in a problem transformation mode, and then trains data in a serial classifier mode. The method has the advantages that good performance of the method is reflected on a data set used in an experiment, time complexity is reduced, training speed is improved, generalization capability of a weak learner is improved, overfitting risk is reduced, and robustness of a model is improved.

Since the tags are independent, tags can be added and deleted without affecting the rest of the model. This makes it suitable for changing or dynamic scenarios and results and offers the opportunity for parallel implementation.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The invention provides a multi-label text classification calculation method based on ensemble learning, which comprises the following steps of:

The preprocessing stage is an important task in data set design, and the adoption of a machine learning method to preprocess data is crucial. In fact, it consists of two subtasks; (1) participles and (2) disabling deletion of words.

The purpose of word segmentation is to segment a sentence into individual words. The original data is only the stacking of a plurality of fonts, and word segmentation is to divide the text into a plurality of small blocks, such as English words or Chinese phrases. The method is mainly used for analyzing the content of the text information more intensely and understanding the meaning of the text to be expressed.

The purpose of the stop word deletion task is to eliminate unnecessary words, such as pronouns, prepositions, conjunctions, and other non-keywords. These words are often used to linguistically reinforce the structure of a sentence. These words are common in text, and the text content that does not provide more valuable information is called stop words.

In text classification, a document is described by a vector of features and feature values, also called attributes and attribute values. A common form of text representation is to specify TF-IDF (word frequency-inverse document frequency) for each word (feature). TF-IDF is a commonly used weighting technique for data mining. And calculating the occurrence frequency of a certain word in the text and the document frequency in the whole data set by a statistical method to calculate the importance of the word in the whole data set. The advantage of this approach is that it can filter out some common, but less relevant words to the meaning expressed by the article, while retaining key words that have an impact on the meaning expressed throughout the text.

TF-IDF is an abbreviation of Term Frequency-Inverse text Frequency. It consists of two parts, TF and IDF.

TF, the word frequency, represents the frequency of occurrence of a certain keyword in the text.

IDF, inverse text frequency. First, the text frequency refers to the number of times a keyword appears in all articles contained in a data set, and the inverse text frequency is the inverse of the text frequency as the name implies. If the less text of the characteristic word t appears, the larger the value of the IDF is, the characteristic word has good category distinguishing capability.

Thus, the higher the frequency of occurrence of a feature word in a text and the lower the frequency of text of the feature word in the entire dataset, the higher the TF-IDF value can be generated. Therefore, the TF-IDF can filter out some common but non-key words and retain important characteristic words. The TF-IDF value is the product of the two.

TF-IDF＝TF*IDF

Multi-label text classification

In general, it is not possible to classify each document as a separate tag due to the natural overlap of the category spaces. As previously described, for multi-label text classification, documents may belong to multiple classes. Not only does the training data have multiple labeled documents, the classifier must also be able to map a single document to multiple classes. The training algorithm must be adjusted to process multiple labels. A multi-label text classification may be defined as a classification task in which a classifier x or a set of classifiers assigns each document to zero or more predefined class labels.

The basic idea of this algorithm is to decompose the multi-label learning problem into q independent binary classification problems, where each binary classification problem corresponds to one label in the label space.

For the jth class, denoted y_jFirst, Binary Relevance determines each training example and y_jWhether there is a relation or not, a corresponding binary training set D is created_j：

D_j＝{(x_i，φ(Y_i，y_i))|1≤i≤N}

Wherein the content of the first and second substances,

on the basis, the binary classifier is induced by using a binary learning algorithm B,

p_s: x → R, i.e. p_i←B(D_i). Thus, for any multi-label training set sample (x)_i,Y_i)，x_iWill participate in the learning process of q binary classifiers. For related tag y_j∈Y_i，x_iIs considered to introduce p_iA direct example of (c). On the other hand, for unrelated tags

x_iIs considered a negative example. The above training strategy is called cross-training.

For unknown sample x_iThe Binary Relevance predicts the Relevance label set Y of the Binary Relevance by inquiring the label Relevance on each Binary classifier and then combining the relevant labels:

Y＝{y_j|p_j(x)＞0，1≤j≤q}

note that the predicted tag set Y is empty when all binary classifiers produce negative outputs. To avoid the situation of generating null predictions, the T-Criterion rule may be applied:

Y＝{y_i|p_i(x)＞0，1≤j≤q}∪{y_i，|j^*＝arg max_I≤１≤0p_i(x)}

briefly, the T-Criterion rule supplements the equation by including a class label that outputs the largest (least negative) when none of the binary classifiers predicts a positive value.

In the selection of the learning algorithm, we select the boosting algorithm in the integration algorithm, i.e. concatenate several base classifiers. After each base classifier is classified, new weights based on the negative gradient are recalculated according to the positive errors of the classification and are loaded into the next base classifier. In one iteration, the error is continuously corrected. Here, the negative gradient is used as a measure for the error of the last round of base learner by fitting the negative gradient in the next round of learning using the gradient descent function. We finally need to find a fitting function f (x) to make it approach the true value infinitely, but it is difficult to directly find a very accurate fitting function, first initializing a less accurate f₀(x) The residual value is calculated from the difference between it and the true value y:

r₀(x)＝y-f₀(x)

now the problem turns to finding a suitable h₀To fit r₀The next step of the fitting function can be expressed as f₁(x)＝f₀(x)+h₀(x) However, there are still residual values:

r₁(x)＝y-f₁(x)

then the next suitable h needs to be found₁To fit r₁The fitting function can be expressed as

f₂(x)＝f₁(x)+h₁(x)

This iteration continues each time with constant modifications such that f (x) continually approximates the true value y. Fitting function f generated by mth iteration_m(x) Is equal to the function f of the last iteration_m-1(x) And then the fitting function h of the last iteration residual is added_m-1. Can be expressed as:

f_m(x)＝f_m-1(x)+h_m-1(x) M ═ 1, 2.. M

This is a boosting calculationGeneral iterative training procedure for the method, specifically, the weak classifiers are initialized first

The constant p minimizes the initial prediction loss function L. The loss function here is a square loss function L (y, f (x)) ═ y-f (x)))²Then the total loss of all N samples is

The purpose of the iteration is to minimize the loss value, find the direction in which the gradient decreases the fastest, and calculate a negative gradient function of L (y, f (x)) to f (x) for each iteration:

wherein i is 1,2,3

A fitting function h (x; α) is then constructed to fit the negative gradient-g (x), using the minimized square error to obtain α values according to the following equation.

Based on the obtained parameter α, the optimal β is calculated according to the following formula.

Finally, the calculation results are merged into the model, and the prediction function is updated:

f_m(x)＝f_m-1(x)+β_mh_m(x；α_m)

and when the preset iteration number M is reached, terminating the iteration.

To evaluate the performance of the multi-label classification algorithm, we tested the algorithm on the following two differently sized data sets. Table 1 shows the data set information used.

TABLE 1 data set information

Name (R)	Form(s) of	Number of samples	Number of labels
				TMC 2007	Text	28596	22
EPdata	Text	155	22

A comparison was made with the ML-kNN algorithm using the method of the present invention.

We will evaluate the performance of each algorithm from three aspects, respectively: precision, recall, F1 value.

True Positive (TP): predicting positive class as a positive class number

True Negative (TN): predicting negative classes as negative class numbers

False Positive (FP): predicting negative classes as positive class numbers

False Negative (FN): predicting positive class as negative class number

The accuracy rate is for our prediction result, which indicates how many of the samples predicted to be positive are true positive samples. Then there are two possibilities to predict positive class (TP) and negative class (FP), i.e. positive class (TP) and negative class (FP)

Recall is for our original sample, which indicates how many of the positive examples in the sample are predicted to be correct. There are also two possibilities, one to predict the original positive class into a positive class (TP) and the other to predict the original positive class into a negative class (FN), i.e. the prediction is based on the prediction of the original positive class into a negative class (FN)

The F1 value is the precision and recall weighted harmonic mean:

TABLE 2 comparison of accuracy results

	ML-kNN	Algorithm in this chapter
			TMC 2007	0.61	0.67
EPdata	0.18	0.51

TABLE 3 recall results comparison

	ML-kNN	Algorithm in this chapter
			TMC 2007	0.62	0.65
EPdata	0.13	0.42

TABLE 4F 1 value comparison of results

	ML-kNN	Algorithm in this chapter
			TMC 2007	0.57	0.62
EPdata	0.14	0.44

From the above experimental results, it can be seen that the same algorithm has a large difference in performance among different data sets for the same measured value. There are also differences in the behavior of different algorithms on the same data set. By comparing these two algorithms, it can be seen that the present invention is generally advantageous in comparison of the various measurement indices. The effectiveness of the algorithm of the ensemble learning in combination with Binary Relevance is illustrated.

The embodiments are merely preferred embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily made by those skilled in the art within the technical scope of the present invention will be covered by the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A multi-label text classification calculation method based on ensemble learning is characterized by comprising the following steps:

step 1: preprocessing an original data set, segmenting sentences into independent words, and deleting non-key words;

2. The multi-label text classification computation method of claim 1, wherein the non-keywords comprise: pronouns, prepositions, conjunctions.

3. The multi-label text classification calculation method according to claim 1, wherein the step 2 comprises: the number of times a word appears in the text and the frequency of occurrence in the data set are counted to calculate the importance of the word in the data set.

4. The multi-label text classification calculation method according to claim 1, wherein the step 3 comprises:

5. The multi-label text classification calculation method according to claim 1, wherein the step 4 comprises:

a boosting algorithm is adopted to serialize a plurality of base classifiers, and after each base classifier is classified, new weights based on negative gradients are recalculated according to the positive errors and the negative errors of the classification and are loaded into the next base classifier; the errors are continuously corrected in the iterations.