CN110968693A - Multi-label text classification calculation method based on ensemble learning - Google Patents

Multi-label text classification calculation method based on ensemble learning Download PDF

Info

Publication number
CN110968693A
CN110968693A CN201911085655.2A CN201911085655A CN110968693A CN 110968693 A CN110968693 A CN 110968693A CN 201911085655 A CN201911085655 A CN 201911085655A CN 110968693 A CN110968693 A CN 110968693A
Authority
CN
China
Prior art keywords
label
text
binary
classification
calculation method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911085655.2A
Other languages
Chinese (zh)
Inventor
马应龙
闫君璐
李莉敏
张冰
陈亮
王乔木
张大伟
王玮
郗子月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
North China Electric Power University
Information and Telecommunication Branch of State Grid Shandong Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
North China Electric Power University
Information and Telecommunication Branch of State Grid Shandong Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Information and Telecommunication Co Ltd, North China Electric Power University, Information and Telecommunication Branch of State Grid Shandong Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN201911085655.2A priority Critical patent/CN110968693A/en
Publication of CN110968693A publication Critical patent/CN110968693A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention belongs to the technical field of computer text classification, and particularly relates to a multi-label text classification calculation method based on ensemble learning, which comprises the following steps: step 1: preprocessing an original data set, segmenting sentences into independent words, and deleting non-key words; step 2: performing feature extraction vectorization processing on the text in a word frequency-inverse text frequency mode; and step 3: decomposing a multi-label learning problem into a plurality of independent binary classification problems by adopting a binary association method, wherein each binary classification problem corresponds to one label in a label space; and 4, step 4: and classifying the labels by adopting an ensemble learning algorithm. The method reduces the time complexity, improves the training speed, improves the generalization capability of the weak learner, reduces the risk of overfitting, and increases the robustness of the model.

Description

Multi-label text classification calculation method based on ensemble learning
Technical Field
The invention belongs to the technical field of computer text classification, and particularly relates to a multi-label text classification calculation method based on ensemble learning.
Background
Classification techniques, a form of data analysis mining, can extract models that can describe sets of important data for predicting classes of data objects. The classification problem can be divided into a single-label classification problem and a multi-label classification problem according to different numbers of sample class labels after classification prediction. The purpose of multi-label classification is to predict whether, in an example associated with multiple classes, certain labels are associated with this example.
Multi-label learning algorithms can be broadly divided into two genres: one is a problem transformation method, and the other is an algorithm adaptation method. The first group of methods is algorithm independent. They convert a multi-label classification task into one or more single-label classification, regression, or label sorting tasks. The multi-label learning problem is solved by converting the multi-label learning problem into other learning scenarios. Representative algorithms include Binary Relevance (BR), Classifer Chain (CC), which convert multi-Label learning tasks into Binary classification tasks, Calibrated Label Ranking which converts multi-Label learning tasks into a second order method of Label Ranking tasks, and Random k-labels which convert multi-Label tasks into multi-class classification tasks. The second group of methods expands the specific learning algorithm to directly process multi-label data. The algorithm directly processes multi-label data by modifying a common learning algorithm, so that the multi-label learning problem is solved. Common algorithms such as decision trees, support vector machines, neural networks, bayes, boosting, etc. can be applied. Representative algorithms include an ML-kNN adaptive lazy learning algorithm, an ML-DT adaptive decision tree algorithm, a Rank-SVM adaptive different core technology and a CML adaptive information theory algorithm.
There are some disadvantages to the existing algorithms, more or less. Algorithms such as BR, CC, LP, etc. may create a sample imbalance problem after a classification problem conversion occurs. It results in n-ary classifiers being summed up from the dataset, where the number of negative examples tends to be more than the number of positive examples. A further problem is associated with the high dimensionality of the labels, which may increase sample imbalance and may also increase the number of signatures to be trained. Robustness is poor and may be less than desirable, especially in the handling of certain noise points. ML-kNN consists of two main steps. In a first step, for each test case, k nearest neighbors in the training set are identified. Next, in a second step, the maximum a posteriori probability identifies a labelset for the test case based on statistical information obtained from the labelsets of these neighboring cases. However, the time complexity of this approach is relatively high.
Disclosure of Invention
Aiming at the technical problem, the invention provides a multi-label text classification calculation method based on ensemble learning, which comprises the following steps:
step 1: preprocessing an original data set, segmenting sentences into independent words, and deleting non-keywords;
step 2: performing feature extraction vectorization processing on the text in a word frequency-inverse text frequency mode;
and step 3: decomposing a multi-label learning problem into a plurality of independent binary classification problems by adopting a binary association method, wherein each binary classification problem corresponds to one label in a label space;
and 4, step 4: and classifying the labels by adopting an ensemble learning algorithm.
The non-keywords include: pronouns, prepositions, conjunctions.
The step 2 comprises the following steps: the number of times a word appears in the text and the frequency of occurrence in the data set are counted to calculate the importance of the word in the data set.
The step 3 comprises the following steps:
firstly, judging whether each training example is associated with a certain label or not, and creating a corresponding binary training set;
and (3) inducing the binary classifier by using a binary correlation method, wherein the training examples related to the labels are positive examples, the training examples related to the labels are negative examples, and the training examples with unknown label correlation predict the correlation label set.
The step 4 comprises the following steps:
a boosting algorithm is adopted to serialize a plurality of base classifiers, after each base classifier is classified, new weights based on negative gradients are recalculated according to the positive errors and the negative errors of the classification, and the weights are loaded into the next base classifier; the errors are continuously corrected in the iterations.
The invention has the beneficial effects that:
the method combines a binary correlation method with ensemble learning, singles the labels in a problem transformation mode, and then trains data in a serial classifier mode. The method has the advantages that good performance of the method is reflected on a data set used in an experiment, time complexity is reduced, training speed is improved, generalization capability of a weak learner is improved, overfitting risk is reduced, and robustness of a model is improved.
Since the tags are independent, tags can be added and deleted without affecting the rest of the model. This makes it suitable for changing or dynamic scenarios and results and offers the opportunity for parallel implementation.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention provides a multi-label text classification calculation method based on ensemble learning, which comprises the following steps of:
step 1: preprocessing an original data set, segmenting sentences into independent words, and deleting non-keywords;
step 2: performing feature extraction vectorization processing on the text in a word frequency-inverse text frequency mode;
and step 3: decomposing a multi-label learning problem into a plurality of independent binary classification problems by adopting a binary association method, wherein each binary classification problem corresponds to one label in a label space;
and 4, step 4: and classifying the labels by adopting an ensemble learning algorithm.
The preprocessing stage is an important task in data set design, and the adoption of a machine learning method to preprocess data is crucial. In fact, it consists of two subtasks; (1) participles and (2) disabling deletion of words.
The purpose of word segmentation is to segment a sentence into individual words. The original data is only the stacking of a plurality of fonts, and word segmentation is to divide the text into a plurality of small blocks, such as English words or Chinese phrases. The method is mainly used for analyzing the content of the text information more intensely and understanding the meaning of the text to be expressed.
The purpose of the stop word deletion task is to eliminate unnecessary words, such as pronouns, prepositions, conjunctions, and other non-keywords. These words are often used to linguistically reinforce the structure of a sentence. These words are common in text, and the text content that does not provide more valuable information is called stop words.
In text classification, a document is described by a vector of features and feature values, also called attributes and attribute values. A common form of text representation is to specify TF-IDF (word frequency-inverse document frequency) for each word (feature). TF-IDF is a commonly used weighting technique for data mining. And calculating the occurrence frequency of a certain word in the text and the document frequency in the whole data set by a statistical method to calculate the importance of the word in the whole data set. The advantage of this approach is that it can filter out some common, but less relevant words to the meaning expressed by the article, while retaining key words that have an impact on the meaning expressed throughout the text.
TF-IDF is an abbreviation of Term Frequency-Inverse text Frequency. It consists of two parts, TF and IDF.
TF, the word frequency, represents the frequency of occurrence of a certain keyword in the text.
IDF, inverse text frequency. First, the text frequency refers to the number of times a keyword appears in all articles contained in a data set, and the inverse text frequency is the inverse of the text frequency as the name implies. If the less text of the characteristic word t appears, the larger the value of the IDF is, the characteristic word has good category distinguishing capability.
Thus, the higher the frequency of occurrence of a feature word in a text and the lower the frequency of text of the feature word in the entire dataset, the higher the TF-IDF value can be generated. Therefore, the TF-IDF can filter out some common but non-key words and retain important characteristic words. The TF-IDF value is the product of the two.
TF-IDF=TF*IDF
Multi-label text classification
In general, it is not possible to classify each document as a separate tag due to the natural overlap of the category spaces. As previously described, for multi-label text classification, documents may belong to multiple classes. Not only does the training data have multiple labeled documents, the classifier must also be able to map a single document to multiple classes. The training algorithm must be adjusted to process multiple labels. A multi-label text classification may be defined as a classification task in which a classifier x or a set of classifiers assigns each document to zero or more predefined class labels.
The basic idea of this algorithm is to decompose the multi-label learning problem into q independent binary classification problems, where each binary classification problem corresponds to one label in the label space.
For the jth class, denoted yjFirst, Binary Relevance determines each training example and yjWhether there is a relation or not, a corresponding binary training set D is createdj
Dj={(xi,φ(Yi,yi))|1≤i≤N}
Wherein the content of the first and second substances,
Figure BDA0002265323870000051
on the basis, the binary classifier is induced by using a binary learning algorithm B,
ps: x → R, i.e. pi←B(Di). Thus, for any multi-label training set sample (x)i,Yi),xiWill participate in the learning process of q binary classifiers. For related tag yj∈Yi,xiIs considered to introduce piA direct example of (c). On the other hand, for unrelated tags
Figure BDA0002265323870000052
xiIs considered a negative example. The above training strategy is called cross-training.
For unknown sample xiThe Binary Relevance predicts the Relevance label set Y of the Binary Relevance by inquiring the label Relevance on each Binary classifier and then combining the relevant labels:
Y={yj|pj(x)>0,1≤j≤q}
note that the predicted tag set Y is empty when all binary classifiers produce negative outputs. To avoid the situation of generating null predictions, the T-Criterion rule may be applied:
Y={yi|pi(x)>0,1≤j≤q}∪{yi,|j*=arg maxI≤1≤0pi(x)}
briefly, the T-Criterion rule supplements the equation by including a class label that outputs the largest (least negative) when none of the binary classifiers predicts a positive value.
In the selection of the learning algorithm, we select the boosting algorithm in the integration algorithm, i.e. concatenate several base classifiers. After each base classifier is classified, new weights based on the negative gradient are recalculated according to the positive errors of the classification and are loaded into the next base classifier. In one iteration, the error is continuously corrected. Here, the negative gradient is used as a measure for the error of the last round of base learner by fitting the negative gradient in the next round of learning using the gradient descent function. We finally need to find a fitting function f (x) to make it approach the true value infinitely, but it is difficult to directly find a very accurate fitting function, first initializing a less accurate f0(x) The residual value is calculated from the difference between it and the true value y:
r0(x)=y-f0(x)
now the problem turns to finding a suitable h0To fit r0The next step of the fitting function can be expressed as f1(x)=f0(x)+h0(x) However, there are still residual values:
r1(x)=y-f1(x)
then the next suitable h needs to be found1To fit r1The fitting function can be expressed as
f2(x)=f1(x)+h1(x)
This iteration continues each time with constant modifications such that f (x) continually approximates the true value y. Fitting function f generated by mth iterationm(x) Is equal to the function f of the last iterationm-1(x) And then the fitting function h of the last iteration residual is addedm-1. Can be expressed as:
fm(x)=fm-1(x)+hm-1(x) M ═ 1, 2.. M
This is a boosting calculationGeneral iterative training procedure for the method, specifically, the weak classifiers are initialized first
Figure BDA0002265323870000061
The constant p minimizes the initial prediction loss function L. The loss function here is a square loss function L (y, f (x)) ═ y-f (x)))2Then the total loss of all N samples is
Figure BDA0002265323870000062
The purpose of the iteration is to minimize the loss value, find the direction in which the gradient decreases the fastest, and calculate a negative gradient function of L (y, f (x)) to f (x) for each iteration:
Figure BDA0002265323870000071
wherein i is 1,2,3
A fitting function h (x; α) is then constructed to fit the negative gradient-g (x), using the minimized square error to obtain α values according to the following equation.
Figure BDA0002265323870000072
Based on the obtained parameter α, the optimal β is calculated according to the following formula.
Figure BDA0002265323870000073
Finally, the calculation results are merged into the model, and the prediction function is updated:
fm(x)=fm-1(x)+βmhm(x;αm)
and when the preset iteration number M is reached, terminating the iteration.
To evaluate the performance of the multi-label classification algorithm, we tested the algorithm on the following two differently sized data sets. Table 1 shows the data set information used.
TABLE 1 data set information
Name (R) Form(s) of Number of samples Number of labels
TMC 2007 Text 28596 22
EPdata Text 155 22
A comparison was made with the ML-kNN algorithm using the method of the present invention.
We will evaluate the performance of each algorithm from three aspects, respectively: precision, recall, F1 value.
True Positive (TP): predicting positive class as a positive class number
True Negative (TN): predicting negative classes as negative class numbers
False Positive (FP): predicting negative classes as positive class numbers
False Negative (FN): predicting positive class as negative class number
The accuracy rate is for our prediction result, which indicates how many of the samples predicted to be positive are true positive samples. Then there are two possibilities to predict positive class (TP) and negative class (FP), i.e. positive class (TP) and negative class (FP)
Figure BDA0002265323870000081
Recall is for our original sample, which indicates how many of the positive examples in the sample are predicted to be correct. There are also two possibilities, one to predict the original positive class into a positive class (TP) and the other to predict the original positive class into a negative class (FN), i.e. the prediction is based on the prediction of the original positive class into a negative class (FN)
Figure BDA0002265323870000082
The F1 value is the precision and recall weighted harmonic mean:
Figure BDA0002265323870000083
TABLE 2 comparison of accuracy results
ML-kNN Algorithm in this chapter
TMC 2007 0.61 0.67
EPdata 0.18 0.51
TABLE 3 recall results comparison
ML-kNN Algorithm in this chapter
TMC 2007 0.62 0.65
EPdata 0.13 0.42
TABLE 4F 1 value comparison of results
ML-kNN Algorithm in this chapter
TMC 2007 0.57 0.62
EPdata 0.14 0.44
From the above experimental results, it can be seen that the same algorithm has a large difference in performance among different data sets for the same measured value. There are also differences in the behavior of different algorithms on the same data set. By comparing these two algorithms, it can be seen that the present invention is generally advantageous in comparison of the various measurement indices. The effectiveness of the algorithm of the ensemble learning in combination with Binary Relevance is illustrated.
The embodiments are merely preferred embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily made by those skilled in the art within the technical scope of the present invention will be covered by the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (5)

1. A multi-label text classification calculation method based on ensemble learning is characterized by comprising the following steps:
step 1: preprocessing an original data set, segmenting sentences into independent words, and deleting non-key words;
step 2: performing feature extraction vectorization processing on the text in a word frequency-inverse text frequency mode;
and step 3: decomposing a multi-label learning problem into a plurality of independent binary classification problems by adopting a binary association method, wherein each binary classification problem corresponds to one label in a label space;
and 4, step 4: and classifying the labels by adopting an ensemble learning algorithm.
2. The multi-label text classification computation method of claim 1, wherein the non-keywords comprise: pronouns, prepositions, conjunctions.
3. The multi-label text classification calculation method according to claim 1, wherein the step 2 comprises: the number of times a word appears in the text and the frequency of occurrence in the data set are counted to calculate the importance of the word in the data set.
4. The multi-label text classification calculation method according to claim 1, wherein the step 3 comprises:
firstly, judging whether each training example is associated with a certain label or not, and creating a corresponding binary training set;
and (3) inducing the binary classifier by using a binary correlation method, wherein the training examples related to the labels are positive examples, the training examples related to the labels are negative examples, and the training examples with unknown label correlation predict the correlation label set.
5. The multi-label text classification calculation method according to claim 1, wherein the step 4 comprises:
a boosting algorithm is adopted to serialize a plurality of base classifiers, and after each base classifier is classified, new weights based on negative gradients are recalculated according to the positive errors and the negative errors of the classification and are loaded into the next base classifier; the errors are continuously corrected in the iterations.
CN201911085655.2A 2019-11-08 2019-11-08 Multi-label text classification calculation method based on ensemble learning Pending CN110968693A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911085655.2A CN110968693A (en) 2019-11-08 2019-11-08 Multi-label text classification calculation method based on ensemble learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911085655.2A CN110968693A (en) 2019-11-08 2019-11-08 Multi-label text classification calculation method based on ensemble learning

Publications (1)

Publication Number Publication Date
CN110968693A true CN110968693A (en) 2020-04-07

Family

ID=70030440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911085655.2A Pending CN110968693A (en) 2019-11-08 2019-11-08 Multi-label text classification calculation method based on ensemble learning

Country Status (1)

Country Link
CN (1) CN110968693A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182213A (en) * 2020-09-27 2021-01-05 中润普达(十堰)大数据中心有限公司 Modeling method based on abnormal lacrimation feature cognition
CN113011337A (en) * 2021-03-19 2021-06-22 山东大学 Chinese character library generation method and system based on deep meta learning
CN113705215A (en) * 2021-08-27 2021-11-26 南京大学 Meta-learning-based large-scale multi-label text classification method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140250032A1 (en) * 2013-03-01 2014-09-04 Xerox Corporation Methods, systems and processor-readable media for simultaneous sentiment analysis and topic classification with multiple labels
CN106156029A (en) * 2015-03-24 2016-11-23 中国人民解放军国防科学技术大学 The uneven fictitious assets data classification method of multi-tag based on integrated study
CN107577785A (en) * 2017-09-15 2018-01-12 南京大学 A kind of level multi-tag sorting technique suitable for law identification
CN109036577A (en) * 2018-07-27 2018-12-18 合肥工业大学 Diabetic complication analysis method and device
CN109036556A (en) * 2018-08-29 2018-12-18 王雁 A method of keratoconus case is diagnosed based on machine learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140250032A1 (en) * 2013-03-01 2014-09-04 Xerox Corporation Methods, systems and processor-readable media for simultaneous sentiment analysis and topic classification with multiple labels
CN106156029A (en) * 2015-03-24 2016-11-23 中国人民解放军国防科学技术大学 The uneven fictitious assets data classification method of multi-tag based on integrated study
CN107577785A (en) * 2017-09-15 2018-01-12 南京大学 A kind of level multi-tag sorting technique suitable for law identification
CN109036577A (en) * 2018-07-27 2018-12-18 合肥工业大学 Diabetic complication analysis method and device
CN109036556A (en) * 2018-08-29 2018-12-18 王雁 A method of keratoconus case is diagnosed based on machine learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MA ZHANGUO 等: "《Improved Terms Weighting Algorithm of Text》" *
俞学豪 等: "《基于BR和GBDT的电力信息通信客服系统多标签文本分类》" *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182213A (en) * 2020-09-27 2021-01-05 中润普达(十堰)大数据中心有限公司 Modeling method based on abnormal lacrimation feature cognition
CN112182213B (en) * 2020-09-27 2022-07-05 中润普达(十堰)大数据中心有限公司 Modeling method based on abnormal lacrimation feature cognition
CN113011337A (en) * 2021-03-19 2021-06-22 山东大学 Chinese character library generation method and system based on deep meta learning
CN113705215A (en) * 2021-08-27 2021-11-26 南京大学 Meta-learning-based large-scale multi-label text classification method

Similar Documents

Publication Publication Date Title
Styawati et al. A support vector machine-firefly algorithm for movie opinion data classification
Kadhim An evaluation of preprocessing techniques for text classification
US10783451B2 (en) Ensemble machine learning for structured and unstructured data
JP2978044B2 (en) Document classification device
Rahman et al. Classifying non-functional requirements using RNN variants for quality software development
Kalaivani et al. Feature reduction based on genetic algorithm and hybrid model for opinion mining
CN110968693A (en) Multi-label text classification calculation method based on ensemble learning
WO2020063071A1 (en) Sentence vector calculation method based on chi-square test, and text classification method and system
CN109376235B (en) Feature selection method based on document layer word frequency reordering
CN110688479A (en) Evaluation method and sequencing network for generating abstract
Ambert et al. K-information gain scaled nearest neighbors: a novel approach to classifying protein-protein interaction-related documents
Jayady et al. Theme Identification using Machine Learning Techniques
Kamalov et al. Nested ensemble selection: An effective hybrid feature selection method
Zobeidi et al. Effective text classification using multi-level fuzzy neural network
KR102025280B1 (en) Method and apparatus for selecting feature in classifying multi-label pattern
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
CN111563361A (en) Text label extraction method and device and storage medium
RU2546555C1 (en) Method of automated classification of formalised documents in electronic document circulation system
Holts et al. Automated text binary classification using machine learning approach
Lichtblau et al. Text documents encoding through images for authorship attribution
Singh et al. Intra news category classification using n-gram tf idf features and decision tree classifier
GHAREB et al. HYBRID STATISTICAL RULE-BASED CLASSIFIER FOR ARABIC TEXT MINING.
JP2008282328A (en) Text sorting device, text sorting method, text sort program, and recording medium with its program recorded thereon
Perwira et al. Effect of information gain on document classification using k-nearest neighbor
Alshalif et al. Alternative Relative Discrimination Criterion Feature Ranking Technique for Text Classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination