CN110968693A - Multi-label text classification calculation method based on ensemble learning - Google Patents
Multi-label text classification calculation method based on ensemble learning Download PDFInfo
- Publication number
- CN110968693A CN110968693A CN201911085655.2A CN201911085655A CN110968693A CN 110968693 A CN110968693 A CN 110968693A CN 201911085655 A CN201911085655 A CN 201911085655A CN 110968693 A CN110968693 A CN 110968693A
- Authority
- CN
- China
- Prior art keywords
- label
- text
- binary
- classification
- calculation method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Abstract
The invention belongs to the technical field of computer text classification, and particularly relates to a multi-label text classification calculation method based on ensemble learning, which comprises the following steps: step 1: preprocessing an original data set, segmenting sentences into independent words, and deleting non-key words; step 2: performing feature extraction vectorization processing on the text in a word frequency-inverse text frequency mode; and step 3: decomposing a multi-label learning problem into a plurality of independent binary classification problems by adopting a binary association method, wherein each binary classification problem corresponds to one label in a label space; and 4, step 4: and classifying the labels by adopting an ensemble learning algorithm. The method reduces the time complexity, improves the training speed, improves the generalization capability of the weak learner, reduces the risk of overfitting, and increases the robustness of the model.
Description
Technical Field
The invention belongs to the technical field of computer text classification, and particularly relates to a multi-label text classification calculation method based on ensemble learning.
Background
Classification techniques, a form of data analysis mining, can extract models that can describe sets of important data for predicting classes of data objects. The classification problem can be divided into a single-label classification problem and a multi-label classification problem according to different numbers of sample class labels after classification prediction. The purpose of multi-label classification is to predict whether, in an example associated with multiple classes, certain labels are associated with this example.
Multi-label learning algorithms can be broadly divided into two genres: one is a problem transformation method, and the other is an algorithm adaptation method. The first group of methods is algorithm independent. They convert a multi-label classification task into one or more single-label classification, regression, or label sorting tasks. The multi-label learning problem is solved by converting the multi-label learning problem into other learning scenarios. Representative algorithms include Binary Relevance (BR), Classifer Chain (CC), which convert multi-Label learning tasks into Binary classification tasks, Calibrated Label Ranking which converts multi-Label learning tasks into a second order method of Label Ranking tasks, and Random k-labels which convert multi-Label tasks into multi-class classification tasks. The second group of methods expands the specific learning algorithm to directly process multi-label data. The algorithm directly processes multi-label data by modifying a common learning algorithm, so that the multi-label learning problem is solved. Common algorithms such as decision trees, support vector machines, neural networks, bayes, boosting, etc. can be applied. Representative algorithms include an ML-kNN adaptive lazy learning algorithm, an ML-DT adaptive decision tree algorithm, a Rank-SVM adaptive different core technology and a CML adaptive information theory algorithm.
There are some disadvantages to the existing algorithms, more or less. Algorithms such as BR, CC, LP, etc. may create a sample imbalance problem after a classification problem conversion occurs. It results in n-ary classifiers being summed up from the dataset, where the number of negative examples tends to be more than the number of positive examples. A further problem is associated with the high dimensionality of the labels, which may increase sample imbalance and may also increase the number of signatures to be trained. Robustness is poor and may be less than desirable, especially in the handling of certain noise points. ML-kNN consists of two main steps. In a first step, for each test case, k nearest neighbors in the training set are identified. Next, in a second step, the maximum a posteriori probability identifies a labelset for the test case based on statistical information obtained from the labelsets of these neighboring cases. However, the time complexity of this approach is relatively high.
Disclosure of Invention
Aiming at the technical problem, the invention provides a multi-label text classification calculation method based on ensemble learning, which comprises the following steps:
step 1: preprocessing an original data set, segmenting sentences into independent words, and deleting non-keywords;
step 2: performing feature extraction vectorization processing on the text in a word frequency-inverse text frequency mode;
and step 3: decomposing a multi-label learning problem into a plurality of independent binary classification problems by adopting a binary association method, wherein each binary classification problem corresponds to one label in a label space;
and 4, step 4: and classifying the labels by adopting an ensemble learning algorithm.
The non-keywords include: pronouns, prepositions, conjunctions.
The step 2 comprises the following steps: the number of times a word appears in the text and the frequency of occurrence in the data set are counted to calculate the importance of the word in the data set.
The step 3 comprises the following steps:
firstly, judging whether each training example is associated with a certain label or not, and creating a corresponding binary training set;
and (3) inducing the binary classifier by using a binary correlation method, wherein the training examples related to the labels are positive examples, the training examples related to the labels are negative examples, and the training examples with unknown label correlation predict the correlation label set.
The step 4 comprises the following steps:
a boosting algorithm is adopted to serialize a plurality of base classifiers, after each base classifier is classified, new weights based on negative gradients are recalculated according to the positive errors and the negative errors of the classification, and the weights are loaded into the next base classifier; the errors are continuously corrected in the iterations.
The invention has the beneficial effects that:
the method combines a binary correlation method with ensemble learning, singles the labels in a problem transformation mode, and then trains data in a serial classifier mode. The method has the advantages that good performance of the method is reflected on a data set used in an experiment, time complexity is reduced, training speed is improved, generalization capability of a weak learner is improved, overfitting risk is reduced, and robustness of a model is improved.
Since the tags are independent, tags can be added and deleted without affecting the rest of the model. This makes it suitable for changing or dynamic scenarios and results and offers the opportunity for parallel implementation.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention provides a multi-label text classification calculation method based on ensemble learning, which comprises the following steps of:
step 1: preprocessing an original data set, segmenting sentences into independent words, and deleting non-keywords;
step 2: performing feature extraction vectorization processing on the text in a word frequency-inverse text frequency mode;
and step 3: decomposing a multi-label learning problem into a plurality of independent binary classification problems by adopting a binary association method, wherein each binary classification problem corresponds to one label in a label space;
and 4, step 4: and classifying the labels by adopting an ensemble learning algorithm.
The preprocessing stage is an important task in data set design, and the adoption of a machine learning method to preprocess data is crucial. In fact, it consists of two subtasks; (1) participles and (2) disabling deletion of words.
The purpose of word segmentation is to segment a sentence into individual words. The original data is only the stacking of a plurality of fonts, and word segmentation is to divide the text into a plurality of small blocks, such as English words or Chinese phrases. The method is mainly used for analyzing the content of the text information more intensely and understanding the meaning of the text to be expressed.
The purpose of the stop word deletion task is to eliminate unnecessary words, such as pronouns, prepositions, conjunctions, and other non-keywords. These words are often used to linguistically reinforce the structure of a sentence. These words are common in text, and the text content that does not provide more valuable information is called stop words.
In text classification, a document is described by a vector of features and feature values, also called attributes and attribute values. A common form of text representation is to specify TF-IDF (word frequency-inverse document frequency) for each word (feature). TF-IDF is a commonly used weighting technique for data mining. And calculating the occurrence frequency of a certain word in the text and the document frequency in the whole data set by a statistical method to calculate the importance of the word in the whole data set. The advantage of this approach is that it can filter out some common, but less relevant words to the meaning expressed by the article, while retaining key words that have an impact on the meaning expressed throughout the text.
TF-IDF is an abbreviation of Term Frequency-Inverse text Frequency. It consists of two parts, TF and IDF.
TF, the word frequency, represents the frequency of occurrence of a certain keyword in the text.
IDF, inverse text frequency. First, the text frequency refers to the number of times a keyword appears in all articles contained in a data set, and the inverse text frequency is the inverse of the text frequency as the name implies. If the less text of the characteristic word t appears, the larger the value of the IDF is, the characteristic word has good category distinguishing capability.
Thus, the higher the frequency of occurrence of a feature word in a text and the lower the frequency of text of the feature word in the entire dataset, the higher the TF-IDF value can be generated. Therefore, the TF-IDF can filter out some common but non-key words and retain important characteristic words. The TF-IDF value is the product of the two.
TF-IDF=TF*IDF
Multi-label text classification
In general, it is not possible to classify each document as a separate tag due to the natural overlap of the category spaces. As previously described, for multi-label text classification, documents may belong to multiple classes. Not only does the training data have multiple labeled documents, the classifier must also be able to map a single document to multiple classes. The training algorithm must be adjusted to process multiple labels. A multi-label text classification may be defined as a classification task in which a classifier x or a set of classifiers assigns each document to zero or more predefined class labels.
The basic idea of this algorithm is to decompose the multi-label learning problem into q independent binary classification problems, where each binary classification problem corresponds to one label in the label space.
For the jth class, denoted yjFirst, Binary Relevance determines each training example and yjWhether there is a relation or not, a corresponding binary training set D is createdj:
Dj={(xi,φ(Yi,yi))|1≤i≤N}
Wherein the content of the first and second substances,on the basis, the binary classifier is induced by using a binary learning algorithm B,
ps: x → R, i.e. pi←B(Di). Thus, for any multi-label training set sample (x)i,Yi),xiWill participate in the learning process of q binary classifiers. For related tag yj∈Yi,xiIs considered to introduce piA direct example of (c). On the other hand, for unrelated tagsxiIs considered a negative example. The above training strategy is called cross-training.
For unknown sample xiThe Binary Relevance predicts the Relevance label set Y of the Binary Relevance by inquiring the label Relevance on each Binary classifier and then combining the relevant labels:
Y={yj|pj(x)>0,1≤j≤q}
note that the predicted tag set Y is empty when all binary classifiers produce negative outputs. To avoid the situation of generating null predictions, the T-Criterion rule may be applied:
Y={yi|pi(x)>0,1≤j≤q}∪{yi,|j*=arg maxI≤1≤0pi(x)}
briefly, the T-Criterion rule supplements the equation by including a class label that outputs the largest (least negative) when none of the binary classifiers predicts a positive value.
In the selection of the learning algorithm, we select the boosting algorithm in the integration algorithm, i.e. concatenate several base classifiers. After each base classifier is classified, new weights based on the negative gradient are recalculated according to the positive errors of the classification and are loaded into the next base classifier. In one iteration, the error is continuously corrected. Here, the negative gradient is used as a measure for the error of the last round of base learner by fitting the negative gradient in the next round of learning using the gradient descent function. We finally need to find a fitting function f (x) to make it approach the true value infinitely, but it is difficult to directly find a very accurate fitting function, first initializing a less accurate f0(x) The residual value is calculated from the difference between it and the true value y:
r0(x)=y-f0(x)
now the problem turns to finding a suitable h0To fit r0The next step of the fitting function can be expressed as f1(x)=f0(x)+h0(x) However, there are still residual values:
r1(x)=y-f1(x)
then the next suitable h needs to be found1To fit r1The fitting function can be expressed as
f2(x)=f1(x)+h1(x)
This iteration continues each time with constant modifications such that f (x) continually approximates the true value y. Fitting function f generated by mth iterationm(x) Is equal to the function f of the last iterationm-1(x) And then the fitting function h of the last iteration residual is addedm-1. Can be expressed as:
fm(x)=fm-1(x)+hm-1(x) M ═ 1, 2.. M
This is a boosting calculationGeneral iterative training procedure for the method, specifically, the weak classifiers are initialized firstThe constant p minimizes the initial prediction loss function L. The loss function here is a square loss function L (y, f (x)) ═ y-f (x)))2Then the total loss of all N samples isThe purpose of the iteration is to minimize the loss value, find the direction in which the gradient decreases the fastest, and calculate a negative gradient function of L (y, f (x)) to f (x) for each iteration:
A fitting function h (x; α) is then constructed to fit the negative gradient-g (x), using the minimized square error to obtain α values according to the following equation.
Based on the obtained parameter α, the optimal β is calculated according to the following formula.
Finally, the calculation results are merged into the model, and the prediction function is updated:
fm(x)=fm-1(x)+βmhm(x;αm)
and when the preset iteration number M is reached, terminating the iteration.
To evaluate the performance of the multi-label classification algorithm, we tested the algorithm on the following two differently sized data sets. Table 1 shows the data set information used.
TABLE 1 data set information
Name (R) | Form(s) of | Number of samples | Number of labels |
TMC 2007 | Text | 28596 | 22 |
EPdata | Text | 155 | 22 |
A comparison was made with the ML-kNN algorithm using the method of the present invention.
We will evaluate the performance of each algorithm from three aspects, respectively: precision, recall, F1 value.
True Positive (TP): predicting positive class as a positive class number
True Negative (TN): predicting negative classes as negative class numbers
False Positive (FP): predicting negative classes as positive class numbers
False Negative (FN): predicting positive class as negative class number
The accuracy rate is for our prediction result, which indicates how many of the samples predicted to be positive are true positive samples. Then there are two possibilities to predict positive class (TP) and negative class (FP), i.e. positive class (TP) and negative class (FP)
Recall is for our original sample, which indicates how many of the positive examples in the sample are predicted to be correct. There are also two possibilities, one to predict the original positive class into a positive class (TP) and the other to predict the original positive class into a negative class (FN), i.e. the prediction is based on the prediction of the original positive class into a negative class (FN)
The F1 value is the precision and recall weighted harmonic mean:
TABLE 2 comparison of accuracy results
ML-kNN | Algorithm in this chapter | |
TMC 2007 | 0.61 | 0.67 |
EPdata | 0.18 | 0.51 |
TABLE 3 recall results comparison
ML-kNN | Algorithm in this chapter | |
TMC 2007 | 0.62 | 0.65 |
EPdata | 0.13 | 0.42 |
TABLE 4F 1 value comparison of results
ML-kNN | Algorithm in this chapter | |
TMC 2007 | 0.57 | 0.62 |
EPdata | 0.14 | 0.44 |
From the above experimental results, it can be seen that the same algorithm has a large difference in performance among different data sets for the same measured value. There are also differences in the behavior of different algorithms on the same data set. By comparing these two algorithms, it can be seen that the present invention is generally advantageous in comparison of the various measurement indices. The effectiveness of the algorithm of the ensemble learning in combination with Binary Relevance is illustrated.
The embodiments are merely preferred embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily made by those skilled in the art within the technical scope of the present invention will be covered by the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (5)
1. A multi-label text classification calculation method based on ensemble learning is characterized by comprising the following steps:
step 1: preprocessing an original data set, segmenting sentences into independent words, and deleting non-key words;
step 2: performing feature extraction vectorization processing on the text in a word frequency-inverse text frequency mode;
and step 3: decomposing a multi-label learning problem into a plurality of independent binary classification problems by adopting a binary association method, wherein each binary classification problem corresponds to one label in a label space;
and 4, step 4: and classifying the labels by adopting an ensemble learning algorithm.
2. The multi-label text classification computation method of claim 1, wherein the non-keywords comprise: pronouns, prepositions, conjunctions.
3. The multi-label text classification calculation method according to claim 1, wherein the step 2 comprises: the number of times a word appears in the text and the frequency of occurrence in the data set are counted to calculate the importance of the word in the data set.
4. The multi-label text classification calculation method according to claim 1, wherein the step 3 comprises:
firstly, judging whether each training example is associated with a certain label or not, and creating a corresponding binary training set;
and (3) inducing the binary classifier by using a binary correlation method, wherein the training examples related to the labels are positive examples, the training examples related to the labels are negative examples, and the training examples with unknown label correlation predict the correlation label set.
5. The multi-label text classification calculation method according to claim 1, wherein the step 4 comprises:
a boosting algorithm is adopted to serialize a plurality of base classifiers, and after each base classifier is classified, new weights based on negative gradients are recalculated according to the positive errors and the negative errors of the classification and are loaded into the next base classifier; the errors are continuously corrected in the iterations.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911085655.2A CN110968693A (en) | 2019-11-08 | 2019-11-08 | Multi-label text classification calculation method based on ensemble learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911085655.2A CN110968693A (en) | 2019-11-08 | 2019-11-08 | Multi-label text classification calculation method based on ensemble learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110968693A true CN110968693A (en) | 2020-04-07 |
Family
ID=70030440
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911085655.2A Pending CN110968693A (en) | 2019-11-08 | 2019-11-08 | Multi-label text classification calculation method based on ensemble learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110968693A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112182213A (en) * | 2020-09-27 | 2021-01-05 | 中润普达(十堰)大数据中心有限公司 | Modeling method based on abnormal lacrimation feature cognition |
CN113011337A (en) * | 2021-03-19 | 2021-06-22 | 山东大学 | Chinese character library generation method and system based on deep meta learning |
CN113705215A (en) * | 2021-08-27 | 2021-11-26 | 南京大学 | Meta-learning-based large-scale multi-label text classification method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140250032A1 (en) * | 2013-03-01 | 2014-09-04 | Xerox Corporation | Methods, systems and processor-readable media for simultaneous sentiment analysis and topic classification with multiple labels |
CN106156029A (en) * | 2015-03-24 | 2016-11-23 | 中国人民解放军国防科学技术大学 | The uneven fictitious assets data classification method of multi-tag based on integrated study |
CN107577785A (en) * | 2017-09-15 | 2018-01-12 | 南京大学 | A kind of level multi-tag sorting technique suitable for law identification |
CN109036577A (en) * | 2018-07-27 | 2018-12-18 | 合肥工业大学 | Diabetic complication analysis method and device |
CN109036556A (en) * | 2018-08-29 | 2018-12-18 | 王雁 | A method of keratoconus case is diagnosed based on machine learning |
-
2019
- 2019-11-08 CN CN201911085655.2A patent/CN110968693A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140250032A1 (en) * | 2013-03-01 | 2014-09-04 | Xerox Corporation | Methods, systems and processor-readable media for simultaneous sentiment analysis and topic classification with multiple labels |
CN106156029A (en) * | 2015-03-24 | 2016-11-23 | 中国人民解放军国防科学技术大学 | The uneven fictitious assets data classification method of multi-tag based on integrated study |
CN107577785A (en) * | 2017-09-15 | 2018-01-12 | 南京大学 | A kind of level multi-tag sorting technique suitable for law identification |
CN109036577A (en) * | 2018-07-27 | 2018-12-18 | 合肥工业大学 | Diabetic complication analysis method and device |
CN109036556A (en) * | 2018-08-29 | 2018-12-18 | 王雁 | A method of keratoconus case is diagnosed based on machine learning |
Non-Patent Citations (2)
Title |
---|
MA ZHANGUO 等: "《Improved Terms Weighting Algorithm of Text》" * |
俞学豪 等: "《基于BR和GBDT的电力信息通信客服系统多标签文本分类》" * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112182213A (en) * | 2020-09-27 | 2021-01-05 | 中润普达(十堰)大数据中心有限公司 | Modeling method based on abnormal lacrimation feature cognition |
CN112182213B (en) * | 2020-09-27 | 2022-07-05 | 中润普达(十堰)大数据中心有限公司 | Modeling method based on abnormal lacrimation feature cognition |
CN113011337A (en) * | 2021-03-19 | 2021-06-22 | 山东大学 | Chinese character library generation method and system based on deep meta learning |
CN113705215A (en) * | 2021-08-27 | 2021-11-26 | 南京大学 | Meta-learning-based large-scale multi-label text classification method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Styawati et al. | A support vector machine-firefly algorithm for movie opinion data classification | |
Kadhim | An evaluation of preprocessing techniques for text classification | |
US10783451B2 (en) | Ensemble machine learning for structured and unstructured data | |
JP2978044B2 (en) | Document classification device | |
Rahman et al. | Classifying non-functional requirements using RNN variants for quality software development | |
Kalaivani et al. | Feature reduction based on genetic algorithm and hybrid model for opinion mining | |
CN110968693A (en) | Multi-label text classification calculation method based on ensemble learning | |
WO2020063071A1 (en) | Sentence vector calculation method based on chi-square test, and text classification method and system | |
CN109376235B (en) | Feature selection method based on document layer word frequency reordering | |
CN110688479A (en) | Evaluation method and sequencing network for generating abstract | |
Ambert et al. | K-information gain scaled nearest neighbors: a novel approach to classifying protein-protein interaction-related documents | |
Jayady et al. | Theme Identification using Machine Learning Techniques | |
Kamalov et al. | Nested ensemble selection: An effective hybrid feature selection method | |
Zobeidi et al. | Effective text classification using multi-level fuzzy neural network | |
KR102025280B1 (en) | Method and apparatus for selecting feature in classifying multi-label pattern | |
CN115098690B (en) | Multi-data document classification method and system based on cluster analysis | |
CN111563361A (en) | Text label extraction method and device and storage medium | |
RU2546555C1 (en) | Method of automated classification of formalised documents in electronic document circulation system | |
Holts et al. | Automated text binary classification using machine learning approach | |
Lichtblau et al. | Text documents encoding through images for authorship attribution | |
Singh et al. | Intra news category classification using n-gram tf idf features and decision tree classifier | |
GHAREB et al. | HYBRID STATISTICAL RULE-BASED CLASSIFIER FOR ARABIC TEXT MINING. | |
JP2008282328A (en) | Text sorting device, text sorting method, text sort program, and recording medium with its program recorded thereon | |
Perwira et al. | Effect of information gain on document classification using k-nearest neighbor | |
Alshalif et al. | Alternative Relative Discrimination Criterion Feature Ranking Technique for Text Classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |