CN106815369B

CN106815369B - A kind of file classification method based on Xgboost sorting algorithm

Info

Publication number: CN106815369B
Application number: CN201710060026.9A
Authority: CN
Inventors: 庞宇明; 任江涛
Original assignee: National Sun Yat Sen University
Current assignee: National Sun Yat Sen University
Priority date: 2017-01-24
Filing date: 2017-01-24
Publication date: 2019-09-20
Anticipated expiration: 2037-01-24
Also published as: CN106815369A

Abstract

Method provided by the invention extracts tagged word by Labeled-LDA to calculate characteristic value, then carries out text classification with Xgboost sorting algorithm.It does feature space with common vector space model, common sorting algorithm is reduced come the method for carrying out text classification compared to the memory of required consuming, this is because word included in Chinese text hundreds and thousands of ten thousand, dimension is higher, if characterized by word, it is huge to expend memory, even can not single machine processing, and Chinese characters in common use be no more than 10,000, often occur even only 1,000, dimension substantially reduces, and Xgboost support using dictionary rather than matrix form as input.The present invention proposes a kind of novel Supervised feature selection algorithm Labeled-LDA algorithm with potential applications simultaneously, and the classification information that feature selecting can be excavated the semantic information of a large amount of corpus using LDA but also be included using text is made of Labeled-LDA.And pretreatment is simple, does not need to extract feature meticulously, in addition the distributed Ensemble Learning Algorithms Xgboost of powerful support, improves the accuracy and performance of classification.

Description

A kind of file classification method based on Xgboost sorting algorithm

Technical field

The present invention relates to text classification fields, more particularly, to a kind of text classification based on Xgboost sorting algorithm Method.

Background technique

File classification method has obtained widely in fields such as search engine, personalized recommendation system, public sentiment monitoring Using, be realize efficiently management and be accurately positioned massive information an important ring.

The common frame of file classification method is based on machine learning classification algorithm, that is, includes data prediction, then special Levy extraction, feature selecting, tagsort.

Feature extraction is identified text using unified method and model, and this method or model can indicate text This feature and it can be easily converted to mathematical linguistics, and then be converted to the mathematical model that computer is capable of handling.It is existing Popular document representation method directed quantity spatial model and potential language analysis model.Vector space model has letter The advantages that list, convenience of calculation, few pretreatment, also has the semantic relation between override feature and feature certainly, indicates text The disadvantages of depth is inadequate.

Feature selecting is one of critical issue of machine learning, and the quality of feature selecting result directly affects classifier Nicety of grading and Generalization Capability.Feature selecting is that some most effective features are picked out from one group of feature to reduce feature sky Between dimension, and reach and reject uncorrelated or redundancy feature, reduce runing time and improve the comprehensibility of analysis result, discovery The structure and other effects hidden in high dimensional data.Whether there is classification information according to data, feature selecting, which can be divided into, supervision and nothing Supervise two classes.There is common feature selection approach in text classification: document frequencies, mutual information, information gain and chi-square statistics Measure the methods of (CHI).

Tagsort be in text classification finally be also a most important ring.Naive Bayes Classification Algorithm is a kind of typical case Tagsort algorithm the probability that text belongs to certain particular category is calculated according to Bayesian formula, needed for the parameter estimated very Few, less sensitive to missing data, algorithm is also fairly simple, and calculating speed is fast, theoretically has compared with other sorting algorithms The smallest error rate, but actually not such was the case with, because this method assumes between attribute that this is assumed in reality independently of each other It is often invalid in the application of border.Decision Tree algorithms should be readily appreciated that and explain, can handle data type and conventional type category simultaneously Property, large data source can be made within the relatively short time it is feasible and work well as a result, but decision tree processing missing It is relatively difficult when data, it is easy to ignore the correlation in data set between attribute.Remaining tagsort method further include have LR, KNN and SVM etc..These are all base learners, and integrated study often can get than single by being combined multiple learners Practise the significantly superior Generalization Capability of device.According to the generating mode of individual learner, integrated learning approach is broadly divided into two at present Class, i.e., there are do not deposit between strong dependence, the sequencing method that must serially generate, and individual learner between individual learner In strong dependence, the parallel method that can be generated simultaneously, it is Boosting that the former, which represents, the latter represent be Bagging and " with Machine forest ".From the point of view of deviation-variation decomposition, Boosting is primarily upon reduction deviation, therefore Boosting can be based on The quite weak learner of Generalization Capability constructs very strong integrated.Bagging is primarily upon reduction variance, therefore it is in not beta pruning Effectiveness becomes apparent on the learner that decision tree, neural network etc. are disturbed vulnerable to sample.Xgboost sorting algorithm is to be based on A kind of integrated learning approach of Boosting, compared to traditional integrated study GBDT algorithm, Xgboost sorting algorithm mainly have with Lower eight big advantages:

One, for tradition GBDT using CART as base classifier, Xgboost sorting algorithm also supports linear classifier.

Two, tradition GBDT only uses first derivative information in optimization, and Xgboost sorting algorithm then carries out cost function The second Taylor series, while having used single order and second dervative.

Three, Xgboost sorting algorithm joined regular terms in cost function, the complexity for Controlling model.Canonical Contained in the leaf node number of tree, the score exported on each leaf node L2 mould quadratic sum.From deviation-variance Weigh for angle, regular terms reduces the variance of model, and the model for coming out study is simpler, prevents over-fitting.

Four, reduce: being equivalent to learning rate.Xgboost sorting algorithm, can be by leaf node after having carried out an iteration Weight be multiplied by the coefficient, primarily to weakening the influence of each tree, allowing below has a bigger studying space.

Five, column sampling: Xgboost sorting algorithm has used for reference the way of random forest, supports column sampling, can not only reduce Fitting, moreover it is possible to reduce and calculate.

Six, to the processing of missing values: having the sample of missing for the value of feature, Xgboost sorting algorithm can be learned automatically Practise out its cleavage direction.

Seven, it supports parallel.Boosting is a kind of serial structure, and Xgboost sorting algorithm is not parallel tree That spends is parallel, and Xgboost sorting algorithm is also complete (the cost letter of the t times iteration that just can be carried out next iteration of an iteration The predicted value of the iteration of front t-1 times is contained in several).Xgboost sorting algorithm is on characteristic particle size parallel.Decision tree The most time-consuming step of study be exactly to be ranked up to the value of feature (because to determine optimal partition point), Xgboost points Class algorithm before training, in advance sorts to data, then saves as block structure, repeatedly uses in subsequent iteration This structure, greatly reduces calculation amount.This block structure is but also become possibility parallel, when carrying out the division of node, It needs to calculate the gain of each feature, that maximum feature of gain is finally selected to do division, then the gain meter of each feature Calculation can open multithreading progress.

Eight, approximate histogramming algorithm that can be parallel.Tree node needs to calculate each of each feature point when being divided The corresponding gain of cutpoint, i.e., enumerate all possible cut-point with greedy method.When data can not once be loaded into memory or divide In the case of cloth, greedy algorithm efficiency will become very low, thus Xgboost sorting algorithm also proposed it is a kind of can be parallel it is close Like histogramming algorithm, for efficiently generating candidate cut-point.

Summary of the invention

The present invention is that the classification performance that the above prior art of solution provides is low, consuming memory is big, classification accuracy is low Defect, provide a kind of file classification method based on Xgboost sorting algorithm.

To realize the above goal of the invention, the technical solution adopted is that:

A kind of file classification method based on Xgboost sorting algorithm, comprising the following steps:

S1. multiple samples are obtained, each sample includes the label of content of text and text；

S2. all samples step S1 obtained are divided into training sample and forecast sample according to a certain percentage, wherein instructing Practice sample and forms training set, forecast sample predicted composition collection；

S3. for each training sample, two words of arbitrary neighborhood in its content of text is separated with space, then should The label of training sample is inputted as the label of Labeled-LDA, and the content of text of the training sample is as Labeled-LDA Text input；

S4. setting Labeled-LDA the number of iterations is K, is then iterated training to training sample；

S5. each training sample obtains two parts of documents after iteration, and portion is encoded about word and its corresponding word, and one Part is encoded about theme and word, i.e., the number that corresponding word coding occurs under each theme；Two parts of documents are integrated, are trained The number that each word occurs under each theme in sample；For each theme, sort according to the frequency of occurrence of corresponding word, choosing Take the LLDA word with the maximally related m word of the theme as training sample；

S6. for each training sample, each LLDA word that it is obtained by step S5 is counted in its content of text Frequency of occurrence, and using the number as the value of the LLDA word, value of each sample about each LLDA word will be obtained, is input to In Xgboost sorting algorithm, then Xgboost sorting algorithm is trained；

S7. so far model has trained, and needs to predict forecast set, i.e., carries out step S3~S5 to forecast set The step of, prediction classification then is carried out to each forecast sample in forecast set using trained model.

Preferably, in the step S6, the frequency of occurrence of LLDA word in the text is 0 if it exists, then not by the LLDA word It being input in Xgboost sorting algorithm, i.e., common sorting algorithm input is matrix form, and this method input is dictionary format, Memory is saved, it is more convenient quickly reasonably to handle missing values.

Preferably, the K is 1000.

Compared with prior art, the beneficial effects of the present invention are:

Method provided by the invention extracts tagged word by Labeled-LDA to calculate characteristic value, then uses Xgboost Sorting algorithm carries out text classification.It does feature space with common vector space model, common sorting algorithm come into The method of row text classification is reduced compared to the memory of required consuming, this is because word is at hundred included in Chinese text Up to ten million, dimension is higher, if characterized by word, expend memory it is huge, or even can not single machine processing, and Chinese characters in common use be no more than one Ten thousand, often occur even only 1,000, dimension substantially reduces, and Xgboost support with dictionary rather than matrix Form is as input.The present invention proposes a kind of novel Supervised feature selection algorithm Labeled- with potential applications simultaneously LDA algorithm, feature selecting is made of Labeled-LDA can be excavated the semantic information of a large amount of corpus but also be utilized using LDA The classification information that text is included.And pretreatment is simple, does not need to extract feature meticulously, in addition powerful support is distributed Ensemble Learning Algorithms Xgboost improves the accuracy and performance of classification.

Detailed description of the invention

Fig. 1 is the flow chart of method.

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；

Below in conjunction with drawings and examples, the present invention is further elaborated.

Embodiment 1

The implementation case includes 3 concrete cases, is classified respectively to 3 text corpus with different characteristics, That is English corpus WebKB disclosed in one weeds out the sample and two Chinese corpus of no any content, wherein one A is disclosed long text corpus: Fudan University's text classification corpus, the other is Chinese short text sample is very uneven The corpus of weighing apparatus: news comment is divided into normal and advertisement two categories, positive negative ratio and reaches 2742/42416=0.065 in this way Rank.

1 text classification data set summary situation of table

As shown in Figure 1, the text of the present invention for making feature extraction of Labeled-LDA based on Xgboost sorting algorithm The specific implementation step of classification method is as follows:

Step 1: Text Pretreatment

The classified text set of a batch, such as 3 cases are prepared in advance, according to the ratio random division training set of 8:2 and pre- Collection (news comment) is surveyed, (WebKB and Fudan University's text classification) is directlyed adopt if disclosed data set has existed division, to institute There is text to be removed dryness, Unified coding UTF-8 distinguishes each word with space and facilitate subsequent program for Chinese text Processing, since the present invention is based on word feature, common word, which puts in the stops, is not more than 10,000, and Labeled- LDA has powerful feature selecting ability, therefore Chinese Text Categorization commonly removes the pretreatment such as punctuation mark, number, stop words Process can dispense, from experimental result it can also be seen that such pretreatment is not required, for English text, then It includes that all Caps switchs to small letter, punctuation mark is replaced with space that it, which is pre-processed, can more distinguish the boundary of word in this way and meet The writing style of English.

Step 2: feature selecting being carried out to training set with Labeled-LDA

Present invention experiment uses Stamford natural language processing tool TMT.The various of Labeled-LDA are first set Parameter, including text segmentation mode and text filtering mode, label number, the number of iterations etc., text segmentation mode use space It distinguishes, label number sees that specific corpus actual conditions, the number of iterations are unified for 1000 times, after training, obtains two File, one is file of the word to coding, one be the number that each word coding occurs under each theme file, merge this Two files obtain the file for the number that each word occurs under each theme, then under each theme, occur to each word Number sequence, takes out the highest top n of frequency of occurrence, then these words obtained are exactly after Labeled-LDA feature selecting The word left, there may be repetitions for these words, count these words after excluding redundancy as LLDA word, have more rich semantic information, It is to be trained in label obtaining under this strong supervision message as a result, having very strong feature representation ability again simultaneously.

Step 3: training set forecast set being processed into the input format of Xgboosts

The input format of Xgboost is very simple and convenient and reasonable, for each of training set and forecast set sample, because For word is regarded as feature, current word, if the word is LLDA word, feature corresponding to corresponding LLDA word are arrived in processing Characteristic value adds 1, for each sample, the LLDA word not occurred, then and the feature of feature corresponding to corresponding LLDA word Value is 0, this feature does not have to the input as Xgboost；

The parameter setting of step 4:Xgboost sorting algorithm, training and prediction

The parameter of Xgboost sorting algorithm is set, the parameter for selecting resultant effect optimal, the Xgboost being finally arranged points Class algorithm parameter are as follows: the number of iterations 200, class number (depending on specific), classifier type gbtree, objective function are Multi:softmax, learning rate eta=0.3, depth capacity max_depth=6 of tree etc., the model that training obtains is to pre- Collection is surveyed to classify；

Step 5: assessment models performance

Since each data set differs greatly, therefore different judgment criteria is used for different data sets, so that judging Standard has more reasonability.For WebKB, there are 4 classifications, using micro- average F1 value (micro-F1) and macro average F1 value (macro-F1) it assesses；

It is defined as follows:

P_iFor the accuracy rate of i-th of classification, R_iFor the recall rate of i-th of classification, n is classification summary.Fudan University's text classification is used Micro- Average Accuracy (micro-P) is assessed；News comment using the accuracy rate P of negative sample, is called together since classification is especially uneven Rate R and F value are returned to assess.

The model that the present invention defines is denoted as LLDA-XG, and experimental result is as follows:

Classification performance experimental result in 2 WebKB corpus of table

	RCC	Centroid	NB	Winnow	SVM	LLDA-XG
							micro_F1	89.64	79.95	86.45	81.40	89.26	91.40
macro_F1	88.02	78.47	86.01	77.85	87.66	90.48

Classification performance experimental result in 3 news comment corpus of table

	Bay es	SGDC	Bagging+KNN	SVM	RF	GBDT	Adaboost	LLDA-XG
									P	69.55	77.71	97.00	93.65	96.4	92.22	95.80	94.52
R	91.62	91.98	85.20	86.81	89.84	88.77	85.56	92.34
									F value	79.08	84.24	90.70	90.10	92.99	90.46	90.40	93.42

Classification performance experimental result in 4 Fudan University's text classification corpus of table

Model	Accuracy rate
		Bag of words son+Logistic is returned	92.08
Binary phrase+Logistic is returned	92.97
		Bag of words+support vector machines	93.02
Binary phrase+support vector machines	93.03
		Average term vector+Logistic is returned	86.89
Labeled-LDA	90.80
		Convolutional neural networks	94.04
Cyclic convolution network	95.20
		LLDA-XG	95.41

From table 2, table 3 and table 4 it can be seen that LLDA-XG model performance WebKB corpus, news comment corpus and Good effect is all obtained on Fudan University this paper taxonomy collection, micro_F1 and macro_F1 value is all super in WebKB corpus Cross 90；The accuracy rate of LLDA-XG model is not highest in news comment corpus, but recall rate and F value are all most High, illustrate that LLDA-XG model had not only been able to maintain very high accuracy rate but also be able to maintain very high recall rate, more emphasis performance is equal Weighing apparatus.In Fudan University's text classification corpus, experimental result reference is Chinese Academy of Sciences's university research neural network and term vector In conjunction with doctor Lai Siwei thesis, LLDA-XG model shows very outstanding, and classification accuracy is than cyclic convolution nerve Network is taller, and pretreatment is considerably less, and runing time is also fast, and single machine performance is common to be obtained for computer four or five minutes As a result, and neural network then needs largely to calculate the time.

Classification time performance experimental result of 5 LLDA-XG of table in WebKB corpus

Characteristic	215	411	925	7288
					micro-F1	90.24	90.82	91.40	91.61
macro-F1	89.19	89.98	90.48	90.69
					Time (second)	3.740	4.560	5.759	8.984

Classification time performance experimental result of 6 LLDA-XG of table in news comment corpus

Characteristic	154	289	674	4154
					P	94.39	93.91	94.53	94.13
R	90.02	90.73	92.34	91.44
					F	92.15	92.29	93.42	92.77
Time (second)	6.739	7.716	8.756	10.589

Classification time performance experimental result of 7 LLDA-XG of table in Fudan University's text classification corpus

Characteristic	408	665	1329	6707
					Accuracy rate	94.59	95.20	95.41	95.39
Time (second)	145.701	206.362	278.52351899874816	342.354

In table 5, characteristic from 215 increase to 411 when, micro-F1 increases 0.58, macro-F1 and increases 0.79；Characteristic from 411 increase to 925 when, micro-F1 increases 0.58, macro-F1 and increases 0.5,；Characteristic is from 925 When increasing to 7288, characteristic increases 7 times or so, and micro-F1 and macro-F1 only increase 0.21, very unknown Aobvious, runing time nearly doubles.In table 6, characteristic from 154 increase to 289 when, accuracy rate P declines 0.48 instead, recalls Rate R increases 0.71, F value and increases 0.14；Characteristic from 289 increase to 674 when, accuracy rate increases 0.62, and recall rate increases Grown 1.61, F value and increased 1.13, but characteristic from 674 increase to 4154 when, accuracy rate decline 0.4, recall rate decline 0.9, F value decline 0.65.On Fudan University's text categorization task, characteristic from 408 increase to 665 when, accuracy rate increases 0.61, characteristic from 665 increase to 1329 when, accuracy rate increases 0.21, characteristic from 1329 increase to 6707 when, accuracy rate Have dropped 0.02.From table 5, table 6 and table 7 as can be seen that LLDA-XG model is in characteristic, the more elapsed times of characteristic are more It is long, but classification performance increasess slowly, or even to the more more declines of classification performance instead of late feature number, this may be to overtrain Cause over-fitting.This also illustrates the importance of feature selecting, and the feature of maximum often minority is influenced for classification performance, There are a large amount of redundancy feature even feature of noise in feature, and feature is more, and consumption operation time is more, and performance boost is non- Often limited or even generation over-fitting.Also illustrate that Labeled-LDA has very strong feature selecting ability simultaneously, either to Chinese Or English, either long text or short text, select very good feature, so that integral operation from a large amount of feature Time decline, performance stability and high efficiency.And feature extraction of the invention is based on word, extraction is very convenient, does not need artificial Also not needing expertise goes consuming vast resources to go to extract feature, and the pretreatment simple operation time is few, classifies in conjunction with Xgboost Algorithm rapidly and efficiently, the great abilities of missing values, the superior stabilization of overall performance can be handled.In short, by experimental verification, this hair Bright proposed textual classification model LLDA-XG can be widely used in respective text categorization task, and pretreatment is simple and quick, Integral operation time and performance are all very prominent, have very high robustness and practical value.

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims

1. a kind of file classification method based on Xgboost sorting algorithm, it is characterised in that: the following steps are included:

S2. all samples step S1 obtained are divided into training sample and forecast sample according to a certain percentage, wherein training sample This composition training set, forecast sample predicted composition collection；

S3. for each training sample, two words of arbitrary neighborhood in its content of text are separated with space, then by the training The label of sample is inputted as the label of Labeled-LDA, and using the content of text of the training sample as Labeled-LDA Text input；

S5. each training sample obtains two parts of documents after iteration, and portion is encoded about word and its corresponding word, Yi Fenshi It is encoded about theme and word, i.e., the number that corresponding word coding occurs under each theme；Two parts of documents are integrated, training sample is obtained In the number that occurs under each theme of each word；For each theme, sort according to the frequency of occurrence of corresponding word, choose with LLDA word of the maximally related m word of the theme as training sample；

S6. for each training sample, the appearance of each LLDA word that it is obtained by step S5 in its content of text is counted Number, and using the number as the value of the LLDA word, value of each sample about each LLDA word will be obtained, is input to In Xgboost sorting algorithm, then Xgboost sorting algorithm is trained；

S7. so far model has trained, and needs to predict forecast set, i.e., the step of step S3~S5 is carried out to forecast set Suddenly, prediction classification then is carried out to each forecast sample in forecast set using trained model.

2. the file classification method according to claim 1 based on Xgboost sorting algorithm, it is characterised in that: the step In rapid S6, the frequency of occurrence of LLDA word in the text is 0 if it exists, then the LLDA word is not input to Xgboost sorting algorithm.

3. the file classification method according to claim 1 based on Xgboost sorting algorithm, it is characterised in that: the K is 1000。