CN106815369A

CN106815369A - A kind of file classification method based on Xgboost sorting algorithms

Info

Publication number: CN106815369A
Application number: CN201710060026.9A
Authority: CN
Inventors: 庞宇明; 任江涛
Original assignee: National Sun Yat Sen University
Current assignee: National Sun Yat Sen University
Priority date: 2017-01-24
Filing date: 2017-01-24
Publication date: 2017-06-09
Anticipated expiration: 2037-01-24
Also published as: CN106815369B

Abstract

The method that the present invention is provided extracts tagged word to calculate characteristic value by Labeled LDA, then carries out text classification with Xgboost sorting algorithms.It does feature space with common vector space model, the internal memory that common sorting algorithm compares required consuming to carry out the method for text classification is reduced, this is that dimension is higher, if being characterized with word due to the word hundreds and thousands of ten thousand included in Chinese text, expend internal memory huge, even cannot unit treatment, and Chinese characters in common use be no more than 10,000, often occur even only 1,000, dimension is substantially reduced, and Xgboost is supported using dictionary rather than matrix form as input.The present invention proposes a kind of novel Supervised feature selection algorithm Labeled LDA algorithms with potential applications simultaneously, and feature selecting is made of Labeled LDA can excavate the classification information that the semantic information of a large amount of language materials can utilize text to be included again using LDA.And pretreatment is simple, it is not necessary to extract feature meticulously, add the distributed Ensemble Learning Algorithms Xgboost of powerful support, improve the accuracy and performance of classification.

Description

A kind of file classification method based on Xgboost sorting algorithms

Technical field

The present invention relates to text classification field, more particularly, to a kind of text classification based on Xgboost sorting algorithms Method.

Background technology

File classification method has been obtained widely in fields such as search engine, personalized recommendation system, public sentiment monitoring Using being to realize efficiently managing and being accurately positioned the important ring of magnanimity information.

The conventional framework of file classification method is based on machine learning classification algorithm, i.e., comprising data prediction, then special Levy the steps such as extraction, feature selecting, tagsort.

Feature extraction is that text is identified using unified method and model, and the method or model can represent text This feature and mathematical linguistics can be easily changed into, and then change into the Mathematical Modeling that computer can be processed.It is existing Popular document representation method directed quantity spatial model and potential language analysis model.Vector space model has letter The advantages of list, convenience of calculation, few pretreatment, also there is the semantic relation between override feature and feature certainly, text is represented The shortcomings of depth is inadequate.

Feature selecting is one of key issue of machine learning, and the quality of feature selecting result directly affects grader Nicety of grading and Generalization Capability.Feature selecting is that some maximally effective features are picked out from one group of feature so as to reduce feature sky Between dimension, and reach and reject uncorrelated or redundancy feature, reduce run time and improve the intelligibility of analysis result, find Structure hidden in high dimensional data and other effects.Whether there is classification information according to data, feature selecting can be divided into supervision and nothing Supervise two classes.The feature selection approach commonly used in text classification has：Document frequencies, mutual information, information gain and chi The methods such as amount (CHI).

It in text classification is also finally a most important ring that tagsort is.Naive Bayes Classification Algorithm is a kind of typical case Tagsort algorithm, according to Bayesian formula, calculate the probability that text belongs to certain particular category, the parameter estimated needed for it is very Few, less sensitive to missing data, algorithm is also fairly simple, and calculating speed is fast, has compared with other sorting algorithms in theory Minimum error rate, but actually not such was the case with, and because separate between the method hypothesis attribute, this is assumed in reality It is often invalid in the application of border.Decision Tree algorithms should be readily appreciated that and explain, processing data type and conventional type can belong to simultaneously Property, result that is feasible and working well, but decision tree treatment missing can be made to large data source within the relatively short time It is relatively difficult during data, easily ignore the correlation between attribute in data set.Remaining tagsort method also include LR, KNN and SVM etc..These are all base learners, and integrated study is combined by by multiple learners, can often obtain than single Practise the significantly superior Generalization Capability of device.According to the generating mode of individual learner, current integrated learning approach is broadly divided into two Exist between class, i.e., individual learner and do not deposited between strong dependence, the sequencing method that must serially generate, and individual learner At strong dependence, the parallel method that can be generated simultaneously, it is Boosting that the former represents, the latter represent be Bagging and " with Machine forest ".From from the point of view of deviation-variation decomposition, Boosting be primarily upon reduce deviation, therefore Boosting can be based on it is general Change the quite weak learner of performance construct it is very strong integrated.Bagging is primarily upon reducing variance, therefore it determines in not beta pruning Effectiveness becomes apparent on the learner that plan tree, neutral net etc. are easily disturbed by sample.Xgboost sorting algorithms are to be based on A kind of integrated learning approach of Boosting, compared to traditional integrated study GBDT algorithms, Xgboost sorting algorithms mainly have with Lower eight big advantages：

First, using CART as base grader, Xgboost sorting algorithms also support linear classifier to tradition GBDT.

2nd, tradition GBDT only uses first derivative information in optimization, and Xgboost sorting algorithms are then carried out to cost function The second Taylor series, while used single order and second dervative.

3rd, Xgboost sorting algorithms add regular terms in cost function, for the complexity of Controlling model.Canonical The quadratic sum of the L2 moulds of the fraction exported on the leaf node number of tree, each leaf node is contained in.From deviation-variance For balance angle, regular terms reduces the variance of model, makes study model out simpler, prevents over-fitting.

4th, reduce：Equivalent to learning rate.Xgboost sorting algorithms, can be by leaf node after an iteration has been carried out Weight be multiplied by the coefficient, primarily to weaken each tree influence, allow below have bigger studying space.

5th, row sampling：Xgboost sorting algorithms have used for reference the way of random forest, support row sampling, can not only reduce Fitting, moreover it is possible to reduce and calculate.

6th, to the treatment of missing values：For the sample that the value of feature has missing, Xgboost sorting algorithms can be learned automatically Practise out its cleavage direction.

7th, support parallel.Boosting is a kind of serial structure, and the parallel of Xgboost sorting algorithms is not tree That spends is parallel, and Xgboost sorting algorithms are also complete (the t times cost letter of iteration that can just carry out next iteration of an iteration The above t-1 predicted value of iteration is contained in several).Xgboost sorting algorithms are parallel on characteristic particle size.Decision tree The most time-consuming step of study be exactly the value of feature to be ranked up (because to determine optimal partition point), Xgboost points Class algorithm before training, is sorted to data in advance, then saves as block structure, is repeatedly used in iteration below This structure, greatly reduces amount of calculation.This block structure also causes to become possibility parallel, when the division of node is carried out, The gain of each feature of calculating is needed, that feature for finally selecting gain maximum does division, then the gain meter of each feature Calculation can just open multithreading to be carried out.

8th, approximate histogramming algorithm that can be parallel.Tree node is when line splitting is entered, it is necessary to calculate each point of each feature The corresponding gain of cutpoint, i.e., enumerate all possible cut-point with greedy method.When data cannot once be loaded into internal memory or divide In the case of cloth, greedy algorithm efficiency will become very low, thus Xgboost sorting algorithms also proposed it is a kind of can be parallel it is near Like histogramming algorithm, the cut-point for efficiently generating candidate.

The content of the invention

The classification performance that above prior art is provided is low, big, classification accuracy is low to expend internal memory to solve for the present invention Defect, there is provided a kind of file classification method based on Xgboost sorting algorithms.

To realize above goal of the invention, the technical scheme of use is：

A kind of file classification method based on Xgboost sorting algorithms, comprises the following steps：

S1. multiple samples are obtained, described each sample includes the label of content of text and text；

S2. all samples that step S1 is obtained are divided into training sample and forecast sample according to a certain percentage, wherein instructing Practice sample composition training set, forecast sample predicted composition collection；

S3. for each training sample, two words of arbitrary neighborhood in its content of text are separated with space, then should The label of training sample as Labeled-LDA label be input into, and the training sample content of text as Labeled-LDA Text input；

S4. it is K to set Labeled-LDA iterationses, and training is then iterated to training sample；

S5. each training sample is on word and its correspondence word coding, one by obtaining two parts of documents, portion after iteration Part is encoded on theme and word, i.e., the number of times that corresponding word coding occurs under each theme；Two parts of documents are integrated, is trained The number of times that each word occurs under each theme in sample；For each theme, sorted according to the occurrence number of corresponding word, choosing Take LLDA word of the m word maximally related with the theme as training sample；

S6. for each training sample, each LLDA word that it is obtained by step S5 is counted in its content of text Occurrence number, and using the number of times as the value of this feature, value of each sample on each LLDA word will be obtained, input is extremely In Xgboost sorting algorithms, then Xgboost sorting algorithms are trained；

S7. so far model has been trained, it is necessary to be predicted to forecast set, i.e., carry out step S3~S5's to forecast set Step, is then predicted classification using the model for training to each forecast sample in forecast set.

Preferably, in the step S6, if there is LLDA words occurrence number in the text is 0, not by the LLDA words In input to Xgboost sorting algorithms, i.e., common sorting algorithm input is matrix form, and the method input is dictionary format, Internal memory is saved, it is more convenient quickly reasonably to process missing values.

Preferably, the K is 1000.

Compared with prior art, the beneficial effects of the invention are as follows：

The method that the present invention is provided extracts tagged word to calculate characteristic value by Labeled-LDA, then with Xgboost points Class algorithm carries out text classification.It does feature space with common vector space model, and common sorting algorithm is carried out The internal memory that the method for text classification expends compared to needed for is reduced, and this is because the word included in Chinese text is on hundred Ten million, dimension is higher, if being characterized with word, it is huge to expend internal memory, in addition cannot unit treatment, and Chinese characters in common use are no more than 10,000 Individual, often occurring even only 1,000, dimension is substantially reduced, and Xgboost is supported with dictionary rather than rectangular Formula is used as input.The present invention proposes a kind of novel Supervised feature selection algorithm Labeled- with potential applications simultaneously LDA algorithm, feature selecting is made of Labeled-LDA can be excavated the semantic information of a large amount of language materials using LDA and can utilized The classification information that text is included.And pretreatment is simple, it is not necessary to extract feature meticulously, add powerful support distributed Ensemble Learning Algorithms Xgboost, improves the accuracy and performance of classification.

Brief description of the drawings

Fig. 1 is the flow chart of method.

Specific embodiment

Accompanying drawing being for illustration only property explanation, it is impossible to be interpreted as the limitation to this patent；

Below in conjunction with drawings and Examples, the present invention is further elaborated.

Embodiment 1

The implementation case includes 3 concrete cases, and 3 text corpus with different characteristics are classified respectively, English corpus WebKB, weeds out the sample without any content, and two Chinese corpuses, wherein one i.e. disclosed in one Individual is disclosed long text corpus：Fudan University's text classification corpus, another is that Chinese short text sample is very uneven The corpus of weighing apparatus：News analysis, is divided into normal and two kinds of classifications of advertisement, and positive negative ratio reaches 2742/42416=0.065 so Rank.

The text classification data set summary situation of table 1

As shown in figure 1, the text for making feature extraction of Labeled-LDA based on Xgboost sorting algorithms of the present invention The specific implementation step of sorting technique is as follows：

Step 1：Text Pretreatment

A collection of classified text set is prepared in advance, such as 3 cases, according to 8:2 ratio random division training set and pre- Collection (news analysis) is surveyed, directly (WebKB and Fudan University's text classification) is used if disclosed data set has had division, to institute There is text to carry out dry, Unified coding is UTF-8, for Chinese text, the convenient program below of each word is distinguished with space Treatment, because the present invention is based on word feature, conventional word puts in the stops and is not more than 10,000, and Labeled- LDA has powerful feature selecting ability, therefore Chinese Text Categorization conventional remove punctuation mark, numeral, stop words etc. are pre-processed Process can be dispensed, from experimental result it can also be seen that such pretreatment is not required, for English text, then Its pretreatment includes that all Caps switchs to small letter, punctuation mark and replaced with space, so can more distinguish the boundary of word and meet The writing style of English.

Step 2：Feature selecting is carried out to training set with Labeled-LDA

Present invention experiment uses Stamford natural language processing instrument TMT.Each seed ginseng of Labeled-LDA is first set Number, including text segmentation mode and text filtering mode, label number, iterations etc., text segmentation mode uses space region Point, label number sees specific corpus actual conditions, and iterations unification is 1000 times, after training terminates, obtains two texts Part, one is word to the file that encodes, one be the number of times that each word coding occurs under each theme file, merge this two Individual file obtains the file of the number of times that each word occurs under each theme, then under each theme, occurrence is gone out to each word Number sequence, takes out occurrence number highest top n, then these words for obtaining are exactly to be stayed by after Labeled-LDA feature selectings Under word, these words there may be repetition, these words are counted for LLDA words after excluding redundancy, with more rich semantic information, together When be again to train the result for obtaining under this strong supervision message in label, with very strong feature representation ability.

Step 3：Training set forecast set is processed into the pattern of the input of Xgboosts

The pattern of the input of Xgboost is very simple and convenient and reasonable, for each sample of training set and forecast set, because It is that word is regarded as feature, current word, if the word is LLDA words, the spy of the feature corresponding to corresponding LLDA words are arrived in treatment Value indicative adds 1, for each sample, the LLDA words not occurred, the then characteristic value of the feature corresponding to corresponding LLDA words It is 0, this feature is without the input as Xgboost；

Step 4：The parameter setting of Xgboost sorting algorithms, training and prediction

The parameter of Xgboost sorting algorithms, the optimal parameter of selection resultant effect, final the Xgboost for setting points are set Class algorithm parameter is：Iterations is 200, and class number (depending on specific), classifier type are gbtree, object function is multi:Softmax, learning rate eta=0.3, depth capacity max_depth=6 of tree etc., the model that training is obtained is to pre- Collection is surveyed to be classified；

Step 5：Assessment models performance

Because each data set differs greatly, therefore different judgment criteria is used for different data sets so that judge Standard has more reasonability.For WebKB, there are 4 classifications, using micro- average F1 values (micro-F1) and grand average F1 values (macro-F1) assess；

It is defined as follows：

P_iIt is i-th accuracy rate of classification, R_iIt is i-th recall rate of classification, n is summarized for classification.Fudan University's text classification is used Micro- Average Accuracy (micro-P) is assessed；News analysis is especially uneven due to classification, using the accuracy rate P of negative sample, calls together Rate R and F value is returned to assess.

The model of present invention definition is designated as LLDA-XG, and experimental result is as follows：

Classification performance experimental result in the WebKB corpus of table 2

	RCC	Centroid	NB	Winnow	SVM	LLDA-XG
							micro_F1	89.64	79.95	86.45	81.40	89.26	91.40
macro_F1	88.02	78.47	86.01	77.85	87.66	90.48

Classification performance experimental result in the news analysis corpus of table 3

	Bayes	SGDC	Bagging+KNN	SVM	RF	GBDT	Adaboost	LLDA-XG
									P	69.55	77.71	97.00	93.65	96.4	92.22	95.80	94.52
R	91.62	91.98	85.20	86.81	89.84	88.77	85.56	92.34
									F values	79.08	84.24	90.70	90.10	92.99	90.46	90.40	93.42

Classification performance experimental result in Fudan University's text classification corpus of table 4

From table 2, table 3 and table 4 it can be seen that LLDA-XG model performances WebKB corpus, news analysis corpus and Fudan University classifies all obtain good effect in corpus herein, micro_F1 and macro_F1 values all surpass in WebKB corpus Cross 90；The accuracy rate of LLDA-XG models is not highest in news analysis corpus, but recall rate and F values are all most It is high, illustrate that LLDA-XG models can keep accuracy rate very high and can keep recall rate very high, more emphasis performance is equal Weighing apparatus.In Fudan University's text classification corpus, experimental result reference is Chinese Academy of Sciences's university research neutral net and term vector With reference to doctor Lai Siwei thesis, LLDA-XG models show very outstanding, and classification accuracy is than cyclic convolution nerve Network is taller, and pretreatment is considerably less, and run time is also fast, and the common computer of unit performance can just be obtained for four or five minutes As a result, and neutral net then needs the substantial amounts of calculating time.

Classification time performance experimental results of the LLDA-XG of table 5 in WebKB corpus

Characteristic	215	411	925	7288
					micro-F1	90.24	90.82	91.40	91.61
macro-F1	89.19	89.98	90.48	90.69
					Time (second)	3.740	4.560	5.759	8.984

Classification time performance experimental results of the LLDA-XG of table 6 in news analysis corpus

Characteristic	154	289	674	4154
					P	94.39	93.91	94.53	94.13
R	90.02	90.73	92.34	91.44
					F	92.15	92.29	93.42	92.77
Time (second)	6.739	7.716	8.756	10.589

Classification time performance experimental results of the LLDA-XG of table 7 in Fudan University's text classification corpus

Characteristic	408	665	1329	6707
					Accuracy rate	94.59	95.20	95.41	95.39
Time (second)	145.701	206.362	278.52351899874816	342.354

In table 5, characteristic from 215 increase to 411 when, micro-F1 increases 0.58, macro-F1 and increases 0.79； Characteristic from 411 increase to 925 when, micro-F1 increases 0.58, macro-F1 and increases 0.5,；Characteristic increases from 925 During to 7288, characteristic increases 7 times or so, and micro-F1 and macro-F1 only increase 0.21, very not substantially, fortune The row time nearly doubles.In table 6, characteristic from 154 increase to 289 when, accuracy rate P declines 0.48 on the contrary, and recall rate R increases Grow 0.71, F values and increase 0.14；Characteristic from 289 increase to 674 when, accuracy rate increases 0.62, and recall rate is increased 1.61, F values increase 1.13, but characteristic from 674 increase to 4154 when, accuracy rate declines 0.4, and recall rate declines 0.9, F Value declines 0.65.On Fudan University's text categorization task, characteristic from 408 increase to 665 when, accuracy rate increases 0.61, feature Number from 665 increase to 1329 when, accuracy rate increases 0.21, characteristic from 1329 increase to 6707 when, accuracy rate have dropped 0.02.From table 5, table 6 and table 7 as can be seen that LLDA-XG models are in characteristic, characteristic is more, and elapsed time is more long, but Classification performance increasess slowly, or even declines to the more many classification performances on the contrary of late feature number, and this is probably to overtrain to cause It is fitted.This also illustrates the importance of feature selecting, the often a small number of feature maximum for classification performance influence, in feature There is substantial amounts of redundancy feature even feature of noise, feature is more, and consumption operation time is more, and performance boost is very limited Even produce over-fitting.Also explanation Labeled-LDA has very strong feature selecting ability simultaneously, either to Chinese or English Text, either long text or short text, select the feature of very high-quality from substantial amounts of feature so that under the integral operation time Drop, stable performance is efficient.And feature extraction of the invention is based on word, extract very convenient, it is not necessary to be manually also not required to Expertise is wanted to go consuming ample resources to go to extract feature, the pretreatment simple operation time is few, with reference to Xgboost sorting algorithms Rapidly and efficiently, the great ability of missing values, the superior stabilization of overall performance can be processed.In a word, by experimental verification, the present invention is carried The textual classification model LLDA-XG for going out, can be widely used in respective text categorization task, and its pretreatment is simple and quick, overall fortune Evaluation time and performance are all protruded very much, with robustness and practical value very high.

Obviously, the above embodiment of the present invention is only intended to clearly illustrate example of the present invention, and is not right The restriction of embodiments of the present invention.For those of ordinary skill in the field, may be used also on the basis of the above description To make other changes in different forms.There is no need and unable to be exhaustive to all of implementation method.It is all this Any modification, equivalent and improvement made within the spirit and principle of invention etc., should be included in the claims in the present invention Protection domain within.

Claims

1. a kind of file classification method based on Xgboost sorting algorithms, it is characterised in that：Comprise the following steps：

S2. all samples that step S1 is obtained are divided into training sample and forecast sample according to a certain percentage, wherein training sample This composition training set, forecast sample predicted composition collection；

S3. for each training sample, two words of arbitrary neighborhood in its content of text are separated with space, then trains this The label of sample as Labeled-LDA label be input into, and the training sample content of text as Labeled-LDA text This input；

S5. each training sample is on word and its correspondence word coding, Yi Fenshi by obtaining two parts of documents, portion after iteration Encoded on theme and word, i.e., the number of times that corresponding word coding occurs under each theme；Two parts of documents are integrated, training sample is obtained In the number of times that occurs under each theme of each word；For each theme, sorted according to the occurrence number of corresponding word, choose with The maximally related m word of the theme as training sample LLDA words；

S6. for each training sample, its each appearance of LLDA words in its content of text obtained by step S5 is counted Number of times, and using the number of times as the value of this feature, value of each sample on each LLDA word, input to Xgboost will be obtained In sorting algorithm, then Xgboost sorting algorithms are trained；

S7. so far model has been trained, it is necessary to be predicted to forecast set, i.e., the step of step S3~S5 is carried out to forecast set Suddenly, classification then is predicted to each forecast sample in forecast set using the model for training.

2. the file classification method based on Xgboost sorting algorithms according to claim 1, it is characterised in that：The step In rapid S6, if there is LLDA words occurrence number in the text is 0, the LLDA words are not input into Xgboost sorting algorithms.

3. the file classification method based on Xgboost sorting algorithms according to claim 1, it is characterised in that：The K is 1000。