CN107066553A

CN107066553A - A kind of short text classification method based on convolutional neural networks and random forest

Info

Publication number: CN107066553A
Application number: CN201710181062.0A
Authority: CN
Inventors: 刘泽锦; 王洁
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-03-24
Filing date: 2017-03-24
Publication date: 2017-08-18
Anticipated expiration: 2037-03-24
Also published as: CN107066553B

Abstract

The invention discloses a kind of short text classification method based on convolutional neural networks and random forest, belong to text classification and deep learning field.For causing the problem of generalization ability is not enough as convolutional neural networks grader using Softmax, it is proposed that the short text sorting algorithm (CNN RF) of a kind of combination convolutional neural networks and random forest.This method proposes that a kind of doublet Vector convolution neutral net is used for fully extracting short text high-order feature first, then using random forest as high-order feature classifiers, so as to improve short text classifying quality.Result on three open experimental data sets shows that CNN RF have a clear superiority in multiple evaluation indexes compared with other algorithms.

Description

A kind of short text classification method based on convolutional neural networks and random forest

Technical field

The invention belongs to text classification and deep learning field, it is related to a kind of short based on product neutral net and random forest File classification method, available for for microblogging, short message, the classification of the Massive short documents notebook data such as user Query or emotion point The tasks such as class.And can be used for system services such as search engine, information retrievals.

Background technology

With developing rapidly for internet in recent years, various information exchange platforms can produce substantial amounts of short text (Short Text), these short texts are related to the every field of people's life, are increasingly becoming people and use frequent and generally acknowledged communication side Formula.Such as ecommerce comment, info web retrieval, intelligent Answer System etc. are the generating sources of Massive short documents sheet.How from Effective information is excavated in Massive short documents sheet, is the widely studied problem of many scholars in recent years.Text classification is that a kind of text is dug The effective ways of pick, but due to the features such as short text length is short, lexical item feature is sparse, causing traditional long text sorting technique to become Obtain and be no longer applicable.Short text sorting technique (Short Text Classification) can solve above-mentioned to a certain extent Facing challenges in short text application, the technology is one of study hotspot of the numerous scholars of recent domestic, is also nature language A vital task in speech processing (NLP) field.Nowadays, file classification method is mainly based upon statistical learning method Or the method for machine learning, it is trained using statistics or the method for machine learning on the corpus by manually marking To grader, then treat categorized data set and classified.Wherein comparing the machine learning method of main flow has naive Bayesian (Naive Bayes, NB), SVMs (Support Vector Machine, SVM), logistic regression (Logistic Regression, LR) many-sorted logic return (Softmax Regression, SR), random forest (Random Forest, RF), Deep neural network (Deep Neural Network, DNN) etc..More successful long article one's duty is obtained in text classification field Class method is difficult to be applied directly in short text classification, and therefore, the sorting algorithm for short text has become current researcher A research puzzle urgently to be resolved hurrily, short text classification facing challenges be mainly：

1) short text keyword feature is sparse, and compared with the long text that general lexical item is enriched, short text is often only several Effective keyword, and when representing text using vector space model, it is difficult to fully excavate the relevance between feature；

2) in Opening field (such as microblogging, search engine), information updating is fast, and single short text information amount is small, but always Greatly, the cross section between information is few for body text message amount；

3) neologisms, new term, colloquial a large amount of appearance, these words are generally difficult to for oneself has categorizing system It is intractable.

Domestic and foreign scholars have carried out some significant research and discoveries for short text classification problem, and the first kind is The method extended based on short essay eigen：Bouaziz et al. utilizes latent Dirichletal location (Latent Dirichlet Allocation, LDA) distribution of theme and word on theme in model learning wikipedia data, then use same theme Under frequent words extend short text, reuse random semantic forest and feature selecting carried out to extension word, then divided Class；Also some scholars obtain word co-occurrence set of modes by association rule mining (FP-Growth), special as text The foundation of extension is levied, and word relation confidence level completes the feature extension of short text as weight when levying extension is held with dividing Class；XH Phan et al. build global corpus by capturing internet mass data, afterwards using the side of LDA topic models Method obtains the topic model of global corpus, and finally short text corpus to be sorted is entered using global LDA topic models Row theme infers (Model Estimation), the theme distribution of short text to be sorted is obtained, using the theme distribution to short essay This progress feature extension, is finally classified.First kind method can inevitably be introduced when doing short text extension feature and made an uproar Sound, causes classifying quality poor.

Equations of The Second Kind is the method based on deep learning：Socher et al. is using recurrent neural networks model (Recursive Neural Network, RNN), for the sentiment analysis task of sentence level, in the classification task of multiple data sets such as SST Achieve certain effect promoting；Kalchbrenner et al.^[8]Utilize convolutional neural networks (Convolutional Neural Network, CNN) handle the other short text classification task of statement level, and propose dynamic convolution network model (Dynimic Convolutional Neural Network, DCNN), the model all obtains good on multiple data sets Effect, further demonstrates potentiality of the convolutional neural networks in short text sort research.Input based on neural net method Generally use random initializtion or use pre-training term vector.The training method of usual term vector is varied, language material, mould Type, pre-process it is different can produce the term vector of different implications, different term vector in terms of different (angle) portrays word It is semantic.Because short essay eigen is sparse, in order to fully extract feature, it may be considered that fully extracted using a variety of term vectors are combined Feature, improves the ability in feature extraction of convolutional neural networks；In addition it is general to use when Softmax is as convolutional network grader BP algorithm is trained, and the process only considers minimization training error, due to local minimum and gradient disappearance, over-fitting etc. The presence of phenomenon is difficult to make neutral net reach optimal generalization ability.Random forest is a kind of based on Boostrap Aggregation (Bagging) integrated learning approach, by combining many decision trees so that model has to exceptional value and noise There are very strong tolerance and robustness, the problem of single decision tree generalization ability is not enough can be overcome.Random forest has many advantages, Such as：

1) less parameter adjustment is needed, training speed is fast；

2) over-fitting problem will not be produced in training process substantially；

3) robustness to noise disturbances is high.

The content of the invention

It is an object of the invention to propose the short essay of a kind of combination doublet Vector convolution neutral net and random forest This sorting algorithm (CNN-RF), doublet Vector convolution neutral net uses two kinds of pre-training term vectors as input, can be abundant Short essay eigen is extracted, overcomes the shortcomings of that short essay eigen is sparse；Classified afterwards using random forest, strengthen the general of model Change ability.The training of CNN-RF models is divided into two stages：1) the pre-training stage：The dual of grader is used as using Softmax Term vector convolutional network is trained, preservation model parameter；2) the classifier training stage：Pre-training stage model parameter constant is kept, Full articulamentum is accessed into random forest, using high-order features training random forest, parameter is preserved.Found in experimentation, it is only necessary to Want seldom epoch to carry out pre-training, the model in classifier training stage can just restrained, and preferably classification effect can be reached Really.

To achieve the above object, the count protocol that uses of the present invention to be a kind of based on convolutional neural networks and random forest Short text classification method, this method comprises the following steps：

Step 1：Participle is carried out to all Chinese texts in corpus to be sorted, respectively using word2vec and glove words Vectorial training tool obtains two groups of term vectors of corpus, is the equal matrix of two dimensions by text representation；Respectively to two Matrix carries out two-dimensional convolution operation, obtains Liang Gejuan basic units characteristic pattern.

Step 2：After convolution operation, pondization operation is carried out to Liang Gejuan basic units characteristic pattern respectively, two pond layers are obtained Eigenmatrix；Nonlinear s igmod conversion is carried out to the pond layer eigenmatrix, two pond layer characteristic patterns are obtained.

Step 3：Convolution operations are carried out to obtain two pond layer characteristic patterns of step 2, final single connect entirely is obtained Connect a layer characteristic pattern.

Step 4：The full connection features figure that step 3 is obtained as random forest layer input data set, to this gather into Row Boostrap samples, and Bootstrap samplings are a kind of methods of samplings statistically, for there is the data set D of m sample, Carry out putting back to sampling m time and obtain new data set D ', hence it is evident that D is identical with D ' sizes, and put back to and sample so that there is repetition in D ' The sample of appearance, also has sample not occur.

Step 5：Taxonomy and distribution CART, Gini are set up using Gini Y-factor method Ys respectively to multiple Boostrap sample sets Coefficient is used for feature selecting, divides feature space with this feature, removes this feature after division from characteristic set, right Recurrence performs Feature Selection with feature division operation until meeting stop condition to left and right subtree respectively.In addition it is to prevent decision tree mistake The generation of fitting phenomenon, this method is operated using predictive pruning.Multiple decision trees are combined, the common classification for sample is carried out Decision-making, generally uses ballot method.

Compared with prior art, the present invention has following beneficial effect.

Replace full connection Softmax layers of convolutional neural networks using random forest (Random Forest), enhance Overall sorting technique obtains robustness, reduces the over-fitting of model, enhances model generalization ability；Using doublet Vector convolution Neutral net, can extract the feature of more horn of plenty；Independent of complicated parsing tree, it is only necessary to pass through convolution and maximum It is worth pond (Max Pooling Over Time) and carries out feature extraction, obtained higher level of abstraction architectural feature is sent into random gloomy Woods layer is classified, and from the point of view of deviation-variance (bias-variance) angle, integrated multiple models can reduce disaggregated model Variance, improves the stability of model.This method would generally be introduced and made an uproar without complicated feature expansion process, feature expansion algorithm Sound, and waste time and energy, this method makes full use of short text self information, and convolution net is inputted compared to traditional single channel term vector Network, sufficiently alleviates the openness of short text data, can fully extract feature.Max-pooling-over-time is operated Also the short text input problem of variable-length is solved, it is seen that also can effectively carry based on dual pre-training term vector convolutional network The degree of accuracy of high short text classification.Found in experimentation, it is only necessary to which seldom epoch carries out pre-training, just can make we Method reaches good effect.

Brief description of the drawings

Fig. 1 is pre-training term vector generation model, skip-gram model schematics

Fig. 2 is the disaggregated model that convolutional neural networks are combined with random forest

Fig. 3 is the contrast on three data sets respectively with NB CART RF CNN on accurate rate (ACC)

Fig. 4 is respectively with NB CART RF CNN in accuracy rate (Pr), recall rate (Re), F1 values on Fudan data sets It must contrast

Fig. 5 is respectively with NB CART RF CNN in accuracy rate (Pr), recall rate (Re), worth couple of F1 on MR data sets Than

Fig. 6 is respectively with NB CART RF CNN in accuracy rate (Pr), recall rate (Re), F1 values on Weibo data sets It must contrast

Fig. 7 .1RF algorithms on Fudan data sets with decision tree three evaluation indexes of change change

Fig. 7 .2 this method on Fudan data sets with decision tree three evaluation indexes of change change

Embodiment

In order that the purpose of the present invention, technical scheme and feature are more clearly understood, below in conjunction with specific embodiment, and Referring to the drawings, further refinement explanation is carried out to the present invention.

The present invention replaces full connection Softmax layers of convolutional neural networks using random forest (Random Forest), Enhance overall sorting technique and obtain robustness, prevent model over-fitting, enhance the general Huaneng Group power of model；Further use doublet Vector convolution neutral net, it is adaptable to extract more rich high-order feature.The specific improvement of the present invention can be summarized as following Several aspects：1) replace the term vector of random initializtion using two groups of pre-training term vectors, this method relative to previous methods or Person's bag of words, can reduce characteristic dimension, extract abundant feature；2) random initializtion term vector also needs to carry out term vector matrix Parameter updates, and our rule improves the efficiency of model without this operation；3) extended without feature, or introduce syntax point The complex operations such as analysis tree, it is to avoid introduce noise for the follow-up feature extraction and classifying of model；4) traditional nerve is similar to first Network, feature is extracted using convolution-pond-softmax layers, after certain epoch, the output characteristic of full articulamentum It is changed into high-order structures feature；5) classified using random forest instead of softmax to do, random forest can effectively improve model Generalization ability, prevents the over-fitting of model, strengthens classifying quality.It is demonstrated experimentally that method proposed by the invention is in three disclosures Result on experimental data set (Fudan, Weibo, MR) shows that CNN-RF is equal in multiple evaluation indexes compared with other method Have a clear superiority.

Skip-gram in the word2vec term vector models that Fig. 1 uses for the present invention, Fig. 2 are based on convolutional Neural net The structure that the short text classification method of network and random forest is used, first will be short in corpus to two groups of pre-training term vectors Text is respectively configured to two term vector matrixes, carries out 2 and ties up convolution algorithm and max-pooling-over-time computings, then The feature of two passages is combined using convolution operation, pre-training is carried out, finally disaggregated model is built using random forest, specifically Implementation process was divided into pre-training stage and classifier training stage：

One：The pre-training stage

Step 1：Obtain after two groups of term vectors, for corpus D, a text is represented with x, thenRepresent text In i-th of word term vector, length for n sentence expression into following form：

HereIt is changed into vectorial concatenation, n is the length of sentence most long in training corpus.For curtailment n's Text then uses additional character<PAD>Polishing is carried out, the vector representation for being uniformly distributed generation between (- 0.25,0.25) is used< PAD>.Suppositive vector length is k, then now every text x to be expressed as twoSingle channel (Channel) two dimension Matrix, as two input layers.

Step 2：Convolution operation is carried out to two input layers respectively, usedFilter act on term vector sequence Arrange x_i:i+h-1={ x_i,x_i+1,…,x_i+h-1On：

C_i=f (Wx_i:i+h-1+b)

Wherein h is size of the filter in term window,For a bias term, f is nonlinear activation letter Number.Filter W can act on whole term vector sequence { x_1:h,x_2:h+1,…,x_n-h+1:nOn, to produce convolutional layer characteristic pattern

C_conv=[C_conv,1,C_conv,2,…,C_conv,n-h+1]

Fully to extract filter m that different spans are set in feature, training process, with { W₁,W₂,…W_mRepresent, often Filter is planted to set respectivelyIt is individual, generally orderM × s feature can be produced Figure, just acts on single feature figure C using maximum pond (Max-pooling-over-time) afterwards_convOn, come Obtain most important feature in characteristic pattern

Step 3：Step 2 will produce m × s pond layer feature, be stitched together and just obtain pond layer featureWherein l=1,2 represents the pond layer feature of two groups of term vectors respectively.

Step 4：Convolution operation is carried out to two pond layer features, final full articulamentum feature C is obtained_final, C_final,i Represent C_finalComponent：

Step 5：Softmax graders are accessed after full articulamentum feature, the model in whole pre-training stage uses Adam Batch gradient declines (Mini-batch Gradient Descent) Algorithm for Training, and each layer parameter is adjusted with BP algorithm, Whole CNN parameter θ under being recorded after restraining.Over-fitting is prevented with L2 canonicals using Dropout during training.

Two：The classifier training stage

Step 6：Parameter θ in read step 5, Softmax models are replaced using Random Forest model, and full articulamentum is special Levy C_finalFeeding random forest is trained.The size of decision tree N in forest is set first, Bootstrap sampling is carried out and obtains N Individual data set, the parameter θ that next every is set in N tree of study_n, because the training process between the tree of each in forest does not have mutually Have an impact, therefore accelerate speed by the way of parallel training in experiment.

Step 7：After the completion of single decision tree training, the output of CNN-RF models is finally obtained in the method for ballot：

T_i(x) it is to set i to sample x classification results, that is, method of voting, c^*As sample corresponds to final classification, and N is random gloomy The number of decision tree in woods.Due to the full articulamentum feature C of random forest_finalLess, general data collection kind has m to usual dimension ×s<10³, so the expense for setting up random forest is very small.

This method combines CNN ability in feature extraction and the generalization ability of random forest, and generalization ability can be from following Three aspects are analyzed：1) from the point of view of statistics angle, because the hypothesis space of learning tasks is often very big, multiple hypothesis may exist The performance of equal level is reached on training set, if now using single decision tree probably due to falsely dropping and causing generalization ability not It is good；2) from feature extraction angle analysis, dual term vector portrays the implication of word from two angles respectively, enriches short text letter Breath, characteristic information has been expanded for single term vector；3) in terms of expression, the true hypothesis of some learning tasks May not be within the hypothesis space residing for current decision tree algorithm, if now using single sorting technique, search can be caused not To set hypothesis space, and random forest is sampled using Bootstrap, can reduce machine learning model to data according to The ability of relying, reduces the variance of model so that model possesses more preferable generalization ability.

Experimental facilities and required environment

Win7 32-bit operating systems, Intel Xeon E5 processors, CPU frequency 3.30Ghz, internal memory 16G.Experimental code Using python, deep learning environment is tensorflow combination Scikit learn frameworks.

Experimental result and explanation

This method is respectively in Fudan Chinese datas collection, NLPIR the Weibo data sets provided and MR comment emotional semantic classification numbers According to being tested on collection.Fudan Chinese datas, which are concentrated, includes 9804 documents of training corpus, testing material totally 9833 documents, 20 classifications, of the invention to use the headline that Fudan Chinese datas are concentrated as short text classification language material altogether, and only chooses 5 classifications therein are respectively C3-Art, C32-Agriculture, C34-Economy, C7-History, C38- Politics, altogether 7120 title documents；In WeiBo data sets amount to 21 classifications, the present invention using except " humanities and art ", All categories outside " advertisement is public ", " campus ", altogether 18 classifications, 36412 microblogging texts.Training is divided for no WeiBo and the MR data set of collection and test set have carried out 10 folding cross validations in an experiment, and experimental result has stronger convincingness.

Pretreatment and parameter setting

In experiment, using two groups of term vectors, first group by word2vec skip-gram training obtain, second group by Glove models are obtained, and are trained the language material of term vector to be obtained using each data set self training, only for Fudan University's data set, are adopted With the training corpus of news content and headline collectively as term vector.Centering literary grace is carried out with Hanlp in preprocessing process Participle, removes stop words operation.The dimension of two groups of term vectors is both configured to filter size in 100, convolutional neural networks and distinguished For 2,3,4, every kind of filter is respectively provided with 100, and it is 0.001 that Dropout parameters, which are set to 0.5, L2 regular parameters,.Due to pre- place The difference of reason mode and term vector language material and method choice, causes the experimental result of different authors to have one in same data set Determine deviation.Herein in order to verify CNN-RF classification performance, it is necessary in identical pretreatment mechanism, voluntarily realize a variety of classification moulds Type and this paper sorting technique carry out the comparative experiments of classification performance.

Setup Experiments and evaluation index

What the present invention was proposed with naive Bayesian (NB), Taxonomy and distribution (CART), random forest (RF) and Kim respectively Four kinds of algorithms of CNN networks are contrasted.The characteristic vector for being wherein used as classification in NB, CART, RF is the corresponding word of its text The form of addition of vectors.Experiment takes accurate rate (accuracy), accuracy rate (precision), recall rate (recall), F1 It is worth (F1-measure) as evaluation criterion, is calculated as follows：

1) accurate rate (accuracy)：

2) accuracy rate (precision)：

3) recall rate (recall)：

4) F1 values (F1-measure)：

Wherein TP represents that positive sample is predicted as positive sample number, and TN represents that negative sample is predicted as negative sample quantity, and FN is represented Positive sample is predicted as the quantity of negative sample, and FP represents that negative data is predicted as the quantity of positive sample, and N represents total sample number.It is real afterwards The influence analyzed with the increase of decision tree number to RF and CNN-RF methods is tested, CNN-RF methods and CNN is finally compared for Convergence of algorithm velocity analysis is contrasted.

Analysis of experimental results

First, accurate rate comparative analysis is carried out on 3 data sets to five kinds of algorithms.As seen from Figure 3, it is proposed by the invention CNN-RF methods on 3 data sets accurate rate be highest, improved on Fudan data sets relative to CNN 1.7%, 1.6% is improved relative to CNN on Weibo data sets, 0.8% is improved on MR data sets.Based on depth The result that the CNN methods of habit are obtained is only second to CNN-RF, and is better than other three kinds of methods, and NB, CART accurate rate are below collection Into learning method RF, it can be obtained from analysis of experimental results, integrated learning approach has combined multiple model generalization abilities compared with single model Lifting, but it is weaker than deep learning CNN methods.CNN is by extracting abstract structure feature, so preferably accurate rate can be obtained. CNN-RF combines both advantages, so obtaining better result.

Result of five kinds of algorithms on Fudan Chinese data collection is as shown in Figure 4.By the visible RF algorithms of experimental data accurate Rate, recall rate, three indexs of F1 values are more than CART and NB algorithms, it is seen that the method based on integrated study is added to making an uproar really The disturbance ability of sound, enhances the generalization ability of grader.And in terms of accuracy rate, RF algorithms are higher than CNN by 1.0%, but In recall rate, CNN is higher by 6.1% than RF algorithm, therefore integrates, in F1 values, and CNN is 2.5% more than RF, and CNN Optimal recall rate 92.8% has been reached in several method, 0.6% is higher by than CNN-RF algorithm.Except not enough in recall rate Outside CNN, CNN-RF algorithms further enhancing model generalization ability, and accuracy rate improves 4.1%, F1 values than CNN and improved 1.9%, CNN-RF algorithms achieve optimal result on accurate rate and F1 values.

Result of five kinds of algorithms on MR data sets is as shown in figure 5, MR data sets are two classification affection data collection.CNN-RF It is highest in three evaluation indexes, 1.2% or so is higher by than CNN on F1 estimates, 4.4% is higher by than RF, and it is different With other two datasets, CNN-RF accuracy rate, recall rate, F1 values on MR data sets more than CNN, exceed respectively 1.5%, 1.1% and 1.3%.

Result of five kinds of algorithms on Weibo data sets is as shown in fig. 6, from data, RF recall rate is still showed not Good, but accuracy rate is higher than CNN algorithm by 7.6%, comparatively CNN algorithms achieve highest recall rate, be higher by respectively RF and CNN-RF algorithms 15.6% and 9.2%, cause RF F1 values lower than CNN algorithm by 5.1%.But CNN is performed poor due to accuracy rate, So its F1 value is less than CNN-RF.CNN-RF has obtained optimal result in accuracy rate and F1 values, the CNN- in accuracy rate RF has been higher by 11% than CNN, has reached optimal F1 values, is higher by 6% and 0.9% than RF and CNN respectively.

In summary, CNN-RF methods are insensitive to short text data collection length, and doublet Vector convolution neutral net can Fully to extract feature, and model generalization ability is better than other four kinds of algorithms.By contrast, the effect of CART algorithms and NB algorithms It is really worst, this integrated study modes of RF are used so that generalization ability has certain lifting, but it is initial due to being only used only Term vector feature is simultaneously added by the term vector of word2vec extractions, causes classifying quality to be worse than CNN-RF.CNN-RF methods are first The abstract high-order feature that make use of dual term vector CNN to extract, and many decision trees of combination enhance the generalization ability of model, General performance is better than CNN and RF on several data sets.Relative to CNN, F1 values are respectively increased on 3 data sets 1.9%, 0.9% and 1.3%, the experiment show validity of the inventive method.

On the influence problem of decision tree number of parameters in random forest, test, tie on Fudan Chinese data collection Fruit sees that the quantity of decision tree number in Fig. 7 .1 and Fig. 7 .2, figure is respectively that increment increases to 200 by 10 with 10, totally 20 times.Fig. 7 .1 RF algorithms are represented, Fig. 7 .2 represent context of methods.It can be seen that with decision tree number n increase when initial, the three of CNN-RF and RF Individual evaluation index is rising, and when decision tree number reaches after 80s in RF, the results of three evaluation metricses tends to stabilization.And In CNN-RF, after number reaches 50, three evaluation metricses tend towards stability substantially.

Claims

1. a kind of short text classification method based on convolutional neural networks and random forest, it is characterised in that：This method include with Lower step：

Step 1：Participle is carried out to all Chinese texts in corpus to be sorted, respectively using word2vec and glove term vectors Training tool obtains two groups of term vectors of corpus, is the equal matrix of two dimensions by text representation；Respectively to two matrixes Two-dimensional convolution operation is carried out, Liang Gejuan basic units characteristic pattern is obtained；

Step 2：After convolution operation, pondization operation is carried out to Liang Gejuan basic units characteristic pattern respectively, two pond layer features are obtained Matrix；Nonlinear s igmod conversion is carried out to the pond layer eigenmatrix, two pond layer characteristic patterns are obtained；

Step 3：Convolution operation is carried out to obtain two pond layer characteristic patterns of step 2, final single full articulamentum is obtained Characteristic pattern；

Step 4：The full connection features figure that step 3 is obtained is carried out as the input data set of random forest layer to the set Boostrap samples, and Bootstrap samplings are a kind of methods of samplings statistically, for there is the data set D of m sample, are entered Row puts back to sampling m times and obtains new data set D ', hence it is evident that D is identical with D ' sizes, and puts back to sampling so that having repetition to go out in D ' Existing sample, also has sample not occur；

Step 5：Taxonomy and distribution CART, Gini coefficient is set up using Gini Y-factor method Ys respectively to multiple Boostrap sample sets For feature selecting, feature space is divided with this feature, this feature is removed after division from characteristic set, to left and right Recurrence performs Feature Selection with feature division operation until meeting stop condition to subtree respectively；In addition it is to prevent decision tree over-fitting The generation of phenomenon, this method is operated using predictive pruning；Multiple decision trees are combined, it is common to be determined for the classification of sample Plan, generally uses ballot method.

2. a kind of short text classification method based on convolutional neural networks and random forest according to claim 1, it is special Levy and be：

The specific implementation process of this method was divided into pre-training stage and classifier training stage：

One：The pre-training stage

Step 1：Obtain after two groups of term vectors, for corpus D, a text is represented with x, thenRepresent i-th in text The term vector of individual word, length for n sentence expression into following form：

<mrow> <msub> <mi>x</mi> <mrow> <mn>1</mn> <mo>:</mo> <mi>n</mi> </mrow> </msub> <mo>=</mo> <msub> <mi>x</mi> <mn>1</mn> </msub> <mo>&CirclePlus;</mo> <msub> <mi>x</mi> <mn>2</mn> </msub> <mo>&CirclePlus;</mo> <mo>...</mo> <mo>&CirclePlus;</mo> <msub> <mi>x</mi> <mi>n</mi> </msub> </mrow>

HereIt is changed into vectorial concatenation, n is the length of sentence most long in training corpus；For curtailment n text then Use additional character<PAD>Polishing is carried out, the vector representation for being uniformly distributed generation between (- 0.25,0.25) is used<PAD>；It is false If term vector length is k, then now every text x to be expressed as twoSingle channel (Channel) two-dimensional matrix, i.e., For two input layers；

Step 2：Convolution operation is carried out to two input layers respectively, usedFilter act on term vector sequence x_i:i+h-1={ x_i,x_i+1,…,x_i+h-1On：

C_i=f (Wx_i:i+h-1+b)

Wherein h is size of the filter in term window,For a bias term, f is nonlinear activation function；Filtering Device W can act on whole term vector sequence { x_1:h,x_2:h+1,…,x_n-h+1:nOn, to produce convolutional layer characteristic pattern

C_conv=[C_conv,1,C_conv,2,…,C_conv,n-h+1]

Fully to extract filter m that different spans are set in feature, training process, with { W₁,W₂,…W_mRepresent, every kind of mistake Filter is set respectivelyIt is individual, generally orderM × s characteristic pattern can be produced, Just act on single feature figure C using maximum pond (Max-pooling-over-time) afterwards_convOn, come The most important feature into characteristic pattern

Step 3：Step 2 will produce m × s pond layer feature, be stitched together and just obtain pond layer featureWherein l=1,2 represents the pond layer feature of two groups of term vectors respectively；

Step 4：Convolution operation is carried out to two pond layer features, final full articulamentum feature C is obtained_final, C_final,iRepresent C_finalComponent：

<mrow> <msub> <mi>C</mi> <mrow> <mi>f</mi> <mi>i</mi> <mi>n</mi> <mi>a</mi> <mi>l</mi> <mo>,</mo> <mi>i</mi> </mrow> </msub> <mo>=</mo> <mi>f</mi> <mrow> <mo>(</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <mn>2</mn> </munderover> <mi>W</mi> <mo>&CenterDot;</mo> <msubsup> <mi>C</mi> <mrow> <mi>p</mi> <mi>o</mi> <mi>o</mi> <mi>l</mi> <mo>,</mo> <mi>i</mi> <mo>:</mo> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> <mi>l</mi> </msubsup> <mo>+</mo> <msup> <mi>b</mi> <mi>l</mi> </msup> <mo>)</mo> </mrow> </mrow>

Step 5：Softmax graders are accessed after full articulamentum feature, the model in whole pre-training stage uses Adam batches Gradient descent algorithm is trained, and each layer parameter is adjusted with BP algorithm, whole CNN parameter θ under being recorded after restraining；Training Shi Caiyong Dropout prevent over-fitting with L2 canonicals；

Two：The classifier training stage

Step 6：Parameter θ in read step 5, replaces Softmax models, by full articulamentum feature using Random Forest model C_finalFeeding random forest is trained；The size of decision tree N in forest is set first, Bootstrap sampling is carried out and obtains N number of Data set, the parameter θ that next every is set in N tree of study_n, because the training process between the tree of each in forest does not have mutually Accelerate speed by the way of parallel training in influence, therefore experiment；

<mrow> <msup> <mi>c</mi> <mo>*</mo> </msup> <mo>=</mo> <mi>arg</mi> <munder> <mi>max</mi> <mi>c</mi> </munder> <mo>{</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <mi>I</mi> <mrow> <mo>(</mo> <msub> <mi>T</mi> <mi>i</mi> </msub> <mo>(</mo> <mi>x</mi> <mo>)</mo> <mo>=</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>}</mo> </mrow>

T_i(x) it is to set i to sample x classification results, that is, method of voting, c^*As sample corresponds to final classification, and N is in random forest The number of decision tree；Due to the full articulamentum feature C of random forest_finalLess, general data collection kind has m × s to usual dimension< 10³, so the expense for setting up random forest is very small.