CN107066553B

CN107066553B - Short text classification method based on convolutional neural network and random forest

Info

Publication number: CN107066553B
Application number: CN201710181062.0A
Authority: CN
Inventors: 刘泽锦; 王洁
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-03-24
Filing date: 2017-03-24
Publication date: 2021-01-01
Anticipated expiration: 2037-03-24
Also published as: CN107066553A

Abstract

The invention discloses a short text classification method based on a convolutional neural network and a random forest, and belongs to the field of text classification and deep learning. Aiming at the problem that the generalization capability is insufficient due to the fact that Softmax is adopted as a convolutional neural network classifier, a short text classification algorithm (CNN-RF) combining a convolutional neural network and a random forest is provided. The method firstly provides a double word vector convolution neural network for fully extracting high-order features of the short text, and then adopts a random forest as a high-order feature classifier, so that the short text classification effect is improved. The results on the three published experimental data sets show that CNN-RF has significant advantages over other algorithms in terms of multiple evaluation indices.

Description

Short text classification method based on convolutional neural network and random forest

Technical Field

The invention belongs to the field of text classification and deep learning, and relates to a short text classification method based on a neural network and a random forest, which can be used for classification or emotion classification of mass short text data such as microblogs, short messages and user Query. And can be used for system services such as search engines, information retrieval and the like.

Background

With the rapid development of the internet in recent years, a large amount of Short texts (Short texts) are generated by various information interaction platforms, and the Short texts relate to various fields of people's lives and gradually become a frequently-used and well-recognized communication mode for people. For example, electronic commerce comments, webpage information retrieval, intelligent question and answer systems and the like are all generation sources of massive short texts. How to mine effective information from massive short texts is a subject of extensive research by many scholars in recent years. Text classification is an effective method for text mining, but the traditional long text classification method is not applicable due to the characteristics of short text length, sparse lexical features and the like. The Short Text Classification technology (Short Text Classification) can solve the challenges in the application of the Short Text to a certain extent, is one of the research hotspots of numerous scholars at home and abroad in recent years, and is also a vital task in the field of Natural Language Processing (NLP). At present, text classification methods are mainly based on statistical learning methods or machine learning methods, and training is performed on artificially labeled corpora by using statistical or machine learning methods to obtain classifiers, and then classification is performed on to-be-classified data sets. The mainstream Machine learning methods include Naive Bayes (NB), Support Vector Machines (SVM), Logistic Regression (LR) multiple Logistic Regression (SR), Random Forest (RF), Deep Neural Network (DNN), and the like. The successful long text classification method in the text classification field is difficult to be directly applied to short text classification, so that a classification algorithm for short texts becomes a research problem to be solved by researchers at present, and the challenges of short text classification are mainly as follows:

1) the short text has sparse key word features, compared with a common long text with rich terms, the short text often has only a few effective key words, and when a vector space model is used for representing the text, the relevance among the features is difficult to fully mine;

2) in an open field (such as microblog and search engine), information is updated quickly, the information amount of a single short text is small, but the total text information amount is extremely large, and the cross part among information is small;

3) the emergence of new words, new phrases, and colloquialisation, which are often difficult to process with existing classification systems.

Scholars at home and abroad have conducted some meaningful research and exploration on the short text classification problem, and the first type is a method based on short text feature expansion: bouaziz et al learn the subject and the distribution of words on the subject on Wikipedia data by using a Latent Dirichlet Allocation (LDA) model, expand short texts by using high-frequency words under the same subject, select features of the expanded words by using a random semantic forest, and classify the expanded words; some scholars obtain a word co-occurrence mode set through association rule mining (FP-Growth), and use the word co-occurrence mode set as a basis for text feature expansion, and the word relation confidence coefficient is used as a weight for supporting feature expansion to complete feature expansion and classification of short texts; XH Phan et al constructs a global corpus by capturing mass data of the Internet, then obtains a topic Model of the global corpus by using an LDA topic Model method, finally performs topic inference (Model optimization) on a short text corpus to be classified by using the global LDA topic Model to obtain topic distribution of the short text to be classified, performs feature expansion on the short text by using the topic distribution, and finally performs classification. The first method inevitably introduces noise when performing short text extension features, resulting in poor classification effect.

The second category is deep learning based methods: socher et al, which adopts a Recurrent Neural Network (RNN) model for sentence-level emotion analysis tasks, achieves a certain effect enhancement in classification tasks of multiple data sets such as SST; kalchbrenner et al^[8]A Convolutional Neural Network (CNN) is used for processing a short text classification task at a statement level, and a dynamic Convolutional Network model (DCNN) is proposed, wherein the model has a good effect on a plurality of data sets, and the potential of the Convolutional Neural Network in short text classification research is further verified. Inputs based on neural network methods typically employ random initialization or use of pre-training word vectors. Generally, word vectors are trained in various ways, different linguistic data, models and preprocessing result in word vectors with different meanings, and different word vectors describe word semantics from different aspects (angles). Due to the fact that the short text features are sparse, in order to fully extract the features, the fact that the features are fully extracted by combining various word vectors can be considered, and the feature extraction capability of the convolutional neural network is improved; in addition, when Softmax is used as a convolutional network classifier, a BP algorithm is generally adopted for training, only minimal training errors are considered in the process, and the neural network is difficult to achieve the optimal generalization capability due to the existence of local minimal values, phenomena such as gradient disappearance, overfitting and the like. The random forest is an ensemble learning method based on Boostrap Aggregation (Bagging), and a model has strong tolerance and robustness to abnormal values and noise by combining a plurality of decision trees, so that the problem of insufficient generalization capability of a single decision tree can be solved. Random forests have many advantages, such as:

1) less parameter adjustment is needed, and the training speed is high;

2) the overfitting problem is basically avoided in the training process;

3) the robustness to noise disturbances is high.

Disclosure of Invention

The invention aims to provide a short text classification algorithm (CNN-RF) combining a double word vector convolution neural network and a random forest, wherein the double word vector convolution neural network uses two pre-training word vectors as input, can fully extract short text characteristics and overcomes the defect of sparse short text characteristics; and then, classifying by adopting a random forest to enhance the generalization capability of the model. The training of the CNN-RF model is divided into two phases: 1) a pre-training stage: performing double word vector convolution network training by using Softmax as a classifier, and storing model parameters; 2) a classifier training stage: keeping the model parameters in the pre-training stage unchanged, accessing the full-link layer into a random forest, training the random forest by using high-order characteristics, and storing the parameters. In the experimental process, the model convergence in the training stage of the classifier can be realized only by using a small amount of epochs for pre-training, and a better classification effect can be achieved.

In order to achieve the purpose, the counting scheme adopted by the invention is a short text classification method based on a convolutional neural network and a random forest, and the method comprises the following steps:

step 1: dividing words of all Chinese texts in a corpus to be classified, respectively using word2vec and glove word vector training tools to obtain two groups of word vectors of the corpus, and expressing the texts as two matrixes with equal dimensions; and respectively carrying out two-dimensional convolution operation on the two matrixes to obtain two volumes of base layer characteristic graphs.

Step 2: after the convolution operation, performing pooling operation on the two volume base layer characteristic graphs respectively to obtain two pooling layer characteristic matrixes; and carrying out nonlinear sigmod transformation on the characteristic matrix of the pooling layer to obtain two pooling layer characteristic graphs.

And step 3: and (3) performing convolution operation on the two obtained pooling layer characteristic graphs in the step (2) to obtain a final single full-connection layer characteristic graph.

And 4, step 4: and 3, taking the fully connected characteristic diagram obtained in the step 3 as an input data set of a random forest layer, performing Boostrap sampling on the set, wherein Bootstrap sampling is a statistical sampling method, and for a data set D with m samples, performing playback sampling m times to obtain a new data set D ', wherein the sizes of the D and the D ' are obviously the same, and the playback sampling enables the D ' to have repeated samples and also has no samples.

And 5: and establishing a classification and regression tree CART for the Boostrap sample sets by using a Gini coefficient method, wherein the Gini coefficient is used for feature selection, dividing a feature space by using the feature, removing the feature from the feature set after division, and performing feature selection and feature division operation on the left subtree and the right subtree in a recursion mode until a stop condition is met. In addition, in order to prevent the over-fitting phenomenon of the decision tree, the method adopts pre-pruning operation. Combining multiple decision trees to jointly make decisions for the categories of the samples, usually by voting.

Compared with the prior art, the invention has the following beneficial effects.

A Random Forest (Random Forest) is adopted to replace a fully-connected Softmax layer of the convolutional neural network, so that the robustness of the whole classification method is enhanced, the overfitting of the model is reduced, and the generalization capability of the model is enhanced; by adopting the double word vector convolution neural network, more abundant characteristics can be extracted; the method is independent of a complex syntactic analysis tree, only the feature extraction is carried out through convolution and maximum Pooling (Max Pooling Over Time), the obtained high-level abstract structure features are sent to a random forest layer for classification, and from the bias-variance (bias-variance) perspective, the variance of a classification model can be reduced by integrating a plurality of models, and the stability of the model is improved. The method does not need a complex feature expansion process, noise is usually introduced into a feature expansion algorithm, time and labor are wasted, self information of the short text is fully utilized, compared with a traditional single-channel word vector input convolution network, sparsity of short text data is fully relieved, and features can be fully extracted. The max-posing-over-time operation also solves the problem of short text input with variable length, and therefore the accuracy of short text classification can be effectively improved based on the double pre-training word vector convolution network. In the experimental process, the method can achieve good effect only by needing few epochs to pre-train.

Drawings

FIG. 1 is a schematic diagram of a pre-training word vector generation model, skip-gram model

FIG. 2 is a classification model combining convolutional neural network and random forest

FIG. 3 is a comparison of Accuracy (ACC) with NB CART RF CNN on three data sets, respectively

FIG. 4 is a comparison of the accuracy (Pr), recall (Re) and F1 values of NB CART RF CNN on the Fudan dataset

FIG. 5 is a comparison of the accuracy (Pr), recall (Re) and F1 values of NB CART RF CNN on the MR data set, respectively

FIG. 6 is a comparison of accuracy (Pr), recall (Re), and F1 values of NB CART RF CNN on a Weibo data set, respectively

FIG. 7.1 variation of the RF Algorithm three evaluation indices on the Fudan dataset with variation of the decision Tree

FIG. 7.2 variation of three evaluation indexes of the method on Fudan data set along with variation of decision tree

Detailed Description

In order to make the objects, technical solutions and features of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.

According to the method, a Random Forest (Random Forest) is adopted to replace a fully-connected Softmax layer of the convolutional neural network, so that the robustness of the whole classification method is enhanced, overfitting of the model is prevented, and the bloom capability of the model is enhanced; and further, a double word vector convolution neural network is adopted, so that the method is suitable for extracting richer high-order features. The specific improvements of the present invention can be summarized as the following aspects: 1) two groups of pre-training word vectors are used for replacing randomly initialized word vectors, and compared with the conventional method or a word bag model, the method can reduce feature dimensions and extract sufficient features; 2) the random initialization of the word vector also needs to update the parameters of the word vector matrix, and the method does not need the operation, thereby improving the efficiency of the model; 3) the feature expansion is not needed, or complex operations such as syntactic analysis trees are introduced, so that noise is prevented from being introduced for subsequent feature extraction and classification of the model; 4) firstly, extracting features by using a convolution-pooling-softmax layer similar to a traditional neural network, wherein after a certain epoch, the output features of a full connection layer are changed into high-order structural features; 5) random forests are used for classifying instead of softmax, the generalization capability of the model can be effectively improved by the random forests, overfitting of the model is prevented, and the classification effect is enhanced. Experiments prove that the results of the method provided by the invention on three public experimental data sets (Fudan, Weibo, MR) show that the CNN-RF has obvious advantages on a plurality of evaluation indexes compared with other methods.

FIG. 1 is a skip-gram in a word2vec word vector model adopted by the invention, FIG. 2 is a structure adopted by a short text classification method based on a convolutional neural network and a random forest, for two groups of pre-training word vectors, firstly, short texts in a corpus are respectively constructed into two word vector matrixes, 2-dimensional convolution operation and max-firing-over-time operation are carried out, then, the convolution operation is combined with the characteristics of two channels for pre-training, and finally, the random forest is used for constructing a classification model, and the specific implementation process comprises a pre-training stage and a classifier training stage:

firstly, the method comprises the following steps: pre-training phase

Step 1: after two groups of word vectors are obtained, for corpus D, x represents a text, and then

A word vector representing the ith word in the text, a sentence of length n is represented in the form:

here, the

And (5) changing into a vector splicing operation, wherein n is the length of the longest sentence in the training corpus. For text with length less than n, special symbols are used<PAD>The completion is performed using a vector representation generated with a uniform distribution between (-0.25,0.25)<PAD>. Assuming that the word vector length is k, each text x is now represented as two

I.e. two input layers, is a single-Channel two-dimensional matrix.

Step 2: performing convolution operations on two input layers separately, using

The filter acting on the word vector sequence x_i：i+h-1＝{x_i，x_i+1，…，x_i+h-1On the column:

C_i＝f(W·x_i：i+h-1+b)

where h is the size of the filter over the word window,

is a bias term, and f is a nonlinear activation function. The filter W will act on the entire word vector sequence x_1：h，x_2：h+1，…，x_n-h+1：nOn the convolution layer to generate a convolution layer feature map

C_conv＝[C_conv，1，C_conv，2，…，C_conv，n-h+1]

In order to fully extract features, m filters with different spans are set in the training process, and the number of the filters is { W }₁，W₂，…W_mDenotes that each filter is separately provided

A, usually order

I.e., m × s feature maps are generated, and then Max-pooling (Max-pooling-over-time) operations are used to work on a single feature map C_convTo obtain the most important feature in the feature map

And step 3: step 2 will generate m × s pooling layer characteristics, which are spliced to obtain pooling layer characteristics

Where l ═ 1, and 2 represent pooling layer features for the two sets of word vectors, respectively.

And 4, step 4: performing convolution operation on the two pooling layer characteristics to obtain the final full-connected layer characteristic C_final，C_final，iRepresents C_finalThe component (c):

and 5: and (3) accessing a Softmax classifier after the characteristics of the full connection layer, training the model in the whole pre-training stage by using an Adam batch Gradient Descent (Mini-batch Gradient Descent) algorithm, adjusting parameters of each layer by using a BP algorithm, and recording the parameter theta of the whole CNN after convergence. Dropout and L2 regularization are used during training to prevent overfitting.

II, secondly: classifier training phase

Step 6: reading the parameter theta in the step 5, replacing a Softmax model with a random forest model, and enabling the full connection layer characteristic C_finalAnd sending the training solution into a random forest for training. Firstly setting the size of a decision tree N in a forest, performing Bootstrap sampling to obtain N data sets, and then learning a parameter theta of each tree in the N trees_nAnd because the training processes among all trees in the forest have no influence on each other, a parallel training mode is adopted in the test to accelerate the speed.

And 7: and after the training of a single decision tree is finished, finally obtaining the output of the CNN-RF model by a voting method:

T_i(x) Is the result of the classification of the sample x by the tree i, i.e. voting, c^*I.e. the sample corresponds to the final category, and N is the number of decision trees in the random forest. Full connectivity layer characteristics C due to random forests_finalThe dimension is not large, and the general data set species have the size of m multiplied by s < 10³The overhead of building a random forest is very small.

The method combines the characteristic extraction capability of CNN and the generalization capability of random forests, and the generalization capability can be analyzed from the following three aspects: 1) from the statistical point of view, because the hypothesis space of the learning task is often very large, a plurality of hypotheses may reach the performance of the same level on the training set, and at this time, if a single decision tree is used, the generalization capability may be poor due to the misselection; 2) from the aspect of feature extraction, the dual word vectors respectively depict the meaning of words from two angles, so that short text information is enriched, and feature information is expanded relative to a single word vector; 3) from the aspect of representation, the true hypothesis of some learning tasks may not be in the hypothesis space where the current decision tree algorithm is located, and at this time, if a single classification method is used, the established hypothesis space cannot be searched, and the random forest adopts Bootstrap sampling, so that the dependence capacity of the machine learning model on data can be reduced, the variance of the model can be reduced, and the model has better generalization capacity.

Experimental facility and required environment

A Win 732-bit operating system, an Intel Xeon E5 processor, a CPU main frequency of 3.30Ghz and a memory of 16G. The experimental code used python, deep learning environment was tensiorflow in combination with the Scikit lern framework.

Results and description of the experiments

The method is used for carrying out experiments on a Fudan Chinese data set, a Weibo data set provided by NLPIR and an MR comment emotion classification data set respectively. The Fudan Chinese data set comprises 9804 documents of training corpuses, the testing corpuses comprise 9833 documents, 20 categories in total, the invention uses news titles in the Fudan Chinese data set as short text classification corpuses, and only 5 categories are selected to be C3-Art, C32-Agriculture, C34-Economy, C7-History and C38-policies, respectively, 7120 title documents in total; the WeiBo dataset totals 21 categories, and the present invention uses all categories except "human art", "public of advertisements", "campus", totaling 18 categories 36412 microblog texts. 10-fold cross validation is carried out on the WeiBo and the MR data set which are not divided into the training set and the testing set in the experiment, and the experimental result has strong persuasion.

Preprocessing and parameter setting

In the experiment, two groups of word vectors are adopted, the first group is obtained by the skip-gram training in the word2vec, the second group is obtained by the glove model, the linguistic data of the training word vectors are obtained by adopting the self-training of each data set, and only for the double data set, the news content and the news title are jointly used as the training linguistic data of the word vectors. And in the preprocessing process, Hanlp is adopted for word segmentation of the Chinese, and word stopping operation is removed. The dimensionality of the two groups of word vectors is set to be 100, the sizes of filters in the convolutional neural network are respectively 2, 3 and 4, 100 filters are arranged in each type, the Dropout parameter is set to be 0.5, and the L2 regular parameter is 0.001. Due to differences in preprocessing modes and word vector corpora and method selections, experimental results of different authors have certain deviations on the same data set. In order to verify the classification performance of CNN-RF, a plurality of classification models and the classification method of the CNN-RF are automatically realized on the same preprocessing mechanism to carry out a comparative experiment on the classification performance.

Experimental setup and evaluation index

The invention is compared with four algorithms of a CNN network proposed by Naive Bayes (NB), a classification and regression tree (CART), a Random Forest (RF) and Kim respectively. The feature vectors used for classification in NB, CART and RF are all in the form of addition of word vectors corresponding to texts. The accuracy (accuracy), accuracy (precision), recall (call) and F1 value (F1-measure) are adopted as evaluation criteria in the test, and are calculated as follows:

1) accuracy (accuracuracy):

2) accuracy (precision):

3) recall (recall):

4) f1 value (F1-measure):

where TP represents the number of positive samples predicted as positive samples, TN represents the number of negative samples predicted as negative samples, FN represents the number of positive samples predicted as negative samples, FP represents the number of negative data predicted as positive samples, and N represents the total number of samples. And then, the influence of the increase of the number of the decision trees on the RF and CNN-RF methods is analyzed through experiments, and finally, the convergence rate analysis and comparison of the CNN-RF method and the CNN algorithm are compared.

Analysis of Experimental results

First, a comparison of accuracy was performed on 3 data sets for the five algorithms. As can be seen from fig. 3, the CNN-RF method proposed by the present invention has the highest accuracy in 3 datasets, which is improved by 1.7% relative to CNN in Fudan dataset, 1.6% relative to CNN in Weibo dataset, and 0.8% in MR dataset. The CNN method based on deep learning only gets second to CNN-RF, and is better than other three methods, the accuracy of NB and CART is lower than that of the integrated learning method RF, and the integrated learning method is obtained from experimental result analysis, and the generalization ability of the integrated learning method combined with a plurality of models is improved compared with that of a single model, but is weaker than that of the CNN method based on deep learning. The CNN obtains better accuracy rate by extracting abstract structural features. CNN-RF combines the advantages of both, so better results are achieved.

The results of the five algorithms on the Fudan chinese dataset are shown in fig. 4. According to experimental data, the accuracy, the recall rate and the F1 value of the RF algorithm exceed those of the CART and NB algorithms, and the noise disturbance capability is increased and the generalization capability of the classifier is enhanced by the method based on ensemble learning. And in the aspect of accuracy, the RF algorithm is 1.0% higher than the CNN, but the CNN is 6.1% higher than the RF algorithm in recall rate, so that in combination, the CNN exceeds the RF by 2.5% in the F1 value, and the CNN reaches the optimal recall rate of 92.8% in several methods and is 0.6% higher than the CNN-RF algorithm. Besides the recall rate is insufficient for CNN, the CNN-RF algorithm further enhances the generalization capability of the model, the accuracy is improved by 4.1 percent compared with CNN, the F1 value is improved by 1.9 percent, and the CNN-RF algorithm obtains the optimal result on the accuracy rate and the F1 value.

The results of the five algorithms on the MR data set are shown in fig. 5, the MR data set being a two-class sentiment data set. The CNN-RF is the highest in three evaluation indexes, is higher than the CNN by about 1.2% and is higher than the RF by 4.4% in the F1 measure, and is different from other two data sets, the accuracy, the recall rate and the F1 value of the CNN-RF on the MR data set exceed the CNN by 1.5%, 1.1% and 1.3% respectively.

The results of the five algorithms on the Weibo data set are shown in fig. 6, and it can be seen from the data that the RF recall rate still performs poorly, but the accuracy is 7.6% higher than that of the CNN algorithm, and relatively speaking, the CNN algorithm achieves the highest recall rate, which is 15.6% higher than that of the RF and CNN-RF algorithms and 9.2% higher than that of the CNN algorithms, respectively, resulting in an RF F1 value 5.1% lower than that of the CNN algorithm. However, CNN has a lower F1 value than CNN-RF because of its poor accuracy. The CNN-RF obtains the best result on the accuracy and the F1 value, the CNN-RF is 11% higher than the CNN on the accuracy, and the optimal F1 value is reached, which is 6% higher and 0.9% higher than the RF and the CNN respectively.

In conclusion, the CNN-RF method is insensitive to the length of the short text data set, the double word vector convolution neural network can fully extract features, and the model generalization capability is better than that of other four algorithms. In contrast, the CART algorithm is the least effective than the NB algorithm, and the ensemble learning method using RF leads to a certain improvement in generalization capability, but the classification effect is worse than CNN-RF because only the word vectors extracted by the initial word2vec are used and the word vector features are added. The CNN-RF method firstly utilizes abstract high-order characteristics extracted by double word vectors CNN, and combines a plurality of decision trees to enhance the generalization capability of the model, so that the comprehensive performance of the CNN-RF method on a plurality of data sets is better than that of the CNN and RF method. Compared with CNN, F1 values are respectively improved by 1.9%, 0.9% and 1.3% on 3 data sets, and the experimental results prove the effectiveness of the method.

Regarding the influence problem of the parameters of the number of decision trees in the random forest, experiments were performed on the Fudan chinese dataset, and the results are shown in fig. 7.1 and 7.2, in which the number of decision trees is increased from 10 to 200 in increments of 10 for 20 times, respectively. Fig. 7.1 represents the RF algorithm and fig. 7.2 represents the method herein. It can be seen that initially, with the increase of the number n of decision trees, the three evaluation indexes of CNN-RF and RF are all rising, and in RF, when the number of decision trees reaches 80, the results of the three evaluation indexes tend to be stable. In CNN-RF, after the number reaches 50, the three evaluation indexes basically tend to be stable.

Claims

1. A short text classification method based on a convolutional neural network and a random forest is characterized in that: the method comprises the following steps:

step 1: dividing words of all Chinese texts in a corpus to be classified, respectively using word2vec and glove word vector training tools to obtain two groups of word vectors of the corpus, and expressing the texts as two matrixes with equal dimensions; respectively carrying out two-dimensional convolution operation on the two matrixes to obtain two volume base layer characteristic graphs;

step 2: after the convolution operation, performing pooling operation on the two volume base layer characteristic graphs respectively to obtain two pooling layer characteristic matrixes; carrying out nonlinear sigmoid transformation on the pooling layer characteristic matrix to obtain two pooling layer characteristic graphs;

and step 3: performing convolution operation on the two pooling layer characteristic graphs obtained in the step 2 to obtain a final single full-connection layer characteristic graph;

and 4, step 4: taking the full-connection characteristic diagram obtained in the step 3 as an input data set of a random forest layer, performing Boostrap sampling on the data set, and performing playback sampling on a data set D with m samples m times to obtain a new data set D ', wherein D and D ' are obviously the same in size, and the playback sampling enables repeated samples to exist in D ', and also enables samples not to exist;

and 5: respectively establishing a classification and regression tree CART for a plurality of Boostrap sample sets by using a Gini coefficient method, wherein the Gini coefficient is used for feature selection, dividing a feature space by using the feature, removing the feature from the feature set after division, and performing feature selection and feature division operation on a left subtree and a right subtree in a recursion mode until a stop condition is met; in addition, in order to prevent the over-fitting phenomenon of the decision tree, pre-pruning operation is adopted; and combining a plurality of decision trees, jointly making decisions for the categories of the samples, and adopting a voting method.

2. The method for short text classification based on convolutional neural network and random forest as claimed in claim 1, wherein:

the specific implementation process of the method comprises a pre-training stage and a classifier training stage:

firstly, the method comprises the following steps: pre-training phase

Step 1: after two sets of word vectors are obtained, for data set D, a text is represented by x, then

here, the

Changing into vector splicing operation, wherein n is the length of the longest sentence in the training corpus; for text with length less than n, special symbols are used<PAD>The completion is performed using a vector representation generated with a uniform distribution between (-0.25,0.25)<PAD>(ii) a For a word vector of dimension k, each text x is represented as two

I.e. two, single-Channel two-dimensional matricesAn input layer;

step 2: performing convolution operations on two input layers separately, using

The filter acting on the word vector sequence x_i：i+h-1＝{x_i，x_i+1，...，x_i+h-1On the column:

c_i＝f(W·x_i：i+h-1+b)

where h is the size of the filter over the word window,

is a bias term, and f is a nonlinear activation function; the filter W will act on the entire word vector sequence x_1：h，x_2：h+1，...，x_n-h+1：nOn the convolution layer to generate a convolution layer feature map

C_conv＝[C_conv，1，C_conv，2，...，C_conv，n-h+1]

In order to fully extract features, m filters with different spans are set in the training process, and the number of the filters is { W }₁，W₂，...W_mDenotes that each filter is separately provided

A main unit, an

Wherein, l is 1, 2 respectively represents the pooling layer characteristics of two groups of word vectors;

and 5: accessing a Softmax classifier after the characteristics of the full connection layer, training a model in the whole pre-training stage by using an Adam batch gradient descent algorithm, adjusting parameters of each layer by using a BP algorithm, and recording a parameter theta of the whole CNN after convergence; dropout and L2 regularization are adopted during training to prevent overfitting;

II, secondly: classifier training phase

Step 6: reading the parameter theta in the step 5, replacing a Softmax model with a random forest model, and enabling the full connection layer characteristic C_finalSending the forest into a random forest for training; firstly setting the size of a decision tree N in a forest, performing Bootstrap sampling to obtain N data sets, and then learning a parameter theta of each tree in the N trees₀Because the training processes among all trees in the forest have no influence on each other, a parallel training mode is adopted in the test to accelerate the speed;

T_i(x) Is the result of the classification of the sample x by the tree i, i.e. voting, c^*The samples correspond to the final category, and N is the number of decision trees in the random forest; full connectivity layer characteristics C due to random forests_finalThe dimension is small, and the data set has the condition that mxs is less than 10³The overhead of building a random forest is very small.