CN107066553B - Short text classification method based on convolutional neural network and random forest - Google Patents

Short text classification method based on convolutional neural network and random forest Download PDF

Info

Publication number
CN107066553B
CN107066553B CN201710181062.0A CN201710181062A CN107066553B CN 107066553 B CN107066553 B CN 107066553B CN 201710181062 A CN201710181062 A CN 201710181062A CN 107066553 B CN107066553 B CN 107066553B
Authority
CN
China
Prior art keywords
training
random forest
feature
cnn
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710181062.0A
Other languages
Chinese (zh)
Other versions
CN107066553A (en
Inventor
刘泽锦
王洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201710181062.0A priority Critical patent/CN107066553B/en
Publication of CN107066553A publication Critical patent/CN107066553A/en
Application granted granted Critical
Publication of CN107066553B publication Critical patent/CN107066553B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a short text classification method based on a convolutional neural network and a random forest, and belongs to the field of text classification and deep learning. Aiming at the problem that the generalization capability is insufficient due to the fact that Softmax is adopted as a convolutional neural network classifier, a short text classification algorithm (CNN-RF) combining a convolutional neural network and a random forest is provided. The method firstly provides a double word vector convolution neural network for fully extracting high-order features of the short text, and then adopts a random forest as a high-order feature classifier, so that the short text classification effect is improved. The results on the three published experimental data sets show that CNN-RF has significant advantages over other algorithms in terms of multiple evaluation indices.

Description

Short text classification method based on convolutional neural network and random forest
Technical Field
The invention belongs to the field of text classification and deep learning, and relates to a short text classification method based on a neural network and a random forest, which can be used for classification or emotion classification of mass short text data such as microblogs, short messages and user Query. And can be used for system services such as search engines, information retrieval and the like.
Background
With the rapid development of the internet in recent years, a large amount of Short texts (Short texts) are generated by various information interaction platforms, and the Short texts relate to various fields of people's lives and gradually become a frequently-used and well-recognized communication mode for people. For example, electronic commerce comments, webpage information retrieval, intelligent question and answer systems and the like are all generation sources of massive short texts. How to mine effective information from massive short texts is a subject of extensive research by many scholars in recent years. Text classification is an effective method for text mining, but the traditional long text classification method is not applicable due to the characteristics of short text length, sparse lexical features and the like. The Short Text Classification technology (Short Text Classification) can solve the challenges in the application of the Short Text to a certain extent, is one of the research hotspots of numerous scholars at home and abroad in recent years, and is also a vital task in the field of Natural Language Processing (NLP). At present, text classification methods are mainly based on statistical learning methods or machine learning methods, and training is performed on artificially labeled corpora by using statistical or machine learning methods to obtain classifiers, and then classification is performed on to-be-classified data sets. The mainstream Machine learning methods include Naive Bayes (NB), Support Vector Machines (SVM), Logistic Regression (LR) multiple Logistic Regression (SR), Random Forest (RF), Deep Neural Network (DNN), and the like. The successful long text classification method in the text classification field is difficult to be directly applied to short text classification, so that a classification algorithm for short texts becomes a research problem to be solved by researchers at present, and the challenges of short text classification are mainly as follows:
1) the short text has sparse key word features, compared with a common long text with rich terms, the short text often has only a few effective key words, and when a vector space model is used for representing the text, the relevance among the features is difficult to fully mine;
2) in an open field (such as microblog and search engine), information is updated quickly, the information amount of a single short text is small, but the total text information amount is extremely large, and the cross part among information is small;
3) the emergence of new words, new phrases, and colloquialisation, which are often difficult to process with existing classification systems.
Scholars at home and abroad have conducted some meaningful research and exploration on the short text classification problem, and the first type is a method based on short text feature expansion: bouaziz et al learn the subject and the distribution of words on the subject on Wikipedia data by using a Latent Dirichlet Allocation (LDA) model, expand short texts by using high-frequency words under the same subject, select features of the expanded words by using a random semantic forest, and classify the expanded words; some scholars obtain a word co-occurrence mode set through association rule mining (FP-Growth), and use the word co-occurrence mode set as a basis for text feature expansion, and the word relation confidence coefficient is used as a weight for supporting feature expansion to complete feature expansion and classification of short texts; XH Phan et al constructs a global corpus by capturing mass data of the Internet, then obtains a topic Model of the global corpus by using an LDA topic Model method, finally performs topic inference (Model optimization) on a short text corpus to be classified by using the global LDA topic Model to obtain topic distribution of the short text to be classified, performs feature expansion on the short text by using the topic distribution, and finally performs classification. The first method inevitably introduces noise when performing short text extension features, resulting in poor classification effect.
The second category is deep learning based methods: socher et al, which adopts a Recurrent Neural Network (RNN) model for sentence-level emotion analysis tasks, achieves a certain effect enhancement in classification tasks of multiple data sets such as SST; kalchbrenner et al[8]A Convolutional Neural Network (CNN) is used for processing a short text classification task at a statement level, and a dynamic Convolutional Network model (DCNN) is proposed, wherein the model has a good effect on a plurality of data sets, and the potential of the Convolutional Neural Network in short text classification research is further verified. Inputs based on neural network methods typically employ random initialization or use of pre-training word vectors. Generally, word vectors are trained in various ways, different linguistic data, models and preprocessing result in word vectors with different meanings, and different word vectors describe word semantics from different aspects (angles). Due to the fact that the short text features are sparse, in order to fully extract the features, the fact that the features are fully extracted by combining various word vectors can be considered, and the feature extraction capability of the convolutional neural network is improved; in addition, when Softmax is used as a convolutional network classifier, a BP algorithm is generally adopted for training, only minimal training errors are considered in the process, and the neural network is difficult to achieve the optimal generalization capability due to the existence of local minimal values, phenomena such as gradient disappearance, overfitting and the like. The random forest is an ensemble learning method based on Boostrap Aggregation (Bagging), and a model has strong tolerance and robustness to abnormal values and noise by combining a plurality of decision trees, so that the problem of insufficient generalization capability of a single decision tree can be solved. Random forests have many advantages, such as:
1) less parameter adjustment is needed, and the training speed is high;
2) the overfitting problem is basically avoided in the training process;
3) the robustness to noise disturbances is high.
Disclosure of Invention
The invention aims to provide a short text classification algorithm (CNN-RF) combining a double word vector convolution neural network and a random forest, wherein the double word vector convolution neural network uses two pre-training word vectors as input, can fully extract short text characteristics and overcomes the defect of sparse short text characteristics; and then, classifying by adopting a random forest to enhance the generalization capability of the model. The training of the CNN-RF model is divided into two phases: 1) a pre-training stage: performing double word vector convolution network training by using Softmax as a classifier, and storing model parameters; 2) a classifier training stage: keeping the model parameters in the pre-training stage unchanged, accessing the full-link layer into a random forest, training the random forest by using high-order characteristics, and storing the parameters. In the experimental process, the model convergence in the training stage of the classifier can be realized only by using a small amount of epochs for pre-training, and a better classification effect can be achieved.
In order to achieve the purpose, the counting scheme adopted by the invention is a short text classification method based on a convolutional neural network and a random forest, and the method comprises the following steps:
step 1: dividing words of all Chinese texts in a corpus to be classified, respectively using word2vec and glove word vector training tools to obtain two groups of word vectors of the corpus, and expressing the texts as two matrixes with equal dimensions; and respectively carrying out two-dimensional convolution operation on the two matrixes to obtain two volumes of base layer characteristic graphs.
Step 2: after the convolution operation, performing pooling operation on the two volume base layer characteristic graphs respectively to obtain two pooling layer characteristic matrixes; and carrying out nonlinear sigmod transformation on the characteristic matrix of the pooling layer to obtain two pooling layer characteristic graphs.
And step 3: and (3) performing convolution operation on the two obtained pooling layer characteristic graphs in the step (2) to obtain a final single full-connection layer characteristic graph.
And 4, step 4: and 3, taking the fully connected characteristic diagram obtained in the step 3 as an input data set of a random forest layer, performing Boostrap sampling on the set, wherein Bootstrap sampling is a statistical sampling method, and for a data set D with m samples, performing playback sampling m times to obtain a new data set D ', wherein the sizes of the D and the D ' are obviously the same, and the playback sampling enables the D ' to have repeated samples and also has no samples.
And 5: and establishing a classification and regression tree CART for the Boostrap sample sets by using a Gini coefficient method, wherein the Gini coefficient is used for feature selection, dividing a feature space by using the feature, removing the feature from the feature set after division, and performing feature selection and feature division operation on the left subtree and the right subtree in a recursion mode until a stop condition is met. In addition, in order to prevent the over-fitting phenomenon of the decision tree, the method adopts pre-pruning operation. Combining multiple decision trees to jointly make decisions for the categories of the samples, usually by voting.
Compared with the prior art, the invention has the following beneficial effects.
A Random Forest (Random Forest) is adopted to replace a fully-connected Softmax layer of the convolutional neural network, so that the robustness of the whole classification method is enhanced, the overfitting of the model is reduced, and the generalization capability of the model is enhanced; by adopting the double word vector convolution neural network, more abundant characteristics can be extracted; the method is independent of a complex syntactic analysis tree, only the feature extraction is carried out through convolution and maximum Pooling (Max Pooling Over Time), the obtained high-level abstract structure features are sent to a random forest layer for classification, and from the bias-variance (bias-variance) perspective, the variance of a classification model can be reduced by integrating a plurality of models, and the stability of the model is improved. The method does not need a complex feature expansion process, noise is usually introduced into a feature expansion algorithm, time and labor are wasted, self information of the short text is fully utilized, compared with a traditional single-channel word vector input convolution network, sparsity of short text data is fully relieved, and features can be fully extracted. The max-posing-over-time operation also solves the problem of short text input with variable length, and therefore the accuracy of short text classification can be effectively improved based on the double pre-training word vector convolution network. In the experimental process, the method can achieve good effect only by needing few epochs to pre-train.
Drawings
FIG. 1 is a schematic diagram of a pre-training word vector generation model, skip-gram model
FIG. 2 is a classification model combining convolutional neural network and random forest
FIG. 3 is a comparison of Accuracy (ACC) with NB CART RF CNN on three data sets, respectively
FIG. 4 is a comparison of the accuracy (Pr), recall (Re) and F1 values of NB CART RF CNN on the Fudan dataset
FIG. 5 is a comparison of the accuracy (Pr), recall (Re) and F1 values of NB CART RF CNN on the MR data set, respectively
FIG. 6 is a comparison of accuracy (Pr), recall (Re), and F1 values of NB CART RF CNN on a Weibo data set, respectively
FIG. 7.1 variation of the RF Algorithm three evaluation indices on the Fudan dataset with variation of the decision Tree
FIG. 7.2 variation of three evaluation indexes of the method on Fudan data set along with variation of decision tree
Detailed Description
In order to make the objects, technical solutions and features of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.
According to the method, a Random Forest (Random Forest) is adopted to replace a fully-connected Softmax layer of the convolutional neural network, so that the robustness of the whole classification method is enhanced, overfitting of the model is prevented, and the bloom capability of the model is enhanced; and further, a double word vector convolution neural network is adopted, so that the method is suitable for extracting richer high-order features. The specific improvements of the present invention can be summarized as the following aspects: 1) two groups of pre-training word vectors are used for replacing randomly initialized word vectors, and compared with the conventional method or a word bag model, the method can reduce feature dimensions and extract sufficient features; 2) the random initialization of the word vector also needs to update the parameters of the word vector matrix, and the method does not need the operation, thereby improving the efficiency of the model; 3) the feature expansion is not needed, or complex operations such as syntactic analysis trees are introduced, so that noise is prevented from being introduced for subsequent feature extraction and classification of the model; 4) firstly, extracting features by using a convolution-pooling-softmax layer similar to a traditional neural network, wherein after a certain epoch, the output features of a full connection layer are changed into high-order structural features; 5) random forests are used for classifying instead of softmax, the generalization capability of the model can be effectively improved by the random forests, overfitting of the model is prevented, and the classification effect is enhanced. Experiments prove that the results of the method provided by the invention on three public experimental data sets (Fudan, Weibo, MR) show that the CNN-RF has obvious advantages on a plurality of evaluation indexes compared with other methods.
FIG. 1 is a skip-gram in a word2vec word vector model adopted by the invention, FIG. 2 is a structure adopted by a short text classification method based on a convolutional neural network and a random forest, for two groups of pre-training word vectors, firstly, short texts in a corpus are respectively constructed into two word vector matrixes, 2-dimensional convolution operation and max-firing-over-time operation are carried out, then, the convolution operation is combined with the characteristics of two channels for pre-training, and finally, the random forest is used for constructing a classification model, and the specific implementation process comprises a pre-training stage and a classifier training stage:
firstly, the method comprises the following steps: pre-training phase
Step 1: after two groups of word vectors are obtained, for corpus D, x represents a text, and then
Figure GDA0002607411940000041
A word vector representing the ith word in the text, a sentence of length n is represented in the form:
Figure GDA0002607411940000042
here, the
Figure GDA0002607411940000043
And (5) changing into a vector splicing operation, wherein n is the length of the longest sentence in the training corpus. For text with length less than n, special symbols are used<PAD>The completion is performed using a vector representation generated with a uniform distribution between (-0.25,0.25)<PAD>. Assuming that the word vector length is k, each text x is now represented as two
Figure GDA0002607411940000044
I.e. two input layers, is a single-Channel two-dimensional matrix.
Step 2: performing convolution operations on two input layers separately, using
Figure GDA0002607411940000045
The filter acting on the word vector sequence xi:i+h-1={xi,xi+1,…,xi+h-1On the column:
Ci=f(W·xi:i+h-1+b)
where h is the size of the filter over the word window,
Figure GDA0002607411940000046
is a bias term, and f is a nonlinear activation function. The filter W will act on the entire word vector sequence x1:h,x2:h+1,…,xn-h+1:nOn the convolution layer to generate a convolution layer feature map
Figure GDA0002607411940000047
Cconv=[Cconv,1,Cconv,2,…,Cconv,n-h+1]
In order to fully extract features, m filters with different spans are set in the training process, and the number of the filters is { W }1,W2,…WmDenotes that each filter is separately provided
Figure GDA00026074119400000411
A, usually order
Figure GDA00026074119400000412
I.e., m × s feature maps are generated, and then Max-pooling (Max-pooling-over-time) operations are used to work on a single feature map CconvTo obtain the most important feature in the feature map
Figure GDA0002607411940000048
Figure GDA0002607411940000049
And step 3: step 2 will generate m × s pooling layer characteristics, which are spliced to obtain pooling layer characteristics
Figure GDA00026074119400000410
Where l ═ 1, and 2 represent pooling layer features for the two sets of word vectors, respectively.
And 4, step 4: performing convolution operation on the two pooling layer characteristics to obtain the final full-connected layer characteristic Cfinal,Cfinal,iRepresents CfinalThe component (c):
Figure GDA0002607411940000051
and 5: and (3) accessing a Softmax classifier after the characteristics of the full connection layer, training the model in the whole pre-training stage by using an Adam batch Gradient Descent (Mini-batch Gradient Descent) algorithm, adjusting parameters of each layer by using a BP algorithm, and recording the parameter theta of the whole CNN after convergence. Dropout and L2 regularization are used during training to prevent overfitting.
II, secondly: classifier training phase
Step 6: reading the parameter theta in the step 5, replacing a Softmax model with a random forest model, and enabling the full connection layer characteristic CfinalAnd sending the training solution into a random forest for training. Firstly setting the size of a decision tree N in a forest, performing Bootstrap sampling to obtain N data sets, and then learning a parameter theta of each tree in the N treesnAnd because the training processes among all trees in the forest have no influence on each other, a parallel training mode is adopted in the test to accelerate the speed.
And 7: and after the training of a single decision tree is finished, finally obtaining the output of the CNN-RF model by a voting method:
Figure GDA0002607411940000052
Ti(x) Is the result of the classification of the sample x by the tree i, i.e. voting, c*I.e. the sample corresponds to the final category, and N is the number of decision trees in the random forest. Full connectivity layer characteristics C due to random forestsfinalThe dimension is not large, and the general data set species have the size of m multiplied by s < 103The overhead of building a random forest is very small.
The method combines the characteristic extraction capability of CNN and the generalization capability of random forests, and the generalization capability can be analyzed from the following three aspects: 1) from the statistical point of view, because the hypothesis space of the learning task is often very large, a plurality of hypotheses may reach the performance of the same level on the training set, and at this time, if a single decision tree is used, the generalization capability may be poor due to the misselection; 2) from the aspect of feature extraction, the dual word vectors respectively depict the meaning of words from two angles, so that short text information is enriched, and feature information is expanded relative to a single word vector; 3) from the aspect of representation, the true hypothesis of some learning tasks may not be in the hypothesis space where the current decision tree algorithm is located, and at this time, if a single classification method is used, the established hypothesis space cannot be searched, and the random forest adopts Bootstrap sampling, so that the dependence capacity of the machine learning model on data can be reduced, the variance of the model can be reduced, and the model has better generalization capacity.
Experimental facility and required environment
A Win 732-bit operating system, an Intel Xeon E5 processor, a CPU main frequency of 3.30Ghz and a memory of 16G. The experimental code used python, deep learning environment was tensiorflow in combination with the Scikit lern framework.
Results and description of the experiments
The method is used for carrying out experiments on a Fudan Chinese data set, a Weibo data set provided by NLPIR and an MR comment emotion classification data set respectively. The Fudan Chinese data set comprises 9804 documents of training corpuses, the testing corpuses comprise 9833 documents, 20 categories in total, the invention uses news titles in the Fudan Chinese data set as short text classification corpuses, and only 5 categories are selected to be C3-Art, C32-Agriculture, C34-Economy, C7-History and C38-policies, respectively, 7120 title documents in total; the WeiBo dataset totals 21 categories, and the present invention uses all categories except "human art", "public of advertisements", "campus", totaling 18 categories 36412 microblog texts. 10-fold cross validation is carried out on the WeiBo and the MR data set which are not divided into the training set and the testing set in the experiment, and the experimental result has strong persuasion.
Preprocessing and parameter setting
In the experiment, two groups of word vectors are adopted, the first group is obtained by the skip-gram training in the word2vec, the second group is obtained by the glove model, the linguistic data of the training word vectors are obtained by adopting the self-training of each data set, and only for the double data set, the news content and the news title are jointly used as the training linguistic data of the word vectors. And in the preprocessing process, Hanlp is adopted for word segmentation of the Chinese, and word stopping operation is removed. The dimensionality of the two groups of word vectors is set to be 100, the sizes of filters in the convolutional neural network are respectively 2, 3 and 4, 100 filters are arranged in each type, the Dropout parameter is set to be 0.5, and the L2 regular parameter is 0.001. Due to differences in preprocessing modes and word vector corpora and method selections, experimental results of different authors have certain deviations on the same data set. In order to verify the classification performance of CNN-RF, a plurality of classification models and the classification method of the CNN-RF are automatically realized on the same preprocessing mechanism to carry out a comparative experiment on the classification performance.
Experimental setup and evaluation index
The invention is compared with four algorithms of a CNN network proposed by Naive Bayes (NB), a classification and regression tree (CART), a Random Forest (RF) and Kim respectively. The feature vectors used for classification in NB, CART and RF are all in the form of addition of word vectors corresponding to texts. The accuracy (accuracy), accuracy (precision), recall (call) and F1 value (F1-measure) are adopted as evaluation criteria in the test, and are calculated as follows:
1) accuracy (accuracuracy):
Figure GDA0002607411940000061
2) accuracy (precision):
Figure GDA0002607411940000062
3) recall (recall):
Figure GDA0002607411940000063
4) f1 value (F1-measure):
Figure GDA0002607411940000064
where TP represents the number of positive samples predicted as positive samples, TN represents the number of negative samples predicted as negative samples, FN represents the number of positive samples predicted as negative samples, FP represents the number of negative data predicted as positive samples, and N represents the total number of samples. And then, the influence of the increase of the number of the decision trees on the RF and CNN-RF methods is analyzed through experiments, and finally, the convergence rate analysis and comparison of the CNN-RF method and the CNN algorithm are compared.
Analysis of Experimental results
First, a comparison of accuracy was performed on 3 data sets for the five algorithms. As can be seen from fig. 3, the CNN-RF method proposed by the present invention has the highest accuracy in 3 datasets, which is improved by 1.7% relative to CNN in Fudan dataset, 1.6% relative to CNN in Weibo dataset, and 0.8% in MR dataset. The CNN method based on deep learning only gets second to CNN-RF, and is better than other three methods, the accuracy of NB and CART is lower than that of the integrated learning method RF, and the integrated learning method is obtained from experimental result analysis, and the generalization ability of the integrated learning method combined with a plurality of models is improved compared with that of a single model, but is weaker than that of the CNN method based on deep learning. The CNN obtains better accuracy rate by extracting abstract structural features. CNN-RF combines the advantages of both, so better results are achieved.
The results of the five algorithms on the Fudan chinese dataset are shown in fig. 4. According to experimental data, the accuracy, the recall rate and the F1 value of the RF algorithm exceed those of the CART and NB algorithms, and the noise disturbance capability is increased and the generalization capability of the classifier is enhanced by the method based on ensemble learning. And in the aspect of accuracy, the RF algorithm is 1.0% higher than the CNN, but the CNN is 6.1% higher than the RF algorithm in recall rate, so that in combination, the CNN exceeds the RF by 2.5% in the F1 value, and the CNN reaches the optimal recall rate of 92.8% in several methods and is 0.6% higher than the CNN-RF algorithm. Besides the recall rate is insufficient for CNN, the CNN-RF algorithm further enhances the generalization capability of the model, the accuracy is improved by 4.1 percent compared with CNN, the F1 value is improved by 1.9 percent, and the CNN-RF algorithm obtains the optimal result on the accuracy rate and the F1 value.
The results of the five algorithms on the MR data set are shown in fig. 5, the MR data set being a two-class sentiment data set. The CNN-RF is the highest in three evaluation indexes, is higher than the CNN by about 1.2% and is higher than the RF by 4.4% in the F1 measure, and is different from other two data sets, the accuracy, the recall rate and the F1 value of the CNN-RF on the MR data set exceed the CNN by 1.5%, 1.1% and 1.3% respectively.
The results of the five algorithms on the Weibo data set are shown in fig. 6, and it can be seen from the data that the RF recall rate still performs poorly, but the accuracy is 7.6% higher than that of the CNN algorithm, and relatively speaking, the CNN algorithm achieves the highest recall rate, which is 15.6% higher than that of the RF and CNN-RF algorithms and 9.2% higher than that of the CNN algorithms, respectively, resulting in an RF F1 value 5.1% lower than that of the CNN algorithm. However, CNN has a lower F1 value than CNN-RF because of its poor accuracy. The CNN-RF obtains the best result on the accuracy and the F1 value, the CNN-RF is 11% higher than the CNN on the accuracy, and the optimal F1 value is reached, which is 6% higher and 0.9% higher than the RF and the CNN respectively.
In conclusion, the CNN-RF method is insensitive to the length of the short text data set, the double word vector convolution neural network can fully extract features, and the model generalization capability is better than that of other four algorithms. In contrast, the CART algorithm is the least effective than the NB algorithm, and the ensemble learning method using RF leads to a certain improvement in generalization capability, but the classification effect is worse than CNN-RF because only the word vectors extracted by the initial word2vec are used and the word vector features are added. The CNN-RF method firstly utilizes abstract high-order characteristics extracted by double word vectors CNN, and combines a plurality of decision trees to enhance the generalization capability of the model, so that the comprehensive performance of the CNN-RF method on a plurality of data sets is better than that of the CNN and RF method. Compared with CNN, F1 values are respectively improved by 1.9%, 0.9% and 1.3% on 3 data sets, and the experimental results prove the effectiveness of the method.
Regarding the influence problem of the parameters of the number of decision trees in the random forest, experiments were performed on the Fudan chinese dataset, and the results are shown in fig. 7.1 and 7.2, in which the number of decision trees is increased from 10 to 200 in increments of 10 for 20 times, respectively. Fig. 7.1 represents the RF algorithm and fig. 7.2 represents the method herein. It can be seen that initially, with the increase of the number n of decision trees, the three evaluation indexes of CNN-RF and RF are all rising, and in RF, when the number of decision trees reaches 80, the results of the three evaluation indexes tend to be stable. In CNN-RF, after the number reaches 50, the three evaluation indexes basically tend to be stable.

Claims (2)

1. A short text classification method based on a convolutional neural network and a random forest is characterized in that: the method comprises the following steps:
step 1: dividing words of all Chinese texts in a corpus to be classified, respectively using word2vec and glove word vector training tools to obtain two groups of word vectors of the corpus, and expressing the texts as two matrixes with equal dimensions; respectively carrying out two-dimensional convolution operation on the two matrixes to obtain two volume base layer characteristic graphs;
step 2: after the convolution operation, performing pooling operation on the two volume base layer characteristic graphs respectively to obtain two pooling layer characteristic matrixes; carrying out nonlinear sigmoid transformation on the pooling layer characteristic matrix to obtain two pooling layer characteristic graphs;
and step 3: performing convolution operation on the two pooling layer characteristic graphs obtained in the step 2 to obtain a final single full-connection layer characteristic graph;
and 4, step 4: taking the full-connection characteristic diagram obtained in the step 3 as an input data set of a random forest layer, performing Boostrap sampling on the data set, and performing playback sampling on a data set D with m samples m times to obtain a new data set D ', wherein D and D ' are obviously the same in size, and the playback sampling enables repeated samples to exist in D ', and also enables samples not to exist;
and 5: respectively establishing a classification and regression tree CART for a plurality of Boostrap sample sets by using a Gini coefficient method, wherein the Gini coefficient is used for feature selection, dividing a feature space by using the feature, removing the feature from the feature set after division, and performing feature selection and feature division operation on a left subtree and a right subtree in a recursion mode until a stop condition is met; in addition, in order to prevent the over-fitting phenomenon of the decision tree, pre-pruning operation is adopted; and combining a plurality of decision trees, jointly making decisions for the categories of the samples, and adopting a voting method.
2. The method for short text classification based on convolutional neural network and random forest as claimed in claim 1, wherein:
the specific implementation process of the method comprises a pre-training stage and a classifier training stage:
firstly, the method comprises the following steps: pre-training phase
Step 1: after two sets of word vectors are obtained, for data set D, a text is represented by x, then
Figure FDA0002607411930000011
A word vector representing the ith word in the text, a sentence of length n is represented in the form:
Figure FDA0002607411930000012
here, the
Figure FDA0002607411930000013
Changing into vector splicing operation, wherein n is the length of the longest sentence in the training corpus; for text with length less than n, special symbols are used<PAD>The completion is performed using a vector representation generated with a uniform distribution between (-0.25,0.25)<PAD>(ii) a For a word vector of dimension k, each text x is represented as two
Figure FDA0002607411930000014
I.e. two, single-Channel two-dimensional matricesAn input layer;
step 2: performing convolution operations on two input layers separately, using
Figure FDA0002607411930000015
The filter acting on the word vector sequence xi:i+h-1={xi,xi+1,...,xi+h-1On the column:
ci=f(W·xi:i+h-1+b)
where h is the size of the filter over the word window,
Figure FDA0002607411930000016
is a bias term, and f is a nonlinear activation function; the filter W will act on the entire word vector sequence x1:h,x2:h+1,...,xn-h+1:nOn the convolution layer to generate a convolution layer feature map
Figure FDA0002607411930000017
Cconv=[Cconv,1,Cconv,2,...,Cconv,n-h+1]
In order to fully extract features, m filters with different spans are set in the training process, and the number of the filters is { W }1,W2,...WmDenotes that each filter is separately provided
Figure FDA0002607411930000018
A main unit, an
Figure FDA0002607411930000019
I.e., m × s feature maps are generated, and then Max-pooling (Max-pooling-over-time) operations are used to work on a single feature map CconvTo obtain the most important feature in the feature map
Figure FDA00026074119300000110
Figure FDA0002607411930000021
And step 3: step 2 will generate m × s pooling layer characteristics, which are spliced to obtain pooling layer characteristics
Figure FDA0002607411930000022
Wherein, l is 1, 2 respectively represents the pooling layer characteristics of two groups of word vectors;
and 4, step 4: performing convolution operation on the two pooling layer characteristics to obtain the final full-connected layer characteristic Cfinal,Cfinal,iRepresents CfinalThe component (c):
Figure FDA0002607411930000023
and 5: accessing a Softmax classifier after the characteristics of the full connection layer, training a model in the whole pre-training stage by using an Adam batch gradient descent algorithm, adjusting parameters of each layer by using a BP algorithm, and recording a parameter theta of the whole CNN after convergence; dropout and L2 regularization are adopted during training to prevent overfitting;
II, secondly: classifier training phase
Step 6: reading the parameter theta in the step 5, replacing a Softmax model with a random forest model, and enabling the full connection layer characteristic CfinalSending the forest into a random forest for training; firstly setting the size of a decision tree N in a forest, performing Bootstrap sampling to obtain N data sets, and then learning a parameter theta of each tree in the N trees0Because the training processes among all trees in the forest have no influence on each other, a parallel training mode is adopted in the test to accelerate the speed;
and 7: and after the training of a single decision tree is finished, finally obtaining the output of the CNN-RF model by a voting method:
Figure FDA0002607411930000024
Ti(x) Is the result of the classification of the sample x by the tree i, i.e. voting, c*The samples correspond to the final category, and N is the number of decision trees in the random forest; full connectivity layer characteristics C due to random forestsfinalThe dimension is small, and the data set has the condition that mxs is less than 103The overhead of building a random forest is very small.
CN201710181062.0A 2017-03-24 2017-03-24 Short text classification method based on convolutional neural network and random forest Active CN107066553B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710181062.0A CN107066553B (en) 2017-03-24 2017-03-24 Short text classification method based on convolutional neural network and random forest

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710181062.0A CN107066553B (en) 2017-03-24 2017-03-24 Short text classification method based on convolutional neural network and random forest

Publications (2)

Publication Number Publication Date
CN107066553A CN107066553A (en) 2017-08-18
CN107066553B true CN107066553B (en) 2021-01-01

Family

ID=59618101

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710181062.0A Active CN107066553B (en) 2017-03-24 2017-03-24 Short text classification method based on convolutional neural network and random forest

Country Status (1)

Country Link
CN (1) CN107066553B (en)

Families Citing this family (53)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107368613B (en) * 2017-09-05 2020-02-28 中国科学院自动化研究所 Short text sentiment analysis method and device
CN107798331B (en) * 2017-09-05 2021-11-26 赵彦明 Method and device for extracting characteristics of off-zoom image sequence
CN110019787A (en) * 2017-09-30 2019-07-16 北京国双科技有限公司 Neural network model generation method, text emotion analysis method and relevant apparatus
CN109843401B (en) * 2017-10-17 2020-11-24 腾讯科技(深圳)有限公司 AI object behavior model optimization method and device
CN109711528A (en) * 2017-10-26 2019-05-03 北京深鉴智能科技有限公司 Based on characteristic pattern variation to the method for convolutional neural networks beta pruning
CN107767378B (en) * 2017-11-13 2020-08-04 浙江中医药大学 GBM multi-mode magnetic resonance image segmentation method based on deep neural network
CN107886474B (en) * 2017-11-22 2019-04-23 北京达佳互联信息技术有限公司 Image processing method, device and server
CN108108351B (en) * 2017-12-05 2020-05-22 华南理工大学 Text emotion classification method based on deep learning combination model
CN108108751B (en) * 2017-12-08 2021-11-12 浙江师范大学 Scene recognition method based on convolution multi-feature and deep random forest
CN107957993B (en) * 2017-12-13 2020-09-25 北京邮电大学 English sentence similarity calculation method and device
CN108231067A (en) * 2018-01-13 2018-06-29 福州大学 Sound scenery recognition methods based on convolutional neural networks and random forest classification
CN108122562A (en) * 2018-01-16 2018-06-05 四川大学 A kind of audio frequency classification method based on convolutional neural networks and random forest
CN108509508A (en) * 2018-02-11 2018-09-07 北京原点时空信息技术有限公司 Short message big data inquiry based on Java technology and analysis system and its method
CN108776805A (en) * 2018-05-03 2018-11-09 北斗导航位置服务(北京)有限公司 It is a kind of establish image classification model, characteristics of image classification method and device
CN108733801B (en) * 2018-05-17 2020-06-09 武汉大学 Digital-human-oriented mobile visual retrieval method
CN108875808A (en) * 2018-05-17 2018-11-23 延安职业技术学院 A kind of book classification method based on artificial intelligence
CN108829671B (en) * 2018-06-04 2021-08-20 北京百度网讯科技有限公司 Decision-making method and device based on survey data, storage medium and terminal equipment
CN108959924A (en) * 2018-06-12 2018-12-07 浙江工业大学 A kind of Android malicious code detecting method of word-based vector sum deep neural network
CN108920586A (en) * 2018-06-26 2018-11-30 北京工业大学 A kind of short text classification method based on depth nerve mapping support vector machines
CN109002532A (en) * 2018-07-17 2018-12-14 电子科技大学 Behavior trend mining analysis method and system based on student data
CN109214298B (en) * 2018-08-09 2021-06-08 盈盈(杭州)网络技术有限公司 Asian female color value scoring model method based on deep convolutional network
CN109165294B (en) * 2018-08-21 2021-09-24 安徽讯飞智能科技有限公司 Short text classification method based on Bayesian classification
CN109543084B (en) * 2018-11-09 2021-01-19 西安交通大学 Method for establishing detection model of hidden sensitive text facing network social media
CN109493886A (en) * 2018-12-13 2019-03-19 西安电子科技大学 Speech-emotion recognition method based on feature selecting and optimization
CN111352926B (en) * 2018-12-20 2024-03-08 北京沃东天骏信息技术有限公司 Method, device, equipment and readable storage medium for data processing
CN111353512B (en) * 2018-12-20 2023-07-28 长沙智能驾驶研究院有限公司 Obstacle classification method, obstacle classification device, storage medium and computer equipment
CN109670182B (en) * 2018-12-21 2023-03-24 合肥工业大学 Massive extremely short text classification method based on text hash vectorization representation
CN109739986A (en) * 2018-12-28 2019-05-10 合肥工业大学 A kind of complaint short text classification method based on Deep integrating study
CN110020431B (en) * 2019-03-06 2023-07-18 平安科技(深圳)有限公司 Feature extraction method and device of text information, computer equipment and storage medium
US11494615B2 (en) * 2019-03-28 2022-11-08 Baidu Usa Llc Systems and methods for deep skip-gram network based text classification
CN110069634A (en) * 2019-04-24 2019-07-30 北京泰迪熊移动科技有限公司 A kind of method, apparatus and computer readable storage medium generating classification model
CN110134786B (en) * 2019-05-14 2021-09-10 南京大学 Short text classification method based on subject word vector and convolutional neural network
CN110222173B (en) * 2019-05-16 2022-11-04 吉林大学 Short text emotion classification method and device based on neural network
CN110222260A (en) * 2019-05-21 2019-09-10 深圳壹账通智能科技有限公司 A kind of searching method, device and storage medium
CN110309304A (en) * 2019-06-04 2019-10-08 平安科技(深圳)有限公司 A kind of file classification method, device, equipment and storage medium
CN110263344B (en) * 2019-06-25 2022-04-19 创优数字科技(广东)有限公司 Text emotion analysis method, device and equipment based on hybrid model
CN110781333A (en) * 2019-06-26 2020-02-11 杭州鲁尔物联科技有限公司 Method for processing unstructured monitoring data of cable-stayed bridge based on machine learning
CN110377915B (en) * 2019-07-25 2022-11-29 腾讯科技(深圳)有限公司 Text emotion analysis method and device, storage medium and equipment
CN111144546B (en) * 2019-10-31 2024-01-02 平安创科科技(北京)有限公司 Scoring method, scoring device, electronic equipment and storage medium
CN111401063B (en) * 2020-06-03 2020-09-11 腾讯科技(深圳)有限公司 Text processing method and device based on multi-pool network and related equipment
CN111813939A (en) * 2020-07-13 2020-10-23 南京睿晖数据技术有限公司 Text classification method based on representation enhancement and fusion
CN111897921A (en) * 2020-08-04 2020-11-06 广西财经学院 Text retrieval method based on word vector learning and mode mining fusion expansion
CN112182219A (en) * 2020-10-09 2021-01-05 杭州电子科技大学 Online service abnormity detection method based on log semantic analysis
CN112487811B (en) * 2020-10-21 2021-07-06 上海旻浦科技有限公司 Cascading information extraction system and method based on reinforcement learning
CN112347247B (en) * 2020-10-29 2023-10-13 南京大学 Specific category text title classification method based on LDA and Bert
CN112329877A (en) * 2020-11-16 2021-02-05 山西三友和智慧信息技术股份有限公司 Voting mechanism-based web service classification method and system
CN113342970B (en) * 2020-11-24 2023-01-03 中电万维信息技术有限责任公司 Multi-label complex text classification method
CN112418354B (en) * 2020-12-15 2022-07-15 江苏满运物流信息有限公司 Goods source information classification method and device, electronic equipment and storage medium
CN114154561B (en) * 2021-11-15 2024-02-27 国家电网有限公司 Electric power data management method based on natural language processing and random forest
CN114511330B (en) * 2022-04-18 2022-12-13 山东省计算中心(国家超级计算济南中心) Ether house Pompe fraudster detection method and system based on improved CNN-RF
CN115064184A (en) * 2022-06-28 2022-09-16 镁佳(北京)科技有限公司 Audio file musical instrument content identification vector representation method and device
CN116226702B (en) * 2022-09-09 2024-04-26 武汉中数医疗科技有限公司 Thyroid sampling data identification method based on bioelectrical impedance
CN117473095B (en) * 2023-12-27 2024-03-29 合肥工业大学 Short text classification method and system based on theme enhancement word representation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160034788A1 (en) * 2014-07-30 2016-02-04 Adobe Systems Incorporated Learning image categorization using related attributes
CN106156781A (en) * 2016-07-12 2016-11-23 北京航空航天大学 Sequence convolutional neural networks construction method and image processing method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160034788A1 (en) * 2014-07-30 2016-02-04 Adobe Systems Incorporated Learning image categorization using related attributes
CN106156781A (en) * 2016-07-12 2016-11-23 北京航空航天大学 Sequence convolutional neural networks construction method and image processing method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Convolutional Neural Networks for Sentence Classification;Yoon Kim;《https://arxiv.org/abs/1408.5882》;20140903;全文 *
基于事件卷积特征的新闻文本分类;夏从零;《计算机应用研究》;20170430;全文 *

Also Published As

Publication number Publication date
CN107066553A (en) 2017-08-18

Similar Documents

Publication Publication Date Title
CN107066553B (en) Short text classification method based on convolutional neural network and random forest
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
Abbas et al. Multinomial Naive Bayes classification model for sentiment analysis
CN112417863B (en) Chinese text classification method based on pre-training word vector model and random forest algorithm
Gao et al. Application of improved distributed naive Bayesian algorithms in text classification
Ju et al. An efficient method for document categorization based on word2vec and latent semantic analysis
Rezaei et al. Multi-document extractive text summarization via deep learning approach
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN110297888A (en) A kind of domain classification method based on prefix trees and Recognition with Recurrent Neural Network
Ayral et al. An automated domain specific stop word generation method for natural language text classification
Shi et al. High performance genetic algorithm based text clustering using parts of speech and outlier elimination
Jayaram et al. A review: Information extraction techniques from research papers
CN111651602A (en) Text classification method and system
CN112883165A (en) Intelligent full-text retrieval method and system based on semantic understanding
CN114997288A (en) Design resource association method
Mahto et al. Sentiment prediction of textual data using hybrid convbidirectional-LSTM model
Gourru et al. Document network projection in pretrained word embedding space
Ding et al. The research of text mining based on self-organizing maps
Li et al. Web page classification method based on semantics and structure
CN116304063A (en) Simple emotion knowledge enhancement prompt tuning aspect-level emotion classification method
Jiang et al. Understanding a bag of words by conceptual labeling with prior weights
Ganguli et al. Nonparametric method of topic identification using granularity concept and graph-based modeling
Xu et al. Research on topic discovery technology for Web news
CN115129890A (en) Feedback data map generation method and generation device, question answering device and refrigerator
Lin et al. Text classification feature extraction method based on deep learning for unbalanced data sets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant